dgit.raspbian.org Git

xen/arm: smccc: Fix indentation in ARM_SMCCC_ARCH_WORKAROUND_1_FID

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Kconfig: Move HARDEN_BRANCH_PREDICTOR under "Architecture features"

At the moment, HARDEN_BRANCH_PREDICTOR is not in any section making
impossible for the user to unselect it.

Also, it looks like we require to use 'expert = "y"' for showing the
option in expert mode.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: Implement a fast path for handling SMCCC_ARCH_WORKAROUND_2

The function ARM_SMCCC_ARCH_WORKAROUND_2 will be called by the guest for
enabling/disabling the ssbd mitigation. So we want the handling to
be as fast as possible.

The new sequence will forward guest's ARCH_WORKAROUND_2 call to EL3 and
also track the state of the workaround per-vCPU.

Note that since we need to execute branches, this always executes after
the spectre-v2 mitigation.

This code is based on KVM counterpart "arm64: KVM: Handle guest's
ARCH_WORKAROUND_2 requests" written by Marc Zyngier.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: Add generic assembly macros

Add assembly macros to simplify assembly code:
- adr_cpu_info: Get the address to the current cpu_info structure
- ldr_this_cpu: Load a per-cpu value

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: alternatives: Add dynamic patching feature

This is based on the Linux commit dea5e2a4c5bc "arm64: alternatives: Add
dynamic patching feature" written by Marc Zyngier:

    We've so far relied on a patching infrastructure that only gave us
    a single alternative, without any way to provide a range of potential
    replacement instructions. For a single feature, this is an all or
    nothing thing.

    It would be interesting to have a more flexible grained way of patching the
    kernel though, where we could dynamically tune the code that gets injected.

    In order to achive this, let's introduce a new form of dynamic patching,
    assiciating a callback to a patching site. This callback gets source and
    target locations of the patching request, as well as the number of
    instructions to be patched.

    Dynamic patching is declared with the new ALTERNATIVE_CB and alternative_cb
    directives:
                    asm volatile(ALTERNATIVE_CB("mov %0, #0\n", callback)
                                 : "r" (v));
    or

                    alternative_cb callback
                            mov x0, #0
                    alternative_cb_end

    where callback is the C function computing the alternative.

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Simplify alternative patching of non-writable region

During the MMU setup process, Xen will set SCTLR_EL2.WNX
(Write-Non-eXecutable) bit. Because of that, the alternative code need
to re-mapped the region in a difference place in order to modify the
text section.

At the moment, the function patching the code is only aware of the
re-mapped region. This requires the caller to mess with Xen internal in
order to have function such as is_active_kernel_text() working.

All the interactions with Xen internal can be removed by specifying the
offset between the region patch and the writable region for updating the
instruction

This simplification will also make it easier to integrate dynamic patching
in a follow-up patch. Indeed, the callback address should be in
an original region and not re-mapped only which is writeable non-executable.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Add ARCH_WORKAROUND_2 support for guests

In order to offer ARCH_WORKAROUND_2 support to guests, we need to track the
state of the workaround per-vCPU. The field 'pad' in cpu_info is now
repurposed to store flags easily accessible in assembly.

As the hypervisor will always run with the workaround enabled, we may
need to enable (on guest exit) or disable (on guest entry) the
workaround.

A follow-up patch will add fastpath for the workaround for arm64 guests.

Note that check_workaround_ssbd() is used instead of ssbd_get_state()
because the former is implemented using an alternative. Thefore the code
will be shortcut on affected platform.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Add command line option to control SSBD mitigation

On a system where the firmware implements ARCH_WORKAROUND_2, it may be
useful to either permanently enable or disable the workaround for cases
where the user decides that they'd rather not get a trap overhead, and
keep the mitigation permanently on or off instead of switching it on
exception entry/exit. In any case, default to mitigation being enabled.

The new command line option is implemented as list of one option to
follow x86 option and also allow to extend it more easily in the future.

Note that for convenience, the full implemention of the workaround is
done in the .matches callback.

Lastly, a accessor is provided to know the state of the mitigation.

After this patch, there are 3 methods complementing each other to find the
state of the mitigation:
    - The capability ARM_SSBD indicates the platform is affected by the
      vulnerability. This will also return false if the user decide to force
      disabled the mitigation (spec-ctrl="ssbd=force-disable"). The
      capability is useful for putting shortcut in place using alternative.
    - ssbd_state indicates the global state of the mitigation (e.g
      unknown, force enable...). The global state is required to report
      the state to a guest.
    - The per-cpu ssbd_callback_required indicates whether a pCPU
      requires to call the SMC. This allows to shortcut SMC call
      and save an entry/exit to EL3.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Add ARCH_WORKAROUND_2 probing

As for Spectre variant-2, we rely on SMCCC 1.1 to provide the discovery
mechanism for detecting the SSBD mitigation.

A new capability is also allocated for that purpose, and a config
option.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: setup: Check errata for boot CPU later on

Some errata will rely on the SMCCC version which is detected by
psci_init().

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: entry: Use named label in guest_sync

This will improve readability for future changes.

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: domain: Zero the per-vCPU cpu_info

A stack is allocated per vCPU to be used by Xen. The allocation is done
with alloc_xenheap_pages that does not zero the memory returned. However
the top of the stack is containing information that will be used to
store the initial state of the vCPU (see struct cpu_info). Some of the
fields may not be initialized and will lead to use/leak bits of previous
memory in some cases on the first run of vCPU (AFAICT this only happen on
vCPU0 for Dom0).

This is part of XSA-263.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: Enable errata for secondary CPU on hotplug after the boot

On boot, enabling errata workarounds will be triggered by the boot CPU
from start_xen(). On CPU hotplug (non-boot scenario) this would not be
done. This patch adds the code required to enable errata workarounds for
a CPU being hotplugged after the system boots. This is triggered using
a notifier. If the CPU fails to enable workarounds the notifier will
return an error and Xen will hit the BUG_ON() in notify_cpu_starting().
To avoid the BUG_ON() in an error case either enabling notifiers should
be fixed to return void (not propagate error to notify_cpu_starting())
and the errata notifier will always return success for CPU_STARTING
event, or the notify_cpu_starting() and other common code should be
fixed to expect an error at CPU_STARTING phase.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

xen/arm: Free memory allocated for sibling/core maps on CPU hot-unplug

The memory allocated in setup_cpu_sibling_map() when a CPU is hotplugged
has to be freed when the CPU is hot-unplugged. This is done in
remove_cpu_sibling_map() and called when the CPU dies. The call to
remove_cpu_sibling_map() is made from a notifier callback when
CPU_DEAD event is received.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Acked-by: Julien Grall <julien.grall@arm.com>

xen/arm: Disable timers and release their interrupts on CPU hot-unplug

When a CPU is hot-unplugged we need to disable timers and release
their interrupts in order to free the memory that was allocated when
interrupts were requested (using request_irq()). The request_irq()
is called for each timer interrupt when the CPU gets hotplugged
(start_secondary->init_timer_interrupt->request_irq).
With this patch timers will be disabled and interrupts will be
released when the newly added callback receives CPU_DYING event.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Acked-by: Julien Grall <julien.grall@arm.com>

xen/arm: Release maintenance interrupt when CPU is hot-unplugged

When a CPU is hot-unplugged the maintenance interrupt has to be
released in order to free the memory that was allocated when the CPU
was hotplugged and interrupt requested. The interrupt was requested
using request_irq() which is called from start_secondary->
init_maintenance_interrupt. With this patch the interrupt will be
released when the CPU_DYING event is received by the callback which
is added in gic.c.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Acked-by: Julien Grall <julien.grall@arm.com>

xen/common: Restore IRQ affinity when hotplugging a pCPU

Non-boot pCPUs are being hot-unplugged during the system suspend to
RAM and hotplugged during the resume. When non-boot pCPUs are
hot-unplugged the interrupts that were targeted to them are migrated
to the boot pCPU.
On suspend, each guest could have its own wake-up devices/interrupts
(passthrough) that could trigger the system resume. These interrupts
could be targeted to a non-boot pCPU, e.g. if the guest's vCPU is
pinned to a non-boot pCPU. Due to the hot-unplug of non-boot pCPUs
during the suspend such interrupts will be migrated from non-boot pCPUs
to the boot pCPU (this is fine). However, when non-boot pCPUs are
hotplugged on resume, these interrupts are not migrated back to non-boot
pCPUs, i.e. IRQ affinity is not restored on resume (this is wrong).
This patch adds the restoration of IRQ affinity when a pCPU is hotplugged.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/arm: Setup virtual paging for non-boot CPUs on hotplug/resume

In existing code the virtual paging for non-boot CPUs is setup only on boot.
The setup is triggered from start_xen() after all CPUs are brought online.
In other words, the initialization of VTCR_EL2 register is done out of the
cpu_up/start_secondary() control flow. However, the cpu_up flow is also used
to hotplug non-boot CPUs on resume from suspend to RAM state, in which case
the virtual paging will not be configured.

With this patch the setting of paging is triggered from start_secondary()
function using cpu starting notifier (notify_cpu_starting() call). The
notifier is registered in p2m.c using init call. This has to be done with
init call rather than presmp_init because the registered callback depends
on vtcr configuration value which is setup after the presmp init calls
are executed (do_presmp_initcalls() called from start_xen()). Init calls
are executed after initial virtual paging is set up for all CPUs on boot.
This ensures that no callback can fire until the vtcr value is calculated
by Xen and virtual paging is set up initially for all CPUs. Also, this way
the virtual paging setup in boot scenario remains unchanged.

It is assumed here that after the system completed the boot, CPUs that
execute start_secondary() were booted as well when the Xen itself was
booted. According to this assumption non-boot CPUs will always be compliant
with the VTCR_EL2 value that was selected by Xen on boot.
Currently, there is no mechanism to trigger hotplugging of a CPU. This
will be added with the suspend to RAM support for ARM, where the hotplug
of non-boot CPUs will be triggered via enable_nonboot_cpus() call.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

xen/arm: Remove __initdata and __init to enable CPU hotplug

CPU up flow is currently used during the initial boot to start secondary
CPUs. However, the same flow should be used for CPU hotplug, e.g. when
hotplugging secondary CPUs within the resume procedure (resume from the
suspend to RAM). Therefore, prefixes __initdata and __init had to be removed
from few data structures and functions that are used within the cpu up flow.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Acked-by: Julien Grall <julien.grall@arm.com>

xen/arm: Implement CPU_OFF PSCI call (physical interface)

During the system suspend to RAM non-boot CPUs will be hotplugged.
This will be triggered via disable_nonboot_cpus() call. When
hotplugged the CPU will end up in an infinite wfi loop in stop_cpu().
This patch adds PSCI CPU_OFF call to the EL3 with the aim to get powered
down the calling CPU during the suspend. The CPU_OFF call will be made
only if the PSCI version is higher than v0.1 (Note that the CPU_OFF
function is mandatory since PSCI v0.2).
If PSCI CPU_OFF call to the EL3 succeeds it will not return. Otherwise,
when the PSCI CPU_OFF call returns we'll raise panic, because the
calling CPU couldn't be enabled afterwards (stays in WFI loop forever).
Note that if the PSCI version is higher than v0.1 the CPU_OFF will be
called regardless of the system state. This is done because scenarios
other than suspend may benefit from powering off the CPU.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Acked-by: Julien Grall <julien.grall@arm.com>

xen/arm: Ignore write to GICD_ISACTIVERn registers (vgic-v2)

Guests attempt to write into these registers on resume (for example Linux).
Without this patch a data abort exception will be raised to the guest.
This patch handles the write access by ignoring it, but only if the value
to be written is zero. This should be fine because reading these registers
is already handled as 'read as zero'.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

xen/arm64: Added handling of the trapped access to OSLSR register

Linux/dom0 accesses OSLSR register when saving CPU context during the
suspend procedure. Xen traps access to this register, but has no handling
for it. Consequently, Xen injects undef exception to linux, causing it to
crash. This patch adds handling of the trapped access to OSLSR as read
only as a fixed value.

For convenience, introduce a new helper handle_ro_read_val() based on
handle_ro_raz() that will return a specified value on read and
re-implement handle_ro_raz() based on the new helper.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Julien Grall <julien.grall@arm.com>
[julien: Add a word about the new helper]

arm: remove the ARM HDLCD driver

The ARM HDLCD driver is unused. The device itself can only be found on
Virtual Express boards that are for early development only. Remove the
driver.

Also remove vexpress_syscfg, now unused, and "select VIDEO" that is not
useful anymore.

Suggested-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>

arm: vgic: Use find_first_bit instead of find_next_bit when possible

find_next_bit(foo, sz, 0) is equivalent to find_first_bit(foo, sz). The
latter is easier to understand. Some architecture may also have a
optimized version of find_first_bit(...). So replace the occurrence of
find_next_bit in vgic_vcpu_pending_irq()."

Signed-off-by: Artem Mygaiev <artem_mygaiev@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

ocaml/xenstored: reduce use of unsafe conversions

The rationalisation of the Xs_ring interface in the xb library
allows to further reduce the unsafe calls withouth introducing
copies. This patch also contains some further code cleanups.

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

ocaml/libs/xb: Use bytes in place of strings for mutable buffers

Since Ocaml 4.06.0, that made safe-string on by default, the compiler is
allowed to perform optimisations on immutable strings. They should no
longer be used as mutable buffers, and bytes should be used instead.

The C stubs for Xs_ring have been updated to use bytes, and the interface
rationalised mimicking the new Unix module in the standard library (the
implementation of Unix.write_substring uses unsafe_of_string in the exact same
way, and both the write implementations are using the bytes as an immutable
payload for the write).

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Fix error handling of the pv %dr7 shadow state

c/s "x86/pv: Introduce and use x86emul_write_dr()" fixed a bug with IO shadow
handling, in that it remained stale and visible until %dr7.L/G got set again.

However, it neglected the -EPERM return inbetween these two hunks, introducing
a different bug in which a write to %dr7 which tries to set IO breakpoints
without %cr4.DE being set clobbers the IO state, rather than leaves it alone.

Instead, move the zeroing slightly later, which guarentees that the shadow
gets written exactly once, on a successful update to %dr7.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/CPUID: don't override tool stack decision to hide STIBP

Other than in the feature sets, where we indeed want to offer the
feature even if not enumerated on hardware, we shouldn't dictate the
feature being available if tool stack or host admin have decided to not
expose it (for whatever [questionable?] reason). That feature set side
override is sufficient to achieve the intended guest side safety
property (in offering - by default - STIBP independent of actual
availability in hardware).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct default_xen_spec_ctrl calculation

Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: suppress sync when XPTI is disabled for a domain

Now that we have a per-domain flag we can and should control sync-ing in
a more fine grained manner: Only domains having XPTI enabled need the
sync to occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxc/x86/PV: don't hand through CPUID leaf 0x80000008 as is

Just like for HVM the feature set should be used for EBX output, while
EAX should be restricted to the low 16 bits and ECX/EDX should be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/kdd: alternative way of muting spurious gcc warning

Older gcc does not support #pragma GCC diagnostics, so use alternative
approach - change variable type to uint32_t (this code handle 32-bit
requests only anyway), which apparently also avoid gcc complaining about
this (otherwise correct) code.

Fixes 437e00fea04becc91c1b6bc1c0baa636b067a5cc "tools/kdd: mute spurious
gcc warning"

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

docs/process/xen-release-management: Lesson to learn

The 4.10 release preparation was significantly more hairy than ideal.
(We seem to have a good overall outcome despite, rather than because
of, our approach.)

This is the second time (at least) that we have come close to failure
by committing to a release date before the exact code to be released
is known and has been made and tested.

Evidently our docs makes it insufficiently clear not to do that.

CC: Julien Grall <julien.grall@arm.com>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

docs/process: Add RUBRIC

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Dump the instruction stream even for double faults

This helps debug #DF's which occur in alternative patches

Reported-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/XPTI: fix S3 resume (and CPU offlining in general)

We should index an L1 table with an L1 index.

Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: correct mtrr_pat_not_equal()

The two vCPU-s differing in MTRR-enabled state means MTRR settings are
not equal. Both vCPU-s having MTRRs disabled means only PAT needs to be
compared. Along those lines for fixed range MTRRs. Differing variable
range counts likewise mean settings are different overall (even if
that's not a very reasonable setup to have).

Constify types and convert bool_t to bool.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct vCPU dirty CPU handling

Commit df8234fd2c ("replace vCPU's dirty CPU mask by numeric ID") was
too lax in two respects: First of all it didn't consider the case of a
vCPU not having a valid dirty CPU in the descriptor table TLB flush
case. This is the issue Manual has run into with NetBSD.

Additionally reads of ->dirty_cpu for other than the current vCPU are at
risk of racing with scheduler actions, i.e. single atomic reads need to
be used there. Obviously the non-init write sites then better also use
atomic writes.

Having to touch the descriptor table TLB flush code here anyway, take
the opportunity and switch it to be at most one flush_tlb_mask()
invocation.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Rename ARCH_CAPS.SSBD_NO to SSB_NO

A last-minute rename of the feature occured, and the patch committed to
staging was unfortunately stale. Correct it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use

Almost all infrastructure is already in place. Update the reserved bits
calculation in guest_wrmsr(), and offer SSBD to guests by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass

To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:

* A feature bit to indicate that Speculative Store Bypass Disable is
   supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
   in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
   indicating that the hardware is not susceptible to Speculative Store Bypass
   sidechannels.

For contemporary processors, this interface will be implemented via a
microcode update.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass

AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.

Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.

For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

doc: correct livepatch.markdown syntax

"make -C docs all" fails due to incorrect markdown syntax in
livepatch.markdown. Correct it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Misc fixes:
* Insert real URLs
* Drop trailing whitespace
* Consistent alignment and indentation for code blocks and lists
* Consistent capitalisation
* Consistent use of `` blocks for command line arguments and function names
* Rearrange things not to leave < and > in the text

No change in content. The document now reads rather more consistently in HTML
and PDF form.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: show full value of cpu_khz in xl info output

The exact value of cpu_khz can only be obtained via 'xl dmesg', and
therefore can be lost after some time. 'xl info' truncates the value to
full MHz. Adjust the output to show the full khz value.
This helps the host admin to track how a host has calibrated itself. The
value of cpu_khz is used during live migration for the decision if
access to TSC should be emualted.

Commit eb5277a30e ("bitkeeper revision 1.959.1.4
(40d04a87acOb29u-5Y5OxMhHvP2x9g)" gives no hint why cpu_mhz instead of
cpu_khz was chosen.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

Config.mk: Update QEMU to include build fixes

This tag includes two build fixes:
- dump: Fix build with newer gcc
Fix build with GCC-8
- Fix libusb-1.0.22 deprecated libusb_set_debug with libusb_set_option

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>

xen/kbdif: Add features to disable keyboard and pointer

It is now not fully possible to control if and which virtual devices
are created by the frontend, e.g. keyboard and pointer devices
are always created and multi-touch device is created if the
backend advertises multi-touch support. In some cases this
behavior is not desirable and better control over the frontend's
configuration is required.

Add new XenStore feature fields, so it is possible to individually
control set of exposed virtual devices for each guest OS:
- set feature-disable-keyboard to 1 if no keyboard device needs
to be created
- set feature-disable-pointer to 1 if no pointer device needs
to be created

Keep old behavior by default.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

scripts/add_maintainers.pl: New script

This provides a much better workflow when using git format-patch and
git send-email, with get_maintainer.pl.

The tool covers step 2 of the following workflow

  Step 1: git format-patch ... -o <patchdir> ...
  Step 2: ./scripts/add_maintainers.pl -d <patchdir>
          This overwrites  *.patch files in <patchdir>
  Step 3: git send-email -to xen-devel@lists.xenproject.org <patchdir>/*.patchxm

I manually tested all options and the most common combinations
on Mac.

Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

vpci/msi: fix unbind loop

The current unbind loop on failure in vpci_msi_enable is wrong and
will only work correctly if the initial pirq is 0. Fix this by adding
a proper bound.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`

In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.

One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains.  In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit.  To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.

For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings.  See the xen-command-line.markdown for
full details of the new options.

While changing the command line options, take the time to change how the data
is reported to the user.  The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.

Sample output from booting with `spec-ctrl=no-pv` looks like:

  (XEN) Speculative mitigation facilities:
  (XEN)   Hardware features: IBRS/IBPB STIBP IBPB
  (XEN)   Compiled-in support: INDIRECT_THUNK
  (XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS-, Other: IBPB
  (XEN)   Support for VMs: PV: None, HVM: MSR_SPEC_CTRL RSB
  (XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/cpuid: Improvements to guest policies for speculative sidechannel features

If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised.  It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.

Another useful option in some usecases is to offer IBPB without IBRS.  When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS.  This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.

Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value

With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value.  Explicitly set Xen's default during the BSP
and AP boot paths.

For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so.  Oracle report that this
speeds up boots of some hardware by 50s.

"when safe to do so" is based on whether we are virtualised.  A native boot
won't have any other code running in a position to mount an attack.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants

In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible

If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT

In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.

Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together

Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.

There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags

All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable

At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.

Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once

Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools/ocaml/libs/xc fix gcc-8 format-truncation warning

CC       xenctrl_stubs.o
xenctrl_stubs.c: In function 'failwith_xc':
xenctrl_stubs.c:65:17: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
      "%d: %s: %s", error->code,
                 ^
xenctrl_stubs.c:64:4: note: 'snprintf' output 6 or more bytes (assuming 1029) into a destination of size 1028
    snprintf(error_str, sizeof(error_str),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      "%d: %s: %s", error->code,
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      xc_error_code_to_desc(error->code),
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      error->message);
      ~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[8]: *** [/build/xen-git/src/xen/tools/ocaml/libs/xc/../../Makefile.rules:37: xenctrl_stubs.o] Error 1
m

Signed-off-by: John Thomson <git@johnthomson.fastmail.com.au>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Add string constants for raw pointer

Add missing string constants for {feature|request}-raw-pointer
to align with the rest of the interface file.

Fixes 7868654ff7fe ("kbdif: Define "feature-raw-pointer" and "request-raw-pointer")

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Correctly process caveats in multi-status sections

When SUPPORT.md uses the syntax
Status, <some thing>: <support status>
the caveats were lost (not footnoted) because they were attached
only to <some thing>.

Caveats occur in running text, so they are necessarily part of a real
section, not an individual status line like that. So attach them to
the RealSectNode, and look there for them.

Reported-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Provide $sectnode->{RealSectNode}

No functional change yet.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Rename RealSect to RealInSect

This makes the distinction between insections and sectnodes clearer.

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

viridian: fix cpuid leaf 0x40000003

The response to viridian leaf 3 needs to split a 64-bit mask across EAX and
EBX, with the low order 32 bits in EAX and the high order 32 bits in EBX.
To facilitate this a union of two uint32_t values and the mask (type
HV_PARTITION_PRIVILEGE_MASK) is allocated on stack as follows:

union {
    HV_PARTITION_PRIVILEGE_MASK mask;
    uint32_t lo, hi;
} u;

This, of course, is incorrect as both lo and hi will alias the low order
32 bits of the mask.

This patch wraps lo and hi in an anonmymous struct to achieve the desired
effect.

NOTE: Fixing this also stops Windows making the HvGetPartitionId hypercall
      which was previously considered erroneous behaviour. Thus the
      hypercall handler is also modified to stop squashing the
      'unimplemented' warning for this hypercall.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libacpi: fixes for iasl >= 20180427

New versions of iasl have introduced improved C file generation, as
reported in the changelog:

iASL: Enhanced the -tc option (which creates an AML hex file in C,
suitable for import into a firmware project):
1) Create a unique name for the table, to simplify use of multiple
SSDTs.
2) Add a protection #ifdef in the file, similar to a .h header file.

The net effect of that on generated files is:

-unsigned char AmlCode[] =
+#ifndef __SSDT_S4_HEX__
+#define __SSDT_S4_HEX__
+
+unsigned char ssdt_s4_aml_code[] =

The above example is from ssdt_s4.asl.

Fix the build with newer versions of iasl by stripping the '_aml_code'
suffix from the variable name on generated files.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: guard against emulator driving ioreq state in weird ways

In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.

Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).

This is XSA-262.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vpt: add support for IO-APIC routed interrupts

And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.

To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.

Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.

Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.

This is part of XSA-261.

Address actual and potential compiler warnings. Fix formatting.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/traps: Fix handling of #DB exceptions in hypervisor context

The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.

Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Use an Interrupt Stack Table for #DB

PV guests can use architectural corner cases to cause #DB to be raised after
transitioning into supervisor mode.

Use an interrupt stack table for #DB to prevent the exception being taken with
a guest controlled stack pointer.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Move exception injection into {,compat_}test_all_events()

This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.

The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Fix %dr6 handing in #DB handler

Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/domain: Drop the only-written smap_check_policy infrastructure

c/s 4c5d78a10d "x86/pagewalk: Re-implement the pagetable walker" dropped the
consumer of smap_policy. Looking at c/s 31ae587e6f which introduced the
smap_check logic, it exists only to work around a bug in guest_walk_tables()
was resolved by the aformentioned commit.

Remove the unused variables and associated infrastructure.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

doc: add credit2_cap_period_ms boot parameter description

credit2_cap_period_ms isn't mentioned in xen-command-line.markdown.
Add a description.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

doc: add architecture qualifier to boot parameter entries

Many of the architecture specific boot parameters are not qualified
as such. Correct that. Reorder PKU to be alphabetical.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/pv: Hide more EFER bits from PV guests

We don't advertise SVM in CPUID so a PV guest shouldn't be under the
impression that it can use SVM functionality, but despite this, it really
shouldn't see SVME set when reading EFER.

On Intel processors, 32bit PV guests don't see, and can't use SYSCALL.

Introduce EFER_KNOWN_MASK to whitelist the features Xen knows about, and use
this to clamp the guests view.

Take the opportunity to reuse the mask to simplify svm_vmcb_isvalid(), and
change "undefined" to "unknown" in the print message, as there is at least
EFER.TCE (Translation Cache Extension) defined but unknown to Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

SVM: introduce a VM entry helper

Neither the register values copying nor the trace entry generation need
doing in assembly. The VMLOAD invocation can also be further deferred
(and centralized). Therefore replace the svm_asid_handle_vmrun()
invocation with one of the new helper.

Similarly move the VM exit side register value copying into
svm_vmexit_handler().

Now that we always make it out to guest context after VMLOAD,
svm_sync_vmcb() no longer overrides vmcb_needs_vmsave, making
svm_vmexit_handler() setting the field early unnecessary.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

SVM: re-work VMCB sync-ing

While the main problem to be addressed here is the issue of what so far
was named "vmcb_in_sync" starting out with the wrong value (should have
been true instead of false, to prevent performing a VMSAVE without ever
having VMLOADed the vCPU's state), go a step further and make the
sync-ed state a tristate: CPU and memory may be in sync or an update
may be required in either direction. Rename the field and introduce an
enum. Callers of svm_sync_vmcb() now indicate the intended new state
(with a slight "anomaly" when requesting VMLOAD: we could store
vmcb_needs_vmsave in those cases as the callers request, but the VMCB
really is in sync at that point, and hence there's no need to VMSAVE in
case we don't make it out to guest context), and all syncing goes
through that function.

With that, there's no need to VMLOAD the state perhaps multiple times;
all that's needed is loading it once before VM entry.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs: fix xpti command line option doc

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use PCID feature

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: add some cr3 helpers

Add some helper macros to access the address and pcid parts of cr3.

Use those helpers where appropriate.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: convert pv_guest_cr4_to_real_cr4() to a function

pv_guest_cr4_to_real_cr4() is becoming more and more complex. Convert
it from a macro to an ordinary function.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use flag byte for decision whether xen_cr3 is valid

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: disable global pages for domains with XPTI active

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).

This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use invpcid for flushing the TLB

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: support per-domain flag for xpti

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: add a function for modifying cr3

Instead of having multiple places with more or less identical asm
statements just have one function doing a write to cr3.

As this function should be named write_cr3() rename the current
write_cr3() function to switch_cr3().

Suggested-by: Andrew Copper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/xpti: avoid copying L4 page table contents when possible

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: fix return value checks of set_guest_{machinecheck,nmi}_trapbounce

Commit 0142064421 ("x86/traps: move set_guest_{machine,nmi}_trapbounce")
converted the functions' return types from int to bool without also
correcting the checks in assembly code: The ABI does not guarantee sub-
32-bit return values to be promoted to 32 bits.

Take the liberty and also adjust the number of spaces used in the compat
code, such that both code sequences end up similar (they already are in
the non-compat case).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/schedule: Fix races in vcpu migration

The current sequence to initiate vcpu migration is inefficent and error-prone:

- The initiator sets VPF_migraging with the lock held, then drops the
  lock and calls vcpu_sleep_nosync(), which immediately grabs the lock
  again

- A number of places unnecessarily check for v->pause_flags in between
  those two

- Every call to vcpu_migrate() must be prefaced with
  vcpu_sleep_nosync() or introduce a race condition; this code
  duplication is error-prone

- In the event that v->is_running is true at the beginning of
  vcpu_migrate(), it's almost certain that vcpu_migrate() will end up
  being called in context_switch() as well; we might as well simply
  let it run there and save the duplicated effort (which will be
  non-negligible).

The result is that Credit1 has several races which result in runqueue
<-> v->processor invariants being violated (triggering ASSERTs in
debug builds and strange bugs in production builds).

Instead, introduce vcpu_migrate_start() to initiate the process.
vcpu_migrate_start() is called with the scheduling lock held.  It not
only sets VPF_migrating, but also calls vcpu_sleep_nosync_locked()
(which will automatically do nothing if there's nothing to do).

Rename vcpu_migrate() to vcpu_migrate_finish().  Check for v->is_running and
pause_flags & VPF_migrating at the top and return if appropriate.

Then the way to initiate migration is consistently:

* Grab lock
* vcpu_migrate_start()
* Release lock
* vcpu_migrate_finish()

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen: Introduce vcpu_sleep_nosync_locked()

There are a lot of places which release a lock before calling
vcpu_sleep_nosync(), which then just grabs the lock again. This is
not only a waste of time, but leads to more code duplication (since
you have to copy-and-paste recipes rather than calling a unified
function), which in turn leads to an increased chance of bugs.

Introduce vcpu_sleep_nosync_locked(), which can be called if you
already hold the schedule lock.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/schedule.c: Fix up whitespace

Delete tabs and trailing whitespace.

No functional change.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

MAINTAINERS: Add Brian Woods as Designated reviewer to AMD IOMMU and AMD SVM

This was discussed in an IRC discussion post the April x86 meeting.
On 27/4/18 Juergen gave a RAB via IRC

Cc: Lars Kurth <lars.kurth@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Brian Woods <brian.woods@amd.com>
Cc: Juergen Gross <jgross@suse.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Brian Woods <brian.woods@amd.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

MAINTAINERS, get_maintainer.pl: Add Designated Reviewer (R:) role

The syntax has been copied from the Linux Maintainers file. I moved the following Linux
get_maintainer.pl patches to Xen, fixing up some merge issues (and a bug).

The get_maintainer.pl changes were based on the following git commits
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/scripts/get_maintainer.pl?id=
* c1c3f2c906e35bcb6e4cdf5b8e077660fead14fe
* 4f07510df2e8c47fd65b8ffaaf6c5d334d59d598

I also removed code related to
P: Person (obsolete)
which is in the Linux MAINTAINER's file, but not ours. I may not have
caught all instances though.

I have tested on a number of files using mock entries in MAINTAINERS
using ./scripts/get_maintainer.pl -f ...

I also tested --nor to disable the support and it worked as expected.

Cc: Lars Kurth <lars.kurth@citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86emul: VMOVNTDQA should raise #GP(0) on mis-alignment

Commit 50b73118d5 introduced emulation of the insn without extending the
set of opcodes requiring special alignment related #GP behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools: prepend to PKG_CONFIG_PATH when configuring qemu

A user may choose to set his/her own PKG_CONFIG_PATH, which is useful in the
case of cross-compiling. We don't want to completely override the
PKG_CONFIG_PATH, just add to it.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@dornerworks.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/process/release-checklist.txt: Say to push staging branch

Preparing a real release, not just an RC, involves making commits.
Typically, those will be on staging-$x. The tag will refer to them,
and the checklist already says to push them to xenbits.

But if the *branch* is not pushed, then people who just "git fetch"
won't get the tag because it refers to commits they don't have.
(Because of the strange rules git has about tag fetching.)
Worse, the same may be true of people who "git clone".

And anyway, those commits *should* be fed to staging-$x.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

docs/process/release-checklist.txt: New instructions for disabling debug

The old instructions were obsolete. Here are the details I used when
branching for 4.10.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/traps: Improve code generation for set_ist()

The IST field in an IDT entry is a 3 bit field, with 5 adjacent reserved bits
which we always write as zero.  By expressing this as a byte field in a union,
we turn an invocation of enable_each_ist() from

  4b 8b 14 d3                     mov    (%r11,%r10,8),%rdx
  48 b8 ff ff ff ff f8 ff ff ff   movabs $0xfffffff8ffffffff,%rax
  48 be 00 00 00 00 01 00 00 00   movabs $0x100000000,%rsi
  48 8b 8a 80 00 00 00            mov    0x80(%rdx),%rcx
  48 21 c1                        and    %rax,%rcx
  48 09 f1                        or     %rsi,%rcx
  48 be 00 00 00 00 02 00 00 00   movabs $0x200000000,%rsi
  48 89 8a 80 00 00 00            mov    %rcx,0x80(%rdx)
  48 8b 4a 20                     mov    0x20(%rdx),%rcx
  48 21 c1                        and    %rax,%rcx
  48 23 82 20 01 00 00            and    0x120(%rdx),%rax
  48 09 f1                        or     %rsi,%rcx
  48 89 4a 20                     mov    %rcx,0x20(%rdx)
  48 b9 00 00 00 00 03 00 00 00   movabs $0x300000000,%rcx
  48 09 c8                        or     %rcx,%rax
  48 89 82 20 01 00 00            mov    %rax,0x120(%rdx)

into

  4b 8b 04 d3                     mov    (%r11,%r10,8),%rax
  c6 80 84 00 00 00 01            movb   $0x1,0x84(%rax)
  c6 40 24 02                     movb   $0x2,0x24(%rax)
  c6 80 24 01 00 00 03            movb   $0x3,0x124(%rax)

which is far more simple.  As the IDT is typically live, this is more
obviously safe.

The net delta for this change is:

  add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-334 (-334)

While making changes here, tidy up the set_ist() declaration.  Drop the
always_inline (I don't recall why I wrote it like that originally) and the ist
parameter need not be unsigned long (although it will be const-propagated in
practice).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/cpuidle: don't init stats lock more than once

Osstest flight 122363, having hit an NMI watchdog timeout, shows CPU1 at

Xen call trace:
   [<ffff82d08023d3f4>] _spin_lock+0x30/0x57
   [<ffff82d0802d9346>] update_last_cx_stat+0x29/0x42
   [<ffff82d0802d96f3>] cpu_idle.c#acpi_processor_idle+0x2ff/0x596
   [<ffff82d080276713>] domain.c#idle_loop+0xa8/0xc3

and CPU0 at

Xen call trace:
   [<ffff82d08023d173>] on_selected_cpus+0xb7/0xde
   [<ffff82d0802dbe22>] powernow.c#powernow_cpufreq_target+0x110/0x1cb
   [<ffff82d080257973>] __cpufreq_driver_target+0x43/0xa6
   [<ffff82d080256b0d>] cpufreq_governor_dbs+0x324/0x37a
   [<ffff82d080257bf2>] __cpufreq_set_policy+0xfa/0x19d
   [<ffff82d080256044>] cpufreq_add_cpu+0x3a1/0x5df
   [<ffff82d0802dbab4>] cpufreq_cpu_init+0x17/0x1a
   [<ffff82d0802567a8>] set_px_pminfo+0x2b6/0x2f7
   [<ffff82d08029f1bf>] do_platform_op+0xe75/0x1977
   [<ffff82d0803712c5>] pv_hypercall+0x1f4/0x440
   [<ffff82d0803784a5>] lstar_enter+0x115/0x120

That is, Dom0's ACPI processor driver is in the process of uploading Px
and Cx data. Looking at the ticket lock state in CPU1's registers, it is
waiting for ticket 0x0000 to have its turn, while the supposed current
owner's ticket is 0x0001, which is an invalid state (and neither of the
other two CPUs holds the lock anyway). Hence I can only conclude that
cpuidle_init_cpu(1) ran on CPU 0 while some other CPU held the lock (the
unlock then put the lock in the state that CPU1 is observing).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

doc: escape underscores in xen-command-line.markdown

Some underscores are not escaped in xen-command-line.markdown.
Correct that.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>