dgit.raspbian.org Git

xen/build: silence make warnings about missing auto.conf*

In a clean tree, both files include/config/auto.conf{,.cmd} are
missing and older version of GNU Make complain about it:
Makefile:103: include/config/auto.conf: No such file or directory
Makefile:106: include/config/auto.conf.cmd: No such file or directory

Those warnings are harmless, make will create the files and start over. But
to avoid confusion, we'll use "-include" to silence the warning.

Those warning started to appear with commit 6c122d3984a5 ("xen/build:
include include/config/auto.conf in main Makefile").

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

guestcopy: evaluate {,__}copy{,_field}_to_guest*() ptr argument just once

There's nothing wrong with having e.g.

copy_to_guest(uarg, ptr++, 1);

yet until now this would increment "ptr" twice.

Also drop a pair of unneeded parentheses from every instance at this
occasion.

Fixes: b7954cc59831 ("Enhance guest memory accessor macros so that source operands can be")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

guest_access: harden *copy_to_guest_offset() to prevent const dest operand

At the moment, *copy_to_guest_offset() will allow the hypervisor to copy
data to guest handle marked const.

Thankfully, no users of the helper will do that. Rather than hoping this
can be caught during review, harden copy_to_guest_offset() so the build
will fail if such users are introduced.

There is no easy way to check whether a const is NULL in C99. The
approach used is to introduce an unused variable that is non-const and
assign the handle. If the handle were const, this would fail at build
because without an explicit cast, it is not possible to assign a const
variable to a non-const variable.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

Introduce a description of the Backport and Fixes tags

Create a new document under docs/process to describe our special tags.
Add a description of the Fixes tag and the new Backport tag. Also
clarify that lines with tags should not be split.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Wei Liu <wl@xen.org>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: jbeulich@suse.com
CC: george.dunlap@citrix.com
CC: julien@xen.org
CC: lars.kurth@citrix.com
CC: andrew.cooper3@citrix.com
CC: konrad.wilk@oracle.com

Update QEMU_TRADITIONAL_REVISION

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86: drop cpu_has_ffxsr

It's definition is bogus when it comes to Hygon CPUs, but since we don't
use it anywhere drop it rather than correcting it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

mem_sharing: fix sharability check during fork reset

When resetting a VM fork we ought to only remove pages that were allocated for
the fork during it's execution and the contents copied over from the parent.
This can be determined if the page is sharable as special pages used by the
fork for other purposes will not pass this test. Unfortunately during the fork
reset loop we only partially check whether that's the case. A page's type may
indicate it is sharable (pass p2m_is_sharable) but that's not a sufficient
check by itself. All checks that are normally performed before a page is
converted to the sharable type need to be performed to avoid removing pages
from the p2m that may be used for other purposes. For example, currently the
reset loop also removes the vcpu info pages from the p2m, potentially putting
the guest into infinite page-fault loops.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

xen/build: start using if_changed

This patch start to use if_changed introduced in a previous commit.

Whenever if_changed is called, the target must have FORCE as
dependency so that if_changed can check if the command line to be
run has changed, so the macro $(real-prereqs) must be used to
discover the dependencies without "FORCE".

Whenever a target isn't in obj-y, it should be added to extra-y so the
.*.cmd dependency file associated with the target can be loaded. This
is done for xsm/flask/ and both common/lib{elf,fdt}/ and
arch/x86/Makefile.

For the targets that generate .*.d dependency files, there's going to
be two dependency files (.*.d and .*.cmd) until we can merge them
together in a later patch via fixdep from Linux.

One cleanup, libelf-relocate.o doesn't exist anymore.

We import cmd_ld and cmd_objcopy from Linux v5.4.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

xen/build: introduce if_changed and if_changed_rule

The if_changed macro from Linux, in addition to check if any files
needs an update, check if the command line has changed since the last
invocation. The latter will force a rebuild if any options to the
executable have changed.

if_changed_rule checks dependencies like if_changed, but execute
rule_$(1) instead of cmd_$(1) when a target needs to be rebuilt. A rule_
macro can call more than one cmd_ macro. One of the cmd_ macro in a
rule need to be call using a macro that record the command line, so
cmd_and_record is introduced. It is similar to cmd_and_fixup from
Linux but without a call to fixdep which we don't have yet. (We will
later replace cmd_and_record by cmd_and_fixup.)

Example of a rule_ macro:
define rule_cc_o_c
$(call cmd_and_record,cc_o_o)
$(call cmd,objcopy)
endef

This needs one of the call to use cmd_and_record, otherwise no .*.cmd
file will be created, and the target will keep been rebuilt.

In order for if_changed to works correctly, we need to load the .%.cmd
files that the macro generates, this is done by adding targets in to
the $(targets) variable. We use intermediate_targets to add %.init.o
dependency %.o to target since there aren't in obj-y.

We also add $(MAKECMDGOALS) to targets so that when running for
example `make common/memory.i`, make will load the associated .%.cmd
dependency file.

Beside the if_changed*, we import the machinery used for a "beautify
output". The important one is when running make with V=2 which help to
debug the makefiles by printing why a target is been rebuilt, via the
$(echo-why) macro.

if_changed and if_changed_rule aren't used yet.

Most of this code is copied from Linux v5.4, including the
documentation.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

build: introduce documentation for xen Makefiles

This start explainning the variables that can be used in the many
Makefiles in xen/.

Most of the document copies and modifies text from Linux v5.4 document
linux.git/Documentation/kbuild/makefiles.rst. Modification are mostly
to avoid mentioning kbuild. Thus I've added the SPDX tag which was
only in index.rst in linux.git.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/build: have the root Makefile generates the CFLAGS

Instead of generating the CFLAGS in Rules.mk everytime we enter a new
subdirectory, we are going to generate most of them a single time, and
export the result in the environment so that Rules.mk can use it.  The
only flags left to be generated are the ones that depend on the
targets, but the variable $(c_flags) takes care of that.

Arch specific CFLAGS are generated by a new file "arch/*/arch.mk"
which is included by the root Makefile.

We export the *FLAGS via the environment variables XEN_*FLAGS because
Rules.mk still includes Config.mk and would add duplicated flags to
CFLAGS.

When running Rules.mk in the root directory (xen/), the variable
`root-make-done' is set, so `need-config' will remain undef and so the
root Makefile will not generate the cflags again.

We can't use CFLAGS in subdirectories to add flags to particular
targets, instead start to use CFLAGS-y. Idem for AFLAGS.
So there are two different CFLAGS-y, the one in xen/Makefile (and
arch.mk), and the one in subdirs that Rules.mk is going to use.
We can't add to XEN_CFLAGS because it is exported, so making change to
it might be propagated to subdirectory which isn't intended.

Some style change are introduced in this patch:
    when LDFLAGS_DIRECT is included in LDFLAGS
    use of CFLAGS-$(CONFIG_INDIRECT_THUNK) instead of ifeq().

The LTO change hasn't been tested properly, as LTO is marked as
broken.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

golang/xenlight: Implement DomainCreateNew

This implements the wrapper around libxl_domain_create_new(). With
the previous changes, it's now possible to create a domain using the
golang bindings (although not yet to unpause it or harvest it after it
shuts down).

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>

golang/xenlight: Notify xenlight of SIGCHLD

libxl forks external processes and waits for them to complete; it
therefore needs to be notified when children exit.

In absence of instructions to the contrary, libxl sets up its own
SIGCHLD handlers.

Golang always unmasks and handles SIGCHLD itself.  libxl thankfully
notices this and throws an assert() rather than clobbering SIGCHLD
handlers.

Tell libxl that we'll be responsible for getting SIGCHLD notifications
to it.  Arrange for a channel in the context to receive notifications
on SIGCHLD, and set up a goroutine that will pass these on to libxl.

NB that every libxl context needs a notification; so multiple contexts
will each spin up their own goroutine when opening a context, and shut
it down on close.

libxl also wants to hold on to a const pointer to
xenlight_childproc_hooks rather than do a copy; so make a global
structure in C space.  Make it `static const`, just for extra safety;
this requires making a function in the C space to pass it to libxl.

While here, add a few comments to make the context set-up a bit easier
to follow.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>

golang/xenlight: Don't try to marshall zero-length arrays in fromC

The current fromC array code will do the "magic" casting and
martialling even when num_foo variable is 0. Go crashes when doing
the cast.

Only do array marshalling if the number of elements is non-zero;
otherwise, leave the target pointer empty (nil for Go slices, NULL for
C arrays).

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>

golang/xenlight: add DeviceUsbdevAdd/Remove wrappers

Add DeviceUsbdevAdd and DeviceUsbdevRemove as wrappers for
libxl_device_usbdev_add and libxl_device_usbdev_remove.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: add DevicePciAdd/Remove wrappers

Add DevicePciAdd and DevicePciRemove as wrappers for
libxl_device_pci_add and libxl_device_pci remove.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: add DeviceNicAdd/Remove wrappers

Add DeviceNicAdd and DeviceNicRemove as wrappers for
libxl_device_nic_add and libxl_device_nic_remove.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

mem_sharing: allow forking domain with IOMMU enabled

The memory sharing subsystem by default doesn't allow a domain to share memory
if it has an IOMMU active for obvious security reasons. However, when fuzzing a
VM fork, the same security restrictions don't necessarily apply. While it makes
no sense to try to create a full fork of a VM that has an IOMMU attached as only
one domain can own the pass-through device at a time, creating a shallow fork
without a device model is still very useful for fuzzing kernel-mode drivers.

By allowing the parent VM to initialize the kernel-mode driver with a real
device that's pass-through, the driver can enter into a state more suitable for
fuzzing. Some of these initialization steps are quite complex and are easier to
perform when a real device is present. After the initialization, shallow forks
can be utilized for fuzzing code-segments in the device driver that don't
directly interact with the device.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

xen/build: use new $(c_flags) and $(a_flags) instead of $(CFLAGS)

In a later patch ("xen/build: have the root Makefile generates the
CFLAGS), we want to generate the CFLAGS in xen/Makefile, then export
it and have Rules.mk use a CFLAGS from the environment variables. That
changes the flavor of the CFLAGS and flags intended for one target
(like -D__OBJECT_FILE__ and -M%) gets propagated and duplicated. So we
start by moving such flags out of $(CFLAGS) and into $(c_flags) which
is to be modified by only Rules.mk.

__OBJECT_FILE__ is only used by arch/x86/mm/*.c files, so having it in
$(c_flags) is enough, we don't need it in $(a_flags).

For include/Makefile and as-insn we can keep using CFLAGS, but since
it doesn't have -M* flags anymore there is no need to filter them out.

The XEN_BUILD_EFI tests in arch/x86/Makefile was filtering out
CFLAGS-y, but according to dd40177c1bc8 ("x86-64/EFI: add CFLAGS to
check compile"), it was done to filter out -MF. CFLAGS doesn't
have those flags anymore, so no filtering is needed.

This is inspired by the way Kbuild generates CFLAGS for each targets.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/build: include include/config/auto.conf in main Makefile

We are going to generate the CFLAGS early from "xen/Makefile" instead
of in "Rules.mk", but we need to include "config/auto.conf", so
include it in "Makefile".

Before including "config/auto.conf" we check which make target a user
is calling, as some targets don't need "auto.conf". For targets that
needs auto.conf, make will generate it (and a default .config if
missing).

root-make-done is to avoid doing the calculation again once Rules.mk
takes over and is been executed with the root Makefile. When Rules.mk
is including xen/Makefile, `config-build' and `need-config' are
undefined so auto.conf will not be included again (it is already
included by Rules.mk) and kconfig target are out of reach of Rules.mk.

We are introducing a target %config to catch all targets for kconfig.
So we need an extra target %/.config to prevent make from trying to
regenerate $(XEN_ROOT)/.config that is included in Config.mk.

The way targets are filtered is inspired by Kbuild, with some code
imported from Linux. That's why there is PHONY variable that isn't
used yet, for example.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

build,xsm: fix multiple call

Both script mkflask.sh and mkaccess_vector.sh generates multiple
files. Exploits the 'multi-target pattern rule' trick to call each
scripts only once.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: use cache in guest_walk_tables()

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far as use of CPU
registers goes (as those can't change without any other instruction
executing in between [1]), but is wrong for memory accesses. In
particular it has been observed that Windows might page out buffers
underneath an instruction currently under emulation (hitting between two
passes). If the first pass translated a linear address successfully, any
subsequent pass needs to do so too, yielding the exact same translation.
To guarantee this, leverage the caching that now backs HVM insn
emulation.

[1] Other than on actual hardware, actions like
    XEN_DOMCTL_sethvmcontext, XEN_DOMCTL_setvcpucontext,
    VCPUOP_initialise, INIT, or SIPI issued against the vCPU can occur
    while the vCPU is blocked waiting for a device model to return data.
    In such cases emulation now gets canceled, though, and hence re-
    execution correctness is unaffected.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amzn.com>

x86/HVM: implement memory read caching for insn emulation

Emulation requiring device model assistance uses a form of instruction
re-execution, assuming that the second (and any further) pass takes
exactly the same path. This is a valid assumption as far as use of CPU
registers goes (as those can't change without any other instruction
executing in between [1]), but is wrong for memory accesses. In
particular it has been observed that Windows might page out buffers
underneath an instruction currently under emulation (hitting between two
passes). If the first pass read a memory operand successfully, any
subsequent pass needs to get to see the exact same value.

Introduce a cache to make sure above described assumption holds. This
is a very simplistic implementation for now: Only exact matches are
satisfied (no overlaps or partial reads or anything); this is sufficient
for the immediate purpose of making re-execution an exact replay. The
cache also won't be used just yet for guest page walks; that'll be the
subject of a subsequent change.

With the cache being generally transparent to upper layers, but with it
having limited capacity yet being required for correctness, certain
users of hvm_copy_from_guest_*() need to disable caching temporarily,
without invalidating the cache. Note that the adjustments here to
hvm_hypercall() and hvm_task_switch() are benign at this point; they'll
become relevant once we start to be able to emulate respective insns
through the main emulator (and more changes will then likely be needed
to nested code).

As to the actual data page in a problamtic scenario, there are a couple
of aspects to take into consideration:
- We must be talking about an insn accessing two locations (two memory
  ones, one of which is MMIO, or a memory and an I/O one).
- If the non I/O / MMIO side is being read, the re-read (if it occurs at
  all) is having its result discarded, by taking the shortcut through
  the first switch()'s STATE_IORESP_READY case in hvmemul_do_io(). Note
  how, among all the re-issue sanity checks there, we avoid comparing
  the actual data.
- If the non I/O / MMIO side is being written, it is the OSes
  responsibility to avoid actually moving page contents to disk while
  there might still be a write access in flight - this is no different
  in behavior from bare hardware.
- Read-modify-write accesses are, as always, complicated, and while we
  deal with them better nowadays than we did in the past, we're still
  not quite there to guarantee hardware like behavior in all cases
  anyway. Nothing is getting worse by the changes made here, afaict.

In __hvm_copy() also reduce p's scope and change its type to void *.

[1] Other than on actual hardware, actions like
    XEN_DOMCTL_sethvmcontext, XEN_DOMCTL_setvcpucontext,
    VCPUOP_initialise, INIT, or SIPI issued against the vCPU can occur
    while the vCPU is blocked waiting for a device model to return data.
    In such cases emulation now gets canceled, though, and hence re-
    execution correctness is unaffected.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amzn.com>

x86/HVM: cancel emulation when register state got altered

Re-execution (after having received data from a device model) relies on
the same register state still being in place as it was when the request
was first sent to the device model. Therefore vCPU state changes
effected by remote sources need to result in no attempt of re-execution.
Instead the returned data is to simply be ignored.

Note that any such asynchronous state changes happen with the vCPU at
least paused (potentially down and/or not marked ->is_initialised), so
there's no issue with fiddling with register state behind the actively
running emulator's back. Hence the new function doesn't need to
synchronize with the core emulation logic.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amzn.com>

x86: validate VM assist value in arch_set_info_guest()

While I can't spot anything that would go wrong, just like the
respective hypercall only permits applicable bits to be set, we should
also do so when loading guest context.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HVM: expose VM assist hypercall

In preparation for the addition of VMASST_TYPE_runstate_update_flag
commit 72c538cca957 ("arm: add support for vm_assist hypercall") enabled
the hypercall for Arm. I consider it not logical that it then isn't also
exposed to x86 HVM guests (with the same single feature permitted to be
enabled as Arm has); Linux actually tries to use it afaict.

Rather than introducing yet another thin wrapper around vm_assist(),
make that function the main handler, requiring a per-arch
arch_vm_assist_valid_mask() definition instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

x86/mm: monitor table is HVM-only

Move the per-vCPU field to the HVM sub-structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: sh_update_linear_entries() is a no-op for PV

Consolidate the shadow_mode_external() in here: Check this once at the
start of the function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: make sh_remove_write_access() helper HVM only

Despite the inline attribute at least some clang versions warn about
trace_shadow_wrmap_bf() being unused in !HVM builds. Include the helper
in the #ifdef region.

Fixes: 8b8d011ad868 ("x86/shadow: the guess_wrmap() hook is needed for HVM only")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

xen/arm: Avoid open-coding the relinquish state machine

In commit 0dfffe01d5 "x86: Improve the efficiency of
domain_relinquish_resources()", the x86 version of the function has been
reworked to avoid open-coding the state machine and also add more
documentation.

Bring the Arm version on par with x86 by introducing a documented
PROGRESS() macro to avoid latent bugs and make the new PROG_* states
private to domain_relinquish_resources().

Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: vgic-v3: fix GICD_ISACTIVER range

The end should be GICD_ISACTIVERN not GICD_ISACTIVER.

See https://marc.info/?l=xen-devel&m=158527653730795 for a discussion on
what it would take to implement GICD_ISACTIVER/GICD_ICACTIVER properly.

We chose v1 instead of v2 of this patch to avoid spamming the console:
v2 adds a printk for every read, and reads can happen often.

Signed-off-by: Peng Fan <peng.fan@nxp.com>
[Stefano: improve commit message]
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <jgrall@amazon.com>

x86: Enumeration for Control-flow Enforcement Technology

The CET spec has been published and guest kernels are starting to get support.
Introduce the CPUID and MSRs, and fully block the MSRs from guest use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>

x86/shadow: don't open-code shadow_blow_tables_per_domain()

Make shadow_blow_all_tables() call the designated function, and on this
occasion make the function itself use domain_vcpu().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: the trace_emul_write_val() hook is HVM-only

Its only caller lives in HVM-only code, and the only caller of
trace_shadow_emulate() also already site in a HVM-only code section.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/mm: pagetable_dying() is HVM-only

Its only caller lives in HVM-only code.

This involves wider changes, in order to limit #ifdef-ary: Shadow's
SHOPT_FAST_EMULATION and the fields used by it get constrained to HVM
builds as well. Additionally the shadow_{init,continue}_emulation()
stubs for the !HVM case aren't needed anymore and hence get dropped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: the guess_wrmap() hook is needed for HVM only

sh_remove_write_access() bails early for !external guests, and hence its
building and thus the need for the hook can be suppressed altogether in
!HVM configs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: sh_remove_write_access_from_sl1p() can be static

It's only used by common.c.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: monitor table is HVM-only

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/shadow: drop a stray forward structure declaration

struct sh_emulate_ctxt is private to shadow code, and hence a
declaration for it is not needed here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/vtd: relax EPT page table sharing check

The EPT page tables can be shared with the IOMMU as long as the page
sizes supported by EPT are also supported by the IOMMU.

Current code checks that both the IOMMU and EPT support the same page
sizes, but this is not strictly required, the IOMMU supporting more
page sizes than EPT is fine and shouldn't block page table sharing.

This is likely not a common case (IOMMU supporting more page sizes
than EPT), but should still be fixed for correctness.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86emul: SYSRET must change CPL

The special AMD behavior of leaving SS mostly alone wasn't really
complete: We need to adjust CPL aka SS.DPL.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

tools/ocaml: Fix stubs build when OCaml has been compiled with -safe-string

The OCaml code has been fixed to handle properly -safe-string in Xen
4.11, however the stubs part were missed.

On OCaml newer than 4.06.1, String_Val() will return a const char *
when using -safe-string leading to build failure when this is used
in place where char * is expected.

The main use in Xen code base is when a new string is allocated. The
suggested approach by the OCaml community [1] is to use the helper
caml_alloc_initialized_string() but it was introduced by OCaml 4.06.1.

The next best approach is to cast String_val() to (char *) as the helper
would have done. So use it when we need to update the new string using
memcpy().

Take the opportunity to remove the unnecessary cast of the source as
mempcy() is expecting a void *.

[1] https://github.com/ocaml/ocaml/pull/1274

Reported-by: Dario Faggioli <dfaggioli@suse.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

tools/ocaml: libxb: Avoid to use String_val() when value is bytes

Commit ec7d54dd1a "ocaml/libs/xb: Use bytes in place of strings for
mutable buffers" switch mutable buffers from string to bytes. However
the C code were still using String_Val() to access them.

While the underlying structure is the same between string and bytes, a
string is meant to be immutable. OCaml 4.06.1 and later will enforce it.
Therefore, it will not be possible to build the OCaml libs when using
-safe-string. This is because String_val() will return a const value.

To avoid plain cast in the code, the code is now switched to use
Bytes_val(). As the macro is not defined in older OCaml version, we need
to provide a stub.

Take the opportunity to switch to const the buffer in
ml_interface_write() as it should not be modified.

Reported-by: Dario Faggioli <dfaggioli@suse.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

tools/ocaml: libxb: Harden stub_header_of_string()

stub_header_of_string() should not modify the header. So mark the
variable 'hdr' as const.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

tools/ocaml: libxc: Check error return in stub_xc_vcpu_context_get()

xc_vcpu_getcontext() may fail to retrieve the vcpu context. Rather than
ignoring the return value, check it and throw an error if needed.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

x86/pv: Delete CONFIG_PV_LDT_PAGING

... in accordance with the timeline laid out in the Kconfig message. There
has been no comment since it was disabled by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>

sched: fix scheduler_disable() with core scheduling

In core-scheduling mode, Xen might crash when entering ACPI S5 state.
This happens in sched_slave() during is_idle_unit(next) check because
next->vcpu_list is stale and points to an already freed memory.

This situation happens shortly after scheduler_disable() is called if
some CPU is still inside sched_slave() softirq. Current logic simply
returns prev->next_task from sched_wait_rendezvous_in() which causes
the described crash because next_task->vcpu_list has become invalid.

Fix the crash by returning NULL from sched_wait_rendezvous_in() in
the case when scheduler_disable() has been called.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

sched/core: fix bug when moving a domain between cpupools

For each UNIT, sched_set_affinity is called before unit->priv is updated
to the new cpupool private UNIT data structure. The issue is
sched_set_affinity will call the adjust_affinity method of the cpupool.
If defined, the new cpupool may use unit->priv (e.g. credit), which at
this point still references the old cpupool private UNIT data structure.

This change fixes the bug by moving the switch of unit->priv earler in
the function.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Dario Faggioli <dfaggioli@suse.com>

x86_64/mm: map and unmap page tables in destroy_m2p_mapping

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86_64/mm: map and unmap page tables in share_hotadd_m2p_table

Fetch lYe by mapping and unmapping lXe instead of using the direct map,
which is now done via the lYe_from_lXe() helpers.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86_64/mm: map and unmap page tables in m2p_mapped

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/shim: map and unmap page tables in replace_va_mapping

Also, introduce lYe_from_lXe() macros which do not rely on the direct
map when walking page tables. Unfortunately, they cannot be inline
functions due to the header dependency on domain_page.h, so keep them as
macros just like map_lYt_from_lXe().

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

hvmloader: enable MMIO and I/O decode, after all resource allocation

It was observed that PCI MMIO and/or IO BARs were programmed with
memory and I/O decodes (bits 0 and 1 of PCI COMMAND register) enabled,
during PCI setup phase. This resulted in incorrect memory mapping as
soon as the lower half of the 64 bit bar is programmed.
This displaced any RAM mappings under 4G. After the
upper half is programmed PCI memory mapping is restored to its
intended high mem location, but the RAM displaced is not restored.
The OS then continues to boot and function until it tries to access
the displaced RAM at which point it suffers a page fault and crashes.

This patch address the issue by deferring enablement of memory and
I/O decode in command register until all the resources, like interrupts
I/O and/or MMIO BARs for all the PCI device functions are programmed,
in the descending order of memory requested.

Signed-off-by: Harsha Shamsundara Havanur <havanur@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/altp2m: add missing break

Add a missing break in the HVMOP_altp2m_set_visibility case, or else
code flow will continue into the default case and trigger the assert.

Fixes: 3fd3e9303ec4b1 ('x86/altp2m: hypercall to set altp2m view visibility')
Coverity-ID: 1461759
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>

x86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS

The PERFC_INCR() macro uses current->processor, but current is not valid
during early boot.  This causes the following crash to occur if
e.g. rdmsr_safe() has to recover from a #GP fault.

  (XEN) Early fatal page fault at e008:ffff82d0803b1a39 (cr2=0000000000000004, ec=0000)
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d0803b1a39>] x86_64/entry.S#handle_exception_saved+0x64/0xb8
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d0803b1a39>] R x86_64/entry.S#handle_exception_saved+0x64/0xb8
  (XEN)    [<ffff82d0806394fe>] F __start_xen+0x2cd/0x2980
  (XEN)    [<ffff82d0802000ec>] F __high_start+0x4c/0x4e

Furthermore, the PERFC_INCR() macro is wildly inefficient.  There has been a
single caller for many releases now, so inline it and delete the macro
completely.

There is no need to reference current at all.  What is actually needed is the
per_cpu_offset which can be obtained directly from the top-of-stack block.
This simplifies the counter handling to 3 instructions and no spilling to the
stack at all.

The same breakage from above is now handled properly:

  (XEN) traps.c:1591: GPF (0000): ffff82d0806394fe [__start_xen+0x2cd/0x2980] -> ffff82d0803b3bfb

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <jgrall@amazon.com>

x86/svm: Don't use vmcb->tlb_control as if it is a boolean

svm_asid_handle_vmrun() treats tlb_control as if it were boolean, but this has
been superseded by new additions to the SVM spec.

Introduce an enum containing all legal values, and update
svm_asid_handle_vmrun() to use appropriate constants.

While adjusting this, take the opportunity to fix up two coding style issues,
and trim the include list.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gnttab: fix GNTTABOP_copy continuation handling

The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.

This is part of XSA-318.

Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Pawel Wieczorkiewicz <wipawel@amazon.de>

xen/gnttab: Fix error path in map_grant_ref()

Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic. If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.

This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.

This is XSA-316.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

xen/rwlock: Add missing memory barrier in the unlock path of rwlock

The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.

In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.

The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.

The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.

So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.

Take the opportunity to document each lock paths explaining why a
barrier is not necessary.

This is XSA-314.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xenoprof: limit consumption of shared buffer data

Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)

Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.

Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>

xenoprof: clear buffer intended to be shared with guests

alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>

x86/EFI: also fill boot_tsc_stamp on the xen.efi boot path

Commit e3a379c35eff ("x86/time: always count s_time from Xen boot")
introducing this missed adjusting this path as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>

x86/altp2m: hypercall to set altp2m view visibility

At this moment a guest can call vmfunc to change the altp2m view. This
should be limited in order to avoid any unwanted view switch.

The new xc_altp2m_set_visibility() solves this by making views invisible
to vmfunc.
This is done by having a separate arch.altp2m_working_eptp that is
populated and made invalid in the same places as altp2m_eptp. This is
written to EPTP_LIST_ADDR.
The views are made in/visible by marking them with INVALID_MFN or
copying them back from altp2m_eptp.
To have consistency the visibility also applies to
p2m_switch_domain_altp2m_by_id().

The usage of this hypercall is aimed at dom0 having a logic with a number of views
created and at some time there is a need to be sure that only some of the views
can be switched, saving the rest and making them visible when the time
is right.

Note: If altp2m mode is set to mixed the guest is able to change the view
visibility and then call vmfunc.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Wei Liu <wl@xen.org>

x86/mem_sharing: Fix build with !CONFIG_XSM

A build fails with:

  mem_sharing.c: In function ‘copy_special_pages’:
  mem_sharing.c:1649:9: error: ‘HVM_PARAM_STORE_PFN’ undeclared (first use in this function)
           HVM_PARAM_STORE_PFN,
           ^~~~~~~~~~~~~~~~~~~
  ...

This is because xsm/xsm.h includes xsm/dummy.h for the !CONFIG_XSM case, which
brings public/hvm/params.h in.

Fixes: 41548c5472a "mem_sharing: VM forking"
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>

xen/x86: ioapic: Simplify ioapic_init()

Since commit 9facd54a45 "x86/ioapic: Add register level checks to detect
bogus io-apic entries", Xen is able to cope with IO APICs not mapped in
the fixmap.

Therefore the whole logic to allocate a fake page for some IO APICs is
unnecessary.

With the logic removed, the code can be simplified a lot as we don't
need to go through all the IO APIC if SMP has not been detected or a
bogus zero IO-APIC address has been detected.

To avoid another level of tabulation, the simplification is now moved in
its own function.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: ioapic: Rename init_ioapic_mappings() to ioapic_init()

The function init_ioapic_mappings() is doing more than initialization
mappings. It is also initialization the number of IRQs/GSIs supported.

So rename the function to ioapic_init().

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/x86: ioapic: Use true/false in bad_ioapic_register()

bad_ioapic_register() is returning a bool, so we should switch to
true/false.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>

tools/xl: Remove the filelock when building VM if autoballooning is off

The presence of this filelock does not allow building several VMs at the same
time. This filelock was added to prevent other xl instances from using memory
freed for the currently building VM in autoballoon mode.

Signed-off-by: Dmitry Isaykin <isaikin-dmitry@yandex.ru>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxc/migration: Abort migration on precopy policy request

libxc defines XGS_POLICY_ABORT for precopy policy to signal that migration
should be aborted (eg. if the estimated pause time is too huge for the
instance). Default simple precopy policy never returns that, but it could be
overriden with a custom one.

Signed-off-by: Andrew Panyakin <apanyaki@amazon.com>
Acked-by: Wei Liu <wl@xen.org>
[wei: fix coding style issue]

x86/PoD: correct ordering of checks in p2m_pod_zero_check()

Commit 0537d246f8db ("mm: add 'is_special_page' inline function...")
moved the is_special_page() checks first in its respective changes to
PoD code. While this is fine for p2m_pod_zero_check_superpage(), the
validity of the MFN is inferred in both cases from the p2m_is_ram()
check, which therefore also needs to come first in this 2nd instance.

Take the opportunity and address latent UB here as well - transform
the MFN into struct page_info * only after having established that
this is a valid page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HVM: __hvm_copy()'s size parameter is an unsigned quantity

There are no negative sizes. Make the function's parameter as well as
that of its derivates "unsigned int". Similarly make its local "count"
variable "unsigned int", and drop "todo" altogether. Don't use min_t()
anymore to calculate "count". Restrict its scope as well as that of
other local variables of the function.

While at it I've also noticed that {copy_{from,to},clear}_user_hvm()
have been returning "unsigned long" for no apparent reason, as their
respective "size" parameters have already been "unsigned int". Adjust
this as well as a slightly wrong comment there at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amzn.com>

mem_sharing: reset a fork

Implement hypercall that allows a fork to shed all memory that got allocated
for it during its execution and re-load its vCPU context from the parent VM.
This allows the forked VM to reset into the same state the parent VM is in a
faster way then creating a new fork would be. Measurements show about a 2x
speedup during normal fuzzing operations. Performance may vary depending how
much memory got allocated for the forked VM. If it has been completely
deduplicated from the parent VM then creating a new fork would likely be more
performant.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

mem_sharing: VM forking

VM forking is the process of creating a domain with an empty memory space and a
parent domain specified from which to populate the memory when necessary. For
the new domain to be functional the VM state is copied over as part of the fork
operation (HVM params, hap allocation, etc).

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

config: use mini-os master for unstable

We haven't used mini-os master for about 2 years now due to a stubdom
test failing [1]. Booting a guest with mini-os master used for building
stubdom didn't reveal any problem, so use master for unstable in order
to let OSStest find any problems not showing up in the local test.

[1]: https://lists.xen.org/archives/html/minios-devel/2018-04/msg00015.html

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>

x86/ucode: Simplify the ops->collect_cpu_info() API

All callers pass &this_cpu(cpu_sig) for the cpu_sig parameter, and all
implementations unconditionally return 0. Simplify it to be void.

Drop the long-stale comment on the AMD side, whose counterpart in
start_update() used to be "collect_cpu_info() doesn't fail so we're fine".

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode: Drop ops->free_patch()

With the newly cleaned up vendor logic, each struct microcode_patch is a
trivial object in memory with no dependent allocations.

This is unlikely to change moving forwards, and function pointers are
expensive in the days of retpoline. Move the responsibility to xfree() back
to common code. If the need does arise in the future, we can consider
reintroducing the hook.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode: Don't try to cope with NULL pointers in apply_microcode()

No paths to apply_microcode() pass a NULL pointer, and other hooks don't
tolerate one in the first place. We can expect the core logic not to pass us
junk, so drop the checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode: Drop ops->match_cpu()

It turns out there are no callers of the hook(). The only callers are the
local, which can easily be rearranged to use the appropriate internal helper.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/intel: Remove one CPUID from collect_cpu_info()

The CPUID instruction is expensive. No point executing it twice when once
will do fine.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

docs: Render .md files using pandoc

This fixes the fact that qemu-deprivilege.md, non-cooperative-migration.md and
xenstore-migration.md don't currently get rendered at all, and are therefore
missing from xenbits.xen.org/docs

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.12

tools/xenstore: fix a use after free problem in xenstored

Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.

With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).

Fixes: 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object twice")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

x86/p2m: make p2m_remove_page()'s parameters type-safe

Also add a couple of blank lines.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m: use available local variable in guest_physmap_add_entry()

The domain is being passed in - no need to obtain it from p2m->domain.
Also drop a pointless cast and simplify expressions while touching this
code anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/cpuidle: correct Cannon Lake residency MSRs

As per SDM rev 071 Cannon Lake has
- no CC3 residency MSR at 3FC,
- a CC1 residency MSR ar 660 (like various Atoms),
- a useless (always zero) CC3 residency MSR at 662.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: use macro DIV_ROUND_UP

Use the DIV_ROUND_UP macro to replace open-coded divisor calculation
(((n) + (d) - 1) / (d)) to improve readability.

Signed-off-by: Simran Singhal <singhalsimran0@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m: drop pointless nested variable from guest_physmap_add_entry()

There's an outer scope rc already, and its use for the mem-sharing logic
does not conflict with its use elsewhere in the function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m: don't assert that the passed in MFN matches for a remove

guest_physmap_remove_page() gets handed an MFN from the outside, yet
takes the necessary lock to prevent further changes to the GFN <-> MFN
mapping itself. While some callers, in particular guest_remove_page()
(by way of having called get_gfn_query()), hold the GFN lock already,
various others (most notably perhaps the 2nd instance in
xenmem_add_to_physmap_one()) don't. While it also is an option to fix
all the callers, deal with the issue in p2m_remove_page() instead:
Replace the ASSERT() by a conditional and split the loop into two, such
that all checking gets done before any modification would occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m: don't ignore p2m_remove_page()'s return value

It's not very nice to return from guest_physmap_add_entry() after
perhaps already having made some changes to the P2M, but this is pre-
existing practice in the function, and imo better than ignoring errors.

Take the liberty and replace an mfn_add() instance with a local variable
already holding the result (as proven by the check immediately ahead).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: inherit HOSTCC when building 32-bit harness on 64-bit host

We're deliberately bringing XEN_COMPILE_ARCH and XEN_TARGET_ARCH out of
sync in this case, and hence HOSTCC won't get set from CC. Therefore
without this addition HOSTCC would not match a possible make command
line override of CC, but default to "gcc", likely causing the build to
fail for test_x86_emulator.c on systems with too old a gcc.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: suppress "not built" warning for test harness'es run targets

The run* targets can be used to test whatever the tool chain is capable
of building, as long as at least the main harness source file builds.
Don't probe the tools chain, in particular to avoid issuing the warning,
in this case. While looking into this I also noticed the wording of the
respective comment isn't quite right, which therefore gets altered at
the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

credit2: fix credit reset happening too few times

There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
credit on reset condition"). In fact, the aim of that commit was to
make sure that we do not perform too many credit reset operations
(which are not super cheap, and in an hot-path). But the check used
to determine whether a reset is necessary was the wrong one.

In fact, knowing just that some vCPUs have been skipped, while
traversing the runqueue (in runq_candidate()), is not enough. We
need to check explicitly whether the first vCPU in the runqueue
has a negative amount of credit.

Since a trace record is changed, this patch updates xentrace format file
and xenalyze as well

This should be backported.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

credit2: avoid vCPUs to ever reach lower credits than idle

There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).

Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.

NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
  running with Credit1 as well) issue, or a scheduler issue.

This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).

Reported-by: Glen <glenbarney@gmail.com>
Reported-by: Tomas Mozes <hydrapolic@gmail.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/ucode/amd: Rework parsing logic in cpu_request_microcode()

cpu_request_microcode() is still a confusing mess to follow, with sub
functions responsible for maintaining offset. Rewrite it so all container
structure handling is in this one function.

Rewrite struct mpbhdr as struct container_equiv_table to aid parsing. Drop
container_fast_forward() entirely, and shrink scan_equiv_cpu_table() to just
its searching/caching logic.

container_fast_forward() gets logically folded into the microcode blob
scanning loop, except that a skip path is inserted, which is conditional on
whether scan_equiv_cpu_table() thinks there is appropriate microcode to find.

With this change, we now scan to the end of all provided microcode containers,
and no longer give up at the first applicable one.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Fold structures together

With all the necessary cleanup now in place, fold struct microcode_header_amd
into struct microcode_patch and drop the struct microcode_amd temporary
ifdef-ary.

This removes the memory allocation of struct microcode_amd which is a single
pointer to a separately allocated object, and therefore a waste.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Remove gratuitous memory allocations from cpu_request_microcode()

Just as on the Intel side, there is no point having
get_ucode_from_buffer_amd() make $N memory allocations and free $N-1 of them.

Delete get_ucode_from_buffer_amd() and rewrite the loop in
cpu_request_microcode() to have 'saved' point into 'buf' until we finally
decide to duplicate that blob and return it to our caller.

Introduce a new struct container_microcode to simplify interpreting the
container format. Doubly indent the logic to substantially reduce the churn
in a later change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Rename bufsize to size in cpu_request_microcode()

To simplify future cleanup, rename this variable.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Alter API for microcode_fits()

Although it is logically a step in the wrong direction overall, it simplifies
the rearranging of cpu_request_microcode() substantially for microcode_fits()
to take struct microcode_header_amd directly, and not require an intermediate
struct microcode_amd pointing at it.

Make this change (taking time to rename 'mc_amd' to its eventual 'patch' to
reduce the churn in the series), and a later cleanup will make it uniformly
take a struct microcode_patch.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Move verify_patch_size() into get_ucode_from_buffer_amd()

We only stash the microcode blob size so it can be audited in
microcode_fits(). However, the patch size check depends only on the CPU
family.

Move the check earlier to when we are parsing the container, which avoids
caching bad microcode in the first place, and allows us to avoid storing the
size at all.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Overhaul the equivalent cpu table handling completely

We currently copy the entire equivalency table, and the single correct
microcode. This is not safe to heterogeneous scenarios, and as Xen doesn't
support such situations to begin with, can be used to simplify things further.

The CPUID.1.EAX => processor_rev_id mapping is fixed for an individual part.
We can cache the single appropriate entry on first discovery, and forgo
duplicating the entire table.

Alter install_equiv_cpu_table() to be scan_equiv_cpu_table() which is
responsible for checking the equivalency table and caching appropriate
details. It now has a check for finding a different mapping (which indicates
that one of the tables we've seen is definitely wrong).

A return value of -ESRCH is now used to signify "everything fine, but nothing
applicable for the current CPU", which is used to select the
container_fast_forward() path.

Drop the printk(), as each applicable error path in scan_equiv_cpu_table()
already prints diagnostics.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode/amd: Collect CPUID.1.EAX in collect_cpu_info()

... rather than collecting it repeatedly in microcode_fits(). This brings the
behaviour in line with the Intel side.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>