dgit.raspbian.org Git

tools/migration: unify known page type checking

Users of xc_get_pfn_type_batch may want to sanity check the data
returned by Xen. Add helpers for this purpose:

is_known_page_type verifies the type returned by Xen on the saving
side, or the incoming type for a page on the restoring side, is known
by the save/restore code.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/python: fix Python3.4 TypeError in format string

Using the first element of a tuple for a format specifier fails with
python3.4 as included in SLE12:
b = b"string/%x" % (i, )
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'

It happens to work with python 2.7 and 3.6.
To support older Py3, format as strings and explicitly encode as ASCII.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

tools/python: handle libxl__physmap_info.name properly in convert-legacy-stream

The trailing member name[] in libxl__physmap_info is written as a
cstring into the stream. The current code does a sanity check if the
last byte is zero. This attempt fails with python3 because name[-1]
returns a type int. As a result the comparison with byte(\00) fails:

  File "/usr/lib/xen/bin/convert-legacy-stream", line 347, in read_libxl_toolstack
    raise StreamError("physmap name not NUL terminated")
  StreamError: physmap name not NUL terminated

To handle both python variants, cast to bytearray().

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

arm64: Change type of hsr, cpsr, spsr_el1 to uint64_t

AArch64 registers are 64bit whereas AArch32 registers
are 32bit or 64bit. MSR/MRS are expecting 64bit values thus
we should get rid of helpers READ/WRITE_SYSREG32
in favour of using READ/WRITE_SYSREG.
We should also use register_t type when reading sysregs
which can correspond to uint64_t or uint32_t.
Even though many AArch64 registers have upper 32bit reserved
it does not mean that they can't be widen in the future.

Modify type of hsr, cpsr, spsr_el1 to uint64_t.
Previously we relied on the padding after spsr_el1.
As we removed the padding, modify the union to be 64bit so we don't corrupt spsr_fiq.
No need to modify the assembly code because the accesses were based on 64bit
registers as there was a 32bit padding after spsr_el1.

Remove 32bit padding in cpu_user_regs before spsr_fiq
as it is no longer needed due to upper union being 64bit now.
Add 64bit padding in cpu_user_regs before spsr_el1
because the kernel frame should be 16-byte aligned.

Change type of cpsr to uint64_t in the public outside interface
"public/arch-arm.h" to allow ABI compatibility between 32bit and 64bit.
Increment XEN_DOMCTL_INTERFACE_VERSION.

Change type of cpsr to uint64_t in the public outside interface
"public/vm_event.h" to allow ABI compatibility between 32bit and 64bit.

Signed-off-by: Michal Orzel <michal.orzel@arm.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

xen/arm: bootfdt: Always sort memory banks

At the moment, Xen on Arm64 expects the memory banks to be ordered.
Unfortunately, there may be a case when updated by firmware
device tree contains unordered banks. This means Xen will panic
when setting xenheap mappings for the subsequent bank with start
address being less than xenheap_mfn_start (start address of
the first bank).

As there is no clear requirement regarding ordering in the device
tree, update code to be able to deal with by sorting memory
banks. There is only one heap region on Arm32, so the sorting
is fine to be done in the common code.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: Stash the correct request in lu_status->in

When Live-Updating with some load, Xenstored may hit the assert
req->in == lu_status->in in do_lu_start().

This is happening because the request is stashed when Live-Update
begins. This happens in a different request (see call lu_begin()
when select the new binary) from the one performing Live-Update.

To avoid the problem, stash the request in lu_start().

Fixes: 65f19ed62aa1 ("tools/xenstore: Don't assume conn->in points to the LU request")
Reported-by: Michael Kurth <mku@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: luca.fancellu@arm.com
Reviewed-by: Juergen Gross <jgross@suse.com>

libxl/arm: provide guests with random seed

Pass 128 bytes of random seed via FDT, so that guests' CRNGs are better seeded
early at boot. This is larger than ChaCha20 key size of 32, so each byte of
CRNG state will be mixed 4 times using this seed. There does not seem to be
advantage in larger seed though.

Depending on its configuration Linux can use the seed as device randomness
or to just quickly initialize CRNG.
In either case this will provide extra randomness to further harden CRNG.

Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
Reviewed-by: Julien Grall <julien@xen.org>
Reviewed-by: Michal Orzel <michal.orzel@arm.com>

MAINTAINERS: Updating after change to tools/include/

The LIBS section doesn't mention the headers associated with the
libraries, same for LIBXENLIGHT section.

They aren't any ':' in other section names, so remove it.

Fixes: 4664034cdc72 ("tools/libs: move official headers to common directory")
Fixes: f7079d7ef69f ("MAINTAINERS: add myself as tools/libs reviewer")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

build: fix %.s: %.S rule

Fixes: e321576f4047 ("xen/build: start using if_changed")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/shadow: drop callback_mask pseudo-variables

In commit 90629587e16e ("x86/shadow: replace stale literal numbers in
hash_{vcpu,domain}_foreach()") I had to work around Clang not following
gcc in certain relaxed requirements as to the expressions usable with
_Static_assert() (gcc tolerates static const variables in otherwise
integer constant expressions). Roberto suggests that we'd better not
rely on such behavior. Drop the involved static const-s, using their
"expansions" in both of the prior use sites each. This then allows
dropping the short-circuiting of the check for clang.

Requested-by: Roberto Bagnara <roberto.bagnara@bugseng.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

tools/libxenguest: Fix migration's debug option

The code has gone through many refactors, but the first refactor was the one
which broke it by inverting the check with respect to checkpointed streams.

Fixes: 7449fb36c6c8 ("migration/save: pass checkpointed_stream from libxl to libxc")
Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools/libxenguest: Fix max_extd_leaf calculation for legacy restore

0x1c is lower than any value which will actually be observed in
p->extd.max_leaf, but higher than the logical 9 leaves worth of extended data
on Intel systems, causing x86_cpuid_copy_to_buffer() to fail with -ENOBUFS.

Correct the calculation.

The problem was first noticed in c/s 34990446ca9 "libxl: don't ignore the
return value from xc_cpuid_apply_policy" but introduced earlier.

Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore")
Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools: use integer division in convert-legacy-stream

A single slash gives a float, a double slash gives an int.

bitmap = unpack_exact("Q" * ((max_id/64) + 1))
TypeError: can't multiply sequence by non-int of type 'float'

Use future division to remain compatible with python 2.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools: fix comment typo in libxl__cpuid_legacy

Replace emualted with emulated.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxl: Fix QEMU cmdline for scsi device

Usage of 'scsi-disk' device is deprecated and removed from QEMU,
instead we need to use 'scsi-hd' for hard drives.
See QEMU 879be3af49 (hw/scsi: remove 'scsi-disk' device)

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Replace short-form boolean for QEMU's -vnc

f3f778c81769 forgot one boolean parameter.

Fixes: f3f778c81769 ("libxl: Replace QEMU's command line short-form boolean option")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

Config.mk: re-pin OVMF changeset and unpin qemu-xen

qemu-xen tree have a osstest gate and doesn't need to be pinned.

On the other hand, OVMF's xen repository doesn't have a gate and needs
to be pinned. The "master" branch correspond now to the tag
"edk2-stable202105", so pin to that commit.

Fixes: a04509d34d72 ("Branching: Update version files etc. for newly unstable")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

AMD/IOMMU: re-work locking around sending of commands

It appears unhelpful to me for flush_command_buffer() to block all
progress elsewhere for the given IOMMU by holding its lock while waiting
for command completion. There's no real need for callers of that
function or of send_iommu_command() to hold the lock. Contain all
command sending related locking to the latter function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

AMD/IOMMU: redo awaiting of command completion

The present abuse of the completion interrupt does not only stand in the
way of, down the road, using it for its actual purpose, but also
requires holding the IOMMU lock while waiting for command completion,
limiting parallelism and keeping interrupts off for non-negligible
periods of time. Have the IOMMU do an ordinary memory write instead of
signaling an otherwise disabled interrupt (by just updating a status
register bit).

Since IOMMU_COMP_WAIT_I_FLAG_SHIFT is now unused and
IOMMU_COMP_WAIT_[FS]_FLAG_SHIFT already were, drop all three of them
while at it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

x86emul: avoid using _PRE_EFLAGS() in a few cases

The macro expanding to quite a few insns, replace its use by simply
clearing the status flags when the to be executed insn doesn't depend
on their initial state, in cases where this is easily possible. (There
are more cases where the uses are hidden inside macros, and where some
of the users of the macros want guest flags put in place before running
the insn, i.e. the macros can't be updated as easily.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()

We should try to limit the failure reasons after we've started making
changes.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/paging: deal with log-dirty stats overflow

While the precise values are unlikely of interest once they exceed 4
billion (allowing us to leave alone the domctl struct), we still
shouldn't wrap or truncate the actual values. It is in particular
problematic if the truncated values were zero (causing libxenguest to
skip an iteration altogether) or a very small value (leading to
premature exiting of the pre-copy phase).

Change the internal fields to unsigned long, and suitably saturate for
copying to guest context.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

fully replace mfn_to_gmfn()

Convert the two remaining uses as well as Arm's stub to the properly
named and type-safe mfn_to_gfn(), dropping x86's definition (where we
already have mfn_to_gfn()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

Replace FSF street address with canonical URL (again)

As recommended in http://www.gnu.org/licenses/gpl-howto.en.html.

Exactly as per changeset 443701ef0c7ff3 - Some errors have crept back in in
the past 6 years.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>

xen/arm: smmuv1: Set privileged attr to 'default'

Backport commit e19898077cfb642fe151ba22981e795c74d9e114
"iommu/arm-smmu: Set privileged attribute to 'default' instead of
'unprivileged'"

Original commit message:
    Currently the driver sets all the device transactions privileges
    to UNPRIVILEGED, but there are cases where the iommu masters wants
    to isolate privileged supervisor and unprivileged user.
    So don't override the privileged setting to unprivileged, instead
    set it to default as incoming and let it be controlled by the
    pagetable settings.

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Sricharan R <sricharan@codeaurora.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: smmuv1: Fixed stream matching register allocation

SMR allocation should be based on the number of supported stream
matching register for each SMMU device.

Issue introduced by commit 5e08586afbb90b2e2d56c175c07db77a4afa873c
when backported the patches from Linux to XEN to fix the stream match
conflict issue when two devices have the same stream-id.

Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Rahul Singh <rahul.singh@arm.com>

IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed

Failure here could in principle mean the device may still be issuing DMA
requests, which would continue to be translated by the page tables the
device entry currently points at. With this we cannot allow the
subsequent cleanup step of freeing the page tables to occur, to prevent
use-after-free issues. We would need to accept, for the time being, that
in such a case the remaining domain resources will all be leaked, and
the domain will continue to exist as a zombie.

However, with flushes no longer timing out (and with proper timeout
detection for device I/O TLB flushing yet to be implemented), there's no
way anymore for failures to occur, except due to bugs elsewhere. Hence
the change here is merely a "just in case" one.

In order to continue the loop in spite of an error, we can't use
pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
the first place, instead of the cheaper list iteration.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

libxencall: Bump SONAME following new functionality

Fixes: bef64f2c00 ("libxencall: introduce variant of xencall2() returning long")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ian Jackson <iwj@xenproject.org>

tools/xenstored: Correctly read the requests header from the stream

Commit c0fe360f42 ("tools/xenstored: Extend restore code to handle
multiple input buffer") extend the read_buffered_state() to support
multiple input buffers. Unfortunately, the commit didn't go far
enough and still used sc->data (start of the buffers) for retrieving
the header. This would lead to read the wrong headers for second and
follow-up commands.

Use data in place for sc->data for the source of the memcpy()s.

Fixes: c0fe360f42 ("tools/xenstored: Extend restore code to handle multiple input buffer")
Reported-by: Raphael Ning <raphning@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Remove redundant check in socket_can_process()

Commit 3adfb50315d9 ("tools/xenstored: Introduce a wrapper for
conn->funcs->can_{read, write}") consolidated the check
!conn->is_ignored in two new wrappers.

This means the check in socket_can_process() is now redundant. In
fact it should have been removed in orignal commit (as it was done
for the domain helpers).

Reported-by: Raphael Ning <raphning@amazon.com
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

libxc: make xc_domain_maximum_gpfn() endianness-agnostic

libxc generally uses uint32_t to represent domain IDs. This is fine as
long as addresses of such variables aren't taken, to then pass into
hypercalls: To the hypervisor, a domain ID is a 16-bit value. Introduce
a wrapper struct to deal with the issue. (On architectures with
arguments passed in registers, an intermediate variable would have been
created by the compiler already anyway, just one of the wrong type.)

The public interface change is both source and binary compatible for
the architectures we currently support.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxencall: drop bogus mentioning of xencall6()

There's no xencall6(), so the version script also shouldn't mention it.
If such a function would ever appear, it shouldn't land in version 1.0.

No change to the generated binary, nor abi-dumper's view of the object.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxc: use multicall for memory-op on Linux (and Solaris)

Some sub-functions, XENMEM_maximum_gpfn and XENMEM_maximum_ram_page in
particular, can return values requiring more than 31 bits to represent.
Hence we cannot issue the hypercall directly when the return value of
ioctl() is used to propagate this value. This is the case for Linux
and Solaris (and hence needs changing), while the BSDs avoid using the
return value for dual purposes altogether, and MiniOS already wraps all
hypercalls in a multicall.

Suggested-by: Jürgen Groß <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxencall: introduce variant of xencall2() returning long

Some hypercalls, memory-op in particular, can return values requiring
more than 31 bits to represent. Hence the underlying layers need to make
sure they won't truncate such values.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxencall: osdep_hypercall() should return long

Some hypercalls, memory-op in particular, can return values requiring
more than 31 bits to represent. Hence the underlying layers need to make
sure they won't truncate such values. (Note that for Solaris the
function also gets renamed, to match the other OSes.)

Due to them merely propagating ioctl()'s return value, this change is
benign on Linux and Solaris. IOW there's an actual effect here only for
the BSDs and MiniOS, but even then further adjustments are needed at the
xencall<N>() level.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

x86/HVM: wire up multicalls

To be able to use them from, in particular, the tool stack, they need to
be supported for all guest types. Note that xc_resource_op() already
does, so would not work without this on PVH Dom0.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Begrudingly acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

VT-d: drop/move a few QI related constants

Replace uses of QINVAL_ENTRY_ORDER and QINVAL_INDEX_SHIFT, such that
the constants can be dropped. Move the remaining QINVAL_* ones to the
single source file using them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

VT-d: centralize mapping of QI entries

Introduce a helper function to reduce redundancy. Take the opportunity
to express the logic without using the somewhat odd QINVAL_ENTRY_ORDER.
Also take the opportunity to uniformly unmap after updating queue tail
and dropping the lock (like was done so far only by
queue_invalidate_context_sync()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

VT-d: don't lose errors when flushing TLBs on multiple IOMMUs

While no longer an immediate problem with flushes no longer timing out,
errors (if any) get properly reported by iommu_flush_iotlb_{dsi,psi}().
Overwriting such an error with, perhaps, a success indicator received
from another IOMMU will misguide callers. Record the first error, but
don't bail from the loop (such that further necessary invalidation gets
carried out on a best effort basis).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

VT-d: clear_fault_bits() should clear all fault bits

If there is any way for one fault to be left set in the recording
registers, there's no reason there couldn't also be multiple ones. If
PPF set set (being the OR or all F fields), simply loop over the entire
range of fault recording registers, clearing F everywhere.

Since PPF is a r/o bit, also remove it from DMA_FSTS_FAULTS (arguably
the constant's name is ambiguous as well).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

VT-d: adjust domid map updating when unmapping context

When an earlier error occurred, cleaning up the domid mapping data is
wrong, as references likely still exist. The only exception to this is
when the actual unmapping worked, but some flush failed (supposedly
impossible after XSA-373). The guest will get crashed in such a case
though, so add fallback cleanup to domain destruction to cover this
case. This in turn makes it desirable to silence the dprintk() in
domain_iommu_domid().

Note that no error will be returned anymore when the lookup fails - in
the common case lookup failure would already have caused
domain_context_unmap_one() to fail, yet even from a more general
perspective it doesn't look right to fail domain_context_unmap() in such
a case when this was the last device, but not when any earlier unmap was
otherwise successful.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

VT-d: undo device mappings upon error

When
- flushes (supposedly not possible anymore after XSA-373),
- secondary mappings for legacy PCI devices behind bridges,
- secondary mappings for chipset quirks, or
- find_upstream_bridge() invocations
fail, the successfully established device mappings should not be left
around.

Further, when (parts of) unmapping fail, simply returning an error is
typically not enough. Crash the domain instead in such cases, arranging
for domain cleanup to continue in a best effort manner despite such
failures.

Finally make domain_context_unmap()'s error behavior consistent in the
legacy PCI device case: Don't bail from the function in one special
case, but always just exit the switch statement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

tools/xenstored: Don't crash xenstored when Live-Update is cancelled

As Live-Update is asynchronous, it is possible to receive a request to
cancel it (either on the same connection or from a different one).

Currently, this will crash xenstored because do_lu_start() assumes
lu_status will be valid. This is not the case when Live-Update has been
cancelled. This will result to dereference a NULL pointer and
crash Xenstored.

Rework do_lu_start() to check if lu_status is NULL and return an
error in this case.

Fixes: af216a99fb ("tools/xenstore: add the basic framework for doing the live update")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Delay new transaction while Live-Update is pending

At the moment, Live-Update will, by default, not proceed if there are
in-flight transactions. It is possible force it by passing -F but this
will break any connection with in-flight transactions.

There are PV drivers out that may never terminate some transaction. On
host running such guest, we would need to use -F. Unfortunately, this
also risks to break well-behaving guests (and even dom0) because
Live-Update will happen as soon as the timeout is hit.

Ideally, we would want to preserve transactions but this requires
some work and a lot of testing to be able to use it in production.

As a stop gap, we want to limit the damage of -F. This patch will delay
any transactions that are started after Live-Update has been requested.

If the request cannot be delayed, the connection will be stalled to
avoid loosing requests.

If the connection has already a pending transaction before Live-Update,
then new transaction will not be delayed. This is to avoid the connection
to stall.

With this stop gap in place, domains with long running transactions will
still break when using -F, but other domains which starts a transaction
in the middle of Live-Update will continue to work.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Dump delayed requests

Currently, only Live-Update request can be delayed. In a follow-up,
we will want to delay more requests (e.g. transaction start).
Therefore we want to preserve delayed requests across Live-Update.

Delayed requests are just complete "in" buffer. So the code is
refactored to allow sharing the code to dump "in" buffer.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

maintainers: adding new reviewer for xsm

Would like to add myself as a reviewer for XSM.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

iommu/arm: ipmmu-vmsa: Add compatible for Renesas R-Car M3-W+ SoC

The "renesas,r8a77961" string identifies M3-W+ (aka M3-W ES3.0)
instead of "renesas,r8a7796" since Linux commit:
"9c9f7891093b02eb64ca4e1c7ab776a4296c058f soc: renesas: Identify R-Car M3-W+".
Add new compatible to the Xen driver.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Acked-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: Extend restore code to handle multiple input buffer

Currently, the restore code is considering the stream will contain at
most one in-flight request per connection. In a follow-up changes, we
will want to transfer multiple in-flight requests.

The function read_state_buffered() is now extended to restore multiple
in-flight request. Complete requests will be queued as delayed
requests, if there a partial request (only the last one can) then it
will used as the current in-flight request.

Note that we want to bypass the quota check for delayed requests as
the new Xenstore may have a lower limit.

Lastly, there is no need to change the specification as there was
no restriction on the number of in-flight requests preserved.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: delay_request: don't assume conn->in == in

delay_request() is currently assuming that the request delayed is
always conn->in. This is currently correct, but it is a call for
a latent bug as the function allows the caller to specify any request.

To prevent any future surprise, check if the request delayed is the
current one.

Fixes: c5ca1404b4 ("tools/xenstore: add support for delaying execution of a xenstore request")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Introduce a wrapper for conn->funcs->can_{read, write}

Currently, the callbacks can_read and can_write are called directly. This
doesn't allow us to add generic check and therefore requires duplication.

At the moment, one check that could benefit to be common is whether the
connection should ignored. The position is slightly different between
domain and socket because for the latter we want to check the state of
the file descriptor first.

In follow-up patches, there will be more potential generic checks.

This patch provides wrappers to read/write a connection and move
the check ->is_ignored after the callback for everyone.

This also requires to replace the direct call to domain_can_read()
and domain_can_write() with the new wrapper. At the same time,
both functions can now be static. Note that the implementations need
to be moved earlier in the file xenstored_domain.c to avoid
declaring the prototype.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: xenstored_core.h should include fcntl.h

xenstored_core.h will consider live-udpate is not supported if
O_CLOEXEC doesn't exist. However, the header doesn't include the one
defining O_CLOEXEC (i.e. fcntl.h). This means that depending on
the header included, some source file will think Live-Update is not
supported.

I am not aware of any issue with the existing. Therefore this is just
a latent bug so far.

Prevent any potential issue by including fcntl.h in xenstored_core.h

Fixes: cd831ee438 ("tools/xenstore: handle CLOEXEC flag for local files and pipes")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Limit the number of requests a connection can delay

Currently, only liveupdate request can be delayed. The request can only
be performed by a privileged connection (e.g. dom0). So it is fine to
have no limits.

In a follow-up patch we will want to delay request for unprivileged
connection as well. So it is best to apply a limit.

For now and for simplicity, only a single request can be delayed
for a given unprivileged connection.

Take the opportunity to tweak the prototype and provide a way to
bypass the quota check. This would be useful when the function
is called from the restore code.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstore: Don't assume conn->in points to the LU request

call_delayed() is currently assuming that conn->in is NULL when
handling delayed request. However, the connection is not paused.
Therefore new request can be processed and conn->in may be non-NULL
if we have only received a partial request.

Furthermore, as we overwrite conn->in, the current partial request
will not be transferred. This will result to corrupt the connection.

Rather than updating conn->in, stash the LU request in lu_status and
let each callback for delayed request to update conn->in when
necessary.

To keep a sane interface, the code to write the "OK" response the
LU request is moved in xenstored_core.c.

Fixes: c5ca1404b4 ("tools/xenstore: add support for delaying execution of a xenstore request")
Fixes: ed6eebf17d ("tools/xenstore: dump the xenstore state for live update")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

tools/xenstored: Introduce lu_get_connection() and use it

At the moment, dump_state_buffered_data() is taking two connections
in parameters (one is the connection to dump, the other is the
connection used to request LU). The naming doesn't help to
distinguish (c vs conn) them and this already lead to several mistake
while modifying the function.

To remove the confusion, introduce an help lu_get_connection() that
will return the connection used to request LU and use it
in place of the existing parameter.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

MAINTAINERS: Add myself as reviewers for tools/xenstore

I would like to help reviewing Xenstored patches. It is more convenient
to find them if I am CCed.

Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Juergen Gross <jgross@suse.com>

Revert "tools/firmware/ovmf: Use OvmfXen platform file is exist"

This reverts commit aad7b5c11d51d57659978e04702ac970906894e8.

The change from OvmfX64 to OvmfXen causes a change in behaviour, whereby
OvmfXen maps its shared info page at the top of address space.  When trying to
migrate such a domain, XENMEM_maximum_gpfn returns a very large value.  This
has uncovered multiple issues:

1) The userspace hypercall wrappers truncate all return values to int on
    Linux and Solaris, even on 64bit.  This needs fixing in libxenctrl.
2) 32bit toolstacks can't migrate any domain with RAM above the 2^40 mark,
    because of virtual address constraints.  This needs fixing in OVMF.

Fixes for both of these aren't completely trivial.  Revert the change to
unblock staging in the meantime.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>

golang/xenlight: do not negate ret when converting to Error

Commit 871e51d2d4 changed the sign on the xenlight error types (making
the values negative, same as the C-generated constants), but failed to
remove the code changing the sign before casting to Error(). This
results in error strings like "libxl error: <x>", rather than the
correct message. Fix all occurrances of this by running:

gofmt -w -r 'Error(-ret) -> Error(ret)' xenlight.go

from tools/golang/xenlight.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: add SendTrigger wrapper

Add a warpper around libxl_send_trigger.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: add DomainDestroy wrapper

Add a wrapper around libxl_domain_destroy.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: rename Ctx receivers to ctx

As a matter of style, it is strange to see capitalized receiver names,
due to the significance of capitalized symbols in Go (although there is
in fact nothing special about a capitalized receiver name). Fix this in
xenlight.go by running:

gofmt -w -r 'Ctx -> ctx' xenlight.go

from tools/golang/xenlight. There is no functional change.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: use struct pointers in keyed union fields

Currently, when marshalig Go types with keyed union fields, we assign the
value of the struct (e.g. DomainBuildInfoTypeUnionHvm) which implements the
interface of the keyed union field (e.g. DomainBuildInfoTypeUnion).
As-is, this means that if a populated DomainBuildInfo is marshaled to
e.g. JSON, unmarshaling back to DomainBuildInfo will fail.

When the encoding/json is unmarshaling data into a Go type, and
encounters a JSON object, it basically can either marshal the data into
an empty interface, a map, or a struct. It cannot, however, marshal data
into an interface with at least one method defined on it (e.g.
DomainBuildInfoTypeUnion). Before this check is done, however, the
decoder will check if the Go type is a pointer, and dereference it if
so. It will then use the type of this value as the "target" type.

This means that if the TypeUnion field is populated with a
DomainBuildInfoTypeUnion, the decoder will see a non-empty interface and
fail. If the TypeUnion field is populated with a
*DomainBuildInfoTypeUnionHvm, it dereferences the pointer and sees a
struct instead, allowing decoding to continue normally.

Since there does not appear to be a strict need for NOT using pointers
in these fields, update code generation to set keyed union fields to
pointers of their implementing structs.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: export keyed union interface types

For structs that have a keyed union, e.g. DomainBuildInfo, the TypeUnion
field must be exported so that package users can get/set the fields
within. This means that users are aware of the existence of the
interface type used in those fields (see [1]), so it is awkward that the
interface itself is not exported. However, the single method within the
interface must remain unexported so that users cannot mistakenly "implement"
those interfaces.

Since there seems to be no reason to do otherwise, export the keyed
union interface types.

[1] https://pkg.go.dev/xenbits.xenproject.org/git-http/xen.git/tools/golang/xenlight?tab=doc#DeviceUsbdev

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: fix StringList toC conversion

The current implementation of StringList.toC does not correctly account
for how libxl_string_list is expected to be laid out in C, which is clear
when one looks at libxl_string_list_length in libxl.c. In particular,
StringList.toC does not account for the extra memory that should be
allocated for the "sentinel" entry. And, when using the "slice trick" to
create a slice that can address C memory, the unsafe.Pointer conversion
should be on a C.libxl_string_list, not *C.libxl_string_list.

Fix these problems by (1) allocating an extra slot in the slice used to
address the C memory, and explicity set the last entry to nil so the C
memory will be zeroed out, and (2) dereferencing csl in the
unsafe.Pointer conversion.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

golang/xenlight: update generated code

Re-generate code to reflect changes to libxl_types.idl from the
following commits:

0570d7f276 x86/msr: introduce an option for compatible MSR behavior selection
7e5cffcd1e viridian: allow vCPU hotplug for Windows VMs
9835246710 viridian: remove implicit limit of 64 VPs per partition

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/ept: force WB cache attributes for grant and foreign maps

Force WB type for grants and foreign pages. Those are usually mapped
over unpopulated physical ranges in the p2m, and those ranges would
usually be UC in the MTRR state, which is unlikely to be the correct
cache attribute. It's also cumbersome (or even impossible) for the
guest to be setting the MTRR type for all those mappings as WB, as
MTRR ranges are finite.

Note that this is not an issue on AMD because WB cache attribute is
already set on grants and foreign mappings in the p2m and MTRR types
are ignored. Also on AMD Xen cannot force a cache attribute because of
the lack of ignore PAT equivalent, so the behavior here slightly
diverges between AMD and Intel (or EPT vs NPT/shadow).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/mtrr: move epte_get_entry_emt to p2m-ept.c

This is an EPT specific function, so it shouldn't live in the generic
mtrr file. Such movement is also needed for future work that will
require passing a p2m_type_t parameter to epte_get_entry_emt, and
making that type visible to the mtrr users is cumbersome and
unneeded.

Moving epte_get_entry_emt out of mtrr.c requires making the helper to
get the MTRR type of an address from the mtrr state public. While
there rename the function to start with the mtrr prefix, like other
mtrr related functions.

While there fix some of the types of the function parameters.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

tests: Introduce a TSX test

See the comment at the top of test-tsx.c for details.

This covers various complexities encountered while trying to address the
recent TSX deprecation on client parts.

A sample run on KabyLake with latest microcode and default tsx= looks like
this:

  root@host# ./test-tsx
  TSX tests
    Got 8 CPUs
  Testing MSR_TSX_FORCE_ABORT consistency
    CPU0 val 0x3
  Testing MSR_TSX_CTRL consistency
  Testing RTM behaviour
    Got Abort
  Testing PV default/max policies
    Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
    Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Testing HVM default/max policies
    Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
    Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Testing PV guest
    Created d7
    Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
    Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Testing HVM guest
    Created d8
    Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
    Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

libs/guest: Move struct xc_cpu_policy into xg_private.h

... so tests can peek at the internals, without the structure being generally
available to users of the library.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/msr: Expose MSR_ARCH_CAPS in the raw and host policies

MSR_ARCH_CAPS is still not supported for guests yet (other than the hardware
domain), until the toolstack learns how to construct an MSR policy.

However, we want access to the host ARCH_CAPS_TSX_CTRL value in particular for
testing purposes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/platform: Permit reading the TSX control MSRs via XENPF_resource_op

We are going to want this to write some tests with.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/platform: Improve MSR permission handling for XENPF_resource_op

The logic to disallow writes to the TSC is out-of-place, and should be in
check_resource_access() rather than in resource_access().

Split the existing allow_access_msr() into two - msr_{read,write}_allowed() -
and move all permissions checks here.

Furthermore, guard access to MSR_IA32_CMT_{EVTSEL,CTR} to prohibit their use
on hardware which is lacking the QoS Monitoring feature. Introduce
cpu_has_pqe to help with the logic.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

libs/foreignmemory: Fix osdep_xenforeignmemory_map prototype

Commit cf8c4d3d13b8 made some preparation to have one day
variable-length-array argument, but didn't declare the array in the
function prototype the same way as in the function definition. And now
GCC 11 complains about it.

Fixes: cf8c4d3d13b8 ("tools/libs/foreignmemory: pull array length argument to map forward")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: move .altinstr_replacement past _einittext

This section's contents do not represent part of actual hypervisor text,
so shouldn't be included in what is_kernel_inittext() or (while still
booting) is_active_kernel_text() report "true" for. Keep them in
.init.text though, as there's no real reason to have a separate section
for this in the final binary.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/vpt: fully init timers before putting onto list

With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
iterating the list and acting on the timers of the list entries will no
longer be kept from entering their loops by create_periodic_time()'s
holding of that lock. Therefore at least init_timer() needs calling
ahead of list insertion, but keep this and set_timer() together.

Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

libxl: Replace QMP command "change" by "blockdev-change-media"

"change" command as been removed in QEMU 6.0. We can use
"blockdev-change-medium" instead.

Using `id` with "blockdev-change-medium" requires a change to the QEMU
command line, introduced by:
"libxl: Use -device for cd-rom drives"

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Use `id` with the "eject" QMP command

`device` parameter is deprecated since QEMU 2.8.

This requires changes to the command line introduced by:
"libxl: Use -device for cd-rom drives"

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Export libxl__qmp_ev_qemu_compare_version

We are going to want to check QEMU's version in other places where we
can use libxl__ev_qmp_send.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Assert qmp_ev's state in qmp_ev_qemu_compare_version

We are supposed to read the version information only when qmp_ev is in
state "Connected" (that correspond to state==qmp_state_connected),
assert it so that the function isn't used too early.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Use -device for cd-rom drives

This allows to set an `id` on the device instead of only the drive. We
are going to need the `id` with the "eject" and
"blockdev-change-media" QMP command as using `device` parameter on
those is deprecated. (`device` is the `id` of the `-drive` on the
command line).

We set the same `id` on both -device and -drive as QEMU doesn't
complain and we can then either do "eject id=$id" or "eject
device=$id".

Using "-drive + -device" instead of only "-drive" has been
available since at least QEMU 0.15, and seems to be the preferred way as it
separates the host part (-drive which describe the disk image location
and format) from the guest part (-device which describe the emulated
device). More information in qemu.git/docs/qdev-device-use.txt .

Changing the command line during migration for the cdrom seems fine.
Also the documentation about migration in QEMU explains that the device
state ID is "been formed from a bus name and device address", so
second IDE bus and first device address on bus is still thus and
doesn't matter if written "-drive if=ide,index=2" or "-drive
ide-cd,bus=ide.1,unit=0".
See qemu.git/docs/devel/migration.rst .

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Replace deprecated "cpu-add" QMP command by "device_add"

The command "cpu-add" for CPU hotplug is deprecated and has been
removed from QEMU 6.0 (April 2021). We need to add cpus with the
command "device_add" now.

In order to find out which parameters to pass to "device_add" we first
make a call to "query-hotpluggable-cpus" which list the cpus drivers
and properties.

The algorithm to figure out which CPU to add, and by extension if any
CPU needs to be hotplugged, is in the function that adds the cpus.
Because of that, the command "query-hotpluggable-cpus" is always
called, even when not needed.

In case we are using a version of QEMU older than 2.7 (Sept 2016)
which don't have "query-hotpluggable-cpus", we fallback to using
"cpu-add".

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Replace QEMU's command line short-form boolean option

Short-form boolean options are deprecated in QEMU 6.0.
Upstream commit that deprecate those: ccd3b3b8112b ("qemu-option: warn
for short-form boolean options").

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

libxl: Replace deprecated QMP command by "query-cpus-fast"

We use the deprecated QMP command "query-cpus" which is removed in the
QEMU 6.0 release. There's a replacement which is "query-cpus-fast",
and have been available since QEMU 2.12 (April 2018).

This patch try the new command first and when the command isn't
available, it fall back to the deprecated one so libxl still works
with older QEMU versions.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

Arm: avoid .init.data to be marked as executable

This confuses disassemblers, at the very least. Move
.altinstr_replacement to .init.text. The previously redundant ALIGN()
now gets converted to page alignment, such that the hypervisor mapping
won't have this as executable (it'll instead get mapped r/w, which I'm
told is intended to be adjusted at some point).

Note that for the actual patching logic's purposes this part of
.init.text _has_ to live after _einittext (or before _sinittext), or
else branch_insn_requires_update() would produce wrong results.

Also, to have .altinstr_replacement have consistent attributes in the
object files, add "x" to the one instance where it was missing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

xen/arm32: avoid .rodata to be marked as executable

The section .proc.info lives in .rodata as it doesn't contain any
executable code. However, the section is still marked as executable
as the consequence .rodata will also be marked executable.

Xen doesn't use the ELF permissions to decide the page-table mapping
permission. However, this will confuse disassemblers.

'#execinstr' is now removed on all the pushsection dealing with
.proc.info

Signed-off-by: Jan Beulich <jbeulich@suse.com>
[julieng: Rework the commit message]
Acked-by: Julien Grall <jgrall@amazon.com>

MAINTAINERS: add myself as tools/libs reviewer

I have touched most of the Xen libraries in the past, and there is a
clear lack of reviewer band width in the tools area.

Add myself as a tools/libs reviewer for that reason.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

revert "tools/libs/guest: fix max_pfn setting in map_p2m()"

The reasoning for commit 7bd8989ab77b6a ("tools/libs/guest: fix max_pfn
setting in map_p2m()") was wrong.

The max_pfn field in shared_info is misnamed, it has the semantics of
num_pfns, which is hidden at least partially in Linux, as the kernel is
(wrongly) treating it like the highest used pfn in some places.

So revert above commit.

Fixes: 7bd8989ab77b6a ("tools/libs/guest: fix max_pfn setting in map_p2m()")
Signed-off-by: Juergen Gross <jgross@suse.com>

xen/grant-table: Simplify the update to the per-vCPU maptrack freelist

Since XSA-228 (commit 02cbeeb62075 "gnttab: split maptrack lock
to make it fulfill its purpose again"), v->maptrack_head,
v->maptrack_tail and the content of the freelist are accessed with
the lock v->maptrack_freelist_lock held.

Therefore it is not necessary to update the fields using cmpxchg()
and also read them atomically.

Note that there are two cases where v->maptrack_tail is accessed without
the lock. They both happen in get_maptrack_handle() when initializing
the free list of the current vCPU. Therefore there is no possible race.

The code is now reworked to remove any use of cmpxch() and read_atomic()
when accessing the fields v->maptrack_{head, tail} as wel as the
freelist.

Take the opportunity to add a comment on top of the lock definition
and explain what it protects.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Arm32: MSR to SPSR needs qualification

The Arm ARM's description of MSR (ARM DDI 0406C.d section B9.3.12)
doesn't even allow for plain "SPSR" here, and while gas accepts this, it
takes it to mean SPSR_cf. Yet surely all of SPSR wants updating on this
path, not just the lowest and highest 8 bits.

Fixes: dfcffb128be4 ("xen/arm32: SPSR_hyp/SPSR")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

tools/libs/store: cleanup libxenstore interface

There are some internals in the libxenstore interface which should be
removed.

Move those functions into xs_lib.c and the related definitions into
xs_lib.h. Remove the functions from the mapfile. Add xs_lib.o to
xenstore_client as some of the internal functions are needed there.

Bump the libxenstore version to 4.0 as the change is incompatible.
Note that the removed functions should not result in any problem as
they ought to be used by xenstored or xenstore_client only.

Avoid an enum as part of a structure as the size of an enum is
compiler implementation dependent.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <iwj@xenproject.org>

x86: please Clang in arch_set_info_guest()

Clang 10 reports

domain.c:1328:10: error: variable 'cr3_mfn' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
    if ( !compat )
         ^~~~~~~
domain.c:1334:34: note: uninitialized use occurs here
    cr3_page = get_page_from_mfn(cr3_mfn, d);
                                 ^~~~~~~
domain.c:1328:5: note: remove the 'if' if its condition is always true
    if ( !compat )
    ^~~~~~~~~~~~~~
domain.c:1042:18: note: initialize the variable 'cr3_mfn' to silence this warning
    mfn_t cr3_mfn;
                 ^
                  = 0
domain.c:1189:14: error: variable 'fail' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
        if ( !compat )
             ^~~~~~~
domain.c:1211:9: note: uninitialized use occurs here
        fail |= v->arch.pv.gdt_ents != c(gdt_ents);
        ^~~~
domain.c:1189:9: note: remove the 'if' if its condition is always true
        if ( !compat )
        ^~~~~~~~~~~~~~
domain.c:1187:18: note: initialize the variable 'fail' to silence this warning
        bool fail;
                 ^
                  = false

despite this being a build with -O2 in effect, and despite "compat"
being constant "false" when CONFIG_COMPAT (and hence CONFIG_PV32) is not
defined, as it gets set at the top of the function from the result of
is_pv_32bit_domain().

Re-arrange the two "offending" if()s such that when COMPAT=n the
respective variables will be seen as unconditionally initialized. The
original aim was to have the !compat cases first, though.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm32: SPSR_hyp/SPSR

SPSR_hyp is not meant to be accessed from Hyp mode (EL2); accesses
trigger UNPREDICTABLE behaviour. Xen should read/write SPSR instead.
See: ARM DDI 0487D.b page G8-5993.

This fixes booting Xen/arm32 on QEMU.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

x86/tsx: Cope with TSX deprecation on SKL/KBL/CFL/WHL

The June 2021 microcode is formally de-featuring TSX on the older Skylake
client CPUs. The workaround from the March 2019 microcode is being dropped,
and replaced with additions to MSR_TSX_FORCE_ABORT to hide the HLE/RTM CPUID
bits.

With this microcode in place, TSX is disabled by default on these CPUs.
Backwards compatibility is provided in the same way as for TAA - RTM force
aborts, rather than suffering #UD, and the CPUID bits can be hidden to recover
performance.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen: add files needed for minimal riscv build

Add arch-specific makefiles and configs needed to build for
riscv. Also add a minimal head.S that is a simple infinite loop.
head.o can be built with

$ make XEN_TARGET_ARCH=riscv64 SUBSYSTEMS=xen -C xen tiny64_defconfig
$ make XEN_TARGET_ARCH=riscv64 SUBSYSTEMS=xen -C xen TARGET=riscv64/head.o

No other TARGET is supported at the moment.

Signed-off-by: Connor Davis <connojdavis@gmail.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Acked-by: Bobby Eshleman <bobbyeshleman@gmail.com>

xen/char: default HAS_NS16550 to y only for X86 and ARM

Defaulting to yes only for X86 and ARM reduces the requirements
for a minimal build when porting new architectures.

Signed-off-by: Connor Davis <connojdavis@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>

MAINTAINERS: adjust x86/mm/shadow maintainers

Better reflect reality: Andrew and Jan are active maintainers
and I review patches. Keep myself as a reviewer so I can help
with historical context &c.

Signed-off-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

AMD/IOMMU: drop command completion timeout

First and foremost - such timeouts were not signaled to callers, making
them believe they're fine to e.g. free previously unmapped pages.

Mirror VT-d's behavior: A fixed number of loop iterations is not a
suitable way to detect timeouts in an environment (CPU and bus speeds)
independent manner anyway. Furthermore, leaving an in-progress operation
pending when it appears to take too long is problematic: If a command
completed later, the signaling of its completion may instead be
understood to signal a subsequently started command's completion.

Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area. Allow callers to specify
a non-default timeout bias for this logging, using the same values as
VT-d does, which in particular means a (by default) much larger value
for device IO TLB invalidation.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

AMD/IOMMU: wait for command slot to be available

No caller cared about send_iommu_command() indicating unavailability of
a slot. Hence if a sufficient number prior commands timed out, we did
blindly assume that the requested command was submitted to the IOMMU
when really it wasn't. This could mean both a hanging system (waiting
for a command to complete that was never seen by the IOMMU) or blindly
propagating success back to callers, making them believe they're fine
to e.g. free previously unmapped pages.

Fold the three involved functions into one, add spin waiting for an
available slot along the lines of VT-d's qinval_next_index(), and as a
consequence drop all error indicator return types/values.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

x86/spec-ctrl: Mitigate TAA after S3 resume

The user chosen setting for MSR_TSX_CTRL needs restoring after S3.

All APs get the correct setting via start_secondary(), but the BSP was missed
out.

This is XSA-377 / CVE-2021-28690.

Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/spec-ctrl: Protect against Speculative Code Store Bypass

Modern x86 processors have far-better-than-architecturally-guaranteed self
modifying code detection.  Typically, when a write hits an instruction in
flight, a Machine Clear occurs to flush stale content in the frontend and
backend.

For self modifying code, before a write which hits an instruction in flight
retires, the frontend can speculatively decode and execute the old instruction
stream.  Speculation of this form can suffer from type confusion in registers,
and potentially leak data.

Furthermore, updates are typically byte-wise, rather than atomic.  Depending
on timing, speculation can race ahead multiple times between individual
writes, and execute the transiently-malformed instruction stream.

Xen has stubs which are used in certain cases for emulation purposes.  Inhibit
speculation between updating the stub and executing it.

This is XSA-375 / CVE-2021-0089.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

VT-d: eliminate flush related timeouts

Leaving an in-progress operation pending when it appears to take too
long is problematic: If e.g. a QI command completed later, the write to
the "poll slot" may instead be understood to signal a subsequently
started command's completion. Also our accounting of the timeout period
was actually wrong: We included the time it took for the command to
actually make it to the front of the queue, which could be heavily
affected by guests other than the one for which the flush is being
performed.

Do away with all timeout detection on all flush related code paths.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area.

Additionally log (once) if qinval_next_index() didn't immediately find
an available slot. Together with the earlier change sizing the queue(s)
dynamically, we should now have a guarantee that with our fully
synchronous model any demand for slots can actually be satisfied.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>