dgit.raspbian.org Git

d/changelog: finish 4.14.2+25-gb6a8c4f72d-2

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

Add debian/README.Debian.security

Add a note about the end of upstream security support, since the date is
already known right now.

Install it into the xen-hypervisor-common package.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

Declare fast forward / record previous work

[git-debrebase pseudomerge: stitch]

Commit patch queue (exported by git-debrebase)

[git-debrebase make-patches: export and commit patches]

debian/changelog: finish 4.14.2+25-gb6a8c4f72d-1

docs: set date to SOURCE_DATE_EPOCH if available

Use the solution described in [1] to replace the call to the 'date'
command with a version that uses SOURCE_DATE_EPOCH if available. This
is needed for reproducible builds.

[1] https://reproducible-builds.org/docs/source-date-epoch/

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
[Hans van Kranenburg]
Note: this patch is submitted upstream but not committed yet. We
expect that it gets in. Otherwise, we don't wait and already have it
here because I want to have the reproducible build work completed.

docs: use predictable ordering in generated documentation

When the seq number is equal, sort by the title to get predictable
output ordering. This is useful for reproducible builds.

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e18dadc5b709290b8038a1cacb52bc3b3b69cf21)

x86/EFI: don't insert timestamp when SOURCE_DATE_EPOCH is defined

By default a timestamp gets added to the xen efi binary. Unfortunately
ld doesn't seem to provide a way to set a custom date, like from
SOURCE_DATE_EPOCH, so set a zero value for the timestamp (option
--no-insert-timestamp) if SOURCE_DATE_EPOCH is defined. This makes
reproducible builds possible.

This is an alternative to the patch suggested in [1]. This patch only
omits the timestamp when SOURCE_DATE_EPOCH is defined.

[1] https://lists.xenproject.org/archives/html/xen-devel/2020-10/msg02161.html

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ee41b5c450032ae7f2531e18cd0a73bf5fb48803)

xen: don't have timestamp inserted in config.gz

This is for improving reproducible builds.

Signed-off-by: Frédéric Pierret (fepitre) <frederic.pierret@qubes-os.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5816d327e44ab37ae08730f4c54a80835998f31f)

fix spelling errors

Only spelling errors; no functional changes.

In docs/misc/dump-core-format.txt there are a few more instances of
'informations'. I'll leave that up to someone who can properly determine
how those sentences should be constructed.

Signed-off-by: Diederik de Haas <didi.debian@cknow.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ba6e78f0db820fbeea4df41fde4655020ca05928)

xen/arm: traps: Don't panic when receiving an unknown debug trap

Even if debug trap are only meant for debugging purpose, it is quite
harsh to crash Xen if one of the trap sent by the guest is not handled.

So switch from a panic() to a printk().

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 957708c2d1ae25d7375abd5e5e70c3043d64f1f1)

xen/arm: acpi: add BAD_MADT_GICC_ENTRY() macro

Imported from Linux commit b6cfb277378ef831c0fa84bcff5049307294adc6:

    The BAD_MADT_ENTRY() macro is designed to work for all of the subtables
    of the MADT.  In the ACPI 5.1 version of the spec, the struct for the
    GICC subtable (struct acpi_madt_generic_interrupt) is 76 bytes long; in
    ACPI 6.0, the struct is 80 bytes long.  But, there is only one definition
    in ACPICA for this struct -- and that is the 6.0 version.  Hence, when
    BAD_MADT_ENTRY() compares the struct size to the length in the GICC
    subtable, it fails if 5.1 structs are in use, and there are systems in
    the wild that have them.

    This patch adds the BAD_MADT_GICC_ENTRY() that checks the GICC subtable
    only, accounting for the difference in specification versions that are
    possible.  The BAD_MADT_ENTRY() will continue to work as is for all other
    MADT subtables.

    This code is being added to an arm64 header file since that is currently
    the only architecture using the GICC subtable of the MADT.  As a GIC is
    specific to ARM, it is also unlikely the subtable will be used elsewhere.

Fixes: aeb823bbacc2 ("ACPICA: ACPI 6.0: Add changes for FADT table.")
Signed-off-by: Al Stone <al.stone@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    [catalin.marinas@arm.com: extra brackets around macro arguments]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit 7056f2f89f03f2f804ac7e776c7b2b000cd716cd)

xen/arm: Introduce fw_unreserved_regions() and use it

Since commit 6e3e77120378 "xen/arm: setup: Relocate the Device-Tree
later on in the boot", the device-tree will not be kept mapped when
using ACPI.

However, a few places are calling dt_unreserved_regions() which expects
a valid DT. This will lead to a crash.

As the DT should not be used for ACPI (other than for detecting the
modules), a new function fw_unreserved_regions() is introduced.

It will behave the same way on DT system. On ACPI system, it will
unreserve the whole region.

Take the opportunity to clarify that bootinfo.reserved_mem is only used
when booting using Device-Tree.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9c2bc0f24b2ba7082df408b3c33ec9a86bf20cf0)

xen/arm: Check if the platform is not using ACPI before initializing Dom0less

Dom0less requires a device-tree. However, since commit 6e3e77120378
"xen/arm: setup: Relocate the Device-Tree later on in the boot", the
device-tree will not get unflatten when using ACPI.

This will lead to a crash during boot.

Given the complexity to setup dom0less with ACPI (for instance how to
assign device?), we should skip any code related to Dom0less when using
ACPI.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit dac867bf9adc1562a4cf9db5f89726597af13ef8)

xen/arm: acpi: The fixmap area should always be cleared during failure/unmap

Commit 022387ee1ad3 "xen/arm: mm: Don't open-code Xen PT update in
{set, clear}_fixmap()" enforced that each set_fixmap() should be
paired with a clear_fixmap(). Any failure to follow the model would
result to a platform crash.

Unfortunately, the use of fixmap in the ACPI code was overlooked as it
is calling set_fixmap() but not clear_fixmap().

The function __acpi_os_map_table() is reworked so:
    - We know before the mapping whether the fixmap region is big
    enough for the mapping.
    - It will fail if the fixmap is already in use. This is not a
    change of behavior but clarifying the current expectation to avoid
    hitting a BUG().

The function __acpi_os_unmap_table() will now call clear_fixmap().

Reported-by: Wei Xu <xuwei5@hisilicon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4d625ff3c3a939dc270b03654337568c30c5ab6e)

xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory()

The functions acpi_os_{un,}map_memory() are meant to be arch-agnostic
while the __acpi_os_{un,}map_memory() are meant to be arch-specific.

Currently, the former are still containing x86 specific code.

To avoid this rather strange split, the generic helpers are reworked so
they are arch-agnostic. This requires the introduction of a new helper
__acpi_os_unmap_memory() that will undo any mapping done by
__acpi_os_map_memory().

Currently, the arch-helper for unmap is basically a no-op so it only
returns whether the mapping was arch specific. But this will change
in the future.

Note that the x86 version of acpi_os_map_memory() was already able to
able the 1MB region. Hence why there is no addition of new code.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit 1c4aa69ca1e1fad20b2158051eb152276d1eb973)

xen/arm: acpi: Don't fail if SPCR table is absent

Absence of a SPCR table likely means the console is a framebuffer. In
such case acpi_iomem_deny_access() should NOT fail.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 861f0c110976fa8879b7bf63d9478b6be83d4ab6)

tools/python: Pass linker to Python build process

Unexpectedly the environment variable which needs to be passed is
$LDSHARED and not $LD. Otherwise Python may find the build `ld` instead
of the host `ld`.

Replace $(LDFLAGS) with $(SHLIB_LDFLAGS) as Python needs shared objects
it can load at runtime, not executables.

This uses $(CC) instead of $(LD) since Python distutils appends $CFLAGS
to $LDFLAGS which breaks many linkers.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 17d192e0238d6c714e9f04593b59597b7090be38)

[ Hans van Kranenburg ]
Fixed cherry-pick conflict because we have LIBEXEC_LIB=$(LIBEXEC_LIB) in
between in the same lines. The line wrap mess makes it a bit hard to
follow.

xen/rpi4: implement watchdog-based reset

The preferred method to reboot RPi4 is PSCI. If it is not available,
touching the watchdog is required to be able to reboot the board.

The implementation is based on
drivers/watchdog/bcm2835_wdt.c:__bcm2835_restart in Linux v5.9-rc7.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Tested-by: Roman Shaposhnik <roman@zededa.com>
CC: roman@zededa.com
(cherry picked from commit 25849c8b16f2a5b7fcd0a823e80a5f1b590291f9)

t/h/L/vif-common.sh: fix handle_iptable return value

A return statement without explicit value will return the value of the
last command executed before this line with return was encountered.

This is not what we want. return 0.

Closes: #955994
Fixes: 2e0814f971dd ("vif-common: disable handle_iptable")
Reported-by: Samuel Thibault <sthibault@debian.org>
Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools: Partially revert "Cross-compilation fixes."

This partially reverts commit 16504669c5cbb8b195d20412aadc838da5c428f7.

Doesn't look like much of 16504669c5cbb8b195d20412aadc838da5c428f7
actually remains due to passage of time.

Of the 3, both Python and pygrub appear to mostly be building just fine
cross-compiling. The OCAML portion is being troublesome, this is going
to cause bug reports elsewhere soon. The OCAML portion though can
already be disabled by setting OCAML_TOOLS=n and shouldn't have this
extra form of disabling.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 69953e2856382274749b617125cc98ce38198463)

tools: don't build/ship xenmon

It can't run with Python 3, and I'm not going to fix it.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools/xl/bash-completion: also complete 'xen'

We have the `xen` alias for xl in Debian, since in the past it was a
command that could execute either xl or xm.

Now, it always does xl, so, complete the same stuff for it as we have
for xl.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
[git-debrebase split: mixed commit: upstream part]

pygrub: Specify -rpath LIBEXEC_LIB when building fsimage.so

If LIBEXEC_LIB is not on the default linker search path, the python
fsimage.so module fails to find libfsimage.so.

Add the relevant directory to the rpath explicitly.

(This situation occurs in the Debian package, where
--with-libexec-libdir is used to put each Xen version's libraries and
utilities in their own directory, to allow them to be coinstalled.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

pygrub: Set sys.path

We install libfsimage in a non-standard path for Reasons.
(See debian/rules.)

This patch was originally part of `tools-pygrub-prefix.diff'
(eg commit 51657319be54) and included changes to the Makefile to
change the installation arrangements (we do that part in the rules now
since that is a lot less prone to conflicts when we update) and to
shared library rpath (which is now done in a separate patch).

(Commit message rewritten by Ian Jackson.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
squash! pygrub: Set sys.path and rpath

hotplug-common: Do not adjust LD_LIBRARY_PATH

This is in the upstream script because on non-Debian systems, the
default install locations in /usr/local/lib might not be on the linker
path, and as a result the hotplug scripts would break.

A reason we might need it in Debian is our multiple version
coinstallation scheme. However, the hotplug scripts all call the
utilities via the wrappers, and the binaries are configured to load
from the right place anyway.

This setting is an annoyance because it requires libdir, which is an
arch-specific path but comes from a file we want to put in
xen-utils-common, an arch:all package.

So drop this setting.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

sysconfig.xencommons.in: Strip and debianize

Strip all options that are for stuff we don't ship, which is 1)
xenstored as stubdom and 2) xenbackendd, which seems to be dead code
anyway. [1]

It seems useful to give the user the option to revert to xenstored
instead of the default oxenstored if they really want.

[1] https://lists.xen.org/archives/html/xen-devel/2015-07/msg04427.html

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
Acked-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

vif-common: disable handle_iptable

Also see Debian bug #894013. The current attempt at providing
anti-spoofing rules results in a situation that does not have any
effect. Also note that forwarding bridged traffic to iptables is not
enabled by default, and that for openvswitch users it does not make any
sense.

So, stop cluttering the live iptables ruleset.

This functionality seems to be introduced before 2004 and since then it
has never got some additional love.

It would be nice to have a proper discussion upstream about how Xen
could provide some anti mac/ip spoofing in the dom0. It does not seem to
be a trivial thing to do, since it requires having quite some knowledge
about what the domU is allowed to do or not (e.g. a domU can be a
router...).

Fix empty fields in first hypervisor log line

Instead of:

    (XEN) Xen version 4.11.1 (Debian )
    (@)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 19:08:37 UTC 2019

I'd like to see:

    (XEN) Xen version 4.11.1 (Debian 4.11.1-1~)
    (pkg-xen-devel@lists.alioth.debian.org)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 22:44:00 CET 2019

The substitution was broken since the great packaging refactoring,
because the directory in which the build is done changed.

Also, use the Maintainer address from debian/control instead of the most
recent changelog entry. If someone wants to use the address to ask a
question, they will end up at the team mailing list, which is better
than an individual person.

docs/man/xen-vbd-interface.7: Provide properly-formatted NAME section

This manpage was omitted from
docs/man: Provide properly-formatted NAME sections
because I was previously building with markdown not installed.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

shim: Provide separate install-shim target

When building on a 32-bit userland, the user wants to build 32-bit
tools and a 64-bit hypervisor.  This involves setting XEN_TARGET_ARCH
to different values for the tools build and the hypervisor build.

So the user must invoke the tools build and the hypervisor build
separately.

However, although the shim is done by the tools/firmware Makefile, its
bitness needs to be the same as the hypervisor, not the same as the
tools.  When run with XEN_TARGET_ARCH=x86_32, it it skipped, which is
wrong.

So the user must invoke the shim build separately.  This can be done
with
   make -C tools/firmware/xen-dir XEN_TARGET_ARCH=x86_64

However, tools/firmware/xen-dir has no `install' target.  The
installation of all `firmware' is done in tools/firmware/Makefile.  It
might be possible to fix this, but it is not trivial.  For example,
the definitions of INST_DIR and DEBG_DIR would need to be copied, as
would an appropriate $(INSTALL_DIR) call.

For now, provide an `install-shim' target in tools/firmware/Makefile.

This has to be called from `install' of course.  We can't make it
a dependency of `install' because it might be run before `all' has
completed.  We could make it depend on a `shim' target but such
a target is nearly impossible to write because everything is done by
the inflexible subdir-$@ machinery.

The overally result of this patch is that existing make invocations
work as before.  But additionally, the user can say
  make -C tools/firmware install-shim XEN_TARGET_ARCH=x86_64
to install the shim.  The user must have built it already.
Unlike the build rune, this install-rune is properly conditional
so it is OK to call on ARM.

What a mess.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makefile: CONFIG_PV_SHIM: enable only on x86_64

Previously this was *dis*abled for x86_*32*. But if someone should
run some of this Makefile on ARM, say, it ought not to be built
either.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makfile: Respect caller's CONFIG_PV_SHIM

This makes it easier to disable the shim build. (In Debian we need to
build the shim separately because it needs different compiler flags
and a different XEN_COMPILE_ARCH.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

Revert "pvshim: make PV shim build selectable from configure"

This reverts commit 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2.

This upstream change changes stuff that breaks our very fragile mess
that builds the shim when it needs to, and doesn't when it should not.

The result is that it's missing in the end for the i386 build... :|

    dh_install: warning: Cannot find (any matches for)
    "usr/lib/debug/usr/lib/xen-*/boot/*" (tried in ., debian/tmp)

    dh_install: warning: xen-utils-4.14 missing files:
    usr/lib/debug/usr/lib/xen-*/boot/*
    dh_install: error: missing files, aborting

.gitignore: Add configure output which we always delete and regenerate

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

\o/

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/tests/x86_emulator: Pass -no-pie -fno-pic to gcc on x86_32

The current build fails with GCC6 on Debian sid i386 (unstable):

/tmp/ccqjaueF.s: Assembler messages:
/tmp/ccqjaueF.s:3713: Error: missing or invalid displacement expression `vmovd_to_reg_len@GOT'

This is due to the combination of GCC6, and Debian's decision to
enable some hardening flags by default (to try to make runtime
addresses less predictable):
  https://wiki.debian.org/Hardening/PIEByDefaultTransition

This is of no benefit for the x86 instruction emulator test, which is
a rebuild of the emulator code for testing purposes only.  So pass
options to disable this.

These options will be no-ops if they are the same as the compiler
default.

On amd64, the -fno-pic breaks the build in a different way.  So do
this only on i386.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name toolstestsx86_emulator-pass--no-pie--fno.patch

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

version

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Update changelog for new upstream 4.14.2+25-gb6a8c4f72d

[git-debrebase changelog: new upstream 4.14.2+25-gb6a8c4f72d]

Update to upstream 4.14.2+25-gb6a8c4f72d

[git-debrebase anchor: new upstream 4.14.2+25-gb6a8c4f72d, merge]

golang/xenlight: fix code generation for python 2.6

Before python 2.7, str.format() calls required that the format fields
were explicitly enumerated, e.g.:

  '{0} {1}'.format(foo, bar)

  vs.

  '{} {}'.format(foo, bar)

Currently, gengotypes.py uses the latter pattern everywhere, which means
the Go bindings do not build on python 2.6. Use the 2.6 syntax for
format() in order to support python 2.6 for now.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 6d49fbdeab3e687a6818f809ca3d98ac7ced2c8d)

x86/tsx: Cope with TSX deprecation on SKL/KBL/CFL/WHL

The June 2021 microcode is formally de-featuring TSX on the older Skylake
client CPUs. The workaround from the March 2019 microcode is being dropped,
and replaced with additions to MSR_TSX_FORCE_ABORT to hide the HLE/RTM CPUID
bits.

With this microcode in place, TSX is disabled by default on these CPUs.
Backwards compatibility is provided in the same way as for TAA - RTM force
aborts, rather than suffering #UD, and the CPUID bits can be hidden to recover
performance.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3e09045991cde360432bc7437103f8f8a6699359)

x86/cpuid: Fix HLE and RTM handling (again)

For reasons which are my fault, but I don't recall why, the
FDP_EXCP_ONLY/NO_FPU_SEL adjustment uses the whole special_features[] array
element, not the two relevant bits.

HLE and RTM were recently added to the list of special features, causing them
to be always set in guest view, irrespective of the toolstacks choice on the
matter.

Rewrite the logic to refer to the features specifically, rather than relying
on the contents of the special_features[] array.

Fixes: 8fe24090d9 ("x86/cpuid: Rework HLE and RTM handling")
Reported-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 60fa12dbf1d4d2c4ffe1ef34b495b24aa7e41aa0)

x86/tsx: Deprecate vpmu=rtm-abort and use tsx=<bool> instead

This reuses the rtm_disable infrastructure, so CPUID derivation works properly
when TSX is disabled in favour of working PCR3.

vpmu= is not a supported feature, and having this functionality under tsx=
centralises all TSX handling.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9fdcf851689cb2a9501d3947cb5d767d9c7797e8)

x86/tsx: Minor cleanup and improvements

* Introduce cpu_has_arch_caps and replace boot_cpu_has(X86_FEATURE_ARCH_CAPS)
* Read CPUID data into the appropriate boot_cpu_data.x86_capability[]
element, as subsequent changes are going to need more cpu_has_* logic.
* Use the hi/lo MSR helpers, which substantially improves code generation.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3670abcaf0324f2aedba0c4dc7939072b27efa1d)

x86/spec-ctrl: Mitigate TAA after S3 resume

The user chosen setting for MSR_TSX_CTRL needs restoring after S3.

All APs get the correct setting via start_secondary(), but the BSP was missed
out.

This is XSA-377 / CVE-2021-28690.

Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8cf276cb2e0b99b96333865873f56b0b31555ff1)

x86/spec-ctrl: Protect against Speculative Code Store Bypass

Modern x86 processors have far-better-than-architecturally-guaranteed self
modifying code detection.  Typically, when a write hits an instruction in
flight, a Machine Clear occurs to flush stale content in the frontend and
backend.

For self modifying code, before a write which hits an instruction in flight
retires, the frontend can speculatively decode and execute the old instruction
stream.  Speculation of this form can suffer from type confusion in registers,
and potentially leak data.

Furthermore, updates are typically byte-wise, rather than atomic.  Depending
on timing, speculation can race ahead multiple times between individual
writes, and execute the transiently-malformed instruction stream.

Xen has stubs which are used in certain cases for emulation purposes.  Inhibit
speculation between updating the stub and executing it.

This is XSA-375 / CVE-2021-0089.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f59ed8865318bb0356954bad067f329677ce9e)

AMD/IOMMU: drop command completion timeout

First and foremost - such timeouts were not signaled to callers, making
them believe they're fine to e.g. free previously unmapped pages.

Mirror VT-d's behavior: A fixed number of loop iterations is not a
suitable way to detect timeouts in an environment (CPU and bus speeds)
independent manner anyway. Furthermore, leaving an in-progress operation
pending when it appears to take too long is problematic: If a command
completed later, the signaling of its completion may instead be
understood to signal a subsequently started command's completion.

Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area. Allow callers to specify
a non-default timeout bias for this logging, using the same values as
VT-d does, which in particular means a (by default) much larger value
for device IO TLB invalidation.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit e4fee66043120c954fc309bbb37813604c1c0eb7)

AMD/IOMMU: wait for command slot to be available

No caller cared about send_iommu_command() indicating unavailability of
a slot. Hence if a sufficient number prior commands timed out, we did
blindly assume that the requested command was submitted to the IOMMU
when really it wasn't. This could mean both a hanging system (waiting
for a command to complete that was never seen by the IOMMU) or blindly
propagating success back to callers, making them believe they're fine
to e.g. free previously unmapped pages.

Fold the three involved functions into one, add spin waiting for an
available slot along the lines of VT-d's qinval_next_index(), and as a
consequence drop all error indicator return types/values.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit c4e44343579e2c3304d676404d81b2b9bd893c5b)

VT-d: eliminate flush related timeouts

Leaving an in-progress operation pending when it appears to take too
long is problematic: If e.g. a QI command completed later, the write to
the "poll slot" may instead be understood to signal a subsequently
started command's completion. Also our accounting of the timeout period
was actually wrong: We included the time it took for the command to
actually make it to the front of the queue, which could be heavily
affected by guests other than the one for which the flush is being
performed.

Do away with all timeout detection on all flush related code paths.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area.

Additionally log (once) if qinval_next_index() didn't immediately find
an available slot. Together with the earlier change sizing the queue(s)
dynamically, we should now have a guarantee that with our fully
synchronous model any demand for slots can actually be satisfied.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit 7c893abc4ee7e9af62297ba6fd5f27c89d0f33f5)

AMD/IOMMU: size command buffer dynamically

With the present synchronous model, we need two slots for every
operation (the operation itself and a wait command). There can be one
such pair of commands pending per CPU. To ensure that under all normal
circumstances a slot is always available when one is requested, size the
command ring according to the number of present CPUs.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit df242851ddc93ac0b0a3a20ecab34acc29e3b36b)

VT-d: size qinval queue dynamically

With the present synchronous model, we need two slots for every
operation (the operation itself and a wait descriptor). There can be
one such pair of requests pending per CPU. To ensure that under all
normal circumstances a slot is always available when one is requested,
size the queue ring according to the number of present CPUs.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit cbfa62bb140c8f10a0ae96db3ce06e22b3b9d302)

xen/arm: Boot modules should always be scrubbed if bootscrub={on, idle}

The function to initialize the pages (see init_heap_pages()) will request
scrub when the admin request idle bootscrub (default) and state ==
SYS_STATE_active. When bootscrub=on, Xen will scrub any free pages in
heap_init_late().

Currently, the boot modules (e.g. kernels, initramfs) will be discarded/
freed after heap_init_late() is called and system_state switched to
SYS_STATE_active. This means the pages associated with the boot modules
will not get scrubbed before getting re-purposed.

If the memory is assigned to an untrusted domU, it may be able to
retrieve secrets from the modules.

This is part of XSA-372 / CVE-2021-28693.

Fixes: 1774e9b1df27 ("xen/arm: introduce create_domUs")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit fd5dc41ceaed9cfcfa011cdfd50f264c89277a90)

xen/arm: Create dom0less domUs earlier

In a follow-up patch we will need to unallocate the boot modules
before heap_init_late() is called.

The modules will contain the domUs kernel and initramfs. Therefore Xen
will need to create extra domUs (used by dom0less) before heap_init_late().

This has two consequences on dom0less:
    1) Domains will not be unpaused as soon as they are created but
    once all have been created. However, Xen doesn't guarantee an order
    to unpause, so this is not something one could rely on.

    2) The memory allocated for a domU will not be scrubbed anymore when an
    admin select bootscrub=on. This is not something we advertised, but if
    this is a concern we can introduce either force scrub for all domUs or
    a per-domain flag in the DT. The behavior for bootscrub=off and
    bootscrub=idle (default) has not changed.

This is part of XSA-372 / CVE-2021-28693.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 371347c5b64da699d9f5a0edda5dc496fd2b7a5c)

x86: fix build race when generating temporary object files (take 2)

The original commit wasn't quite sufficient: Emptying DEPS is helpful
only when nothing will get added to it subsequently. xen/Rules.mk will,
after including the local Makefile, amend DEPS by dependencies for
objects living in sub-directories though. For the purpose of suppressing
dependencies of the makefiles on the .*.d2 files (and thus to avoid
their re-generation) it is, however, not necessary at all to play with
DEPS. Instead we can override DEPS_INCLUDE (which generally is a late-
expansion variable).

Fixes: 761bb575ce97 ("x86: fix build race when generating temporary object files")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c90dbb99907f3b471d558775777a84daec7c3f6
master date: 2021-05-28 09:12:24 +0200

x86/cpuid: Rework HLE and RTM handling

The TAA mitigation offered the option to hide the HLE and RTM CPUID bits,
which has caused some migration compatibility problems.

These two bits are special.  Annotate them with ! to emphasise this point.

Hardware Lock Elision (HLE) may or may not be visible in CPUID, but is
disabled in microcode on all CPUs, and has been removed from the architecture.
Do not advertise it to VMs by default.

Restricted Transactional Memory (RTM) may or may not be visible in CPUID, and
may or may not be configured in force-abort mode.  Have tsx_init() note
whether RTM has been configured into force-abort mode, so
guest_common_feature_adjustments() can conditionally hide it from VMs by
default.

The host policy values for HLE/RTM may or may not be set, depending on any
previous running kernel's choice of visibility, and Xen's choice.  TSX is
available on any CPU which enumerates a TSX-hiding mechanism, so instead of
doing a two-step to clobber any hiding, scan CPUID, then set the visibility,
just force visibility of the bits in the first place.

With the HLE/RTM bits now unilaterally visible in the host policy,
xc_cpuid_apply_policy() can construct a more appropriate policy out of thin
air for pre-4.13 VMs with no CPUID data in their migration stream, and
specifically one where HLE/RTM doesn't potentially disappear behind the back
of a running VM.

Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8fe24090d940d760145ccd5e234290be7418b175
master date: 2021-05-27 19:34:00 +0100

x86: make hypervisor build with gcc11

Gcc 11 looks to make incorrect assumptions about valid ranges that
pointers may be used for addressing when they are derived from e.g. a
plain constant. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100680.

Utilize RELOC_HIDE() to work around the issue, which for x86 manifests
in at least
- mpparse.c:efi_check_config(),
- tboot.c:tboot_probe(),
- tboot.c:tboot_gen_frametable_integrity(),
- x86_emulate.c:x86_emulate() (at -O2 only).
The last case is particularly odd not just because it only triggers at
higher optimization levels, but also because it only affects one of at
least three similar constructs. Various "note" diagnostics claim the
valid index range to be [0, 2⁶³-1].

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 722f59d38c710a940ab05e542a83020eb5546dea
master date: 2021-05-27 14:40:29 +0200

x86emul: fix test harness build for gas 2.36

All of the sudden, besides .text and .rodata and alike, an always
present .note.gnu.property section has appeared. This section, when
converting to binary format output, gets placed according to its
linked address, causing the resulting blobs to be about 128Mb in size.
The resulting headers with a C representation of the binary blobs then
are, of course all a multiple of that size (and take accordingly long
to create). I didn't bother waiting to see what size the final
test_x86_emulator binary then would have had.

See also https://sourceware.org/bugzilla/show_bug.cgi?id=27753.

Rather than figuring out whether gas supports -mx86-used-note=, simply
remove the section while creating *.bin.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: aa803ba38a867551917d11059eaa044955556e05
master date: 2021-05-17 15:41:28 +0200

x86/vhpet: fix RTC special casing

Restore setting the virtual timer callback private data to NULL if the
timer is not level triggered. This fixes the special casing done in
pt_update_irq so that the RTC interrupt when originating from the HPET
is suspended if the interrupt source is masked.

Note the RTC special casing done in pt_update_irq should only apply to
the RTC interrupt originating from the emulated RTC device (which does
set the callback private data), as in that case the callback itself
will destroy the virtual timer if the interrupt is ignored.

While there also use RTC_IRQ instead of 8 when the HPET is configured
in LegacyReplacement Mode.

Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 472a13988a051e5ae84b95815c6caf4378062abe
master date: 2021-05-07 10:43:29 +0200

x86/intel: insert Ice Lake-SP and Ice Lake-D model numbers

LBR, C-state MSRs should correspond to Ice Lake desktop according to
SDM rev. 74 for both models.

Ice Lake-SP is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
(as advisory tells and Whitley SDP confirms) which means the erratum is fixed
in hardware for that model and therefore it shouldn't be present in
has_if_pschange_mc list. Provisionally assume the same to be the case
for Ice Lake-D.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 95419adfd4b275cffe24b96edcc2f15bc4db8907
master date: 2021-04-26 10:22:48 +0200

x86/vtx: add LBR_SELECT to the list of LBR MSRs

This MSR exists since Nehalem / Silvermont and is actively used by Linux,
for instance, to improve sampling efficiency.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 08693c03e00ea3448adc4406c891e707f0068eb6
master date: 2021-04-26 10:22:04 +0200

VT-d: Don't assume register-based invalidation is always supported

According to Intel VT-d SPEC rev3.3 Section 6.5, Register-based Invalidation
isn't supported by Intel VT-d version 6 and beyond.

This hardware change impacts following two scenarios: admin can disable
queued invalidation via 'qinval' cmdline and use register-based interface;
VT-d switches to register-based invalidation when queued invalidation needs
to be disabled, for example, during disabling x2apic or during system
suspension or after enabling queued invalidation fails.

To deal with this hardware change, if register-based invalidation isn't
supported, queued invalidation cannot be disabled through Xen cmdline; and
if queued invalidation has to be disabled temporarily in some scenarios,
VT-d won't switch to register-based interface but use some dummy functions
to catch errors in case there is any invalidation request issued when queued
invalidation is disabled.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 6773b1a7584a75a486e9774541ad5bd84c9aa5ee
master date: 2021-04-26 10:16:50 +0200

update Xen version to 4.14.3-pre

x86/Intel: insert Tiger Lake model numbers

Both match prior generation processors as far as LBR and C-state MSRs
go (SDM rev 073). The if_pschange_mc erratum, according to the spec
update, is not applicable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e93c3712d67098453760fd61c338cbf62dd08da1
master date: 2020-12-22 09:00:03 +0100

SUPPORT.md: Document speculative attacks status of non-shim 32-bit PV

This documents, but does not fix, XSA-370.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

update Xen version to 4.14.2

x86/hpet: Don't enable legacy replacement mode unconditionally

Commit e1de4c196a2e ("x86/timer: Fix boot on Intel systems using ITSSPRC
static PIT clock gating") was reported to cause boot failures on certain
AMD Ryzen systems.

Refine the fix to do nothing in the default case, and only attempt to
configure legacy replacement mode if IRQ0 is found to not be working. If
legacy replacement mode doesn't help, undo it before falling back to other IRQ
routing configurations.

In addition, introduce a "hpet" command line option so this heuristic
can be overridden. Since it makes little sense to introduce just
"hpet=legacy-replacement", also allow for a boolean argument as well as
"broadcast" to replace the separate "hpetbroadcast" option.

Reported-by: Frédéric Pierret frederic.pierret@qubes-os.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Frédéric Pierret <frederic.pierret@qubes-os.org>
master commit: b53173e7cdafb7a318a239d557478fd73734a86a
master date: 2021-04-15 18:25:40 +0100

x86/hpet: Factor hpet_enable_legacy_replacement_mode() out of hpet_setup()

... in preparation to introduce a second caller.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Frédéric Pierret <frederic.pierret@qubes-os.org>
master commit: 238168b5bcd27fec97704f6295fa5bf7a442eb6f
master date: 2021-04-15 18:25:40 +0100

x86/vpt: do not take pt_migrate rwlock in some cases

Commit 8e76aef72820 ("x86/vpt: fix race when migrating timers between
vCPUs") addressed XSA-336 by introducing a per-domain rwlock that was
intended to protect periodic timer during VCPU migration. Since such
migration is an infrequent event no performance impact was expected.

Unfortunately this turned out not to be the case: on a fairly large
guest (92 VCPUs) we've observed as much as 40% TPCC performance
regression with some guest kernels. Further investigation pointed to
pt_migrate read lock taken in pt_update_irq() as the largest contributor
to this regression. With large number of VCPUs and large number of VMEXITs
(from where pt_update_irq() is always called) the update of an atomic in
read_lock() is thought to be the main cause.

Stephen Brennan analyzed locking pattern and classified lock users as
follows:

1. Functions which read (maybe write) all periodic_time instances attached
to a particular vCPU. These are functions which use pt_vcpu_lock() such
as pt_restore_timer(), pt_save_timer(), etc.
2. Functions which want to modify a particular periodic_time object.
These functions lock whichever vCPU the periodic_time is attached to, but
since the vCPU could be modified without holding any lock, they are
vulnerable to XSA-336. Functions in this group use pt_lock(), such as
pt_timer_fn() or destroy_periodic_time().
3. Functions which not only want to modify the periodic_time, but also
would like to modify the =vcpu= fields. These are create_periodic_time()
or pt_adjust_vcpu(). They create XSA-336 conditions for group 2, but we
can't simply hold 2 vcpu locks due to the deadlock risk.

Roger then pointed out that group 1 functions don't really need to hold
the pt_migrate rwlock and that instead groups 2 and 3 should hold per-vcpu
lock whenever they modify per-vcpu timer lists.

Suggested-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com>
master commit: 1f3d87c7512975274cc45c40097b05550eba1ac9
master date: 2021-04-09 09:21:27 +0200

fix for_each_cpu() again for NR_CPUS=1

Unfortunately aa50f45332f1 ("xen: fix for_each_cpu when NR_CPUS=1") has
caused quite a bit of fallout with gcc10, e.g. (there are at least two
more similar ones, and I didn't bother trying to find them all):

In file included from .../xen/include/xen/config.h:13,
                 from <command-line>:
core_parking.c: In function ‘core_parking_power’:
.../xen/include/asm/percpu.h:12:51: error: array subscript 1 is above array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds]
   12 |     (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
.../xen/include/xen/compiler.h:141:29: note: in definition of macro ‘RELOC_HIDE’
  141 |     (typeof(ptr)) (__ptr + (off)); })
      |                             ^~~
core_parking.c:133:39: note: in expansion of macro ‘per_cpu’
  133 |             core_tmp = cpumask_weight(per_cpu(cpu_core_mask, cpu));
      |                                       ^~~~~~~
In file included from .../xen/include/xen/percpu.h:4,
                 from .../xen/include/asm/msr.h:7,
                 from .../xen/include/asm/time.h:5,
                 from .../xen/include/xen/time.h:76,
                 from .../xen/include/xen/spinlock.h:4,
                 from .../xen/include/xen/cpu.h:5,
                 from core_parking.c:19:
.../xen/include/asm/percpu.h:6:22: note: while referencing ‘__per_cpu_offset’
    6 | extern unsigned long __per_cpu_offset[NR_CPUS];
      |                      ^~~~~~~~~~~~~~~~

One of the further errors even went as far as claiming that an array
index (range) of [0, 0] was outside the bounds of a [1] array, so
something fishy is pretty clearly going on there.

The compiler apparently wants to be able to see that the loop isn't
really a loop in order to avoid triggering such warnings, yet what
exactly makes it consider the loop exit condition constant and within
the [0, 1] range isn't obvious - using ((mask)->bits[0] & 1) instead of
cpumask_test_cpu() for example did _not_ help.

Re-instate a special form of for_each_cpu(), experimentally "proven" to
avoid the diagnostics.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 625faf9f002bd6ff4b6457a016b8ff338223b659
master date: 2021-04-07 12:24:45 +0200

VT-d: restore flush hooks when disabling qinval

Leaving the hooks untouched is at best a latent risk: There may well be
cases where some flush is needed, which then needs carrying out the
"register" way.

Switch from u<N> to uint<N>_t while needing to touch the function
headers anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 9c39dba2b179c0f4c42c98e97ea0878119718530
master date: 2021-03-30 14:40:24 +0200

VT-d: re-order register restoring in vtd_resume()

For one FECTL must be written last - the interrupt shouldn't be unmasked
without first having written the data and address needed to actually
deliver it. In the common case (when dma_msi_set_affinity() doesn't end
up bailing early) this happens from init_vtd_hw(), but for this to
actually have the intended effect we shouldn't subsequently overwrite
what was written there - this is only benign when old and new settings
match. Instead we should restore the registers ahead of calling
init_vtd_hw(), just for the unlikely case of dma_msi_set_affinity()
bailing early.

In the moved code drop some stray casts as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 95e07f8c0e64889ee56015c7f99bdf5309e9e8ef
master date: 2021-03-30 14:39:54 +0200

VT-d: leave FECTL write to vtd_resume()

We shouldn't blindly unmask the interrupt when resuming. vtd_resume()
will restore the intended state.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 0d597e7bd1bd8a57619690d457f79769777a5834
master date: 2021-03-30 14:39:23 +0200

VT-d: correct off-by-1 in number-of-IOMMUs check

Otherwise, if we really run on a system with this many IOMMUs,
entering/leaving S3 would overrun iommu_state[].

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: b9b3082002cac68726fb303e0abd2ff0113d4657
master date: 2021-03-23 17:01:30 +0100

MAINTAINERS: Belatedly update for this being a stable branch

Signed-off-by: Ian Jackson <iwj@xenproject.org>

SUPPORT.MD: Clarify the support state for the Arm SMMUv{1, 2} drivers

SMMUv{1, 2} are both marked as security supported, so we would
technically have to issue an XSA for any IOMMU security bug.

However, at the moment, device passthrough is not security supported
on Arm and there is no plan to change that in the next few months.

Therefore, mark Arm SMMUv{1, 2} as supported but not security supported.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 28804c0ce9fde36feec04ad7f57b2683875da8a0)

xen/vgic: Implement write to ISPENDR in vGICv{2, 3}

Currently, Xen will send a data abort to a guest trying to write to the
ISPENDR.

Unfortunately, recent version of Linux (at least 5.9+) will start
writing to the register if the interrupt needs to be re-triggered
(see the callback irq_retrigger). This can happen when a driver (such as
the xgbe network driver on AMD Seattle) re-enable an interrupt:

(XEN) d0v0: vGICD: unhandled word write 0x00000004000000 to ISPENDR44
[...]
[   25.635837] Unhandled fault at 0xffff80001000522c
[...]
[   25.818716]  gic_retrigger+0x2c/0x38
[   25.822361]  irq_startup+0x78/0x138
[   25.825920]  __enable_irq+0x70/0x80
[   25.829478]  enable_irq+0x50/0xa0
[   25.832864]  xgbe_one_poll+0xc8/0xd8
[   25.836509]  net_rx_action+0x110/0x3a8
[   25.840328]  __do_softirq+0x124/0x288
[   25.844061]  irq_exit+0xe0/0xf0
[   25.847272]  __handle_domain_irq+0x68/0xc0
[   25.851442]  gic_handle_irq+0xa8/0xe0
[   25.855171]  el1_irq+0xb0/0x180
[   25.858383]  arch_cpu_idle+0x18/0x28
[   25.862028]  default_idle_call+0x24/0x5c
[   25.866021]  do_idle+0x204/0x278
[   25.869319]  cpu_startup_entry+0x24/0x68
[   25.873313]  rest_init+0xd4/0xe4
[   25.876611]  arch_call_rest_init+0x10/0x1c
[   25.880777]  start_kernel+0x5b8/0x5ec

As a consequence, the OS may become unusable.

Implementing the write part of ISPENDR is somewhat easy. For
virtual interrupt, we only need to inject the interrupt again.

For physical interrupt, we need to be more careful as the de-activation
of the virtual interrupt will be propagated to the physical distributor.
For simplicity, the physical interrupt will be set pending so the
workflow will not differ from a "real" interrupt.

Longer term, we could possible directly activate the physical interrupt
and avoid taking an exception to inject the interrupt to the domain.
(This is the approach taken by the new vGIC based on KVM).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit 067935804a8e7a33ff7170a2db8ce94bb46d9a63)

xen/arm: mm: Remove ; at the end of mm_printk()

The ; at the end of mm_printk() means the following code will not build
correctly:

if ( ... )
mm_printk(...);
else
...

As we treat the macro as a function, we want to remove the ; at the end
of it.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit bfe67a17d4df2efbedaaf5cfbadc8a8627debf5c)

xen/arm: Add workaround for Cortex-A53 erratum #843419

On the Cortex A53, when executing in AArch64 state, a load or store instruction
which uses the result of an ADRP instruction as a base register, or which uses
a base register written by an instruction immediately after an ADRP to the
same register, might access an incorrect address.

The workaround is to enable the linker flag --fix-cortex-a53-843419
if present, to check and fix the affected sequence. Otherwise print a warning
that Xen may be susceptible to this errata

Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit d81133d45d81d35a4e7445778bfd1179190cbd31)

xen/arm: Add workaround for Cortex-A55 erratum #1530923

On the Cortex A55, TLB entries can be allocated by a speculative AT
instruction. If this is happening during a guest context switch with an
inconsistent page table state in the guest, TLBs with wrong values might
be allocated.
The ARM64_WORKAROUND_AT_SPECULATE workaround is used as for erratum
1165522 on Cortex A76 or Neoverse N1.

This change is also introducing the MIDR identifier for the Cortex-A55.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit fd7479b9aec25885cc17d33b326b9babae59faee)

xen/arm: Add Cortex-A73 erratum 858921 workaround

CNTVCT_EL0 or CNTPCT_EL0 counter read in Cortex-A73 (all versions)
might return a wrong value when the counter crosses a 32bit boundary.

Until now, there is no case for Xen itself to access CNTVCT_EL0,
and it also should be the Guest OS's responsibility to deal with
this part.

But for CNTPCT, there exists several cases in Xen involving reading
CNTPCT, so a possible workaround is that performing the read twice,
and to return one or the other depending on whether a transition has
taken place.

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 5505f5f8e7e805365cfe70b6a4af6115940bb749)

xen/arm: Document the erratum #853709 related to Cortex A72

The Cortex-A72 erratum #853709 is the same as the Cortex-A57
erratum #852523. As the latter is already workaround, we only
need to update the documentation.

Signed-off-by: Michal Orzel <michal.orzel@arm.com>
[julieng: Reworded the commit message]
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit f776e5fb3ee699745f6442ec8c47d0fa647e0575)

x86/ucode/amd: Fix microcode payload size for Fam19 processors

The original limit provided wasn't accurate. Blobs are in fact rather larger.

Fixes: fe36a173d1 ("x86/amd: Initial support for Fam19h processors")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 90b014a6e6ecad036ec5846426afd19b305dedff
master date: 2021-02-10 13:23:51 +0000

tools/oxenstored: mkdir conflicts were sometimes missed

Due to how set_write_lowpath was used here it didn't detect create/delete
conflicts. When we create an entry we must mark our parent as modified
(this is what creating a new node via write does).

Otherwise we can have 2 transactions one creating, and another deleting a node
both succeeding depending on timing. Or one transaction reading an entry,
concluding it doesn't exist, do some other work based on that information and
successfully commit even if another transaction creates the node via mkdir
meanwhile.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit 45dee7d92b493bb531e7e77a6f9c0180ab152f87)

tools/oxenstored: Reject invalid watch paths early

Watches on invalid paths were accepted, but they would never trigger. The
client also got no notification that its watch is bad and would never trigger.

Found again by the structured fuzzer, due to an error on live update reload:
the invalid watch paths would get rejected during live update and the list of
watches would be different pre/post live update.

The testcase is watch on `//`, which is an invalid path.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit dc8caf214fb882546b0e93317b9828247a7c9da8)

tools/oxenstored: Fix quota calculation for mkdir EEXIST

We increment the domain's quota on mkdir even when the node already exists.
This results in a quota inconsistency after live update, where reconstructing
the tree from scratch results in a different quota.

Not a security issue because the domain uses up quota faster, so it will only
get a Quota error sooner than it should.

Found by the structured fuzzer.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit c8b96708252a436da44005307f7c195d699bd7c5)

tools/oxenstored: Trim txhistory on xenbus reconnect

There is a global history, containing transactions from the past 0.05s, which
get trimmed whenever any transaction commits or aborts.  Destroying a domain
will cause xenopsd to perform some transactions deleting the tree, so that is
fine.  But I think that a domain can abuse the xenbus reconnect facility to
cause a large history to be recorded - provided that noone does any
transactions on the system inbetween, which may be difficult to achieve given
squeezed's constant pinging.

The theoretical situation is like this:
- a domain starts a transaction, creates as large a tree as it can, commits
  it. Then repeatedly:
    - start a transaction, do nothing with it, start a transaction, delete
      part of the large tree, write some new unique data there, don't commit
    - cause a xenbus reconnect (I think this can be done by writing something
      to the ring). This causes all transactions/watches for the connection to
      be cleared, but NOT the history, there were no commits, so nobody
      trimmed the history, i.e. it the history can contain transactions from
      more than just 0.05s
    - loop back and start more transactions, you can keep this up indefinitely
      without hitting quotas

Now there is a periodic History.trim running every 0.05s, so I don't think you
can do much damage with it.  But lets be safe an trim the transaction history
anyway on reconnect.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 2a47797d1f3b14aab4f0368ab833abd311f94a70)

tools/ocaml/libs/xb: Do not crash after xenbus is unmapped

Xenmmap.unmap sets the address to MAP_FAILED in xenmmap_stubs.c. If due to a
bug there were still references to the Xenbus and we attempt to use it then we
crash. Raise an exception instead of crashing.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5e317896342d553f0b55f72948bbf93a0f1147d3)

oxenstored: fix ABI breakage introduced in Xen 4.9.0

dbc84d2983969bb47d294131ed9e6bbbdc2aec49 (Xen >= 4.9.0) deleted XS_RESTRICT
from oxenstored, which caused all the following opcodes to be shifted by 1:
reset_watches became off-by-one compared to the C version of xenstored.

Looking at the C code the opcode for reset watches needs:
XS_RESET_WATCHES = XS_SET_TARGET + 2

So add the placeholder `Invalid` in the OCaml<->C mapping list.
(Note that the code here doesn't simply convert the OCaml constructor to
an integer, so we don't need to introduce a dummy constructor).

Igor says that with a suitably patched xenopsd to enable watch reset,
we now see `reset watches` during kdump of a guest in xenstored-access.log.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit a6ed77f1e0334c26e6e216aea45f8674d9284856)

libxl: Fix domain soft reset state handling

In do_domain_soft_reset(), a `libxl__domain_suspend_state' is used
without been properly initialised and disposed of. This lead do a
abort() in libxl due to the `dsps.qmp' state been used before been
initialised:
libxl__ev_qmp_send: Assertion `ev->state == qmp_state_disconnected || ev->state == qmp_state_connected' failed.

Once initialised, `dsps' also needs to be disposed of as the `qmp'
state might still be in the `Connected' state in the callback for
libxl__domain_suspend_device_model(). So this patch adds
libxl__domain_suspend_dispose() which can be called from the two
places where we need to dispose of `dsps'.

This is XSA-368.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Ian Jackson <iwj@xenproject.org>
Tested-by: Olaf Hering <olaf@aepfle.de>
master commit: dae3c3e8b257cd27d6b35a467a34bf79a6650340
master date: 2021-03-18 14:56:33 +0100

xen: fix for_each_cpu when NR_CPUS=1

When running an hypervisor build with NR_CPUS=1 for_each_cpu does not
take into account whether the bit of the CPU is set or not in the
provided mask.

This means that whatever we have in the bodies of these loops is always
done once, even if the mask was empty and it should never be done. This
is clearly a bug and was in fact causing an assert to trigger in credit2
code.

Removing the special casing of NR_CPUS == 1 makes things work again.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: aa50f45332f17e8d6308b996d890d3e83748a1a5
master date: 2021-03-12 17:02:47 +0100

vtd: make sure QI/IR are disabled before initialisation

BIOS might pass control to Xen leaving QI and/or IR in enabled and/or
partially configured state. In case of x2APIC code path where EIM is
enabled early in boot - those are correctly disabled by Xen before any
attempt to configure. But for xAPIC that step is missing which was
proven to cause QI initialization failures on some ICX based platforms
where QI is left pre-enabled and partially configured by BIOS. That
problem becomes hard to avoid since those platforms are shipped with
x2APIC opt out being advertised by default at the same time by firmware.

Unify the behaviour between x2APIC and xAPIC code paths keeping that in
line with what Linux does.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 04181c6fb543db01f635227c7681ced4073109ba
master date: 2021-03-12 17:01:52 +0100

x86/shadow: suppress "fast fault path" optimization without reserved bits

When none of the physical address bits in PTEs are reserved, we can't
create any 4k (leaf) PTEs which would trigger reserved bit faults. Hence
the present SHOPT_FAST_FAULT_PATH machinery needs to be suppressed in
this case, which is most easily achieved by never creating any magic
entries.

To compensate a little, eliminate sh_write_p2m_entry_post()'s impact on
such hardware.

While at it, also avoid using an MMIO magic entry when that would
truncate the incoming GFN.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
x86/shadow: suppress "fast fault path" optimization when running virtualized

We can't make correctness of our own behavior dependent upon a
hypervisor underneath us correctly telling us the true physical address
with hardware uses. Without knowing this, we can't be certain reserved
bit faults can actually be observed. Therefore, besides evaluating the
number of address bits when deciding whether to use the optimization,
also check whether we're running virtualized ourselves. (Note that since
we may get migrated when running virtualized, the number of address bits
may also change.)

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
master commit: 9318fdf757ec234f0ee6c5cd381326b2f581d065
master date: 2021-03-05 13:29:28 +0100
master commit: 60c0444fae2148452f9ed0b7c49af1fa41f8f522
master date: 2021-03-08 10:41:50 +0100