dgit.raspbian.org Git

xen/arm: Check if the platform is not using ACPI before initializing Dom0less

Dom0less requires a device-tree. However, since commit 6e3e77120378
"xen/arm: setup: Relocate the Device-Tree later on in the boot", the
device-tree will not get unflatten when using ACPI.

This will lead to a crash during boot.

Given the complexity to setup dom0less with ACPI (for instance how to
assign device?), we should skip any code related to Dom0less when using
ACPI.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit dac867bf9adc1562a4cf9db5f89726597af13ef8)

xen/arm: acpi: The fixmap area should always be cleared during failure/unmap

Commit 022387ee1ad3 "xen/arm: mm: Don't open-code Xen PT update in
{set, clear}_fixmap()" enforced that each set_fixmap() should be
paired with a clear_fixmap(). Any failure to follow the model would
result to a platform crash.

Unfortunately, the use of fixmap in the ACPI code was overlooked as it
is calling set_fixmap() but not clear_fixmap().

The function __acpi_os_map_table() is reworked so:
    - We know before the mapping whether the fixmap region is big
    enough for the mapping.
    - It will fail if the fixmap is already in use. This is not a
    change of behavior but clarifying the current expectation to avoid
    hitting a BUG().

The function __acpi_os_unmap_table() will now call clear_fixmap().

Reported-by: Wei Xu <xuwei5@hisilicon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4d625ff3c3a939dc270b03654337568c30c5ab6e)

xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory()

The functions acpi_os_{un,}map_memory() are meant to be arch-agnostic
while the __acpi_os_{un,}map_memory() are meant to be arch-specific.

Currently, the former are still containing x86 specific code.

To avoid this rather strange split, the generic helpers are reworked so
they are arch-agnostic. This requires the introduction of a new helper
__acpi_os_unmap_memory() that will undo any mapping done by
__acpi_os_map_memory().

Currently, the arch-helper for unmap is basically a no-op so it only
returns whether the mapping was arch specific. But this will change
in the future.

Note that the x86 version of acpi_os_map_memory() was already able to
able the 1MB region. Hence why there is no addition of new code.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit 1c4aa69ca1e1fad20b2158051eb152276d1eb973)

xen/arm: acpi: Don't fail if SPCR table is absent

Absence of a SPCR table likely means the console is a framebuffer. In
such case acpi_iomem_deny_access() should NOT fail.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 861f0c110976fa8879b7bf63d9478b6be83d4ab6)

tools/python: Pass linker to Python build process

Unexpectedly the environment variable which needs to be passed is
$LDSHARED and not $LD. Otherwise Python may find the build `ld` instead
of the host `ld`.

Replace $(LDFLAGS) with $(SHLIB_LDFLAGS) as Python needs shared objects
it can load at runtime, not executables.

This uses $(CC) instead of $(LD) since Python distutils appends $CFLAGS
to $LDFLAGS which breaks many linkers.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 17d192e0238d6c714e9f04593b59597b7090be38)

[ Hans van Kranenburg ]
Fixed cherry-pick conflict because we have LIBEXEC_LIB=$(LIBEXEC_LIB) in
between in the same lines. The line wrap mess makes it a bit hard to
follow.

xen/rpi4: implement watchdog-based reset

The preferred method to reboot RPi4 is PSCI. If it is not available,
touching the watchdog is required to be able to reboot the board.

The implementation is based on
drivers/watchdog/bcm2835_wdt.c:__bcm2835_restart in Linux v5.9-rc7.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Tested-by: Roman Shaposhnik <roman@zededa.com>
CC: roman@zededa.com
(cherry picked from commit 25849c8b16f2a5b7fcd0a823e80a5f1b590291f9)

t/h/L/vif-common.sh: fix handle_iptable return value

A return statement without explicit value will return the value of the
last command executed before this line with return was encountered.

This is not what we want. return 0.

Closes: #955994
Fixes: 2e0814f971dd ("vif-common: disable handle_iptable")
Reported-by: Samuel Thibault <sthibault@debian.org>
Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools: Partially revert "Cross-compilation fixes."

This partially reverts commit 16504669c5cbb8b195d20412aadc838da5c428f7.

Doesn't look like much of 16504669c5cbb8b195d20412aadc838da5c428f7
actually remains due to passage of time.

Of the 3, both Python and pygrub appear to mostly be building just fine
cross-compiling. The OCAML portion is being troublesome, this is going
to cause bug reports elsewhere soon. The OCAML portion though can
already be disabled by setting OCAML_TOOLS=n and shouldn't have this
extra form of disabling.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 69953e2856382274749b617125cc98ce38198463)

tools: don't build/ship xenmon

It can't run with Python 3, and I'm not going to fix it.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools/xl/bash-completion: also complete 'xen'

We have the `xen` alias for xl in Debian, since in the past it was a
command that could execute either xl or xm.

Now, it always does xl, so, complete the same stuff for it as we have
for xl.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
[git-debrebase split: mixed commit: upstream part]

pygrub: Specify -rpath LIBEXEC_LIB when building fsimage.so

If LIBEXEC_LIB is not on the default linker search path, the python
fsimage.so module fails to find libfsimage.so.

Add the relevant directory to the rpath explicitly.

(This situation occurs in the Debian package, where
--with-libexec-libdir is used to put each Xen version's libraries and
utilities in their own directory, to allow them to be coinstalled.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

pygrub: Set sys.path

We install libfsimage in a non-standard path for Reasons.
(See debian/rules.)

This patch was originally part of `tools-pygrub-prefix.diff'
(eg commit 51657319be54) and included changes to the Makefile to
change the installation arrangements (we do that part in the rules now
since that is a lot less prone to conflicts when we update) and to
shared library rpath (which is now done in a separate patch).

(Commit message rewritten by Ian Jackson.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
squash! pygrub: Set sys.path and rpath

hotplug-common: Do not adjust LD_LIBRARY_PATH

This is in the upstream script because on non-Debian systems, the
default install locations in /usr/local/lib might not be on the linker
path, and as a result the hotplug scripts would break.

A reason we might need it in Debian is our multiple version
coinstallation scheme. However, the hotplug scripts all call the
utilities via the wrappers, and the binaries are configured to load
from the right place anyway.

This setting is an annoyance because it requires libdir, which is an
arch-specific path but comes from a file we want to put in
xen-utils-common, an arch:all package.

So drop this setting.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

sysconfig.xencommons.in: Strip and debianize

Strip all options that are for stuff we don't ship, which is 1)
xenstored as stubdom and 2) xenbackendd, which seems to be dead code
anyway. [1]

It seems useful to give the user the option to revert to xenstored
instead of the default oxenstored if they really want.

[1] https://lists.xen.org/archives/html/xen-devel/2015-07/msg04427.html

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
Acked-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

vif-common: disable handle_iptable

Also see Debian bug #894013. The current attempt at providing
anti-spoofing rules results in a situation that does not have any
effect. Also note that forwarding bridged traffic to iptables is not
enabled by default, and that for openvswitch users it does not make any
sense.

So, stop cluttering the live iptables ruleset.

This functionality seems to be introduced before 2004 and since then it
has never got some additional love.

It would be nice to have a proper discussion upstream about how Xen
could provide some anti mac/ip spoofing in the dom0. It does not seem to
be a trivial thing to do, since it requires having quite some knowledge
about what the domU is allowed to do or not (e.g. a domU can be a
router...).

Fix empty fields in first hypervisor log line

Instead of:

    (XEN) Xen version 4.11.1 (Debian )
    (@)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 19:08:37 UTC 2019

I'd like to see:

    (XEN) Xen version 4.11.1 (Debian 4.11.1-1~)
    (pkg-xen-devel@lists.alioth.debian.org)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 22:44:00 CET 2019

The substitution was broken since the great packaging refactoring,
because the directory in which the build is done changed.

Also, use the Maintainer address from debian/control instead of the most
recent changelog entry. If someone wants to use the address to ask a
question, they will end up at the team mailing list, which is better
than an individual person.

docs/man/xen-vbd-interface.7: Provide properly-formatted NAME section

This manpage was omitted from
docs/man: Provide properly-formatted NAME sections
because I was previously building with markdown not installed.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

shim: Provide separate install-shim target

When building on a 32-bit userland, the user wants to build 32-bit
tools and a 64-bit hypervisor.  This involves setting XEN_TARGET_ARCH
to different values for the tools build and the hypervisor build.

So the user must invoke the tools build and the hypervisor build
separately.

However, although the shim is done by the tools/firmware Makefile, its
bitness needs to be the same as the hypervisor, not the same as the
tools.  When run with XEN_TARGET_ARCH=x86_32, it it skipped, which is
wrong.

So the user must invoke the shim build separately.  This can be done
with
   make -C tools/firmware/xen-dir XEN_TARGET_ARCH=x86_64

However, tools/firmware/xen-dir has no `install' target.  The
installation of all `firmware' is done in tools/firmware/Makefile.  It
might be possible to fix this, but it is not trivial.  For example,
the definitions of INST_DIR and DEBG_DIR would need to be copied, as
would an appropriate $(INSTALL_DIR) call.

For now, provide an `install-shim' target in tools/firmware/Makefile.

This has to be called from `install' of course.  We can't make it
a dependency of `install' because it might be run before `all' has
completed.  We could make it depend on a `shim' target but such
a target is nearly impossible to write because everything is done by
the inflexible subdir-$@ machinery.

The overally result of this patch is that existing make invocations
work as before.  But additionally, the user can say
  make -C tools/firmware install-shim XEN_TARGET_ARCH=x86_64
to install the shim.  The user must have built it already.
Unlike the build rune, this install-rune is properly conditional
so it is OK to call on ARM.

What a mess.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makefile: CONFIG_PV_SHIM: enable only on x86_64

Previously this was *dis*abled for x86_*32*. But if someone should
run some of this Makefile on ARM, say, it ought not to be built
either.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makfile: Respect caller's CONFIG_PV_SHIM

This makes it easier to disable the shim build. (In Debian we need to
build the shim separately because it needs different compiler flags
and a different XEN_COMPILE_ARCH.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

Revert "pvshim: make PV shim build selectable from configure"

This reverts commit 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2.

This upstream change changes stuff that breaks our very fragile mess
that builds the shim when it needs to, and doesn't when it should not.

The result is that it's missing in the end for the i386 build... :|

    dh_install: warning: Cannot find (any matches for)
    "usr/lib/debug/usr/lib/xen-*/boot/*" (tried in ., debian/tmp)

    dh_install: warning: xen-utils-4.14 missing files:
    usr/lib/debug/usr/lib/xen-*/boot/*
    dh_install: error: missing files, aborting

.gitignore: Add configure output which we always delete and regenerate

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

\o/

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/tests/x86_emulator: Pass -no-pie -fno-pic to gcc on x86_32

The current build fails with GCC6 on Debian sid i386 (unstable):

/tmp/ccqjaueF.s: Assembler messages:
/tmp/ccqjaueF.s:3713: Error: missing or invalid displacement expression `vmovd_to_reg_len@GOT'

This is due to the combination of GCC6, and Debian's decision to
enable some hardening flags by default (to try to make runtime
addresses less predictable):
  https://wiki.debian.org/Hardening/PIEByDefaultTransition

This is of no benefit for the x86 instruction emulator test, which is
a rebuild of the emulator code for testing purposes only.  So pass
options to disable this.

These options will be no-ops if they are the same as the compiler
default.

On amd64, the -fno-pic breaks the build in a different way.  So do
this only on i386.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name toolstestsx86_emulator-pass--no-pie--fno.patch

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

version

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Update changelog for new upstream 4.14.3

[git-debrebase changelog: new upstream 4.14.3]

Update to upstream 4.14.3

[git-debrebase anchor: new upstream 4.14.3, merge]

d/changelog: finish 4.14.2+25-gb6a8c4f72d-2

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

Add debian/README.Debian.security

Add a note about the end of upstream security support, since the date is
already known right now.

Install it into the xen-hypervisor-common package.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

debian/changelog: finish 4.14.2+25-gb6a8c4f72d-1

update Xen version to 4.14.3

gnttab: deal with status frame mapping race

Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.

Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).

This is CVE-2021-28701 / XSA-384.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: eb6bbf7b30da5bae87932514d54d0e3c68b23757
master date: 2021-09-08 14:37:45 +0200

x86/p2m-pt: fix p2m_flags_to_access()

The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):

OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0

present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.

Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags())
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e70a9a043a5ce6d4025420f729bc473f711bf5d1
master date: 2021-09-07 14:24:49 +0200

x86/P2M: relax guarding of MMIO entries

One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. At least in
the case of PVH Dom0 hitting an RMRR covered by an E820 ACPI region,
this is too strict. Generally short-circuit requests establishing the
same kind of mapping (mfn, type), but allow permissions to differ.

While there, also add a log message to the other domain_crash()
invocation that did prevent PVH Dom0 from coming up after the XSA-378
changes.

Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 111469cc7b3f586c2335e70205320ed3c828b89e
master date: 2021-09-07 09:39:38 +0200

x86/PVH: de-duplicate mappings for first Mb of Dom0 memory

One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
dealt with at that point.

Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b4f6a31ace125d658a581e8d10809e4fccdc272
master date: 2021-08-31 17:43:36 +0200

gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr()

Relevant quotes from the C11 standard:

"Except where explicitly stated otherwise, for the purposes of this
subclause unnamed members of objects of structure and union type do not
participate in initialization. Unnamed members of structure objects
have indeterminate value even after initialization."

"If there are fewer initializers in a brace-enclosed list than there are
elements or members of an aggregate, [...], the remainder of the
aggregate shall be initialized implicitly the same as objects that have
static storage duration."

"If an object that has static or thread storage duration is not
initialized explicitly, then:
[...]
— if it is an aggregate, every member is initialized (recursively)
   according to these rules, and any padding is initialized to zero
   bits;
[...]"

"A bit-field declaration with no declarator, but only a colon and a
width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field
structure member is useful for padding to conform to externally imposed
layouts."

"There may be unnamed padding within a structure object, but not at its
beginning."

Which makes me conclude:
- Whether an unnamed bit-field member is an unnamed member or padding is
  unclear, and hence also whether the last quote above would render the
  big endian case of the structure declaration invalid.
- Whether the number of members of an aggregate includes unnamed ones is
  also not really clear.
- The initializer in map_grant_ref() initializes all fields of the "cnt"
  sub-structure of the union, so assuming the second quote above applies
  here (indirectly), the compiler isn't required to implicitly
  initialize the rest (i.e. in particular any padding) like would happen
  for static storage duration objects.

Gcc 7.4.1 can be observed (apparently in debug builds only) to translate
aforementioned initializer to a read-modify-write operation of a stack
variable, leaving unchanged the top two bits of whatever was previously
in that stack slot. Clearly if either of the two bits were set,
radix_tree_ulong_to_ptr()'s assertion would trigger.

Therefore, to be on the safe side, add an explicit padding field for the
non-big-endian-bitfields case and give a dummy name to both padding
fields.

Fixes: 9781b51efde2 ("gnttab: replace mapkind()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6da9d0414d69c2682214ee3ecf9816fcac500d0
master date: 2021-08-27 10:54:46 +0200

tools/firmware/ovmf: Use OvmfXen platform file is exist

A platform introduced in EDK II named OvmfXen is now the one to use for
Xen instead of OvmfX64. It comes with PVH support.

Also, the Xen support in OvmfX64 is deprecated,
"deprecation notice: *dynamic* multi-VMM (QEMU vs. Xen) support in OvmfPkg"
https://edk2.groups.io/g/devel/message/75498

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit aad7b5c11d51d57659978e04702ac970906894e8)
(cherry picked from commit 7988ef515a5eabe74bb5468c8c692e03ee9db8bc)

AMD/IOMMU: don't leave page table mapped when unmapping ...

... an already not mapped page. With all other exit paths doing the
unmap, I have no idea how I managed to miss that aspect at the time.

Fixes: ad591454f069 ("AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 3cfec6a6aa7a7bf68f8e19e21f450c2febe9acb4
master date: 2021-08-20 12:30:35 +0200

xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume

With smt=0 during a suspend/resume cycle of the machine the threads
which have been parked before will briefly come up again. This can
result in problems e.g. with cpufreq driver being active as this will
call into get_cpu_idle_time() for a cpu without initialized scheduler
data.

Fix that by letting get_cpu_idle_time() deal with this case. Drop a
redundant check in exchange.

Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 5293470a77ad980dce2af9b7e6c3f11eeebf1b64
master date: 2021-08-19 13:38:31 +0200

VT-d: Tylersburg errata apply to further steppings

While for 5500 and 5520 chipsets only B3 and C2 are mentioned in the
spec update, X58's also mentions B2, and searching the internet suggests
systems with this stepping are actually in use. Even worse, for X58
erratum #69 is marked applicable even to C2. Split the check to cover
all applicable steppings and to also report applicable errata numbers in
the log message. The splitting requires using the DMI port instead of
the System Management Registers device, but that's then in line (also
revision checking wise) with the spec updates.

Fixes: 6890cebc6a98 ("VT-d: deal with 5500/5520/X58 errata")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 517a90d1ca09ce00e50d46ac25566cc3bd2eb34d
master date: 2021-08-18 09:44:14 +0200

x86/cet: Fix shskt manipulation error with BUGFRAME_{warn,run_fn}

This was a clear oversight in the original CET work.  The BUGFRAME_run_fn and
BUGFRAME_warn paths update regs->rip without an equivalent adjustment to the
shadow stack, causing IRET to suffer #CP because of the mismatch.

One subtle, and therefore fragile, aspect of extable_shstk_fixup() was that it
required regs->rip to have its old value as a cross-check that the right word
in the shadow stack was being edited.

Rework extable_shstk_fixup() into fixup_exception_return() which takes
ownership of the update to both the regular and shadow stacks, ensuring that
the regs->rip update is ordered correctly.

Use the new fixup_exception_return() for BUGFRAME_run_fn and BUGFRAME_warn to
ensure that the shadow stack is updated too.

Fixes: 209fb9919b50 ("x86/extable: Adjust extable handling to be shadow stack compatible")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/cet: Fix build on newer versions of GCC

Some versions of GCC complain with:

  traps.c:405:22: error: 'get_shstk_bottom' defined but not used [-Werror=unused-function]
   static unsigned long get_shstk_bottom(unsigned long sp)
                        ^~~~~~~~~~~~~~~~
  cc1: all warnings being treated as errors

Change #ifdef to if ( IS_ENABLED(...) ) to make the sole user of
get_shstk_bottom() visible to the compiler.

Fixes: 35727551c070 ("x86/cet: Fix shskt manipulation error with BUGFRAME_{warn,run_fn}")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Compile-tested-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: 35727551c0703493a2240e967cffc3063b13d49c
master date: 2021-08-16 16:03:20 +0100
master commit: 54c9736382e0d558a6acd820e44185e020131c48
master date: 2021-08-17 12:55:48 +0100

credit2: avoid picking a spurious idle unit when caps are used

Commit 07b0eb5d0ef0 ("credit2: make sure we pick a runnable unit from the
runq if there is one") did not fix completely the problem of potentially
selecting a scheduling unit that will then not be able to run.

In fact, in case caps are used and the unit we are currently looking
at, during the runqueue scan, does not have enough budget for being run,
we should continue looking instead than giving up and picking the idle
unit.

Suggested-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f742839ae57e10687e7a573070c37430f31068c
master date: 2021-08-10 09:29:10 +0200

xen/lib: Fix strcmp() and strncmp()

The C standard requires that each character be compared as unsigned
char. Xen's current behaviour compares as signed char, which changes
the answer when chars with a value greater than 0x7f are used.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jane Malalane <jane.malalane@citrix.com>
Reviewed-by: Ian Jackson <iwj@xenproject.org>
master commit: 3747a2bb67daa5a8baeff6cda57dc98a5ef79c3e
master date: 2021-07-30 10:52:46 +0100

x86/hvm: Propagate real error information up through hvm_load()

hvm_load() is currently a mix of -errno and -1 style error handling, which
aliases -EPERM.  This leads to the following confusing diagnostics:

From userspace:
  xc: info: Restoring domain
  xc: error: Unable to restore HVM context (1 = Operation not permitted): Internal error
  xc: error: Restore failed (1 = Operation not permitted): Internal error
  xc_domain_restore: [1] Restore failed (1 = Operation not permitted)

From Xen:
  (XEN) HVM10.0 restore: inconsistent xsave state (feat=0x2ff accum=0x21f xcr0=0x7 bv=0x3 err=-22)
  (XEN) HVM10 restore: failed to load entry 16/0

The actual error was a bad backport, but the -EINVAL got converted to -EPERM
on the way out of the hypercall.

The overwhelming majority of *_load() handlers already use -errno consistenty.
Fix up the rest to be consistent, and fix a few other errors noticed along the
way.

* Failures of hvm_load_entry() indicate a truncated record or other bad data
   size.  Use -ENODATA.
* Don't use {g,}dprintk().  Omitting diagnostics in release builds is rude,
   and almost everything uses unconditional printk()'s.
* Switch some errors for more appropriate ones.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 96e5ad4c476e70688295b3cfb537847a3351d6fd
master date: 2021-07-19 14:34:38 +0100

xen/arm: Restrict the amount of memory that dom0less domU and dom0 can allocate

Currently, both dom0less domUs and dom0 can allocate an "unlimited"
amount of memory because d->max_pages is set to ~0U.

In particular, the former are meant to be unprivileged. Therefore the
memory they could allocate should be bounded. As the domain are not yet
officially aware of Xen (we don't expose advertise it in the DT, yet
the hypercalls are accessible), they should not need to allocate more
than the initial amount. So cap set d->max_pages directly the amount of
memory we are meant to allocate.

Take the opportunity to also restrict the memory for dom0 as the
domain is direct mapped (e.g. MFN == GFN) and therefore cannot
allocate outside of the pre-allocated region.

This is CVE-2021-28700 / XSA-383.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: c08d68cd2aacbc7cb56e73ada241bfe4639bbc68
master date: 2021-08-25 14:19:31 +0200

gnttab: fix array capacity check in gnttab_get_status_frames()

The number of grant frames is of no interest here; converting the passed
in op.nr_frames this way means we allow for 8 times as many GFNs to be
written as actually fit in the array. We would corrupt xlat areas of
higher vCPU-s (after having faulted many times while trying to write to
the guard pages between any two areas) for 32-bit PV guests. For HVM
guests we'd simply crash as soon as we hit the first guard page, as
accesses to the xlat area are simply memcpy() there.

This is CVE-2021-28699 / XSA-382.

Fixes: 18b1be5e324b ("gnttab: make resource limits per domain")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: ec820035b875cdbedce5e73f481ce65963ede9ed
master date: 2021-08-25 14:19:09 +0200

gnttab: replace mapkind()

mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.

To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.

As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.

As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.

This is part of CVE-2021-28698 / XSA-380.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 9781b51efde251efcc0291ddb1d9c7cefe2b2555
master date: 2021-08-25 14:18:39 +0200

gnttab: add preemption check to gnttab_release_mappings()

A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).

Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.

In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.

This is part of CVE-2021-28698 / XSA-380.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: b1ee10be5625b7d502cef1e6ee3818610ab0d29c
master date: 2021-08-25 14:18:18 +0200

x86/mm: widen locked region in xenmem_add_to_physmap_one()

For pages which can be made part of the P2M by the guest, but which can
also later be de-allocated (grant table v2 status pages being the
present example), it is imperative that they be mapped at no more than a
single GFN. We therefore need to make sure that of two parallel
XENMAPSPACE_grant_table requests for the same status page one completes
before the second checks at which other GFN the underlying MFN is
presently mapped.

Pull ahead the respective get_gfn() and push down the respective
put_gfn(). This leverages that gfn_lock() really aliases p2m_lock(), but
the function makes this assumption already anyway: In the
XENMAPSPACE_gmfn case lock nesting constraints for both involved GFNs
would otherwise need to be enforced to avoid ABBA deadlocks.

This is CVE-2021-28697 / XSA-379.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: f147422bf9476fb8161b43e35f5901571ed17c35
master date: 2021-08-25 14:17:56 +0200

x86/p2m: guard (in particular) identity mapping entries

Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).

As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.

Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.

Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.

This is CVE-2021-28694 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 753cb68e653002e89fdcd1c80e52905fdbfb78cb
master date: 2021-08-25 14:17:32 +0200

x86/p2m: introduce p2m_is_special()

Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").

Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0bf755e2c856628e11e93c76c3e12974e9964638
master date: 2021-08-25 14:17:07 +0200

AMD/IOMMU: re-arrange exclusion range and unity map recording

The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)

First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.

With this various local variables become unnecessary and hence get
dropped at the same time.

With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.

Further don't make use of the exclusion range unless ACPI data says so.

Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.

Also adjust types where suitable without touching extra lines.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 8ea80530cd0dbb8ffa7ac92606a3ee29663fdc93
master date: 2021-08-25 14:16:46 +0200

AMD/IOMMU: re-arrange/complete re-assignment handling

Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.

De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.

This is CVE-2021-28696 / part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 899272539cbe1acda736a850015416fff653a1b6
master date: 2021-08-25 14:16:26 +0200

IOMMU: generalize VT-d's tracking of mapped RMRR regions

In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.

Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).

Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: c0e19d7c6c42f0bfccccd96b4f7b03b5515e10fc
master date: 2021-08-25 14:15:57 +0200

IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()

A subsequent change will want to customize the IOMMU permissions based
on this.

This is part of XSA-378.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: d1bb6c97c31ef754fb29b29eb307c090414e8022
master date: 2021-08-25 14:15:32 +0200

AMD/IOMMU: correct device unity map handling

Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.

At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 34750a3eb022462cdd1c36e8ef9049d3d73c824c
master date: 2021-08-25 14:15:11 +0200

AMD/IOMMU: correct global exclusion range extending

Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.

Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.

Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).

This is part of XSA-378 / CVE-2021-28695.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: b02c5c88982411be11e3413159862f255f1f39dc
master date: 2021-08-25 14:12:13 +0200

x86: work around build issue with GNU ld 2.37

I suspect it is commit 40726f16a8d7 ("ld script expression parsing")
which broke the hypervisor build, by no longer accepting section names
with a dash in them inside ADDR() (and perhaps other script directives
expecting just a section name, not an expression): .note.gnu.build-id
is such a section.

Quoting all section names passed to ADDR() via DECL_SECTION() works
around the regression.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 58ad654ebce7ccb272a3f4f3482c03aaad850d31
master date: 2021-07-27 15:03:29 +0100

libxl/x86: check return value of SHADOW_OP_SET_ALLOCATION domctl

The hypervisor may not have enough memory to satisfy the request. While
there, make the unit of the value clear by renaming the local variable.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
backport-requested-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0be5a00af590c97ea553aadb60f1e0b3af53d8f6)
(cherry picked from commit 6bbdcefd205903b2181b3b4fdc9503709ecdb7c4)

xen/arm: bootfdt: Always sort memory banks

At the moment, Xen on Arm64 expects the memory banks to be ordered.
Unfortunately, there may be a case when updated by firmware
device tree contains unordered banks. This means Xen will panic
when setting xenheap mappings for the subsequent bank with start
address being less than xenheap_mfn_start (start address of
the first bank).

As there is no clear requirement regarding ordering in the device
tree, update code to be able to deal with by sorting memory
banks. There is only one heap region on Arm32, so the sorting
is fine to be done in the common code.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 4473f3601098a2c3cf5ab89d5a29504772985e3a)

arm: Modify type of actlr to register_t

AArch64 registers are 64bit whereas AArch32 registers
are 32bit or 64bit. MSR/MRS are expecting 64bit values thus
we should get rid of helpers READ/WRITE_SYSREG32
in favour of using READ/WRITE_SYSREG.
We should also use register_t type when reading sysregs
which can correspond to uint64_t or uint32_t.
Even though many AArch64 registers have upper 32bit reserved
it does not mean that they can't be widen in the future.

ACTLR_EL1 system register bits are implementation defined
which means it is possibly a latent bug on current HW as the CPU
implementer may already have decided to use the top 32bit.

Signed-off-by: Michal Orzel <michal.orzel@arm.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit b80470c84553808fef3a6803000ceee8a100e63c)

Arm32: MSR to SPSR needs qualification

The Arm ARM's description of MSR (ARM DDI 0406C.d section B9.3.12)
doesn't even allow for plain "SPSR" here, and while gas accepts this, it
takes it to mean SPSR_cf. Yet surely all of SPSR wants updating on this
path, not just the lowest and highest 8 bits.

Fixes: dfcffb128be4 ("xen/arm32: SPSR_hyp/SPSR")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 93031fbe9f4c341a2e7950a088025ea550291433)

xen/arm32: SPSR_hyp/SPSR

SPSR_hyp is not meant to be accessed from Hyp mode (EL2); accesses
trigger UNPREDICTABLE behaviour. Xen should read/write SPSR instead.
See: ARM DDI 0487D.b page G8-5993.

This fixes booting Xen/arm32 on QEMU.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
(cherry picked from commit dfcffb128be46a3e413eaa941744536fe53c94b6)

tools/libxenstat: fix populating vbd.rd_sect

Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.")
Signed-off-by: Richard Kojedzinszky <richard@kojedz.in>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89d57f291e37b4769ab26db919eba46548f2e13e)

tools/python: fix Python3.4 TypeError in format string

Using the first element of a tuple for a format specifier fails with
python3.4 as included in SLE12:
b = b"string/%x" % (i, )
TypeError: unsupported operand type(s) for %: 'bytes' and 'tuple'

It happens to work with python 2.7 and 3.6.
To support older Py3, format as strings and explicitly encode as ASCII.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit a27976a1080d537fb1f212a8f9133d60daa0025b)

tools/python: handle libxl__physmap_info.name properly in convert-legacy-stream

The trailing member name[] in libxl__physmap_info is written as a
cstring into the stream. The current code does a sanity check if the
last byte is zero. This attempt fails with python3 because name[-1]
returns a type int. As a result the comparison with byte(\00) fails:

  File "/usr/lib/xen/bin/convert-legacy-stream", line 347, in read_libxl_toolstack
    raise StreamError("physmap name not NUL terminated")
  StreamError: physmap name not NUL terminated

To handle both python variants, cast to bytearray().

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit c8f88810db2a25d6aacf65c1c60bc4f5d848a483)

tools: use integer division in convert-legacy-stream

A single slash gives a float, a double slash gives an int.

bitmap = unpack_exact("Q" * ((max_id/64) + 1))
TypeError: can't multiply sequence by non-int of type 'float'

Use future division to remain compatible with python 2.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 74d044d51b19bb697eac5c3deafa140f6afafec8)

x86/mem-sharing: ensure consistent lock order in get_two_gfns()

While the comment validly says "Sort by domain, if same domain by gfn",
the implementation also included equal domain IDs in the first part of
the check, thus rending the second part entirely dead and leaving
deadlock potential when there's only a single domain involved.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 09af2d01a2fe6a0af08598bdfe12c9707f4d82ba
master date: 2021-07-07 12:35:12 +0200

build: fix %.s: %.S rule

Fixes: e321576f4047 ("xen/build: start using if_changed")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d468f9522174114ab06239894b6079c0a487e408
master date: 2021-07-05 16:47:51 +0200

IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed

Failure here could in principle mean the device may still be issuing DMA
requests, which would continue to be translated by the page tables the
device entry currently points at. With this we cannot allow the
subsequent cleanup step of freeing the page tables to occur, to prevent
use-after-free issues. We would need to accept, for the time being, that
in such a case the remaining domain resources will all be leaked, and
the domain will continue to exist as a zombie.

However, with flushes no longer timing out (and with proper timeout
detection for device I/O TLB flushing yet to be implemented), there's no
way anymore for failures to occur, except due to bugs elsewhere. Hence
the change here is merely a "just in case" one.

In order to continue the loop in spite of an error, we can't use
pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
the first place, instead of the cheaper list iteration.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: f591755823a7e94fc6b4b8ddce71f0421a94fa09
master date: 2021-06-25 14:06:55 +0200

VT-d: don't lose errors when flushing TLBs on multiple IOMMUs

While no longer an immediate problem with flushes no longer timing out,
errors (if any) get properly reported by iommu_flush_iotlb_{dsi,psi}().
Overwriting such an error with, perhaps, a success indicator received
from another IOMMU will misguide callers. Record the first error, but
don't bail from the loop (such that further necessary invalidation gets
carried out on a best effort basis).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e7059776f9755b989a992d229c68c3d7778412be
master date: 2021-06-24 16:30:06 +0200

VT-d: clear_fault_bits() should clear all fault bits

If there is any way for one fault to be left set in the recording
registers, there's no reason there couldn't also be multiple ones. If
PPF set set (being the OR or all F fields), simply loop over the entire
range of fault recording registers, clearing F everywhere.

Since PPF is a r/o bit, also remove it from DMA_FSTS_FAULTS (arguably
the constant's name is ambiguous as well).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 80589800ae62fce43fd3921e8fbd362465fe5ba3
master date: 2021-06-24 16:29:42 +0200

VT-d: adjust domid map updating when unmapping context

When an earlier error occurred, cleaning up the domid mapping data is
wrong, as references likely still exist. The only exception to this is
when the actual unmapping worked, but some flush failed (supposedly
impossible after XSA-373). The guest will get crashed in such a case
though, so add fallback cleanup to domain destruction to cover this
case. This in turn makes it desirable to silence the dprintk() in
domain_iommu_domid().

Note that no error will be returned anymore when the lookup fails - in
the common case lookup failure would already have caused
domain_context_unmap_one() to fail, yet even from a more general
perspective it doesn't look right to fail domain_context_unmap() in such
a case when this was the last device, but not when any earlier unmap was
otherwise successful.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 32655880057ce2829f962d46916ea6cec60f98d3
master date: 2021-06-24 16:29:13 +0200

VT-d: undo device mappings upon error

When
- flushes (supposedly not possible anymore after XSA-373),
- secondary mappings for legacy PCI devices behind bridges,
- secondary mappings for chipset quirks, or
- find_upstream_bridge() invocations
fail, the successfully established device mappings should not be left
around.

Further, when (parts of) unmapping fail, simply returning an error is
typically not enough. Crash the domain instead in such cases, arranging
for domain cleanup to continue in a best effort manner despite such
failures.

Finally make domain_context_unmap()'s error behavior consistent in the
legacy PCI device case: Don't bail from the function in one special
case, but always just exit the switch statement.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f3401d65d9f0dce508c3d7da55de4a093d748ae1
master date: 2021-06-24 16:28:25 +0200

libs/foreignmemory: Fix osdep_xenforeignmemory_map prototype

Commit cf8c4d3d13b8 made some preparation to have one day
variable-length-array argument, but didn't declare the array in the
function prototype the same way as in the function definition. And now
GCC 11 complains about it.

Fixes: cf8c4d3d13b8 ("tools/libs/foreignmemory: pull array length argument to map forward")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5d3e4ebb5c71477d74a0c503438545a0126d3863
master date: 2021-06-15 18:07:58 +0100

x86/vpt: fully init timers before putting onto list

With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
iterating the list and acting on the timers of the list entries will no
longer be kept from entering their loops by create_periodic_time()'s
holding of that lock. Therefore at least init_timer() needs calling
ahead of list insertion, but keep this and set_timer() together.

Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases")
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 6d622f3a96bbd76ce8422c6e3805e6609417ec76
master date: 2021-06-15 15:14:20 +0200

xen: credit2: fix per-entity load tracking when continuing running

If we schedule, and the current vCPU continues to run, its statistical
load is not properly updated, resulting in something like this, even if
all the 8 vCPUs are 100% busy:

(XEN) Runqueue 0:
(XEN) [...]
(XEN)   aveload            = 2097152 (~800%)
(XEN) [...]
(XEN)   Domain: 0 w 256 c 0 v 8
(XEN)     1: [0.0] flags=2 cpu=4 credit=9996885 [w=256] load=35 (~0%)
(XEN)     2: [0.1] flags=2 cpu=2 credit=9993725 [w=256] load=796 (~0%)
(XEN)     3: [0.2] flags=2 cpu=1 credit=9995885 [w=256] load=883 (~0%)
(XEN)     4: [0.3] flags=2 cpu=5 credit=9998833 [w=256] load=487 (~0%)
(XEN)     5: [0.4] flags=2 cpu=6 credit=9998942 [w=256] load=1595 (~0%)
(XEN)     6: [0.5] flags=2 cpu=0 credit=9994669 [w=256] load=22 (~0%)
(XEN)     7: [0.6] flags=2 cpu=7 credit=9997706 [w=256] load=0 (~0%)
(XEN)     8: [0.7] flags=2 cpu=3 credit=9992440 [w=256] load=0 (~0%)

As we can see, the average load of the runqueue as a whole is, instead,
computed properly.

This issue would, in theory, potentially affect Credit2 load balancing
logic. In practice, however, the problem only manifests (at least with
these characteristics) when there is only 1 runqueue active in the
cpupool, which also means there is no need to do any load-balancing.

Hence its real impact is pretty much limited to wrong per-vCPU load
percentages, when looking at the output of the 'r' debug-key.

With this patch, the load is updated and displayed correctly:

(XEN) Runqueue 0:
(XEN) [...]
(XEN)   aveload            = 2097152 (~800%)
(XEN) [...]
(XEN) Domain info:
(XEN)   Domain: 0 w 256 c 0 v 8
(XEN)     1: [0.0] flags=2 cpu=4 credit=9995584 [w=256] load=262144 (~100%)
(XEN)     2: [0.1] flags=2 cpu=6 credit=9992992 [w=256] load=262144 (~100%)
(XEN)     3: [0.2] flags=2 cpu=3 credit=9998918 [w=256] load=262118 (~99%)
(XEN)     4: [0.3] flags=2 cpu=5 credit=9996867 [w=256] load=262144 (~100%)
(XEN)     5: [0.4] flags=2 cpu=1 credit=9998912 [w=256] load=262144 (~100%)
(XEN)     6: [0.5] flags=2 cpu=2 credit=9997842 [w=256] load=262144 (~100%)
(XEN)     7: [0.6] flags=2 cpu=7 credit=9994623 [w=256] load=262144 (~100%)
(XEN)     8: [0.7] flags=2 cpu=0 credit=9991815 [w=256] load=262144 (~100%)

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 89052b9fa24bf976924e40918fc9fa3b1b940e17
master date: 2021-06-07 13:17:06 +0100

credit2: make sure we pick a runnable unit from the runq if there is one

A !runnable unit (temporarily) present in the runq may cause us to
stop scanning the runq itself too early. Of course, we don't run any
non-runnable vCPUs, but we end the scan and we fallback to picking
the idle unit. In other word, this prevent us to find there and pick
the actual unit that we're meant to start running (which might be
further ahead in the runq).

Depending on the vCPU pinning configuration, this may lead to such
unit to be stuck in the runq for long time, causing malfunctioning
inside the guest.

Fix this by checking runnable/non-runnable status up-front, in the runq
scanning function.

Reported-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Reported-by: Dion Kant <g.w.kant@hunenet.nl>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 07b0eb5d0ef0be154606aa46b5b4c5c59b158505
master date: 2021-06-07 13:16:36 +0100

SUPPORT.md: Un-shimmed 32-bit PV guests are no longer supported

The support status of 32-bit guests doesn't seem particularly useful.

With it changed to fully unsupported outside of PV-shim, adjust the PV32
Kconfig default accordingly.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 1a0f2fe2297d122a08fee2b26de5de995fdeca13
master date: 2021-06-04 17:24:05 +0100

Update changelog for new upstream 4.14.2+25-gb6a8c4f72d

[git-debrebase changelog: new upstream 4.14.2+25-gb6a8c4f72d]

Update to upstream 4.14.2+25-gb6a8c4f72d

[git-debrebase anchor: new upstream 4.14.2+25-gb6a8c4f72d, merge]

golang/xenlight: fix code generation for python 2.6

Before python 2.7, str.format() calls required that the format fields
were explicitly enumerated, e.g.:

  '{0} {1}'.format(foo, bar)

  vs.

  '{} {}'.format(foo, bar)

Currently, gengotypes.py uses the latter pattern everywhere, which means
the Go bindings do not build on python 2.6. Use the 2.6 syntax for
format() in order to support python 2.6 for now.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 6d49fbdeab3e687a6818f809ca3d98ac7ced2c8d)

x86/tsx: Cope with TSX deprecation on SKL/KBL/CFL/WHL

The June 2021 microcode is formally de-featuring TSX on the older Skylake
client CPUs. The workaround from the March 2019 microcode is being dropped,
and replaced with additions to MSR_TSX_FORCE_ABORT to hide the HLE/RTM CPUID
bits.

With this microcode in place, TSX is disabled by default on these CPUs.
Backwards compatibility is provided in the same way as for TAA - RTM force
aborts, rather than suffering #UD, and the CPUID bits can be hidden to recover
performance.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3e09045991cde360432bc7437103f8f8a6699359)

x86/cpuid: Fix HLE and RTM handling (again)

For reasons which are my fault, but I don't recall why, the
FDP_EXCP_ONLY/NO_FPU_SEL adjustment uses the whole special_features[] array
element, not the two relevant bits.

HLE and RTM were recently added to the list of special features, causing them
to be always set in guest view, irrespective of the toolstacks choice on the
matter.

Rewrite the logic to refer to the features specifically, rather than relying
on the contents of the special_features[] array.

Fixes: 8fe24090d9 ("x86/cpuid: Rework HLE and RTM handling")
Reported-by: Edwin Török <edvin.torok@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 60fa12dbf1d4d2c4ffe1ef34b495b24aa7e41aa0)

x86/tsx: Deprecate vpmu=rtm-abort and use tsx=<bool> instead

This reuses the rtm_disable infrastructure, so CPUID derivation works properly
when TSX is disabled in favour of working PCR3.

vpmu= is not a supported feature, and having this functionality under tsx=
centralises all TSX handling.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9fdcf851689cb2a9501d3947cb5d767d9c7797e8)

x86/tsx: Minor cleanup and improvements

* Introduce cpu_has_arch_caps and replace boot_cpu_has(X86_FEATURE_ARCH_CAPS)
* Read CPUID data into the appropriate boot_cpu_data.x86_capability[]
element, as subsequent changes are going to need more cpu_has_* logic.
* Use the hi/lo MSR helpers, which substantially improves code generation.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 3670abcaf0324f2aedba0c4dc7939072b27efa1d)

x86/spec-ctrl: Mitigate TAA after S3 resume

The user chosen setting for MSR_TSX_CTRL needs restoring after S3.

All APs get the correct setting via start_secondary(), but the BSP was missed
out.

This is XSA-377 / CVE-2021-28690.

Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8cf276cb2e0b99b96333865873f56b0b31555ff1)

x86/spec-ctrl: Protect against Speculative Code Store Bypass

Modern x86 processors have far-better-than-architecturally-guaranteed self
modifying code detection.  Typically, when a write hits an instruction in
flight, a Machine Clear occurs to flush stale content in the frontend and
backend.

For self modifying code, before a write which hits an instruction in flight
retires, the frontend can speculatively decode and execute the old instruction
stream.  Speculation of this form can suffer from type confusion in registers,
and potentially leak data.

Furthermore, updates are typically byte-wise, rather than atomic.  Depending
on timing, speculation can race ahead multiple times between individual
writes, and execute the transiently-malformed instruction stream.

Xen has stubs which are used in certain cases for emulation purposes.  Inhibit
speculation between updating the stub and executing it.

This is XSA-375 / CVE-2021-0089.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f59ed8865318bb0356954bad067f329677ce9e)

AMD/IOMMU: drop command completion timeout

First and foremost - such timeouts were not signaled to callers, making
them believe they're fine to e.g. free previously unmapped pages.

Mirror VT-d's behavior: A fixed number of loop iterations is not a
suitable way to detect timeouts in an environment (CPU and bus speeds)
independent manner anyway. Furthermore, leaving an in-progress operation
pending when it appears to take too long is problematic: If a command
completed later, the signaling of its completion may instead be
understood to signal a subsequently started command's completion.

Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area. Allow callers to specify
a non-default timeout bias for this logging, using the same values as
VT-d does, which in particular means a (by default) much larger value
for device IO TLB invalidation.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit e4fee66043120c954fc309bbb37813604c1c0eb7)

AMD/IOMMU: wait for command slot to be available

No caller cared about send_iommu_command() indicating unavailability of
a slot. Hence if a sufficient number prior commands timed out, we did
blindly assume that the requested command was submitted to the IOMMU
when really it wasn't. This could mean both a hanging system (waiting
for a command to complete that was never seen by the IOMMU) or blindly
propagating success back to callers, making them believe they're fine
to e.g. free previously unmapped pages.

Fold the three involved functions into one, add spin waiting for an
available slot along the lines of VT-d's qinval_next_index(), and as a
consequence drop all error indicator return types/values.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit c4e44343579e2c3304d676404d81b2b9bd893c5b)

VT-d: eliminate flush related timeouts

Leaving an in-progress operation pending when it appears to take too
long is problematic: If e.g. a QI command completed later, the write to
the "poll slot" may instead be understood to signal a subsequently
started command's completion. Also our accounting of the timeout period
was actually wrong: We included the time it took for the command to
actually make it to the front of the queue, which could be heavily
affected by guests other than the one for which the flush is being
performed.

Do away with all timeout detection on all flush related code paths.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area.

Additionally log (once) if qinval_next_index() didn't immediately find
an available slot. Together with the earlier change sizing the queue(s)
dynamically, we should now have a guarantee that with our fully
synchronous model any demand for slots can actually be satisfied.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit 7c893abc4ee7e9af62297ba6fd5f27c89d0f33f5)

AMD/IOMMU: size command buffer dynamically

With the present synchronous model, we need two slots for every
operation (the operation itself and a wait command). There can be one
such pair of commands pending per CPU. To ensure that under all normal
circumstances a slot is always available when one is requested, size the
command ring according to the number of present CPUs.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit df242851ddc93ac0b0a3a20ecab34acc29e3b36b)

VT-d: size qinval queue dynamically

With the present synchronous model, we need two slots for every
operation (the operation itself and a wait descriptor). There can be
one such pair of requests pending per CPU. To ensure that under all
normal circumstances a slot is always available when one is requested,
size the queue ring according to the number of present CPUs.

This is part of XSA-373 / CVE-2021-28692.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
(cherry picked from commit cbfa62bb140c8f10a0ae96db3ce06e22b3b9d302)