dgit.raspbian.org Git

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

\o/

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

Display Debian package version in hypervisor log

During hypervisor boot, disable the banner and nicely display the xen
version as well as the Maintainer address from debian/control.

For this to work the SOURCE_BASE_DIR variable needs to be set by the
build system to the top directory, i.e. where the debian folder is.

Original patch by Bastian Blank <waldi@debian.org>
Modified by
Hans van Kranenburg <hans@knorrie.org>
Maximilian Engelhardt <maxi@daemonizer.de>

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/changelog: finish 4.16.1-1

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

d/control: make qemu-system-xen a versioned dependency

Qemu version 1:7.0+dfsg-6 fixes adding the --enable-xen-pci-passthrough
option for the xen binary.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

d/control: Improve spelling and grammar

Signed-off-by: Diederik de Haas <didi.debian@cknow.org>

d/control: Harmonize the capitalization of the 'Xen' word

The writing style of 'Xen' was inconsistent, so fix that.

Signed-off-by: Diederik de Haas <didi.debian@cknow.org>

d/control: drop obsolete paragraph about separate xen linux kernel package

It is already not needed any more to install a different kernel package (e.g.
linux-image-2.6.32-5-xen-amd64), for a long time.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

d/xen-utils-common.NEWS: add item about qemu-system-xen

debian: switch from recommending qemu-system-x86 to qemu-system-xen

Qemu builds xen-specific package now with xen support, with qemu binary
being /usr/libexec/xen-qemu-system-i386. Stop recommending generic
qemu-system-x86 with a lot of dependencies, move to much leanier
qemu-system-xen build.

d/xen-utils-common.xen.init: it is qemu -monitor none, not -monitor /dev/null

It makes no sense to create qemu devices and shut them off to/from /dev/null.
We were supposed to _not_ create these devices instead of creating them and
getting rid of their output. We do not need these devices in the first place.

d/control: stop recommending qemu-system-x86 on arm

qemu is only being built with --with-xen on amd64, kthxbye

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

d/control: make xen-hypervisor-common arch specific

It really does not make sense at all to have this package available for
any other arch than that we provide the other packages for. So, do the
same thing like was done for xen-utils-common.

The Debian Policy Manual, in section 5.6.8. Architecture, tells us:
"Specifying a list of architectures or architecture wildcards other than
any is for the minority of cases where a program is not portable or is
not useful on some architectures." Well, in our case, Xen is not
portable to all the other architectures, so this package is not useful
on them.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

debian: split NEWS file into package specific ones

Users who only have some libxen* package installed do not have to be
presented with scary news items that are relevant only if you have
hypervisor and/or tools installed.

Besides that, the messages are mostly really package specific. And,
there's no strict enforcement (because the user is able to have multiple
xen versions installed) to have exact the same version for all xen
packages. If you install a newer xen-hypervisor-common, you won't get
the init script fixes for example that apt-listchanges would be blaring
about.

(Closes: #962267)

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

Update changelog for new upstream 4.16.1

[git-debrebase changelog: new upstream 4.16.1]

Update to upstream 4.16.1

[git-debrebase anchor: new upstream 4.16.1, merge]

debian/changelog: finish 4.16.0+51-g0941d6cb-1

update Xen version to 4.16.1

livepatch: avoid relocations referencing ignored section symbols

Track whether symbols belong to ignored sections in order to avoid
applying relocations referencing those symbols. The address of such
symbols won't be resolved and thus the relocation will likely fail or
write garbage to the destination.

Return an error in that case, as leaving unresolved relocations would
lead to malfunctioning payload code.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Bjoern Doebel <doebel@amazon.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 9120b5737f517fe9d2a3936c38d3a2211630323b
master date: 2022-04-08 10:27:11 +0200

livepatch: do not ignore sections with 0 size

A side effect of ignoring such sections is that symbols belonging to
them won't be resolved, and that could make relocations belonging to
other sections that reference those symbols fail.

For example it's likely to have an empty .altinstr_replacement with
symbols pointing to it, and marking the section as ignored will
prevent the symbols from being resolved, which in turn will cause any
relocations against them to fail.

In order to solve this do not ignore sections with 0 size, only ignore
sections that don't have the SHF_ALLOC flag set.

Special case such empty sections in move_payload so they are not taken
into account in order to decide whether a livepatch can be safely
re-applied after a revert.

Fixes: 98b728a7b2 ('livepatch: Disallow applying after an revert')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Bjoern Doebel <doebel@amazon.de>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 0dc1f929e8fed681dec09ca3ea8de38202d5bf30
master date: 2022-04-08 10:24:10 +0200

vPCI: fix MSI-X PBA read/write gprintk()s

%pp wants the address of an SBDF, not that of a PCI device.

Fixes: b4f211606011 ("vpci/msix: fix PBA accesses")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: d3f61beea4255e2d86ae82303384c57a3262435e
master date: 2022-04-07 18:01:24 +0200

x86/cpuid: Clobber CPUID leaves 0x800000{1d..20} in policies

c/s 1a914256dca5 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain.  As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.

All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone.  However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").

Such userspace is buggy, but Xen's behaviour isn't great either.

In the short term, clobber all information in these leaves.  This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.

Fixes: 1a914256dca5 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Link: https://github.com/QubesOS/qubes-issues/issues/7392
Reported-by: fosslinux <fosslinux@aussies.space>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d50082c2eae2f3cbe7770be13b9227fbc3f
master date: 2022-04-07 11:36:45 +0100

VT-d: avoid infinite recursion on domain_context_mapping_one() error path

Despite the comment there infinite recursion was still possible, by
flip-flopping between two domains. This is because prev_dom is derived
from the DID found in the context entry, which was already updated by
the time error recovery is invoked. Simply introduce yet another mode
flag to prevent rolling back an in-progress roll-back of a prior
mapping attempt.

Also drop the existing recursion prevention for having been dead anyway:
Earlier in the function we already bail when prev_dom == domain.

Fixes: 8f41e481b485 ("VT-d: re-assign devices directly")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 99d829dba1390b98a3ca07b365713e62182ee7ca
master date: 2022-04-07 12:31:16 +0200

VT-d: avoid NULL deref on domain_context_mapping_one() error paths

First there's a printk() which actually wrongly uses pdev in the first
place: We want to log the coordinates of the (perhaps fake) device
acted upon, which may not be pdev.

Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
device quarantine page tables (part I)") to add a domid_t parameter to
domain_context_unmap_one(): It's only used to pass back here via
me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.

Finally there's the invocation of domain_context_mapping_one(), which
needs to be passed the correct domain ID. Avoid taking that path when
pdev is NULL and the quarantine state is what would need restoring to.
This means we can't security-support non-PCI-Express devices with RMRRs
(if such exist in practice) any longer; note that as of trhe 1st of the
two commits referenced below assigning them to DomU-s is unsupported
anyway.

Fixes: 8f41e481b485 ("VT-d: re-assign devices directly")
Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
Coverity ID: 1503784
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 608394b906e71587f02e6662597bc985bad33a5a
master date: 2022-04-07 12:30:19 +0200

VT-d: don't needlessly look up DID

If get_iommu_domid() in domain_context_unmap_one() fails, we better
wouldn't clear the context entry in the first place, as we're then unable
to issue the corresponding flush. However, we have no need to look up the
DID in the first place: What needs flushing is very specifically the DID
that was in the context entry before our clearing of it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 445ab9852d69d8957467f0036098ebec75fec092
master date: 2022-04-07 12:29:03 +0200

tools/firmware: do not add a .note.gnu.property section

Prevent the assembler from creating a .note.gnu.property section on
the output objects, as it's not useful for firmware related binaries,
and breaks the resulting rombios image.

This requires modifying the cc-option Makefile macro so it can test
assembler options (by replacing the usage of the -S flag with -c) and
also stripping the -Wa, prefix if present when checking for the test
output.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e270af94280e6a9610705ebc1fdd1d7a9b1f8a98
master date: 2022-04-04 12:30:07 +0100

tools/firmware: force -fcf-protection=none

Do so right in firmware/Rules.mk, like it's done for other compiler
flags.

Fixes: 3667f7f8f7 ('x86: Introduce support for CET-IBT')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7225f6e0cd3afd48b4d61c43dd8fead0f4c92193
master date: 2022-04-04 12:30:00 +0100

libxl: Re-scope qmp_proxy_spawn.ao usage

I've observed this failed assertion:
libxl_event.c:2057: libxl__ao_inprogress_gc: Assertion `ao' failed.

AFAICT, this is happening in qmp_proxy_spawn_outcome where
sdss->qmp_proxy_spawn.ao is NULL.

The out label of spawn_stub_launch_dm() calls qmp_proxy_spawn_outcome(),
but it is only in the success path that sdss->qmp_proxy_spawn.ao gets
set to the current ao.

qmp_proxy_spawn_outcome() should instead use sdss->dm.spawn.ao, which is
the already in-use ao when spawn_stub_launch_dm() is called. The same
is true for spawn_qmp_proxy().

With this, move sdss->qmp_proxy_spawn.ao initialization to
spawn_qmp_proxy() since its use is for libxl__spawn_spawn() and it can
be initialized along with the rest of sdss->qmp_proxy_spawn.

Fixes: 83c845033dc8 ("libxl: use vchan for QMP access with Linux stubdomain")
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d62a34423a1a98aefd7c30e22d2d82d198f077c8
master date: 2022-04-01 17:01:57 +0100

libxl: Don't segfault on soft-reset failure

If domain_soft_reset_cb can't rename the save file, it doesn't call
initiate_domain_create() and calls domcreate_complete().

Skipping initiate_domain_create() means dcs->console_wait is
uninitialized and all 0s.

We have:
  domcreate_complete()
    libxl__xswait_stop()
      libxl__ev_xswatch_deregister().

The uninitialized slotnum 0 is considered valid (-1 is the invalid
sentinel), so the NULL pointer path to passed to xs_unwatch() which
segfaults.

libxl__ev_xswatch_deregister:watch w=0x12bc250 wpath=(null) token=0/0: deregister slotnum=0

Move dcs->console_xswait initialization into the callers of
initiate_domain_create, do_domain_create() and do_domain_soft_reset(),
so it is initialized along with the other dcs state.

Fixes: c57e6ebd8c3e ("(lib)xl: soft reset support")
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d2ecf97f911fc00a85b34b70ca311b5d355a9756
master date: 2022-04-01 17:01:57 +0100

xl: Fix global pci options

commit babde47a3fed "introduce a 'passthrough' configuration option to
xl.cfg..." moved the pci list parsing ahead of the global pci option
parsing. This broke the global pci configuration options since they
need to be set first so that looping over the pci devices assigns their
values.

Move the global pci options ahead of the pci list to restore their
function.

Fixes: babde47a3fed ("introduce a 'passthrough' configuration option to xl.cfg...")
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: e45ad0b1b0bd6a43f59aaf4a6f86d88783c630e5
master date: 2022-03-31 19:48:12 +0100

tools/libs/light: set video_mem for PVH guests

The size of the video memory of PVH guests should be set to 0 in case
no value has been specified.

Doing not so will leave it to be -1, resulting in an additional 1 kB
of RAM being advertised in the memory map (here the output of a PVH
Mini-OS boot with 16 MB of RAM assigned):

Memory map:
000000000000-0000010003ff: RAM
0000feff8000-0000feffffff: Reserved
0000fc008000-0000fc00803f: ACPI
0000fc000000-0000fc000fff: ACPI
0000fc001000-0000fc007fff: ACPI

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 0a20a53df158eb0724ce6dcd9de70cbdad547d6f
master date: 2021-12-09 16:26:29 +0000

IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

As to the removal of QUARANTINE_SKIP() from domain_context_unmap_one():
I think this was never really needed there, as the function explicitly
deals with finding a non-present context entry. Leaving it there would
require propagating pgd_maddr into the function (like was done by "VT-d:
prepare for per-device quarantine page tables" for
domain_context_mapping_one()).

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 14dd241aad8af447680ac73e8579990e2c09c1e7
master date: 2022-04-05 14:24:18 +0200

AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: a038b514c1e970a8dc32229cbd31f6769ee61ad5
master date: 2022-04-05 14:20:04 +0200

IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 54c5cef49239e2f27ec3b3fc8804bf57aa4bf46d
master date: 2022-04-05 14:19:42 +0200

IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 97af062b89d52c0ecf7af254b53345c97d438e33
master date: 2022-04-05 14:19:10 +0200

VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7131163c4806e3c7de24873164d1a003d2a27dee
master date: 2022-04-05 14:18:48 +0200

VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: eb19326a328d49a6a4dc3930391b340f3bcd8948
master date: 2022-04-05 14:18:26 +0200

AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges. The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1fa6e9aa36233fe9c29a204fcb2697e985b8345f
master date: 2022-04-05 14:18:04 +0200

VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8f41e481b4852173909363b88c1ab3da747d3a05
master date: 2022-04-05 14:17:42 +0200

VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a680b8134b2d1828bbbf443a97feea66e8a85c75
master date: 2022-04-05 14:17:21 +0200

IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter

This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: eba09b9dd78f9e8cbaa78ef0edb301b32def2c7a
master date: 2022-04-05 14:16:46 +0200

VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3221f270cf2eba0a22fd4f92319d664eacb92889
master date: 2022-04-05 14:16:10 +0200

VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78a40f8b5dfa1a3aec43528663f39473d4429101
master date: 2022-04-05 14:15:33 +0200

VT-d: correct ordering of operations in cleanup_domid_map()

The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.

For the interaction with context_set_domain_id() and did_to_domain_id()
see the code comment.

{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.

domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.

This is CVE-2022-26357 / XSA-399.

Fixes: b9c20c78789f ("VT-d: per-iommu domain-id")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: d9eca7bb6c6636eb87bb17b08ba7de270f47ecd0
master date: 2022-04-05 14:12:27 +0200

x86/hap: do not switch on log dirty for VRAM tracking

XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.

This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:

Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable  x86_64  debug=y  Not tainted ]----
CPU:    34
RIP:    e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206   CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
   [<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
   [<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
   [<ffff82d04031577f>] F paging_domctl+0x251/0xd41
   [<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
   [<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
   [<ffff82d0403a729d>] F lstar_enter+0x12d/0x140

Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.

Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.

Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.

This is CVE-2022-26356 / XSA-397.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4f4db53784d912c4f409a451c36ebfd4754e0a42
master date: 2022-04-05 14:11:30 +0200

livepatch: account for patch offset when applying NOP patch

While not triggered by the trivial xen_nop in-tree patch on
staging/master, that patch exposes a problem on the stable trees, where
all functions have ENDBR inserted. When NOP-ing out a range, we need to
account for this. Handle this right in livepatch_insn_len().

This requires livepatch_insn_len() to be called _after_ ->patch_offset
was set.

Fixes: 6974c75180f1 ("xen/x86: Livepatch: support patching CET-enhanced functions")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8a87b9a0fb0564f9d68f0be0a0d1a17c34117b8b
master date: 2022-03-31 10:45:46 +0200

vpci/msix: fix PBA accesses

Map the PBA in order to access it from the MSI-X read and write
handlers. Note that previously the handlers would pass the physical
host address into the {read,write}{l,q} handlers, which is wrong as
those expect a linear address.

Map the PBA using ioremap when the first access is performed. Note
that 32bit arches might want to abstract the call to ioremap into a
vPCI arch handler, so they can use a fixmap range to map the PBA.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Alex Olson <Alex.Olson@starlab.io>
master commit: b4f21160601155762a4d014db9623af921fec959
master date: 2022-03-09 16:21:01 +0100

x86/Kconfig: introduce option to select retpoline usage

Add a new Kconfig option under the "Speculative hardening" section
that allows selecting whether to enable retpoline. This depends on the
underlying compiler having retpoline support.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 95d9ab46143685f169f636cfdd7997e2fc630e86
master date: 2022-02-21 18:17:56 +0000

x86/clang: add retpoline support

Detect whether the compiler supports clang retpoline option and enable
by default if available, just like it's done for gcc.

Note clang already disables jump tables when retpoline is enabled, so
there's no need to also pass the fno-jump-tables parameter. Also clang
already passes the return address in a register always on amd64, so
there's no need for any equivalent mindirect-branch-register
parameter.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9412486707f8f1ca2eb31c2ef330c5e39c0a2f30
master date: 2022-02-21 18:17:56 +0000

x86/retpoline: split retpoline compiler support into separate option

Keep the previous option as a way to signal generic retpoline support
regardless of the underlying compiler, while introducing a new
CC_HAS_INDIRECT_THUNK that signals whether the underlying compiler
supports retpoline.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e245bc154300b5d0367b64e8b937c9d1da508ad3
master date: 2022-02-21 18:17:56 +0000

livepatch: resolve old address before function verification

When verifying that a livepatch can be applied, we may as well want to
inspect the target function to be patched. To do so, we need to resolve
this function's address before running the arch-specific
livepatch_verify hook.

Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit 5142dc5c25e317c208e3dc16d16b664b9f05dab5)

x86/cet: Remove XEN_SHSTK's dependency on EXPERT

CET-SS hardware is now available from multiple vendors, the feature has
downstream users, and was declared security supported in XSA-398.

Enable it by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit fc90d75c2b71ae15b75128e7d0d4dbe718164ecb)

xen/x86: Livepatch: support patching CET-enhanced functions

Xen enabled CET for supporting architectures. The control flow aspect of
CET require functions that can be called indirectly (i.e., via function
pointers) to start with an ENDBR64 instruction. Otherwise a control flow
exception is raised.

This expectation breaks livepatching flows because we patch functions by
overwriting their first 5 bytes with a JMP + <offset>, thus breaking the
ENDBR64. We fix this by checking the start of a patched function for
being ENDBR64. In the positive case we move the livepatch JMP to start
behind the ENDBR64 instruction.

To avoid having to guess the ENDBR64 offset again on patch reversal
(which might race with other mechanisms adding/removing ENDBR
dynamically), use the livepatch metadata to store the computed offset
along with the saved bytes of the overwritten function.

Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Tested-by: Jiamei Xie <jiamei.xie@arm.com>
(cherry picked from commit 6974c75180f1aad44e5428eabf2396b2b50fb0e4)

Note: For backports to 4.14 thru 4.16, there is no endbr-clobbering, hence no
is_endbr64_poison() logic.

x86/cet: Remove writeable mapping of the BSPs shadow stack

An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss. This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.

We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.

Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen(). As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.

Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.

Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37786b23b027ab83051175cb8ce9ac86cacfc58e)

x86/cet: Clear IST supervisor token busy bits on S3 resume

Stacks are not freed across S3.  Execution just stops, leaving supervisor
token busy bits active.  Fixing this for the primary shadow stack was done
previously, but there is a (rare) risk that an IST token is left busy too, if
the platform power-off happens to intersect with an NMI/#MC arriving.  This
will manifest as #DF next time the IST vector gets used.

Introduce rdssp() and wrss() helpers in a new shstk.h, cleaning up
fixup_exception_return() and explaining the trick with the literal 1.

Then this infrastructure to rewrite the IST tokens in load_system_tables()
when all the other IST details are being set up.  In the case that an IST
token were left busy across S3, this will clear the busy bit before the stack
gets used.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e421ed0f68488863599532bda575c03c33cde0e0)

x86/kexec: Fix kexec-reboot with CET active

The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline.  While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path.  This in turn highlights that the same is true of
resetting the CPUID masking/faulting.

Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions.  Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.

Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT")
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()")
Reported-by: David Vrabel <dvrabel@amazon.co.uk>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: David Vrabel <dvrabel@amazon.co.uk>
(cherry picked from commit 7f5b2448bd724f5f24426b2595a9bdceb1e5a346)

x86/spec-ctrl: Disable retpolines with CET-IBT

CET-IBT depend on executing indirect branches for protections to apply.
Extend the clobber for CET-SS to all of CET.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6e3f36387de566b09aa4145ea0e3bfe4814d68b4)

x86/CET: Fix S3 resume with shadow stacks active

The original shadow stack support has an error on S3 resume with very bizarre
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.

Rework the AP bring-up path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Note: There are potential issues with the IST shadow stacks too, but fixing
      those is more involved.

Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: https://github.com/QubesOS/qubes-issues/issues/7283
Reported-by: Thiner Logoer <logoerthiner1@163.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Thiner Logoer <logoerthiner1@163.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7d9589239ec068c944190408b9838774d5ec1f8f)

x86: Enable CET Indirect Branch Tracking

With all the pieces now in place, turn CET-IBT on when available.

MSR_S_CET, like SMEP/SMAP, controls Ring1 meaning that ENDBR_EN can't be
enabled for Xen independently of PV32 kernels. As we already disable PV32 for
CET-SS, extend this to all CET, adjusting the documentation/comments as
appropriate.

Introduce a cet=no-ibt command line option to allow the admin to disable IBT
even when everything else is configured correctly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit cdbe2b0a1aecae946639ee080f14831429b184b6)

x86/EFI: Disable CET-IBT around Runtime Services calls

UEFI Runtime services, at the time of writing, aren't CET-IBT compatible.
Work is ongoing to address this. In the meantime, unconditionally disable IBT.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d37a8a067e62e3b6709d224c22f740fdda9d0078)

x86/setup: Rework MSR_S_CET handling for CET-IBT

CET-SS and CET-IBT can be independently controlled, so the configuration of
MSR_S_CET can't be constant any more.

Introduce xen_msr_s_cet_value(), mostly because I don't fancy
writing/maintaining that logic in assembly. Use this in the 3 paths which
alter MSR_S_CET when both features are potentially active.

To active CET-IBT, we only need CR4.CET and MSR_S_CET.ENDBR_EN. This is
common with the CET-SS setup, so reorder the operations to set up CR4 and
MSR_S_CET for any nonzero result from xen_msr_s_cet_value(), and set up
MSR_PL0_SSP and SSP if SHSTK_EN was also set.

Adjust the crash path to disable CET-IBT too.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 311434bfc9d10615adbd340d7fb08c05cd14f4c7)

x86/entry: Make IDT entrypoints CET-IBT compatible

Each IDT vector needs to land on an endbr64 instruction. This is especially
important for the #CP handler, which will recurse indefinitely if the endbr64
is missing, eventually escalating to #DF if guard pages are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e702e36d1d519f4b66086650c1c47d6bac96d4b9)

Also include the continue_pv_domain() change from c/s 954bb07fdb5fad which is
also in entry.S

x86/entry: Make syscall/sysenter entrypoints CET-IBT compatible

Each of MSR_{L,C}STAR and MSR_SYSENTER_EIP need to land on an endbr64
instruction.  For sysenter, this is easy.

Unfortunately for syscall, the stubs are already 29 byte long with a limit of
32.  endbr64 is 4 bytes.  Luckily, there is a 1 byte instruction which can
move from the stubs into the main handlers.

Move the push %rax out of the stub and into {l,c}star_entry(), allowing room
for the endbr64 instruction when appropriate.  Update the comment describing
the entry state.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 17d77ec62a299f4299883ec79ab10cacafd0b2f5)

x86/emul: Update emulation stubs to be CET-IBT compatible

All indirect branches need to land on an endbr64 instruction.

For stub_selftests(), use endbr64 unconditionally for simplicity. For ioport
and instruction emulation, add endbr64 conditionally.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0d101568d29e8b4bfd33f20031fedec2652aa0cf)

x86: Introduce helpers/checks for endbr64 instructions

... to prevent the optimiser creating unsafe code.  See the code comment for
full details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4046ba97446e3974a4411db227263a9f11e0aeb4)

Note: For the backport to 4.14 thru 4.16, we don't care for embedded endbr64
      specifically, but place_endbr64() is a prerequisite for other parts of
      the series.

x86/traps: Rework write_stub_trampoline() to not hardcode the jmp

For CET-IBT, we will need to optionally insert an endbr64 instruction at the
start of the stub. Don't hardcode the jmp displacement assuming that it
starts at byte 24 of the stub.

Also add extra comments describing what is going on. The mix of %rax and %rsp
is far from trivial to follow.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 809beac3e7fdfd20000386453c64a1e2a3d93075)

x86/alternatives: Clear CR4.CET when clearing CR0.WP

This allows us to have CET active much earlier in boot.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 48cdc15a424f9fadad7f9aed00e7dc8ef16a2196)

x86/setup: Read CR4 earlier in __start_xen()

This is necessary for read_cr4() to function correctly. Move the EFER caching
at the same time.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9851bc4939101828d2ad7634b93c0d9ccaef5b7e)

x86: Introduce support for CET-IBT

CET Indirect Branch Tracking is a hardware feature designed to provide
forward-edge control flow integrity, protecting against jump/call oriented
programming.

IBT requires the placement of endbr{32,64} instructions at the target of every
indirect call/jmp, and every entrypoint.

It is necessary to check for both compiler and assembler support, as the
notrack prefix can be emitted in certain cases.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3667f7f8f7c471e94e58cf35a95f09a0fe5c1290)

Note: For backports to 4.14 thru 4.16, we are deliberately not using
      -mmanual-endbr as done in staging, as an intermediate approach which
      is not too invasive to backport.

x86/cet: Force -fno-jump-tables for CET-IBT

Both GCC and Clang have a (mis)feature where, even with
-fcf-protection=branch, jump tables are created using a notrack jump rather
than using endbr's in each case statement.

This is incompatible with the safety properties we want in Xen, and enforced
by not setting MSR_S_CET.NOTRACK_EN.  The consequence is a fatal #CP[endbr].

-fno-jump-tables is generally active as a side effect of
CONFIG_INDIRECT_THUNK (retpoline), but as of c/s 95d9ab461436 ("x86/Kconfig:
introduce option to select retpoline usage"), we explicitly support turning
retpoline off.

Fixes: 3667f7f8f7c4 ("x86: Introduce support for CET-IBT")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9d4a44380d273de22d5753883cbf5581795ff24d)

arm/efi: Handle Xen bootargs from both xen.cfg and DT

Currently the Xen UEFI stub can accept Xen boot arguments from
the Xen configuration file using the "options=" keyword, but also
directly from the device tree specifying xen,xen-bootargs
property.

When the configuration file is used, device tree boot arguments
are ignored and overwritten even if the keyword "options=" is
not used.

This patch handle this case, so if the Xen configuration file is not
specifying boot arguments, the device tree boot arguments will be
used, if they are present.

Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f3999bc2e099c571e4583bff8f494b834b2f5f76)

xen/arm: increase memory banks number define value

Currently the maximum number of memory banks (NR_MEM_BANKS define)
is fixed to 128, but on some new platforms that have a large amount
of memory, this value is not enough and prevents Xen from booting.

Increase the value to 256.

Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f1f38e26c3669f5e4583c3756f213c167d19651a)

xen/arm64: Zero the top 32 bits of gp registers on entry...

to hypervisor when switching from AArch32 state.

According to section D1.20.2 of Arm Arm(DDI 0487A.j):
"If the general-purpose register was accessible from AArch32 state the
upper 32 bits either become zero, or hold the value that the same
architectural register held before any AArch32 execution.
The choice between these two options is IMPLEMENTATION DEFINED"

Currently Xen does not ensure that the top 32 bits are zeroed and this
needs to be fixed. The reason why is that there are places in Xen
where we assume that top 32bits are zero for AArch32 guests.
If they are not, this can lead to misinterpretation of Xen regarding
what the guest requested. For example hypercalls returning an error
encoded in a signed long like do_sched_op, do_hmv_op, do_memory_op
would return -ENOSYS if the command passed as the first argument was
clobbered.

Create a macro clobber_gp_top_halves to clobber top 32 bits of gp
registers when hyp == 0 (guest mode) and compat == 1 (AArch32 mode).
Add a compile time check to ensure that save_x0_x1 == 1 if
compat == 1.

Signed-off-by: Michal Orzel <michal.orzel@arm.com>
[julieng: Tweak the comment in clobber_gp_top_halves]
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 32365f3476ac4655f2f26111cd7879912808cd77)

xz: validate the value before assigning it to an enum variable

This might matter, for example, if the underlying type of enum xz_check
was a signed char. In such a case the validation wouldn't have caught an
unsupported header. I don't know if this problem can occur in the kernel
on any arch but it's still good to fix it because some people might copy
the XZ code to their own projects from Linux instead of the upstream
XZ Embedded repository.

This change may increase the code size by a few bytes. An alternative
would have been to use an unsigned int instead of enum xz_check but
using an enumeration looks cleaner.

Link: https://lore.kernel.org/r/20211010213145.17462-3-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4f8d7abaa413
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0a21660515c24f09c4ee060ce0bb42e4b2e6b6fa
master date: 2022-03-07 09:08:54 +0100

xz: avoid overlapping memcpy() with invalid input with in-place decompression

With valid files, the safety margin described in lib/decompress_unxz.c
ensures that these buffers cannot overlap. But if the uncompressed size
of the input is larger than the caller thought, which is possible when
the input file is invalid/corrupt, the buffers can overlap. Obviously
the result will then be garbage (and usually the decoder will return
an error too) but no other harm will happen when such an over-run occurs.

This change only affects uncompressed LZMA2 chunks and so this
should have no effect on performance.

Link: https://lore.kernel.org/r/20211010213145.17462-2-xiang@kernel.org
Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 83d3c4f22a36
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 10454f381f9157bce26d5db15e07e857b317b4af
master date: 2022-03-07 09:08:08 +0100

tools/libxl: don't allow IOMMU usage with PoD

Prevent libxl from creating guests that attempts to use PoD together
with an IOMMU, even if no devices are actually assigned.

While the hypervisor could support using PoD together with an IOMMU as
long as no devices are assigned, such usage seems doubtful. There's no
guarantee the guest has PoD no longer be active, and thus a later
assignment of a PCI device to such domain could fail.

Preventing the usage of PoD together with an IOMMU at guest creation
avoids having to add checks for active PoD entries in the device
assignment paths.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 07449ecfa42532495156fa342af2112e3e31dd3f
master date: 2022-02-18 09:03:08 +0100

x86/console: process softirqs between warning prints

Process softirqs while printing end of boot warnings. Each warning can
be several lines long, and on slow consoles printing multiple ones
without processing softirqs can result in the watchdog triggering:

(XEN) [   22.277806] ***************************************************
(XEN) [   22.417802] WARNING: CONSOLE OUTPUT IS SYNCHRONOUS
(XEN) [   22.556029] This option is intended to aid debugging of Xen by ensuring
(XEN) [   22.696802] that all output is synchronously delivered on the serial line.
(XEN) [   22.838024] However it can introduce SIGNIFICANT latencies and affect
(XEN) [   22.978710] timekeeping. It is NOT recommended for production use!
(XEN) [   23.119066] ***************************************************
(XEN) [   23.258865] Booted on L1TF-vulnerable hardware with SMT/Hyperthreading
(XEN) [   23.399560] enabled.  Please assess your configuration and choose an
(XEN) [   23.539925] explicit 'smt=<bool>' setting.  See XSA-273.
(XEN) [   23.678860] ***************************************************
(XEN) [   23.818492] Booted on MLPDS/MFBDS-vulnerable hardware with SMT/Hyperthreading
(XEN) [   23.959811] enabled.  Mitigations will not be fully effective.  Please
(XEN) [   24.100396] choose an explicit smt=<bool> setting.  See XSA-297.
(XEN) [   24.240254] *************************************************(XEN) [   24.247302] Watchdog timer detects that CPU0 is stuck!
(XEN) [   24.386785] ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
(XEN) [   24.527874] CPU:    0
(XEN) [   24.662422] RIP:    e008:[<ffff82d04025b84a>] drivers/char/ns16550.c#ns16550_tx_ready+0x3a/0x90

Fixes: ee3fd57acd ('xen: add warning infrastructure')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6bd1b4d35c05c21a78bf00f610587ce8a75cb5c2
master date: 2022-02-18 09:02:16 +0100

VT-d: drop undue address-of from check_cleanup_domid_map()

For an unknown reason I added back the operator while backporting,
despite 4.16 having c06e3d810314 ("VT-d: per-domain IOMMU bitmap needs
to have dynamic size"). I can only assume that I mistakenly took the
4.15 backport as basis and/or reference.

Fixes: fa45f6b5560e ("VT-d: split domid map cleanup check into a function")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/spec-ctrl: Cease using thunk=lfence on AMD

AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe.  AMD are recommending using retpoline everywhere.

Retpoline is incompatible with CET.  All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).

This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway.  Leave behind a message if IBRS is
found to be missing.

Also update the default heuristics to never select THUNK_LFENCE.  This causes
AMD CPUs to change their default to retpoline.

Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.

This is part of XSA-398 / CVE-2021-26401.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)

xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3

Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.

The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.

As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <julien@xen.org>
(cherry picked from commit c0a56ea0fd92ecb471936b7355ddbecbaea3707c)

xen/arm: Add Spectre BHB handling

This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction

Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.

To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.

To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).

When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.

In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 62c91eb66a2904eefb1d1d9642e3697a1e3c3a3c)

xen/arm: Add ECBHB and CLEARBHB ID fields

Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 4b68d12d98b8790d8002fcc2c25a9d713374a4d7)

xen/arm: move errata CSV2 check earlier

CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.

Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <julien@xen.org>
(cherry picked from commit 599616d70eb886b9ad0ef9d6b51693ce790504ba)

xen/arm: Introduce new Arm processors

Add some new processor identifiers in processor.h and sync Xen
definitions with status of Linux 5.17 (declared in
arch/arm64/include/asm/cputype.h).

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 35d1b85a6b43483f6bd007d48757434e54743e98)

Update changelog for new upstream 4.16.0+51-g0941d6cb

[git-debrebase changelog: new upstream 4.16.0+51-g0941d6cb]

Update to upstream 4.16.0+51-g0941d6cb

[git-debrebase anchor: new upstream 4.16.0+51-g0941d6cb, merge]

x86emul: fix VPBLENDMW with mask and memory operand

Element size for this opcode depends on EVEX.W, not the low opcode bit.
Make use of AVX512BW being a prereq to AVX512_BITALG and move the case
label there, adding an AVX512BW feature check.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eddf13b5e9401f6871dcce1ce61c80cff62079ed
master date: 2022-02-14 10:08:38 +0100

tools/libs: Fix build dependencies

Some libs' Makefile aren't loading the dependencies files *.d2.

We can load them from "libs.mk" as none of the Makefile here are
changing $(DEPS) or $(DEPS_INCLUDE) so it is fine to move the
"include" to "libs.mk".

As a little improvement, don't load the dependencies files (and thus
avoid regenerating the *.d2 files) during `make clean`.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: e62cc29f9b6c42b67182a1362e2ea18bad75b5ff
master date: 2022-02-08 11:15:53 +0000

build: fix exported variable name CFLAGS_stack_boundary

Exporting a variable with a dash doesn't work reliably, they may be
striped from the environment when calling a sub-make or sub-shell.

CFLAGS-stack-boundary start to be removed from env in patch "build:
set ALL_OBJS in main Makefile; move prelink.o to main Makefile" when
running `make "ALL_OBJS=.."` due to the addition of the quote. At
least in my empirical tests.

Fixes: 2740d96efd ("xen/build: have the root Makefile generates the CFLAGS")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

libxl: force netback to wait for hotplug execution before connecting

By writing an empty "hotplug-status" xenstore node in the backend path
libxl can force Linux netback to wait for hotplug script execution
before proceeding to the 'connected' state.

This is required so that netback doesn't skip state 2 (InitWait) and
thus blocks libxl waiting for such state in order to launch the
hotplug script (see libxl__wait_device_connection).

Reported-by: James Dingwall <james-xen@dingwall.me.uk>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: James Dingwall <james-xen@dingwall.me.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Tested-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
master commit: 0bdc43c8dec993258e930b34855853c22b917519
master date: 2022-01-27 13:51:19 +0100

tools/libs/light: don't touch nr_vcpus_out if listing vcpus and returning NULL

If we are in libxl_list_vcpu() and we are returning NULL, let's avoid
touching the output parameter *nr_vcpus_out, which the caller should
have initialized to 0.

The current behavior could be problematic if are creating a domain and,
in the meantime, an existing one is destroyed when we have already done
some steps of the loop. At which point, we'd return a NULL list of vcpus
but with something different than 0 as the number of vcpus in that list.
And this can cause troubles in the callers (e.g., nr_vcpus_on_nodes()),
when they do a libxl_vcpuinfo_list_free().

Crashes due to this are rare and difficult to reproduce, but have been
observed, with stack traces looking like this one:

#0  libxl_bitmap_dispose (map=map@entry=0x50) at libxl_utils.c:626
#1  0x00007fe72c993a32 in libxl_vcpuinfo_dispose (p=p@entry=0x38) at _libxl_types.c:692
#2  0x00007fe72c94e3c4 in libxl_vcpuinfo_list_free (list=0x0, nr=<optimized out>) at libxl_utils.c:1059
#3  0x00007fe72c9528bf in nr_vcpus_on_nodes (vcpus_on_node=0x7fe71000eb60, suitable_cpumap=0x7fe721df0d38, tinfo_elements=48, tinfo=0x7fe7101b3900, gc=0x7fe7101bbfa0) at libxl_numa.c:258
#4  libxl__get_numa_candidate (gc=gc@entry=0x7fe7100033a0, min_free_memkb=4233216, min_cpus=4, min_nodes=min_nodes@entry=0, max_nodes=max_nodes@entry=0, suitable_cpumap=suitable_cpumap@entry=0x7fe721df0d38, numa_cmpf=0x7fe72c940110 <numa_cmpf>, cndt_out=0x7fe721df0cf0, cndt_found=0x7fe721df0cb4) at libxl_numa.c:394
#5  0x00007fe72c94152b in numa_place_domain (d_config=0x7fe721df11b0, domid=975, gc=0x7fe7100033a0) at libxl_dom.c:209
#6  libxl__build_pre (gc=gc@entry=0x7fe7100033a0, domid=domid@entry=975, d_config=d_config@entry=0x7fe721df11b0, state=state@entry=0x7fe710077700) at libxl_dom.c:436
#7  0x00007fe72c92c4a5 in libxl__domain_build (gc=0x7fe7100033a0, d_config=d_config@entry=0x7fe721df11b0, domid=975, state=0x7fe710077700) at libxl_create.c:444
#8  0x00007fe72c92de8b in domcreate_bootloader_done (egc=0x7fe721df0f60, bl=0x7fe7100778c0, rc=<optimized out>) at libxl_create.c:1222
#9  0x00007fe72c980425 in libxl__bootloader_run (egc=egc@entry=0x7fe721df0f60, bl=bl@entry=0x7fe7100778c0) at libxl_bootloader.c:403
#10 0x00007fe72c92f281 in initiate_domain_create (egc=egc@entry=0x7fe721df0f60, dcs=dcs@entry=0x7fe7100771b0) at libxl_create.c:1159
#11 0x00007fe72c92f456 in do_domain_create (ctx=ctx@entry=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, restore_fd=restore_fd@entry=-1, send_back_fd=send_back_fd@entry=-1, params=params@entry=0x0, ao_how=0x0, aop_console_how=0x7fe721df10f0) at libxl_create.c:1856
#12 0x00007fe72c92f776 in libxl_domain_create_new (ctx=0x7fe71001c840, d_config=d_config@entry=0x7fe721df11b0, domid=domid@entry=0x7fe721df10a8, ao_how=ao_how@entry=0x0, aop_console_how=aop_console_how@entry=0x7fe721df10f0) at libxl_create.c:2075

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Tested-by: James Fehlig <jfehlig@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d9d3496e817ace919092d70d4730257b37c2e743
master date: 2022-01-31 10:58:07 +0100

x86/spec-ctrl: Support Intel PSFD for guests

The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface
to Sunny Cove (IceLake) and later cores.

Update the MSR_SPEC_CTRL emulation, and expose it to guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)

x86/cpuid: Infrastructure for cpuid word 7:2.edx

While in principle it would be nice to keep leaf 7 in order, that would
involve having an extra 5 words of zeros in a featureset.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)

tests/tsx: Extend test-tsx to check MSR_MCU_OPT_CTRL

This MSR needs to be identical across the system for TSX to have identical
behaviour everywhere. Furthermore, its CPUID bit (SRBDS_CTRL) shouldn't be
visible to guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4b45c4faa8c0637eb41cb4b143ccd4e9548c4908)

x86/tsx: Cope with TSX deprecation on WHL-R/CFL-R

The February 2022 microcode is formally de-featuring TSX on the TAA-impacted
client CPUs.  The backup TAA mitigation (VERW regaining its flushing side
effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer
protects against TAA on these parts.

The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID
bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as
opposed to MSR_TSX_FORCE_ABORT.

TSX now defaults to being disabled on ucode load.  Furthermore, if SGX is
enabled in the BIOS, TSX is locked and cannot be re-enabled.  In this case,
override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default.

While updating the command line documentation, take the opportunity to add a
paragraph explaining what TSX being disabled actually means, and how migration
compatibility works.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)