dgit.raspbian.org Git

pygrub: Set sys.path

We install libfsimage in a non-standard path for Reasons.
(See debian/rules.)

This patch was originally part of `tools-pygrub-prefix.diff'
(eg commit 51657319be54) and included changes to the Makefile to
change the installation arrangements (we do that part in the rules now
since that is a lot less prone to conflicts when we update) and to
shared library rpath (which is now done in a separate patch).

(Commit message rewritten by Ian Jackson.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
squash! pygrub: Set sys.path and rpath

hotplug-common: Do not adjust LD_LIBRARY_PATH

This is in the upstream script because on non-Debian systems, the
default install locations in /usr/local/lib might not be on the linker
path, and as a result the hotplug scripts would break.

A reason we might need it in Debian is our multiple version
coinstallation scheme. However, the hotplug scripts all call the
utilities via the wrappers, and the binaries are configured to load
from the right place anyway.

This setting is an annoyance because it requires libdir, which is an
arch-specific path but comes from a file we want to put in
xen-utils-common, an arch:all package.

So drop this setting.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

sysconfig.xencommons.in: Strip and debianize

Strip all options that are for stuff we don't ship, which is 1)
xenstored as stubdom and 2) xenbackendd, which seems to be dead code
anyway. [1]

It seems useful to give the user the option to revert to xenstored
instead of the default oxenstored if they really want.

[1] https://lists.xen.org/archives/html/xen-devel/2015-07/msg04427.html

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
Acked-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

vif-common: disable handle_iptable

Also see Debian bug #894013. The current attempt at providing
anti-spoofing rules results in a situation that does not have any
effect. Also note that forwarding bridged traffic to iptables is not
enabled by default, and that for openvswitch users it does not make any
sense.

So, stop cluttering the live iptables ruleset.

This functionality seems to be introduced before 2004 and since then it
has never got some additional love.

It would be nice to have a proper discussion upstream about how Xen
could provide some anti mac/ip spoofing in the dom0. It does not seem to
be a trivial thing to do, since it requires having quite some knowledge
about what the domU is allowed to do or not (e.g. a domU can be a
router...).

Fix empty fields in first hypervisor log line

Instead of:

    (XEN) Xen version 4.11.1 (Debian )
    (@)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 19:08:37 UTC 2019

I'd like to see:

    (XEN) Xen version 4.11.1 (Debian 4.11.1-1~)
    (pkg-xen-devel@lists.alioth.debian.org)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 22:44:00 CET 2019

The substitution was broken since the great packaging refactoring,
because the directory in which the build is done changed.

Also, use the Maintainer address from debian/control instead of the most
recent changelog entry. If someone wants to use the address to ask a
question, they will end up at the team mailing list, which is better
than an individual person.

Revert "tools-xenstore-compatibility.diff"

Following recent discussion in pkg-xen-devel and xen-devel,
https://lists.xenproject.org/archives/html/xen-devel/2018-10/msg00838.html
I am dropping this patch.

For now I revert it. When we next debrebase, we can (if we like)
throw away both the original patch, and this revert.

This reverts commit 5047884c76849b67e364bc525d1b3b55e781cf16.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man/xen-vbd-interface.7: Provide properly-formatted NAME section

This manpage was omitted from
docs/man: Provide properly-formatted NAME sections
because I was previously building with markdown not installed.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/firmware/Makefile: CONFIG_PV_SHIM: enable only on x86_64

Previously this was *dis*abled for x86_*32*. But if someone should
run some of this Makefile on ARM, say, it ought not to be built
either.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

shim: Provide separate install-shim target

When building on a 32-bit userland, the user wants to build 32-bit
tools and a 64-bit hypervisor.  This involves setting XEN_TARGET_ARCH
to different values for the tools build and the hypervisor build.

So the user must invoke the tools build and the hypervisor build
separately.

However, although the shim is done by the tools/firmware Makefile, its
bitness needs to be the same as the hypervisor, not the same as the
tools.  When run with XEN_TARGET_ARCH=x86_32, it it skipped, which is
wrong.

So the user must invoke the shim build separately.  This can be done
with
   make -C tools/firmware/xen-dir XEN_TARGET_ARCH=x86_64

However, tools/firmware/xen-dir has no `install' target.  The
installation of all `firmware' is done in tools/firmware/Makefile.  It
might be possible to fix this, but it is not trivial.  For example,
the definitions of INST_DIR and DEBG_DIR would need to be copied, as
would an appropriate $(INSTALL_DIR) call.

For now, provide an `install-shim' target in tools/firmware/Makefile.

This has to be called from `install' of course.  We can't make it
a dependency of `install' because it might be run before `all' has
completed.  We could make it depend on a `shim' target but such
a target is nearly impossible to write because everything is done by
the inflexible subdir-$@ machinery.

The overally result of this patch is that existing make invocations
work as before.  But additionally, the user can say
  make -C tools/firmware install-shim XEN_TARGET_ARCH=x86_64
to install the shim.  The user must have built it already.
Unlike the build rune, this install-rune is properly conditional
so it is OK to call on ARM.

What a mess.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makfile: Respect caller's CONFIG_PV_SHIM

This makes it easier to disable the shim build. (In Debian we need to
build the shim separately because it needs different compiler flags
and a different XEN_COMPILE_ARCH.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

.gitignore: Add configure output which we always delete and regenerate

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

Patch-Name: tools-libfsimage-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name tools-libfsimage-prefix.diff

tools-libfsimage-abiname.diff

Patch-Name: tools-libfsimage-abiname.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name tools-libfsimage-abiname.diff

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/tests/x86_emulator: Pass -no-pie -fno-pic to gcc on x86_32

The current build fails with GCC6 on Debian sid i386 (unstable):

/tmp/ccqjaueF.s: Assembler messages:
/tmp/ccqjaueF.s:3713: Error: missing or invalid displacement expression `vmovd_to_reg_len@GOT'

This is due to the combination of GCC6, and Debian's decision to
enable some hardening flags by default (to try to make runtime
addresses less predictable):
  https://wiki.debian.org/Hardening/PIEByDefaultTransition

This is of no benefit for the x86 instruction emulator test, which is
a rebuild of the emulator code for testing purposes only.  So pass
options to disable this.

These options will be no-ops if they are the same as the compiler
default.

On amd64, the -fno-pic breaks the build in a different way.  So do
this only on i386.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name toolstestsx86_emulator-pass--no-pie--fno.patch

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

tools-xenmon-install.diff

Patch-Name: tools-xenmon-install.diff

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-xenmon-install.diff

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

version

Patch-Name: version.diff

Gbp-Pq: Topic misc
Gbp-Pq: Name version.diff

tools/kdd: mute spurious gcc warning

gcc-8 complains:

    kdd.c:698:13: error: 'memcpy' offset [-204, -717] is out of the bounds [0, 216] of object 'ctrl' with type 'kdd_ctrl' {aka 'union <anonymous>'} [-Werror=array-bounds]
                 memcpy(buf, ((uint8_t *)&ctrl.c32) + offset, len);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    kdd.c: In function 'kdd_select_callback':
    kdd.c:642:14: note: 'ctrl' declared here
         kdd_ctrl ctrl;
                  ^~~~

But this is impossible - 'offset' is unsigned and correctly validated
few lines before.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 437e00fea04becc91c1b6bc1c0baa636b067a5cc)

libxl/arm: Fix build on arm64 + acpi w/ gcc 8.2

Add zero-padding to #defined ACPI table strings that are copied.
Provides sufficient characters to satisfy the length required to
fully populate the destination and prevent array-bounds warnings.
Add BUILD_BUG_ON sizeof checks for compile-time length checking.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b8f33431f3dd23fb43a879f4bdb4283fdc9465ad)

tools: Move ARRAY_SIZE() into xen-tools/libs.h

xen-tools/libs.h currently contains a shared BUILD_BUG_ON() implementation and
is used by some tools. Extend this to include ARRAY_SIZE and clean up all the
opencoding.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e1b7eb92d3ec6ce3ca68cffb36a148eb59f59613)

xenpmd: make 32 bit gcc 8.1 non-debug build work

32 bit gcc 8.1 non-debug build yields:

xenpmd.c:354:23: error: '%02x' directive output may be truncated writing between 2 and 8 bytes into a region of size 3 [-Werror=format-truncation=]
     snprintf(val, 3, "%02x",
                       ^~~~
xenpmd.c:354:22: note: directive argument in the range [40, 2147483778]
     snprintf(val, 3, "%02x",
                      ^~~~~~
xenpmd.c:354:5: note: 'snprintf' output between 3 and 9 bytes into a destination of size 3
     snprintf(val, 3, "%02x",
     ^~~~~~~~~~~~~~~~~~~~~~~~
              (unsigned int)(9*4 +
              ~~~~~~~~~~~~~~~~~~~~
                             strlen(info->model_number) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->serial_number) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->battery_type) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->oem_info) + 4));
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All info->* used in calculation are 32 bytes long, and the parsing
code makes sure they are null-terminated, so the end result of the
expression won't exceed 255, which should be able to be fit into 3
bytes in hexadecimal format.

Add an assertion to make gcc happy.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit e75c9dc85fdeeeda0b98d8cd8d784e0508c3ffb8)

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-xenstore-compatibility.diff

Patch-Name: tools-xenstore-compatibility.diff

Gbp-Pq: Topic xenstore
Gbp-Pq: Name tools-xenstore-compatibility.diff

tools-fake-xs-restrict

Gbp-Pq: Topic xenstore
Gbp-Pq: Name tools-fake-xs-restrict.patch

tools/debugger/kdd: Install as `xen-kdd', not just `kdd'

`kdd' is an unfortunate namespace landgrab.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

xenmon: Install as xenmon, not xenmon.py

Adding the implementation language as a suffix to a program name is
poor practice.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

pygrub fsimage.so: Honour LDFLAGS when building

This seems to have been simply omitted. Obviously this is needed when
building and not just when installing. Passing only when installing
is ineffective.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

libfsimage: Honour general LDFLAGS

Do not reset LDFLAGS to empty. Instead, append the fsimage-special
LDFLAGS.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

gdbsx: Honour LDFLAGS when linking

This command does the link, so it needs LDFLAGS.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xenstat: Fix shared library version

libxenstat does not have a stable ABI. Set its version to the current
Xen release version.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man/xen-pv-channel.pod.7: Remove a spurious blank line

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man: Provide properly-formatted NAME sections

A manpage `foo.7.pod' must start with

=head NAME

foo - some summary of what foo is or what this manpage is

because otherwise manpage catalogue systems cannot generate a proper
`whatis' entry.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

INSTALL: Mention kconfig

Firstly, add a reference to the documentation for the kconfig system.

Secondly, warn the user about the XEN_CONFIG_EXPERT problem.

CC: Doug Goldstein <cardoe@cardoe.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/Rules.mk: Honour PREPEND_LDFLAGS_XEN_TOOLS

This allows the caller to provide some LDFLAGS to the Xen build
system.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xentop : replace use of deprecated vwprintw

gcc-8.1 complains:

| xentop.c: In function 'print':
| xentop.c:304:4: error: 'vwprintw' is deprecated [-Werror=deprecated-declarations]
| vwprintw(stdscr, (curses_str_t)fmt, args);
| ^~~~~~~~

vw_printw (note the underscore) is a non-deprecated alternative.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name tools-xentop-replace-use-of-deprecated-vwprintw.patch

Various: Fix typo `mappping'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `infomation'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/python/xen/lowlevel: Fix typo `sucess'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `reseting'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `occured'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typos `unkown', `retreive' (detected by lintian)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xentrace/xenalyze: Fix typos detected by lintian

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man: Fix two typos detected by the Debian lintian tool

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Update changelog for new upstream 4.11.3+24-g14b62ab3e5

[git-debrebase changelog: new upstream 4.11.3+24-g14b62ab3e5]

Update to upstream 4.11.3+24-g14b62ab3e5

[git-debrebase anchor: new upstream 4.11.3+24-g14b62ab3e5, merge]

Update to 4.11.1+92-g6c33308a8d-2 with MDS documentation

Following up feedback from the release team, add a NEWS file mentioning
the MDS mitigations with some instructions, so that it will be more
visible to people using apt-listchanges.

Mention the ucode option in our default documented set of "usually used
options", so that users doing a new install will get a hint about the
existence of this option, and what it does.

d/changelog: Update 4.11.1+92-g6c33308a8d-1

The +58 was never uploaded, so reuse the changelog entry.

Use urgency=high because this is a huge pile of security fixes.

lz4: fix system halt at boot kernel on x86_64

Sometimes, on x86_64, decompression fails with the following
error:

Decompressing Linux...

Decoding failed

-- System halted

This condition is not needed for a 64bit kernel(from commit d5e7caf):

if( ... ||
(op + COPYLENGTH) > oend)
goto _output_error

macro LZ4_SECURE_COPY() tests op and does not copy any data
when op exceeds the value.

added by analogy to lz4_uncompress_unknownoutputsize(...)

Signed-off-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
[Linux commit 99b7e93c95c78952724a9783de6c78def8fbfc3f]

The offending commit in our case is fcc17f96c277 ("LZ4 : fix the data
abort issue").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5d90ff79542ab9c6eebe5c315c68c196bcf353b9
master date: 2019-12-09 14:02:35 +0100

lz4: refine commit 9143a6c55ef7 for the 64-bit case

I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to

cpy = op + length - (STEPSIZE - 4);

where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.

Reported-by: Mark Pryor <pryorm09@gmail.com>
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Mark Pryor <pryorm09@gmail.com>
Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100

x86/tlbflush: do not toggle the PGE CR4 bit unless necessary

When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.

Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b5087a31efee7a4e34c958b88671ac6669501b09
master date: 2019-12-03 14:15:35 +0100

x86: avoid HPET use on certain Intel platforms

Linux commit fc5db58539b49351e76f19817ed1102bf7c712d0 says

"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC."

Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.

Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.

Adjust a few types in touched or nearby code at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5294a302c8441191d47888452958aea25243723
master date: 2019-12-03 14:14:44 +0100

gnttab: make sure grant map operations don't skip their IOMMU part

Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):

initial state: act->pin = 0

vCPU A: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)

vCPU B: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)

vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
        iommu_legacy_map() invocations get skipped due to non-zero
        old_pin

vCPU B: return to caller without IOMMU mapping

vCPU A: (re-)acquire both gt locks, mapkind() returns 0,
        iommu_legacy_map() gets invoked

With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().

Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 921f1f42260c7967bf18f8a143d39511d163c421
master date: 2019-12-03 14:13:40 +0100

x86/psr: fix bug which may cause crash

During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************

The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
    +------------------+------------------+--------------+
    | Data default val | Code default val | MBA throttle |
    +------------------+------------------+--------------+

Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 42c8cdc039d6dc7d6aea8008bb24622eaf4b7bc8
master date: 2019-12-02 15:15:18 +0000

x86 / iommu: set up a scratch page in the quarantine domain

This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.

The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.

NOTE: These modifications are restricted to x86 implementations only as
      the buggy h/w I am aware of is only used with Xen in an x86
      environment. ARM may require similar code but, since I am not
      aware of the need, this patch does not modify any ARM implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea38867831da67eed0e9c61672c8941016b63dd9
master date: 2019-11-29 18:27:54 +0000

xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed

A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.

This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.

As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.

We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().

NOTE: - The call to vpmu_destroy() must also be moved from
        arch_vcpu_destroy() into domain_relinquish_resources() such that
        the reference on the mapped page does not prevent domain_destroy()
        (which calls arch_vcpu_destroy()) from being called.
      - Whilst it appears that vpmu_arch_destroy() is idempotent it is
        by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
        flag is cleared at the end of vpmu_arch_destroy().
      - This is not an XSA because vPMU is not security supported (see
        XSA-163).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be18e39d2f69038804b27c30026754deaeefa543
master date: 2019-11-29 18:23:24 +0000

x86/svm: Write the correct %eip into the outgoing task

The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.

This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).

32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.

In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1d758bc6d1a8c0f658a874470c349ee4e27aee46
master date: 2019-11-28 17:14:38 +0000

x86/svm: Always intercept ICEBP

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit.  This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2585f8c2e0d43d350503ff2b2be252adc6b7239
master date: 2019-11-28 17:14:38 +0000

x86/vtx: Fix fault semantics for early task switch failures

The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:

1) Early faults (i.e. ones delivered in the context of the old task) get
    delivered with trap semantics, and break restartibility.

2) The addition isn't truncated to 32 bits.  In the corner case of a task
    switch instruction crossing the 4G->0 boundary taking an early fault (with
    trap semantics), a VMEntry failure will occur due to %rip being out of
    range.

Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.

For now, pass 0 on the SVM side.  This highlights a separate preexisting bug
which will be addressed in the following patch.

While adjusting call sites, drop the unnecessary uint16_t cast.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 943c74bc0ee5044a826e428a3b2ffbdf9a43628d
master date: 2019-11-28 17:14:38 +0000

x86/vmx: always sync PIR to IRR before vmentry

When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.

Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 56348df32bbc782e63b6e3fb978b80e015ae76e7
master date: 2019-11-28 11:58:25 +0100

x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible

This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.

This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.

While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 48599114d3ca24157c25f6684bb9322f6dca12bb
master date: 2019-11-26 14:16:09 +0100

x86/tss: Fix clang build following c/s 7888440625

Clang-3.5 from Debian Jessie fails with:

  smpboot.c:829:29: error: statement expression not allowed at file scope
          BUILD_BUG_ON(sizeof(this_cpu(tss_page)) != PAGE_SIZE);
                              ^
  /local/xen.git/xen/include/asm/percpu.h:14:7: note: expanded from macro
          'this_cpu'
      (*RELOC_HIDE(&per_cpu__##var, get_cpu_info()->per_cpu_offset))
        ^
  /local/xen.git/xen/include/xen/compiler.h:98:3: note: expanded from macro
          'RELOC_HIDE'
    ({ unsigned long __ptr;                       \
    ^
  /local/xen.git/xen/include/xen/lib.h:26:53: note: expanded from macro
          'BUILD_BUG_ON'
  #define BUILD_BUG_ON(cond) ((void)BUILD_BUG_ON_ZERO(cond))
                                                      ^
  /local/xen.git/xen/include/xen/lib.h:25:57: note: expanded from macro
          'BUILD_BUG_ON_ZERO'
  #define BUILD_BUG_ON_ZERO(cond) sizeof(struct { int:-!!(cond); })
                                                          ^
  1 error generated.
  /local/xen.git/xen/Rules.mk:202: recipe for target 'smpboot.o' failed

This is obviously a compiler bug because the BUILD_BUG_ON() is not at file
scope.  However, it can be worked around by using a local variable.

Spotted by Gitlab CI.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 1722da6c0c6f6b7b320bdd239c46c0cb1048f804
master date: 2019-08-14 12:04:20 +0100

x86: Don't increase ApicIdCoreSize past 7

Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads.  This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one.  In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.

Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7.  (Linux
guests have so far continued to cope.)

A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches.  As a short-term fix,
limit this value to 7.

This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change.  This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Reported-by: Andreas Kinzler <hfp@posteo.de>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c79c129a6db2220c1089e0ce5fa49e7298b1d3e
master date: 2019-11-26 10:33:52 +0000

AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables

update_paging_mode() has multiple bugs:

1) Booting with iommu=debug will cause it to inform you that that it called
    without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
    table(s) in the case of a further error.

Furthermore, the choice of default level for a domain has issues:

1) All HVM guests grow from 2 to 3 levels during construction because of the
    position of the VRAM just below the 4G boundary, so defaulting to 2 is a
    waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
    isn't dynamic at runtime like HVM guests.  This means that a PV guest may
    get RAM which it can't map in the IOMMU.

The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement.  Remove
the complexity by removing the dynamic height.

PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.

HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.

The overhead of this extra level is not expected to be noticeable.  It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.

This is XSA-311.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100

x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial

The PGT_partial bit in page->type_info holds both a type count and a
general ref count.  During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial.  When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():

    BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);

As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().

(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)

Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100

x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR

When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.

One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).

Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR.  This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set.  In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.

Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.

NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i.  On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100

x86/mm: Set old_guest_table when destroying vcpu pagetables

Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set.  If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.

One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set.  The
result was that if such an operation were interrupted, Xen would hit a
BUG().

Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
  put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately

While here, do some refactoring:

- Move clearing of arch.cr3 to the top of the function

- Now that clearing is unconditional, move the unmap to the same
   conditional as the l4tab mapping.  This also allows us to reduce
   the scope of the l4tab variable.

- Avoid code duplication by looping to drop references on
   guest_table_user

This is part of XSA-310.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100

x86/mm: Don't reset linear_pt_count on partial validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.

On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:

Assertion 'oc > 0' failed at mm.c:874

Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.

This is XSA-309.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100

x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures

See patch comment for technical details.

Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.

After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.

A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.

This is XSA-308

Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100

x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.

Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.

This is XSA-307.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
master commit: 7442006b9f0940fb36f1f8470a416ec836e0d2ce
master date: 2019-12-11 14:06:18 +0100

update Xen version to 4.11.4-pre

xen:arm: Populate arm64 image header

This patch adds image size and flags to XEN image header. It uses
those fields according to the updated Linux kernel image definition.

With this patch bootloader can now place XEN image anywhere in system
RAM at 2MB aligned address without to worry about relocation.
For instance, it fixes the XEN boot on Amlogic SoC where bootloader(U-BOOT)
always relocates the XEN image to an address range reserved for firmware data.

Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com>
Reviewed-by: Andre Pryzwara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 17bd254a508f4174fe0d56a9f1b9892b7649b4b9)

update Xen version to 4.11.3

IOMMU: default to always quarantining PCI devices

XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.

Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.

Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).

This is XSA-306.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100

x86/mm: Adjust linear uses / entries when a page loses validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.

Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.

This could happen in one of two places:

1. In the case a partially-devalidated page was re-validated by
someone else

2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.

The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.

Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 77beba7c921a286c31a2a76f26500047f353614a
master date: 2019-11-25 10:58:27 +0000

x86/vvmx: Fix livelock with XSA-304 fix

It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.

Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.

Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.

When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.

Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000

x86/livepatch: Prevent patching with active waitqueues

The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.

This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.

In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ca4cd3668237d50a0b33b48e7de7f93d9475120d
master date: 2019-11-22 17:05:43 +0000

x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode

Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.

This issue was discovered while trying to boot a pvshim with x2APIC
enabled.

Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d7cd999faa1edf745a7597db811956cb882a5436
master date: 2019-11-22 17:52:59 +0100

xen: Add missing va_end() in hypercall_create_continuation()

The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.

This was introduced by XSA-296.

Coverity-ID: 1488727
Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df7a19338a892b5cf585fd2bee8584cb15e0cace
master date: 2019-11-21 15:50:01 +0000

x86: fix race to build arch/x86/efi/relocs-dummy.o

With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.

Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.

Have only one makefile responsible for building relocs-dummy.o.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/Makefile: remove $(guard) use from $(TARGET).efi target

Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.

But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.

Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.

The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.

Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 65d104984c04e69234f77bd3b8f8c0ef85b3f7fa
master date: 2019-11-15 14:18:16 +0100
master commit: 7059afb202ff0d82a6fa94f7ef84e4bb3139914e
master date: 2019-11-20 17:12:12 +0100

x86emul: 16-bit XBEGIN does not truncate rIP

SDM rev 071 points out this fact explicitly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a72c508656c0a0fa573890b290064e6035971f86
master date: 2019-11-15 14:15:31 +0100

AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page

Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100

x86/ioapic: fix clear_IO_APIC_pin write of raw entries

clear_IO_APIC_pin can be called after the iommu has been enabled, and
using raw reads and writes to modify IO-APIC entries that have been
setup to use interrupt remapping can lead to issues as some of the
fields have different meaning when the IO-APIC entry is setup to point
to an interrupt remapping table entry.

The following ASSERT in AMD IOMMU code triggers afterwards as a result
of the raw changes to IO-APIC entries performed by clear_IO_APIC_pin.

(XEN) [   10.082154] ENABLING IO-APIC IRQs
(XEN) [   10.087789]  -> Using new ACK method
(XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328

Fix this by making sure that modifications to entries are performed in
non raw mode when fields are affected which may either have changed
meaning with interrupt remapping, or which may need mirroring into
IRTEs.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dedcb1087dfeae0bbd9eea465a57f25b13e40585
master date: 2019-11-12 11:07:40 +0100

x86/shim: copy back the result of EVTCHNOP_status

The event channel data was not copied back to guest memory, fix this
by doing the copy.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0f45bbbc404e2d1257476f9caa6644c209ec2c90
master date: 2019-11-01 10:48:04 +0000

x86/vtx: Fixes to Haswell/Broadwell LBR TSX errata

Cross reference and list all errata, now that they are published.

These errata are specific to Haswell/Broadwell. They should have model and
vendor checks, as Intel isn't the only vendor to implement VT-x.

All affected models use the same MSR indicies, so these can be hard coded
rather than looking up and storing constant values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f51d4a19427674491eaecef85c551613450188c5
master date: 2019-10-29 19:27:40 +0000

x86/vtx: Corrections to BDF93 errata workaround

At the time of fixing c/s 20f1976b44, no obvious errata had been published,
and BDF14 looked like the most obvious candidate.  Subsequently, BDF93 has
been published and it is obviously this.

The erratum states that LER_TO_LIP is the only affected MSR.  The provisional
fix in Xen adjusted LER_FROM_LIP, but this is not correct.  The FROM MSRs are
intended to have TSX metadata, and for steppings with TSX enabled, it will
corrupt the value the guest sees, while for parts with TSX disabled, it is
redundant with FIXUP_TSX.  Drop the LER_FROM_LIP adjustment.

Replace BDF14 references with BDF93, drop the redundant 'bdw_erratum_' prefix,
and use an Intel vendor check, as other vendors implement VT-x.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1a3b393129c1dcfec418f9b0ee92d126c2ae8141
master date: 2019-10-29 19:27:40 +0000

x86: fix off-by-one in is_xen_fixed_mfn()

__2M_rwdata_end marks the first byte after the Xen image, not its last
byte. Subtract 1 to obtain the upper bound to compare against. (Note
that instead switching from <= to < is less desirable, as in principle
__pa() might return rubbish for addresses outside of the Xen image.)

Since the & needs to be dropped from the line in question, also drop it
from the adjacent one.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9633929824204ca7a6d60d083466de79993d60f1
master date: 2019-10-25 10:38:58 +0200

x86/tsc: update vcpu time info on guest TSC adjustments

If a HVM/PVH guest writes to MSR_IA32_TSC{_ADJUST} and thus changes
the value of the time stamp counter the vcpu time info must also be
updated, or the time calculated by the guest using the Xen PV clock
interface will be skewed.

Update the vcpu time info when the guest writes to either MSR_IA32_TSC
or MSR_IA32_TSC_ADJUST. This fixes lockups seen when running the
pv-shim on AMD hardware, since the shim will aggressively try to keep
TSCs in sync by periodically writing to MSR_IA32_TSC if the TSC is not
reliable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7eee9c16d6405a1a1f2e8c6472923db842c90cfb
master date: 2019-10-23 17:01:56 +0100

x86/vvmx: Fix the use of RDTSCP when it is intercepted at L0

Linux has started using RDTSCP as of v5.1.  This has highlighted a bug in Xen,
where virtual vmexit simply gives up.

  (XEN) d1v1 Unhandled nested vmexit: reason 51
  (XEN) domain_crash called from vvmx.c:2671
  (XEN) Domain 1 (vcpu#1) crashed on cpu#2:

Handle RDTSCP in the virtual vmexit hander in the same was as RDTSC
intercepts.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Chris Brannon <cmb@prgmr.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 9257c218e56e9902b78662e5852d69329b9cc204
master date: 2019-10-23 16:43:48 +0100

x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel

See patch documentation and comments.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available

To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off.  This will be architectural on
future parts.

Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID.  Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.

VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely.  A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.

The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/arm64: Don't blindly unmask interrupts on trap without a change of level

Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.

One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
    - hyp_error: All the interrupts are now kept masked. SError should
      be pretty rare and if ever happen then we most likely want to
      avoid any other interrupts to be generated. The potential main
      "caller" is during virtual SError synchronization on the exit
      path from the guest (see check_pending_vserror).

    - hyp_sync: The interrupts state is inherited from the interrupted
      context.

    - hyp_irq: All the interrupts but IRQ state are inherited from the
      interrupted context. IRQ is kept masked.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 3ed885a8874003f6011460f4f46d1d130dd6b2db
master date: 2019-10-31 16:22:55 +0100

xen/arm32: Don't blindly unmask interrupts on trap without a change of level

Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.

One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.

As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.

On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.

On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.

The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.

Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 61b683571f0abd12395b1454cd055f2ad9bb3a37
master date: 2019-10-31 16:22:34 +0100