dgit.raspbian.org Git

.gitignore: Add configure output which we always delete and regenerate

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

Patch-Name: tools-libfsimage-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name tools-libfsimage-prefix.diff

tools-libfsimage-abiname.diff

Patch-Name: tools-libfsimage-abiname.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name tools-libfsimage-abiname.diff

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/tests/x86_emulator: Pass -no-pie -fno-pic to gcc on x86_32

The current build fails with GCC6 on Debian sid i386 (unstable):

/tmp/ccqjaueF.s: Assembler messages:
/tmp/ccqjaueF.s:3713: Error: missing or invalid displacement expression `vmovd_to_reg_len@GOT'

This is due to the combination of GCC6, and Debian's decision to
enable some hardening flags by default (to try to make runtime
addresses less predictable):
  https://wiki.debian.org/Hardening/PIEByDefaultTransition

This is of no benefit for the x86 instruction emulator test, which is
a rebuild of the emulator code for testing purposes only.  So pass
options to disable this.

These options will be no-ops if they are the same as the compiler
default.

On amd64, the -fno-pic breaks the build in a different way.  So do
this only on i386.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name toolstestsx86_emulator-pass--no-pie--fno.patch

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

tools-xenmon-install.diff

Patch-Name: tools-xenmon-install.diff

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-xenmon-install.diff

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

version

Patch-Name: version.diff

Gbp-Pq: Topic misc
Gbp-Pq: Name version.diff

tools/kdd: mute spurious gcc warning

gcc-8 complains:

    kdd.c:698:13: error: 'memcpy' offset [-204, -717] is out of the bounds [0, 216] of object 'ctrl' with type 'kdd_ctrl' {aka 'union <anonymous>'} [-Werror=array-bounds]
                 memcpy(buf, ((uint8_t *)&ctrl.c32) + offset, len);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    kdd.c: In function 'kdd_select_callback':
    kdd.c:642:14: note: 'ctrl' declared here
         kdd_ctrl ctrl;
                  ^~~~

But this is impossible - 'offset' is unsigned and correctly validated
few lines before.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 437e00fea04becc91c1b6bc1c0baa636b067a5cc)

libxl/arm: Fix build on arm64 + acpi w/ gcc 8.2

Add zero-padding to #defined ACPI table strings that are copied.
Provides sufficient characters to satisfy the length required to
fully populate the destination and prevent array-bounds warnings.
Add BUILD_BUG_ON sizeof checks for compile-time length checking.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b8f33431f3dd23fb43a879f4bdb4283fdc9465ad)

tools: Move ARRAY_SIZE() into xen-tools/libs.h

xen-tools/libs.h currently contains a shared BUILD_BUG_ON() implementation and
is used by some tools. Extend this to include ARRAY_SIZE and clean up all the
opencoding.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e1b7eb92d3ec6ce3ca68cffb36a148eb59f59613)

xenpmd: make 32 bit gcc 8.1 non-debug build work

32 bit gcc 8.1 non-debug build yields:

xenpmd.c:354:23: error: '%02x' directive output may be truncated writing between 2 and 8 bytes into a region of size 3 [-Werror=format-truncation=]
     snprintf(val, 3, "%02x",
                       ^~~~
xenpmd.c:354:22: note: directive argument in the range [40, 2147483778]
     snprintf(val, 3, "%02x",
                      ^~~~~~
xenpmd.c:354:5: note: 'snprintf' output between 3 and 9 bytes into a destination of size 3
     snprintf(val, 3, "%02x",
     ^~~~~~~~~~~~~~~~~~~~~~~~
              (unsigned int)(9*4 +
              ~~~~~~~~~~~~~~~~~~~~
                             strlen(info->model_number) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->serial_number) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->battery_type) +
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                             strlen(info->oem_info) + 4));
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All info->* used in calculation are 32 bytes long, and the parsing
code makes sure they are null-terminated, so the end result of the
expression won't exceed 255, which should be able to be fit into 3
bytes in hexadecimal format.

Add an assertion to make gcc happy.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit e75c9dc85fdeeeda0b98d8cd8d784e0508c3ffb8)

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-xenstore-compatibility.diff

Patch-Name: tools-xenstore-compatibility.diff

Gbp-Pq: Topic xenstore
Gbp-Pq: Name tools-xenstore-compatibility.diff

tools-fake-xs-restrict

Gbp-Pq: Topic xenstore
Gbp-Pq: Name tools-fake-xs-restrict.patch

tools/debugger/kdd: Install as `xen-kdd', not just `kdd'

`kdd' is an unfortunate namespace landgrab.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

xenmon: Install as xenmon, not xenmon.py

Adding the implementation language as a suffix to a program name is
poor practice.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

pygrub fsimage.so: Honour LDFLAGS when building

This seems to have been simply omitted. Obviously this is needed when
building and not just when installing. Passing only when installing
is ineffective.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

libfsimage: Honour general LDFLAGS

Do not reset LDFLAGS to empty. Instead, append the fsimage-special
LDFLAGS.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

gdbsx: Honour LDFLAGS when linking

This command does the link, so it needs LDFLAGS.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xenstat: Fix shared library version

libxenstat does not have a stable ABI. Set its version to the current
Xen release version.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man/xen-pv-channel.pod.7: Remove a spurious blank line

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man: Provide properly-formatted NAME sections

A manpage `foo.7.pod' must start with

=head NAME

foo - some summary of what foo is or what this manpage is

because otherwise manpage catalogue systems cannot generate a proper
`whatis' entry.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

INSTALL: Mention kconfig

Firstly, add a reference to the documentation for the kconfig system.

Secondly, warn the user about the XEN_CONFIG_EXPERT problem.

CC: Doug Goldstein <cardoe@cardoe.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/Rules.mk: Honour PREPEND_LDFLAGS_XEN_TOOLS

This allows the caller to provide some LDFLAGS to the Xen build
system.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xentop : replace use of deprecated vwprintw

gcc-8.1 complains:

| xentop.c: In function 'print':
| xentop.c:304:4: error: 'vwprintw' is deprecated [-Werror=deprecated-declarations]
| vwprintw(stdscr, (curses_str_t)fmt, args);
| ^~~~~~~~

vw_printw (note the underscore) is a non-deprecated alternative.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name tools-xentop-replace-use-of-deprecated-vwprintw.patch

Various: Fix typo `mappping'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `infomation'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/python/xen/lowlevel: Fix typo `sucess'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `reseting'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typo `occured'

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Various: Fix typos `unkown', `retreive' (detected by lintian)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/xentrace/xenalyze: Fix typos detected by lintian

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

docs/man: Fix two typos detected by the Debian lintian tool

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Update changelog for new upstream 4.11.3+24-g14b62ab3e5

[git-debrebase changelog: new upstream 4.11.3+24-g14b62ab3e5]

Update to upstream 4.11.3+24-g14b62ab3e5

[git-debrebase anchor: new upstream 4.11.3+24-g14b62ab3e5, merge]

Update to 4.11.1+92-g6c33308a8d-2 with MDS documentation

Following up feedback from the release team, add a NEWS file mentioning
the MDS mitigations with some instructions, so that it will be more
visible to people using apt-listchanges.

Mention the ucode option in our default documented set of "usually used
options", so that users doing a new install will get a hint about the
existence of this option, and what it does.

d/changelog: Update 4.11.1+92-g6c33308a8d-1

The +58 was never uploaded, so reuse the changelog entry.

Use urgency=high because this is a huge pile of security fixes.

lz4: fix system halt at boot kernel on x86_64

Sometimes, on x86_64, decompression fails with the following
error:

Decompressing Linux...

Decoding failed

-- System halted

This condition is not needed for a 64bit kernel(from commit d5e7caf):

if( ... ||
(op + COPYLENGTH) > oend)
goto _output_error

macro LZ4_SECURE_COPY() tests op and does not copy any data
when op exceeds the value.

added by analogy to lz4_uncompress_unknownoutputsize(...)

Signed-off-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
[Linux commit 99b7e93c95c78952724a9783de6c78def8fbfc3f]

The offending commit in our case is fcc17f96c277 ("LZ4 : fix the data
abort issue").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5d90ff79542ab9c6eebe5c315c68c196bcf353b9
master date: 2019-12-09 14:02:35 +0100

lz4: refine commit 9143a6c55ef7 for the 64-bit case

I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to

cpy = op + length - (STEPSIZE - 4);

where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.

Reported-by: Mark Pryor <pryorm09@gmail.com>
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Mark Pryor <pryorm09@gmail.com>
Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100

x86/tlbflush: do not toggle the PGE CR4 bit unless necessary

When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.

Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b5087a31efee7a4e34c958b88671ac6669501b09
master date: 2019-12-03 14:15:35 +0100

x86: avoid HPET use on certain Intel platforms

Linux commit fc5db58539b49351e76f19817ed1102bf7c712d0 says

"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC."

Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.

Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.

Adjust a few types in touched or nearby code at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5294a302c8441191d47888452958aea25243723
master date: 2019-12-03 14:14:44 +0100

gnttab: make sure grant map operations don't skip their IOMMU part

Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):

initial state: act->pin = 0

vCPU A: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)

vCPU B: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)

vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
        iommu_legacy_map() invocations get skipped due to non-zero
        old_pin

vCPU B: return to caller without IOMMU mapping

vCPU A: (re-)acquire both gt locks, mapkind() returns 0,
        iommu_legacy_map() gets invoked

With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().

Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 921f1f42260c7967bf18f8a143d39511d163c421
master date: 2019-12-03 14:13:40 +0100

x86/psr: fix bug which may cause crash

During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************

The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
    +------------------+------------------+--------------+
    | Data default val | Code default val | MBA throttle |
    +------------------+------------------+--------------+

Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 42c8cdc039d6dc7d6aea8008bb24622eaf4b7bc8
master date: 2019-12-02 15:15:18 +0000

x86 / iommu: set up a scratch page in the quarantine domain

This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.

The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.

NOTE: These modifications are restricted to x86 implementations only as
      the buggy h/w I am aware of is only used with Xen in an x86
      environment. ARM may require similar code but, since I am not
      aware of the need, this patch does not modify any ARM implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea38867831da67eed0e9c61672c8941016b63dd9
master date: 2019-11-29 18:27:54 +0000

xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed

A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.

This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.

As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.

We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().

NOTE: - The call to vpmu_destroy() must also be moved from
        arch_vcpu_destroy() into domain_relinquish_resources() such that
        the reference on the mapped page does not prevent domain_destroy()
        (which calls arch_vcpu_destroy()) from being called.
      - Whilst it appears that vpmu_arch_destroy() is idempotent it is
        by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
        flag is cleared at the end of vpmu_arch_destroy().
      - This is not an XSA because vPMU is not security supported (see
        XSA-163).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be18e39d2f69038804b27c30026754deaeefa543
master date: 2019-11-29 18:23:24 +0000

x86/svm: Write the correct %eip into the outgoing task

The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.

This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).

32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.

In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1d758bc6d1a8c0f658a874470c349ee4e27aee46
master date: 2019-11-28 17:14:38 +0000

x86/svm: Always intercept ICEBP

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit.  This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2585f8c2e0d43d350503ff2b2be252adc6b7239
master date: 2019-11-28 17:14:38 +0000

x86/vtx: Fix fault semantics for early task switch failures

The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:

1) Early faults (i.e. ones delivered in the context of the old task) get
    delivered with trap semantics, and break restartibility.

2) The addition isn't truncated to 32 bits.  In the corner case of a task
    switch instruction crossing the 4G->0 boundary taking an early fault (with
    trap semantics), a VMEntry failure will occur due to %rip being out of
    range.

Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.

For now, pass 0 on the SVM side.  This highlights a separate preexisting bug
which will be addressed in the following patch.

While adjusting call sites, drop the unnecessary uint16_t cast.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 943c74bc0ee5044a826e428a3b2ffbdf9a43628d
master date: 2019-11-28 17:14:38 +0000

x86/vmx: always sync PIR to IRR before vmentry

When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.

Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 56348df32bbc782e63b6e3fb978b80e015ae76e7
master date: 2019-11-28 11:58:25 +0100

x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible

This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.

This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.

While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 48599114d3ca24157c25f6684bb9322f6dca12bb
master date: 2019-11-26 14:16:09 +0100

x86/tss: Fix clang build following c/s 7888440625

Clang-3.5 from Debian Jessie fails with:

  smpboot.c:829:29: error: statement expression not allowed at file scope
          BUILD_BUG_ON(sizeof(this_cpu(tss_page)) != PAGE_SIZE);
                              ^
  /local/xen.git/xen/include/asm/percpu.h:14:7: note: expanded from macro
          'this_cpu'
      (*RELOC_HIDE(&per_cpu__##var, get_cpu_info()->per_cpu_offset))
        ^
  /local/xen.git/xen/include/xen/compiler.h:98:3: note: expanded from macro
          'RELOC_HIDE'
    ({ unsigned long __ptr;                       \
    ^
  /local/xen.git/xen/include/xen/lib.h:26:53: note: expanded from macro
          'BUILD_BUG_ON'
  #define BUILD_BUG_ON(cond) ((void)BUILD_BUG_ON_ZERO(cond))
                                                      ^
  /local/xen.git/xen/include/xen/lib.h:25:57: note: expanded from macro
          'BUILD_BUG_ON_ZERO'
  #define BUILD_BUG_ON_ZERO(cond) sizeof(struct { int:-!!(cond); })
                                                          ^
  1 error generated.
  /local/xen.git/xen/Rules.mk:202: recipe for target 'smpboot.o' failed

This is obviously a compiler bug because the BUILD_BUG_ON() is not at file
scope.  However, it can be worked around by using a local variable.

Spotted by Gitlab CI.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 1722da6c0c6f6b7b320bdd239c46c0cb1048f804
master date: 2019-08-14 12:04:20 +0100

x86: Don't increase ApicIdCoreSize past 7

Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads.  This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one.  In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.

Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7.  (Linux
guests have so far continued to cope.)

A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches.  As a short-term fix,
limit this value to 7.

This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change.  This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Reported-by: Andreas Kinzler <hfp@posteo.de>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c79c129a6db2220c1089e0ce5fa49e7298b1d3e
master date: 2019-11-26 10:33:52 +0000

AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables

update_paging_mode() has multiple bugs:

1) Booting with iommu=debug will cause it to inform you that that it called
    without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
    table(s) in the case of a further error.

Furthermore, the choice of default level for a domain has issues:

1) All HVM guests grow from 2 to 3 levels during construction because of the
    position of the VRAM just below the 4G boundary, so defaulting to 2 is a
    waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
    isn't dynamic at runtime like HVM guests.  This means that a PV guest may
    get RAM which it can't map in the IOMMU.

The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement.  Remove
the complexity by removing the dynamic height.

PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.

HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.

The overhead of this extra level is not expected to be noticeable.  It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.

This is XSA-311.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100

x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial

The PGT_partial bit in page->type_info holds both a type count and a
general ref count.  During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial.  When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():

    BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);

As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().

(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)

Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100

x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR

When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.

One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).

Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR.  This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set.  In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.

Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.

NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i.  On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100

x86/mm: Set old_guest_table when destroying vcpu pagetables

Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set.  If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.

One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set.  The
result was that if such an operation were interrupted, Xen would hit a
BUG().

Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
  put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately

While here, do some refactoring:

- Move clearing of arch.cr3 to the top of the function

- Now that clearing is unconditional, move the unmap to the same
   conditional as the l4tab mapping.  This also allows us to reduce
   the scope of the l4tab variable.

- Avoid code duplication by looping to drop references on
   guest_table_user

This is part of XSA-310.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100

x86/mm: Don't reset linear_pt_count on partial validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.

On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:

Assertion 'oc > 0' failed at mm.c:874

Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.

This is XSA-309.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100

x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures

See patch comment for technical details.

Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.

After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.

A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.

This is XSA-308

Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100

x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.

Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.

This is XSA-307.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
master commit: 7442006b9f0940fb36f1f8470a416ec836e0d2ce
master date: 2019-12-11 14:06:18 +0100

update Xen version to 4.11.4-pre

xen:arm: Populate arm64 image header

This patch adds image size and flags to XEN image header. It uses
those fields according to the updated Linux kernel image definition.

With this patch bootloader can now place XEN image anywhere in system
RAM at 2MB aligned address without to worry about relocation.
For instance, it fixes the XEN boot on Amlogic SoC where bootloader(U-BOOT)
always relocates the XEN image to an address range reserved for firmware data.

Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com>
Reviewed-by: Andre Pryzwara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 17bd254a508f4174fe0d56a9f1b9892b7649b4b9)

update Xen version to 4.11.3

IOMMU: default to always quarantining PCI devices

XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.

Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.

Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).

This is XSA-306.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100

x86/mm: Adjust linear uses / entries when a page loses validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.

Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.

This could happen in one of two places:

1. In the case a partially-devalidated page was re-validated by
someone else

2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.

The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.

Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 77beba7c921a286c31a2a76f26500047f353614a
master date: 2019-11-25 10:58:27 +0000

x86/vvmx: Fix livelock with XSA-304 fix

It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.

Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.

Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.

When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.

Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000

x86/livepatch: Prevent patching with active waitqueues

The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.

This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.

In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ca4cd3668237d50a0b33b48e7de7f93d9475120d
master date: 2019-11-22 17:05:43 +0000

x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode

Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.

This issue was discovered while trying to boot a pvshim with x2APIC
enabled.

Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d7cd999faa1edf745a7597db811956cb882a5436
master date: 2019-11-22 17:52:59 +0100

xen: Add missing va_end() in hypercall_create_continuation()

The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.

This was introduced by XSA-296.

Coverity-ID: 1488727
Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df7a19338a892b5cf585fd2bee8584cb15e0cace
master date: 2019-11-21 15:50:01 +0000

x86: fix race to build arch/x86/efi/relocs-dummy.o

With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.

Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.

Have only one makefile responsible for building relocs-dummy.o.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/Makefile: remove $(guard) use from $(TARGET).efi target

Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.

But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.

Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.

The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.

Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 65d104984c04e69234f77bd3b8f8c0ef85b3f7fa
master date: 2019-11-15 14:18:16 +0100
master commit: 7059afb202ff0d82a6fa94f7ef84e4bb3139914e
master date: 2019-11-20 17:12:12 +0100

x86emul: 16-bit XBEGIN does not truncate rIP

SDM rev 071 points out this fact explicitly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a72c508656c0a0fa573890b290064e6035971f86
master date: 2019-11-15 14:15:31 +0100

AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page

Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100

x86/ioapic: fix clear_IO_APIC_pin write of raw entries

clear_IO_APIC_pin can be called after the iommu has been enabled, and
using raw reads and writes to modify IO-APIC entries that have been
setup to use interrupt remapping can lead to issues as some of the
fields have different meaning when the IO-APIC entry is setup to point
to an interrupt remapping table entry.

The following ASSERT in AMD IOMMU code triggers afterwards as a result
of the raw changes to IO-APIC entries performed by clear_IO_APIC_pin.

(XEN) [   10.082154] ENABLING IO-APIC IRQs
(XEN) [   10.087789]  -> Using new ACK method
(XEN) [   10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328

Fix this by making sure that modifications to entries are performed in
non raw mode when fields are affected which may either have changed
meaning with interrupt remapping, or which may need mirroring into
IRTEs.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dedcb1087dfeae0bbd9eea465a57f25b13e40585
master date: 2019-11-12 11:07:40 +0100

x86/shim: copy back the result of EVTCHNOP_status

The event channel data was not copied back to guest memory, fix this
by doing the copy.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0f45bbbc404e2d1257476f9caa6644c209ec2c90
master date: 2019-11-01 10:48:04 +0000

x86/vtx: Fixes to Haswell/Broadwell LBR TSX errata

Cross reference and list all errata, now that they are published.

These errata are specific to Haswell/Broadwell. They should have model and
vendor checks, as Intel isn't the only vendor to implement VT-x.

All affected models use the same MSR indicies, so these can be hard coded
rather than looking up and storing constant values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f51d4a19427674491eaecef85c551613450188c5
master date: 2019-10-29 19:27:40 +0000

x86/vtx: Corrections to BDF93 errata workaround

At the time of fixing c/s 20f1976b44, no obvious errata had been published,
and BDF14 looked like the most obvious candidate.  Subsequently, BDF93 has
been published and it is obviously this.

The erratum states that LER_TO_LIP is the only affected MSR.  The provisional
fix in Xen adjusted LER_FROM_LIP, but this is not correct.  The FROM MSRs are
intended to have TSX metadata, and for steppings with TSX enabled, it will
corrupt the value the guest sees, while for parts with TSX disabled, it is
redundant with FIXUP_TSX.  Drop the LER_FROM_LIP adjustment.

Replace BDF14 references with BDF93, drop the redundant 'bdw_erratum_' prefix,
and use an Intel vendor check, as other vendors implement VT-x.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1a3b393129c1dcfec418f9b0ee92d126c2ae8141
master date: 2019-10-29 19:27:40 +0000

x86: fix off-by-one in is_xen_fixed_mfn()

__2M_rwdata_end marks the first byte after the Xen image, not its last
byte. Subtract 1 to obtain the upper bound to compare against. (Note
that instead switching from <= to < is less desirable, as in principle
__pa() might return rubbish for addresses outside of the Xen image.)

Since the & needs to be dropped from the line in question, also drop it
from the adjacent one.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9633929824204ca7a6d60d083466de79993d60f1
master date: 2019-10-25 10:38:58 +0200

x86/tsc: update vcpu time info on guest TSC adjustments

If a HVM/PVH guest writes to MSR_IA32_TSC{_ADJUST} and thus changes
the value of the time stamp counter the vcpu time info must also be
updated, or the time calculated by the guest using the Xen PV clock
interface will be skewed.

Update the vcpu time info when the guest writes to either MSR_IA32_TSC
or MSR_IA32_TSC_ADJUST. This fixes lockups seen when running the
pv-shim on AMD hardware, since the shim will aggressively try to keep
TSCs in sync by periodically writing to MSR_IA32_TSC if the TSC is not
reliable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7eee9c16d6405a1a1f2e8c6472923db842c90cfb
master date: 2019-10-23 17:01:56 +0100

x86/vvmx: Fix the use of RDTSCP when it is intercepted at L0

Linux has started using RDTSCP as of v5.1.  This has highlighted a bug in Xen,
where virtual vmexit simply gives up.

  (XEN) d1v1 Unhandled nested vmexit: reason 51
  (XEN) domain_crash called from vvmx.c:2671
  (XEN) Domain 1 (vcpu#1) crashed on cpu#2:

Handle RDTSCP in the virtual vmexit hander in the same was as RDTSC
intercepts.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Chris Brannon <cmb@prgmr.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 9257c218e56e9902b78662e5852d69329b9cc204
master date: 2019-10-23 16:43:48 +0100

x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel

See patch documentation and comments.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available

To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off.  This will be architectural on
future parts.

Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID.  Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.

VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely.  A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.

The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/arm64: Don't blindly unmask interrupts on trap without a change of level

Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.

One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
    - hyp_error: All the interrupts are now kept masked. SError should
      be pretty rare and if ever happen then we most likely want to
      avoid any other interrupts to be generated. The potential main
      "caller" is during virtual SError synchronization on the exit
      path from the guest (see check_pending_vserror).

    - hyp_sync: The interrupts state is inherited from the interrupted
      context.

    - hyp_irq: All the interrupts but IRQ state are inherited from the
      interrupted context. IRQ is kept masked.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 3ed885a8874003f6011460f4f46d1d130dd6b2db
master date: 2019-10-31 16:22:55 +0100

xen/arm32: Don't blindly unmask interrupts on trap without a change of level

Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.

One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).

In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.

As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.

On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.

On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.

The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.

Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 61b683571f0abd12395b1454cd055f2ad9bb3a37
master date: 2019-10-31 16:22:34 +0100

xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector

Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.

The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: a7b81b021ead23bffb5affcac05edfc0a84d129d
master date: 2019-10-31 16:21:33 +0100

xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two

The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.

In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.

This is part of XSA-303.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
master commit: 6eeef7ecaeec002bb1da7e20c9cfaec5549bd940
master date: 2019-10-31 16:20:58 +0100

passthrough: quarantine PCI devices

When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.

This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.

In addition, a fix to assignment handling is made for VT-d.  Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully.  Remove the PI
hooks from the source domain then earlier as well.

Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.

NOTE: This patch also includes one printk() cleanup; the
      "XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
      since similar printk()-s elsewhere also don't log such a tag.

This is XSA-302.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 319f9a0ba94c7db505cd5dd9cb0b037ab1aa8e12
master date: 2019-10-31 16:20:05 +0100

xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()

It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.

While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.

The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.

The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 31b4f4ab6634f85163656b470dffc6d974917853
master date: 2019-10-31 16:19:14 +0100

xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn

The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.

p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.

p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.

The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.

Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.

Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.

Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 6e8e163b46d0823526f1afbbe6f66c668fc811d1
master date: 2019-10-31 16:18:38 +0100

xen/arm: p2m: Avoid aliasing guest physical frame

The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.

This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.

At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.

Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.

As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.

To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.

With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.

Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.

This is part of XSA-301.

Reported-by: Julien Grall <Julien.Grall@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 88aaf40eeff771c546ad3bbb02000171648a89f7
master date: 2019-10-31 16:17:33 +0100

x86/mm: Don't drop a type ref unless you held a ref to begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
master commit: c40b33d72630dcfa506d6fd856532d6152cb97dc
master date: 2019-10-31 16:16:37 +0100

x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3c15a2d8cc1981f369cc9542f028054d0dfb325b
master date: 2019-10-31 16:16:13 +0100

x86/mm: Properly handle linear pagetable promotion failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
master commit: 2f126247ef49c2ba52bae29a2ab371059ede67c0
master date: 2019-10-31 16:15:48 +0100

x86/mm: Collapse PTF_partial_set and PTF_partial_general_ref into one

...now that they are equivalent. No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d28fe10c50e59569c050878226dcd95dc741810f
master date: 2019-10-31 16:15:11 +0100