dgit.raspbian.org Git

Update to upstream 4.11.1

[git-debrebase anchor: new upstream 4.11.1, merge]

d/changelog: revert closing pygrub bugs

It appears that the pygrub script itself is still broken because of
import problems with a renamed library. Make sure we're not claiming
that the bugs are solved.

d/rules: Don't exclude the actual pygrub script

We still want to have `/usr/lib/xen-4.11/bin/pygrub`.

Thanks PryMar56 for quickly pointing out the fix on IRC.

debian/changelog: mention closing #865086

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

grub.d/xen.cfg: fix default entry when using l10n

When a user uses a locale that results in translating menu item titles
into another language than English, the hardcoded "Debian GNU/Linux,
with Xen hypervisor" would not match anything.

So, use gettext to make it match the right translated entry.
Also see
- https://bugs.launchpad.net/ubuntu/+source/xen/+bug/1321144
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865086

Note that (thanks Ian for the info):
* When GRUB_TERMINAL is not empty and set to anything other than
  `gfxterm', grub will not do translation at all, because grub-mkconfig
  thinks that other GRUB_TERMINAL values including `serial' preclude
  non-ASCII characters, and that causes it to set LANG=C. (I have
  GRUB_TERMINAL="serial console", which caused much confusion when
  trying to test all of this).
* Just trying the printf "$(gettext... below is not enough to test if a
  translation shows up. It needs -d grub additionally for gettext, or
  TEXTDOMAIN=grub in the environment, which is probably present when
  this file gets run by update-grub.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

debian/changelog: start -6 entry

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

debian/control: Add Homepage, Vcs-Browser and Vcs-Git.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

changelog: Finalise -5

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Cope if xen-utils-common not being built

In a binary-indep build, xen-utils-common is not built so the files
are not installed by dh_install and the directory is missing.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

changelog: finalise +dfsg-4.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/control: Add pandoc and markdown to b-d

Without these, some documentation is ommitted.

Resulting changes to the binary packages are:
xen-doc: lots of extra html files in /usr/share/doc/xen/html/
xen-utils-common: xen-vbd-interface(7)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Do not try to move EFI binaries on armhf

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Use find rather than shell glob for strip

This stops this from falling over on arches without hvmloader.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

xen-utils-*.install: Expect shim only on amd64 | i386

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

debian/shuffle-boot-files: Handle boot/xen as well as boot/xen.gz

On arm64, at least, the main file is boot/xen, not boot/xen.gz.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

dbian/rules: Install shim separately

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Build shim separately

So we can control (1) the make arguments including the arch
(2) the other compile flags.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

debian/rules: Fix some cases of HOST/BUILD arch confusion

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

changelog: finalise -3.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Add a -n to a gzip rune to improve reproducibility

There's still a lot of unreproducibility here, but this at least is an
easy fix.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/control: Add missing Replaces on old xen-utils-common

Previously the xenstore utility manpages were erroneously in
xen-utils-common. We need to declare Replaces so that dpkg lets us
take them over rather than regarding it as a file conflict.

I think we can safely drop the old Conflicts/Replaces from Xen 3.1.0
days.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/control: Adding Section to source stanza

This is recommended by policy, although lintian doesn't mind its
absence.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

hypervisor package postinst: Actually install

This source template file needs to have .vsn-in at the end of its
filename.

This fixes the bug that one needs to run update-grub by hand.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Redo as an upload with binaries

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

changelog: Incorporate changelog changes from Hans's pre.20180911.

The changes in Hans's version are all in my tree now: I've rebased
onto his .dfsg upstream tag, and the my own tree already had the
lintian override.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

update Xen version to 4.11.1

x86/dom0: Avoid using 1G superpages if shadowing may be necessary

The shadow code doesn't support 1G superpages, and will hand #PF[RSVD] back to
guests.

For dom0's with 512GB of RAM or more (and subject to the P2M alignment), Xen's
domain builder might use 1G superpages.

Avoid using 1G superpages (falling back to 2M superpages instead) if there is
a reasonable chance that we may have to shadow dom0. This assumes that there
are no circumstances where we will activate logdirty mode on dom0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 96f6ee15ad7ca96472779fc5c083b4149495c584
master date: 2018-11-12 11:26:04 +0000

x86/shadow: shrink struct page_info's shadow_flags to 16 bits

This is to avoid it overlapping the linear_pt_count field needed for PV
domains. Introduce a separate, HVM-only pagetable_dying field to replace
the sole one left in the upper 16 bits.

Note that the accesses to ->shadow_flags in shadow_{pro,de}mote() get
switched to non-atomic, non-bitops operations, as {test,set,clear}_bit()
are not allowed on uint16_t fields and hence their use would have
required ugly casts. This is fine because all updates of the field ought
to occur with the paging lock held, and other updates of it use |= and
&= as well (i.e. using atomic operations here didn't really guard
against potentially racing updates elsewhere).

This is part of XSA-280.

Reported-by: Prgmr.com Security <security@prgmr.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 789589968ed90e82a832dbc60e958c76b787be7e
master date: 2018-11-20 14:59:54 +0100

x86/shadow: move OOS flag bit positions

In preparation of reducing struct page_info's shadow_flags field to 16
bits, lower the bit positions used for SHF_out_of_sync and
SHF_oos_may_write.

Instead of also adjusting the open coded use in _get_page_type(),
introduce shadow_prepare_page_type_change() to contain knowledge of the
bit positions to shadow code.

This is part of XSA-280.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: d68e1070c3e8f4af7a31040f08bdd98e6d6eac1d
master date: 2018-11-20 14:59:13 +0100

x86/mm: Don't perform flush after failing to update a guests L1e

If the L1e update hasn't occured, the flush cannot do anything useful. This
skips the potentially expensive vcpumask_to_pcpumask() conversion, and
broadcast TLB shootdown.

More importantly however, we might be in the error path due to a bad va
parameter from the guest, and this should not propagate into the TLB flushing
logic. The INVPCID instruction for example raises #GP for a non-canonical
address.

This is XSA-279.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6c8d50288722672ecc8e19b0741a31b521d01706
master date: 2018-11-20 14:58:41 +0100

x86/mm: Put the gfn on all paths after get_gfn_query()

c/s 7867181b2 "x86/PoD: correctly handle non-order-0 decrease-reservation
requests" introduced an early exit in guest_remove_page() for unexpected p2m
types. However, get_gfn_query() internally takes the p2m lock, and must be
matched with a put_gfn() call later.

Fix the erroneous comment beside the declaration of get_gfn_query().

This is XSA-277.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d80988cfc04ee608bee722448e7c3bc8347ec04c
master date: 2018-11-20 14:58:10 +0100

x86/hvm/ioreq: use ref-counted target-assigned shared pages

Passing MEMF_no_refcount to alloc_domheap_pages() will allocate, as
expected, a page that is assigned to the specified domain but is not
accounted for in tot_pages. Unfortunately there is no logic for tracking
such allocations and avoiding any adjustment to tot_pages when the page
is freed.

The only caller of alloc_domheap_pages() that passes MEMF_no_refcount is
hvm_alloc_ioreq_mfn() so this patch removes use of the flag from that
call-site to avoid the possibility of a domain using an ioreq server as
a means to adjust its tot_pages and hence allocate more memory than it
should be able to.

However, the reason for using the flag in the first place was to avoid
the allocation failing if the emulator domain is already at its maximum
memory limit. Hence this patch switches to allocating memory from the
target domain instead of the emulator domain. There is already an extra
memory allowance of 2MB (LIBXL_HVM_EXTRA_MEMORY) applied to HVM guests,
which is sufficient to cover the pages required by the supported
configuration of a single IOREQ server for QEMU. (Stub-domains do not,
so far, use resource mapping). It also also the case the QEMU will have
mapped the IOREQ server pages before the guest boots, hence it is not
possible for the guest to inflate its balloon to consume these pages.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
master commit: e862e6ceb1fd971d755a0c57d6a0f3b8065187dc
master date: 2018-11-20 14:57:38 +0100

x86/hvm/ioreq: fix page referencing

The code does not take a page reference in hvm_alloc_ioreq_mfn(), only a
type reference. This can lead to a situation where a malicious domain with
XSM_DM_PRIV can engineer a sequence as follows:

- create IOREQ server: no pages as yet.
- acquire resource: page allocated, total 0.
- decrease reservation: -1 ref, total -1.

This will cause Xen to hit a BUG_ON() in free_domheap_pages().

This patch fixes the issue by changing the call to get_page_type() in
hvm_alloc_ioreq_mfn() to a call to get_page_and_type(). This change
in turn requires an extra put_page() in hvm_free_ioreq_mfn() in the case
that _PGC_allocated is still set (i.e. a decrease reservation has not
occurred) to avoid the page being leaked.

This is part of XSA-276.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: f6b6ae78679b363ff670a9c125077c436dabd608
master date: 2018-11-20 14:57:05 +0100

AMD/IOMMU: suppress PTE merging after initial table creation

The logic is not fit for this purpose, so simply disable its use until
it can be fixed / replaced. Note that this re-enables merging for the
table creation case, which was disabled as a (perhaps unintended) side
effect of the earlier "amd/iommu: fix flush checks". It relies on no
page getting mapped more than once (with different properties) in this
process, as that would still be beyond what the merging logic can cope
with. But arch_iommu_populate_page_table() guarantees this afaict.

This is part of XSA-275.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 937ef32565fa3a81fdb37b9dd5aa99a1b87afa75
master date: 2018-11-20 14:55:14 +0100

amd/iommu: fix flush checks

Flush checking for AMD IOMMU didn't check whether the previous entry
was present, or whether the flags (writable/readable) changed in order
to decide whether a flush should be executed.

Fix this by taking the writable/readable/next-level fields into account,
together with the present bit.

Along these lines the flushing in amd_iommu_map_page() must not be
omitted for PV domains. The comment there was simply wrong: Mappings may
very well change, both their addresses and their permissions. Ultimately
this should honor iommu_dont_flush_iotlb, but to achieve this
amd_iommu_ops first needs to gain an .iotlb_flush hook.

Also make clear_iommu_pte_present() static, to demonstrate there's no
caller omitting the (subsequent) flush.

This is part of XSA-275.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 1a7ffe466cd057daaef245b0a1ab6b82588e4c01
master date: 2018-11-20 14:52:12 +0100

stubdom/vtpm: fix memcmp in TPM_ChangeAuthAsymFinish

gcc8 spotted this error:
error: 'memcmp' reading 20 bytes from a region of size 8 [-Werror=stringop-overflow=]

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit 22bf5be3237cb482a2ffd772ffd20ce37285eebf)

x86: work around HLE host lockup erratum

XACQUIRE prefixed accesses to the 4Mb range of memory starting at 1Gb
are liable to lock up the processor. Disallow use of this memory range.

Unfortunately the available Core Gen7 and Gen8 spec updates are pretty
old, so I can only guess that they're similarly affected when Core Gen6
is and the Xeon counterparts are, too.

This is part of XSA-282.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cc76410d20aff2cc07b268b0713dc1d2740c6e12
master date: 2018-11-07 09:33:24 +0100

x86: extend get_platform_badpages() interface

Use a structure so along with an address (now frame number) an order can
also be specified.

This is part of XSA-282.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8617e69fb8307b372eeff41d55ec966dbeba36eb
master date: 2018-11-07 09:32:08 +0100

Release: add release note link to SUPPORT.md

In order to have a link to the release notes in the feature list
generated from SUPPORT.md add that link in the "Release Support"
section of that file.

The real link needs to be adapted when the version is being released.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/pv: Fix crash when using `xl set-parameter pcid=...`

"pcid=" is registered as a runtime parameter, which means that parse_pcid()
must not reside in .init, or the following happens when parse_params() tries
to call an unmapped function pointer.

  (XEN) ----[ Xen-4.12-unstable  x86_64  debug=y   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d080407fb3>] ffff82d080407fb3
  (XEN) RFLAGS: 0000000000010292   CONTEXT: hypervisor (d0v1)
  (XEN) rax: ffff82d080407fb3   rbx: ffff82d0803cf270   rcx: 0000000000000000
  (XEN) rdx: ffff8300abe67fff   rsi: 000000000000000a   rdi: ffff8300abe67bfd
  (XEN) rbp: ffff8300abe67ca8   rsp: ffff8300abe67ba0   r8:  ffff83084d980000
  (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
  (XEN) r12: ffff8300abe67bfd   r13: ffff82d0803cb628   r14: 0000000000000000
  (XEN) r15: ffff8300abe67bf8   cr0: 0000000080050033   cr4: 0000000000172660
  (XEN) cr3: 0000000828efd000   cr2: ffff82d080407fb3
  (XEN) fsb: 00007fb810d4b780   gsb: ffff88007ce20000   gss: 0000000000000000
  (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
  (XEN) Xen code around <ffff82d080407fb3> (ffff82d080407fb3) [fault on access]:
  (XEN)  -- -- -- -- -- -- -- -- <--> -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
  (XEN) Xen stack trace from rsp=ffff8300abe67ba0:
  (XEN)    ffff82d080217f61 ffff830826db0f09 ffff8300abe67bf8 ffff82d0803cf1e0
  (XEN)    00007cff54198409 ffff8300abe67bf0 010001d000000000 0000000000000000
  (XEN)    ffff82d0803cf288 ffff8300abe67c88 ffff82d0805a09c0 616c620064696370
  (XEN)    00000000aaaa0068 0000000000000296 ffff82d08023d60e aaaaaaaaaaaaaaaa
  (XEN)    ffff83084d9b4000 ffff8300abe67c68 ffff82d08024940e ffff83083736e000
  (XEN)    0000000000000080 000000000000007a 000000000000000a ffff82d08045e61c
  (XEN)    ffff82d080573d80 ffff8300abe67cb8 ffff82d080249805 80000007fce54067
  (XEN)    fffffffffffffff2 ffff830826db0f00 ffff8300abfa7000 ffff82d08045e61c
  (XEN)    ffff82d080573d80 ffff8300abe67cb8 ffff82d08021801e ffff8300abe67e48
  (XEN)    ffff82d08023f60a ffff83083736e000 0000000000000000 ffff8300abe67d58
  (XEN)    ffff82d080293d90 0000000000000092 ffff82d08023d60e ffff820040006ae0
  (XEN)    0000000000000000 0000000000000000 00007fb810d5c010 ffff83083736e248
  (XEN)    0000000000000286 ffff8300abe67d58 0000000000000000 ffff82e010521b00
  (XEN)    0000000000000206 0000000000000000 0000000000000000 ffff8300abe67e48
  (XEN)    ffff82d080295270 00000000ffffffff ffff83083736e000 ffff8300abe67e48
  (XEN)    ffff820040006ae0 ffff8300abe67d98 000000120000001c 00007fb810d5d010
  (XEN)    0000000000000009 0000000000000002 0000000000000001 00007fb810b53260
  (XEN)    0000000000000001 0000000000000000 0000000000638bc0 00007fb81066a748
  (XEN)    00007ffe11087881 0000000000000002 0000000000000001 00007fb810b53260
  (XEN)    0000000000638b60 0000000000000000 00007fb8100322a0 ffff82d08035d444
  (XEN) Xen call trace:
  (XEN)    [<ffff82d080217f61>] kernel.c#parse_params+0x34a/0x3eb
  (XEN)    [<ffff82d08021801e>] runtime_parse+0x1c/0x1e
  (XEN)    [<ffff82d08023f60a>] do_sysctl+0x108d/0x1241
  (XEN)    [<ffff82d0803535cb>] pv_hypercall+0x1ac/0x4c5
  (XEN)    [<ffff82d08035d4a2>] lstar_enter+0x112/0x120
  (XEN)
  (XEN) Pagetable walk from ffff82d080407fb3:
  (XEN)  L4[0x105] = 00000000abe5c063 ffffffffffffffff
  (XEN)  L3[0x142] = 00000000abe59063 ffffffffffffffff
  (XEN)  L2[0x002] = 000000084d9bf063 ffffffffffffffff
  (XEN)  L1[0x007] = 0000000000000000 ffffffffffffffff
  (XEN)
  (XEN) ****************************************
  (XEN) Panic on CPU 0:
  (XEN) FATAL PAGE FAULT
  (XEN) [error_code=0010]
  (XEN) Faulting linear address: ffff82d080407fb3
  (XEN) ****************************************

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f993c3e90728705dacd834b49a6e5608c1360409
master date: 2018-10-30 13:26:21 +0000

tools/dombuilder: Initialise vcpu debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 46029da12e5efeca6d957e5793bd34f2965fa0a1
master date: 2018-10-24 14:43:05 +0100

x86/domain: Initialise vcpu debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Introduce arch_vcpu_regs_init() to set various architectural defaults, and
reuse this in the hvm_vcpu_reset_state() path.

Architecturally, %edx's init state contains the processors model information,
and 0xf looks to be a remnant of the old Intel processors. We clearly have no
software which cares, seeing as it is wrong for the last decade's worth of
Intel hardware and for all other vendors, so lets use the value 0 for
simplicity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/domain: Fix build with GCC 4.3.x

GCC 4.3.x can't initialise the user_regs structure like this.

Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: dfba4d2e91f63a8f40493c4fc2db03fd8287f6cb
master date: 2018-10-24 14:43:05 +0100
master commit: 0a1fa635029d100d4b6b7eddb31d49603217cab7
master date: 2018-10-30 13:26:21 +0000

x86/boot: Initialise the debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Move X86_DR6_DEFAULT into x86-defns.h along with the other architectural
register constants, and introduce a new X86_DR7_DEFAULT. Use the existing
write_debugreg() helper, rather than opencoded inline assembly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 721da6d41a70fe08b3fcd9c31a62f6709a54c6ba
master date: 2018-10-24 14:43:05 +0100

x86/boot: enable NMIs after traps init

In certain scenarios, NMIs might be disabled during Xen boot process.
Such situation will cause alternative_instructions() to:

panic("Timed out waiting for alternatives self-NMI to hit\n");

This bug was originally seen when using Tboot to boot Xen 4.11

To prevent this from happening, enable NMIs during cpu_init() and
during __start_xen() for BSP.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 072e054359a4d4a4f6c3fa09585667472c4f0f1d
master date: 2018-10-23 12:33:54 +0100

vtd: add missing check for shared EPT...

...in intel_iommu_unmap_page().

This patch also includes some non-functional modifications in
intel_iommu_map_page().

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: e30c47cd8be8ba73cfc1ec7b1ebd036464708a24
master date: 2018-10-04 14:53:57 +0200

x86: fix "xpti=" and "pv-l1tf=" yet again

While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing") indeed fixed "xpti=dom0", it broke "xpti=no-dom0", in that
this then became equivalent to "xpti=no". In particular, the presence
of "xpti=" alone on the command line means nothing as to which default
is to be overridden; "xpti=no-dom0", for example, ought to have no
effect for DomU-s, as this is distinct from both "xpti=no-dom0,domu"
and "xpti=no-dom0,no-domu".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8743d2dea539617e237c77556a91dc357098a8af
master date: 2018-10-04 14:49:56 +0200

x86: split opt_pv_l1tf

Use separate tracking variables for the hardware domain and DomU-s.

No functional change intended, but adjust the comment in
init_speculation_mitigations() to match prior as well as resulting code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0b89643ef6ef14e2c2b731ca675d23e405ed69b1
master date: 2018-10-04 14:49:19 +0200

x86: split opt_xpti

Use separate tracking variables for the hardware domain and DomU-s.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 51e0cb45932d80d4eeb59994ee2c3f3c597b0212
master date: 2018-10-04 14:48:18 +0200

x86: silence false log messages for plain "xpti" / "pv-l1tf"

While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing") claimed to have got rid of the 'parameter "xpti" has invalid
value "", rc=-22!' log message for "xpti" alone on the command line,
this wasn't the case (the option took effect nevertheless).

Fix this there as well as for plain "pv-l1tf".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fb57e4beefeda923446b73f88b392e59b07d847
master date: 2018-09-28 17:12:14 +0200

x86/vvmx: Disallow the use of VT-x instructions when nested virt is disabled

c/s ac6a4500b "vvmx: set vmxon_region_pa of vcpu out of VMX operation to an
invalid address" was a real bugfix as described, but has a very subtle bug
which results in all VT-x instructions being usable by a guest.

The toolstack constructs a guest by issuing:

  XEN_DOMCTL_createdomain
  XEN_DOMCTL_max_vcpus

and optionally later, HVMOP_set_param to enable nested virt.

As a result, the call to nvmx_vcpu_initialise() in hvm_vcpu_initialise()
(which is what makes the above patch look correct during review) is actually
dead code.  In practice, nvmx_vcpu_initialise() first gets called when nested
virt is enabled, which is typically never.

As a result, the zeroed memory of struct vcpu causes nvmx_vcpu_in_vmx() to
return true before nested virt is enabled for the guest.

Fixing the order of initialisation is a work in progress for other reasons,
but not viable for security backports.

A compounding factor is that the vmexit handlers for all instructions, other
than VMXON, pass 0 into vmx_inst_check_privilege()'s vmxop_check parameter,
which skips the CR4.VMXE check.  (This is one of many reasons why nested virt
isn't a supported feature yet.)

However, the overall result is that when nested virt is not enabled by the
toolstack (i.e. the default configuration for all production guests), the VT-x
instructions (other than VMXON) are actually usable, and Xen very quickly
falls over the fact that the nvmx structure is uninitalised.

In order to fail safe in the supported case, reimplement all the VT-x
instruction handling using a single function with a common prologue, covering
all the checks which should cause #UD or #GP faults.  This deliberately
doesn't use any state from the nvmx structure, in case there are other lurking
issues.

This is XSA-278

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com>

stubdom/grub.patches: Drop docs changes, for licensing reasons

The patch file 00cvs is an import of a new upstream version of
grub1 from upstream CVS.

Unfortunately, in the period covered by the update, upstream changed
the documentation licence from a simple permissive licence, to the GNU
"Free Documentation Licence" with Front and Back Cover Texts.

The Debian Project is of the view that use the Front and Back Cover
Texts feature of the GFDL makes the resulting document not Free
Software, because of the mandatory redistribution of these immutable
texts.  (Personally, I agree.)

This is awkward because Debian do not want to ship non-free content.
So the Debian maintainers need to launder the upstream source code, to
remove the troublesome files.  This is an extra step when
incorporating new upstream versions.  It's particularly annoying for
security response, which often involves rebasing onto a new upstream
release.

grub1 is obsolete and the last change to Xen's PV grub1 stubdom code
was in 2016.  Furthermore, the grub1 documentation is not built and
installed by the Xen pv-grub stubdom Makefiles.

Therefore, remove all docs changes from stubdom/grub.patches.  This
means that there are now no longer any GFDL-licenced grub docs in
xen.git.

There is no user impact, and Debian is helped.  This change would
complicate any attempts to update to a new version of upstream grub1,
but it seems unlikely that such a thing will ever happen.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Doug Goldstein <cardoe@cardoe.com>
CC: Juergen Gross <jgross@suse.com>
CC: pkg-xen-devel@lists.alioth.debian.org
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit c62c53d61477dfeb63a47b0673c389082112babc)

tools/tests: fix an xs-test.c issue

The ret variable can be used uninitialised when iters is 0. Initialise
ret at the beginning to fix this issue.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 3a2b8525b883baa87fe89b3da58f5c09fa599b99)

x86/boot: Allocate one extra module slot for Xen image placement

Commit 9589927 (x86/mb2: avoid Xen image when looking for
module/crashkernel position) fixed relocation issues for
Multiboot2 protocol. Unfortunately it missed to allocate
module slot for Xen image placement in early boot path.
So, let's fix it right now.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4c5f9dbebc0bd2afee1ecd936c74ffe65756950f
master date: 2018-09-27 11:17:47 +0100

xen: sched/Credit2: fix bug when moving CPUs between two Credit2 cpupools

Whether or not a CPU is assigned to a runqueue (and, if yes, to which
one) within a Credit2 scheduler instance must be both a per-cpu and
per-scheduler instance one.

In fact, when we move a CPU between cpupools, we first setup its per-cpu
data in the new pool, and then cleanup its per-cpu data from the old
pool. In Credit2, when there currently is no per-scheduler, per-cpu
data (as the cpu-to-runqueue map is stored on a per-cpu basis only),
this means that the cleanup of the old per-cpu data can mess with the
new per-cpu data, leading to crashes like this:

https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg23306.html
https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg23350.html

Basically, when csched2_deinit_pdata() is called for CPU 13, for fully
removing the CPU from Pool-0, per_cpu(13,runq_map) already contain the
id of the runqueue to which the CPU has been assigned in the scheduler
of Pool-1, which means wrong runqueue manipulations happen in Pool-0's
scheduler. Furthermore, at the end of such call, that same runq_map is
updated with -1, which is what causes the BUG_ON in csched2_schedule(),
on CPU 13, to trigger.

So, instead of reverting a2c4e5ab59d "xen: credit2: make the cpu to
runqueue map per-cpu" (as we don't want to go back to having the huge
array in struct csched2_private) add a per-cpu scheduler specific data
structure, like, for instance, Credit1 has already. That (for now) only
contains one field: the id of the runqueue the CPU is assigned to.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6e395f477fb854f11de83a951a070d3aacb6dc59
master date: 2018-09-18 16:50:44 +0100

x86/hvm/emulate: make sure rep I/O emulation does not cross GFN boundaries

When emulating a rep I/O operation it is possible that the ioreq will
describe a single operation that spans multiple GFNs. This is fine as long
as all those GFNs fall within an MMIO region covered by a single device
model, but unfortunately the higher levels of the emulation code do not
guarantee that. This is something that should almost certainly be fixed,
but in the meantime this patch makes sure that MMIO is truncated at GFN
boundaries and hence the appropriate device model is re-evaluated for each
target GFN.

NOTE: This patch does not deal with the case of a single MMIO operation
spanning a GFN boundary. That is more complex to deal with and is
deferred to a subsequent patch.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Convert calculations to be 32-bit only.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 7626edeaca972e3e823535dcc44338f6b2f0b21f
master date: 2018-08-16 09:27:30 +0200

x86/efi: split compiler vs linker support

So that an ELF binary with support for EFI services will be built when
the compiler supports the MS ABI, regardless of the linker support for
PE.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
master commit: 93249f7fc17c1f3a2aa8bf9ea055aa326e93a4ae
master date: 2018-07-31 10:25:06 +0200

x86/efi: move the logic to detect PE build support

So that it can be used by other components apart from the efi specific
code. By moving the detection code creating a dummy efi/disabled file
can be avoided.

This is required so that the conditional used to define the efi symbol
in the linker script can be removed and instead the definition of the
efi symbol can be guarded using the preprocessor.

The motivation behind this change is to be able to build Xen using lld
(the LLVM linker), that at least on version 6.0.0 doesn't work
properly with a DEFINED being used in a conditional expression:

ld -melf_x86_64_fbsd -T xen.lds -N prelink.o --build-id=sha1 \
/root/src/xen/xen/common/symbols-dummy.o -o /root/src/xen/xen/.xen-syms.0
ld: error: xen.lds:233: symbol not found: efi

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Daniel Kiper <daniel.kiper@oracle.com>
master commit: 18cd4997d26b9df95dda87503e41c823279a07a0
master date: 2018-07-31 10:24:22 +0200

x86/shutdown: use ACPI reboot method for Dell PowerEdge R540

When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.

BIOS Information
    Vendor: Dell Inc.
    Version: 1.3.7
    Release Date: 02/09/2018
System Information
    Manufacturer: Dell Inc.
    Product Name: PowerEdge R540

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 328ca55b7bd47e1324b75cce2a6c461308ecf93d
master date: 2018-06-28 09:29:13 +0200

Update changelog for new upstream 4.11.1~pre.20180911.5acdd26fdc+dfsg

[git-debrebase changelog: new upstream 4.11.1~pre.20180911.5acdd26fdc+dfsg]

Update to upstream 4.11.1~pre.20180911.5acdd26fdc+dfsg

[git-debrebase anchor: new upstream 4.11.1~pre.20180911.5acdd26fdc+dfsg, merge]

changelog: finalise -1 for upload to unstable

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: Copy config.{sub,guess} by hand

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/rules: rm -v the xenstore utils from xen-utils-common

This makes the log slightly more debuggable.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/control: Use https for wiki.xen.org

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

xenstore-utils: Hardlink the various xenstore-* programs together

This is an argv[0]-using binary of which we could have only one copy.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/: Completely rework the packaging

Abolish the old template system. Instead, the Xen version is left to
be updated by hand in debian/control and debian/changelog. Elsewhere
things are templated at package build time.

Everything that is not just `dh $@' now has a comment explaining it.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

x86: assorted array_index_nospec() insertions

Don't chance having Spectre v1 (including BCBS) gadgets. In some of the
cases the insertions are more of precautionary nature rather than there
provably being a gadget, but I think we should err on the safe (secure)
side here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3f2002614af51dfd507168a1696658bac91155ce
master date: 2018-09-03 17:50:10 +0200

VT-d/dmar: iommu mem leak fix

Release memory allocated for drhd iommu in error path.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fd07b6648c4c8891dca5bd0f7ef174b6831f80b2
master date: 2018-08-27 11:37:24 +0200

rangeset: make inquiry functions tolerate NULL inputs

Rather than special casing the ->iomem_caps check in x86's
get_page_from_l1e() for the dom_xen case, let's be more tolerant in
general, along the lines of rangeset_is_empty(): A never allocated
rangeset can't possibly contain or overlap any range.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: ad0a9f273d6d6f0545cd9b708b2d4be581a6cadd
master date: 2018-08-17 13:54:40 +0200

x86/setup: Avoid OoB E820 lookup when calculating the L1TF safe address

A number of corner cases (most obviously, no-real-mode and no Multiboot memory
map) can end up with e820_raw.nr_map being 0, at which point the L1TF
calculation will underflow.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 3e4ec07e14bce81f6ae22c31ff1302d1f297a226
master date: 2018-08-16 18:10:07 +0100

x86/hvm/ioreq: MMIO range checking completely ignores direction flag

hvm_select_ioreq_server() is used to route an ioreq to the appropriate
ioreq server. For MMIO this is done by comparing the range of the ioreq
to the ranges registered by the device models of each ioreq server.
Unfortunately the calculation of the range if the ioreq completely ignores
the direction flag and thus may calculate the wrong range for comparison.
Thus the ioreq may either be routed to the wrong server or erroneously
terminated by null_ops.

NOTE: The patch also fixes whitespace in the switch statement to make it
style compliant.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 60a56dc0064a00830663ffe48215dcd080cb9504
master date: 2018-08-15 14:14:06 +0200

x86/vlapic: Bugfixes and improvements to vlapic_{read,write}()

Firstly, there is no 'offset' boundary check on the non-32-bit write path
before the call to vlapic_read_aligned(), which allows an attacker to read
beyond the end of vlapic->regs->data[], which is only 1024 bytes long.

However, as the backing memory is a domheap page, and misaligned accesses get
chunked down to single bytes across page boundaries, I can't spot any
XSA-worthy problems which occur from the overrun.

On real hardware, bad accesses don't instantly crash the machine.  Their
behaviour is undefined, but the domain_crash() prohibits sensible testing.
Behave more like other x86 MMIO and terminate bad accesses with appropriate
defaults.

While making these changes, clean up and simplify the the smaller-access
handling.  In particular, avoid pointer based mechansims for 1/2-byte reads so
as to avoid forcing the value to be spilled to the stack.

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-175 (-175)
  function                                     old     new   delta
  vlapic_read                                  211     142     -69
  vlapic_write                                 304     198    -106

Finally, there are a plethora of read/write functions in the vlapic namespace,
so rename these to vlapic_mmio_{read,write}() to make their purpose more
clear.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b6f43c14cef3af8477a9eca4efab87dd150a2885
master date: 2018-08-10 13:27:24 +0100

x86/vmx: Avoid hitting BUG_ON() after EPTP-related domain_crash()

If the EPTP pointer can't be located in the altp2m list, the domain
is (legitimately) crashed.

Under those circumstances, execution will continue and guarentee to hit the
BUG_ON(idx >= MAX_ALTP2M) (unfortunately, just out of context).

Return from vmx_vmexit_handler() after the domain_crash(), which also has the
side effect of reentering the scheduler more promptly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 48dbb2dbe9d9f92a2890a15bb48a0598c065b9f8
master date: 2018-08-02 10:10:43 +0100

Remove stubdom/grub.patches/00cvs

Lintian complains about:

E: xen source: license-problem-gfdl-invariants
stubdom/grub.patches/00cvs invariant part is: with no invariant
sections, with the front-cover texts being a gnu manual, and with the
back-cover texts as in (a) below

...and because of that the debian archive rejects our source package.

We are not using this anywhere in our packaging, so just remove the
whole file for now.

git-debrebase import: declare upstream

First breakwater merge.

[git-debrebase anchor: declare upstream]

git-debrebase convert-from-gbp: drop patches from tree

Delete debian/patches, as part of converting to git-debrebase format.

[git-debrebase convert-from-gbp: drop patches from tree]

Commit files generated by debian/rules debian/control

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

debian/.gitignore: Unignore files generated by rules control

We are going to commit to git all the files generated by
  debian/rules debian/control

This makes the git tree have a control file and therefore it is
directly buildable.  (It also avoids gbp pq producing an error message
which invoked by git-debrebase convert-from-gbp, which we are going to
use to convert the branch to git-debrebase format.)

The templating here is overkill.  Eventually, if we are lucky, we will
be able to reduce this to just debian/control.

In particular:

  * Rather than a pile of autogenerated rules in rules.gen,
    we could have suitable pattern rules, or make macros.

  * The files like xen-hypervisor-4.11-amd64.postinst could
    be generated by the rules in a hook.  Then they will
    want to be ignored again.  But they wouldn't hang off
    debian/rules debian/control.

  * The only thing that actually needs some kind of automated
    assistance, and which needs to be in the source package, is the
    binary packaage names, and dependencies, in debian/control.
    We could provide a script to update this in place, maybe, and do
    away with debian/templates/control.*.in entirely.

But for now we want control to be in git so it's easy to find, and so
that our source packages and git trees are identical as dgit requires.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Merge upstream into master

Merging commit '733450b39b', which was the upstream for
4.11.1~pre+1.733450b39b-1~exp1, into HEAD.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

Prepare to release xen (4.11.1~pre+1.733450b39b-1~exp1).

debian/changelog: mention the vwprintw compile fix

Update to 4.11.1-pre commit 733450b39b

libxl: start pvqemu when 9pfs is requested

PV 9pfs requires the PV backend in QEMU. Make sure that libxl knows it.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 47bc2c29b5a875e5f4abd36f2cb9faa594299f6c)

x86: write to correct variable in parse_pv_l1tf()

Apparently a copy-and-paste mistake.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 57c554f8a6e06894f601d977d18b3017d2a60f40
master date: 2018-08-15 14:15:30 +0200

xl.conf: Add global affinity masks

XSA-273 involves one hyperthread being able to use Spectre-like
techniques to "spy" on another thread.  The details are somewhat
complicated, but the upshot is that after all Xen-based mitigations
have been applied:

* PV guests cannot spy on sibling threads
* HVM guests can spy on sibling threads

(NB that for purposes of this vulnerability, PVH and HVM guests are
identical.  Whenever this comment refers to 'HVM', this includes PVH.)

There are many possible mitigations to this, including disabling
hyperthreading entirely.  But another solution would be:

* Specify some cores as PV-only, others as PV or HVM
* Allow HVM guests to only run on thread 0 of the "HVM-or-PV" cores
* Allow PV guests to run on the above cores, as well as any thread of the PV-only cores.

For example, suppose you had 16 threads across 8 cores (0-7).  You
could specify 0-3 as PV-only, and 4-7 as HVM-or-PV.  Then you'd set
the affinity of the HVM guests as follows (binary representation):

0000000010101010

And the affinity of the PV guests as follows:

1111111110101010

In order to make this easy, this patches introduces three "global affinity
masks", placed in xl.conf:

    vm.cpumask
    vm.hvm.cpumask
    vm.pv.cpumask

These are parsed just like the 'cpus' and 'cpus_soft' options in the
per-domain xl configuration files.  The resulting mask is AND-ed with
whatever mask results at the end of the xl configuration file.
`vm.cpumask` would be applied to all guest types, `vm.hvm.cpumask`
would be applied to HVM and PVH guest types, and `vm.pv.cpumask`
would be applied to PV guest types.

The idea would be that to implement the above mask across all your
VMs, you'd simply add the following two lines to the configuration
file:

    vm.hvm.cpumask=8,10,12,14
    vm.pv.cpumask=0-8,10,12,14

See xl.conf manpage for details.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit aa67b97ed34279c43a43d9ca46727b5746caa92e)

x86: Make "spec-ctrl=no" a global disable of all mitigations

In order to have a simple and easy to remember means to suppress all the
more or less recent workarounds for hardware vulnerabilities, force
settings not controlled by "spec-ctrl=" also to their original defaults,
unless they've been forced to specific values already by earlier command
line options.

This is part of XSA-273.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d8800a82c3840b06b17672eddee4878bbfdacc6d)

x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests

This mitigation requires up-to-date microcode, and is enabled by default on
affected hardware if available, and is used for HVM guests

The default for SMT/Hyperthreading is far more complicated to reason about,
not least because we don't know if the user is going to want to run any HVM
guests to begin with. If a explicit default isn't given, nag the user to
perform a risk assessment and choose an explicit default, and leave other
configuration to the toolstack.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3bd36952dab60290f33d6791070b57920e10754b)

x86/msr: Virtualise MSR_FLUSH_CMD for guests

Guests (outside of the nested virt case, which isn't supported yet) don't need
L1D_FLUSH for their L1TF mitigations, but offering/emulating MSR_FLUSH_CMD is
easy and doesn't pose an issue for Xen.

The MSR is offered to HVM guests only. PV guests attempting to use it would
trap for emulation, and the L1D cache would fill long before the return to
guest context. As such, PV guests can't make any use of the L1D_FLUSH
functionality.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fd9823faf9df057a69a9a53c2e100691d3f4267c)

x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3563fc2b2731a63fd7e8372ab0f5cef205bf8477)

x86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE

See the comment in shadow.h for an explanation of L1TF and the safety
consideration of the PTEs.

In the case that CONFIG_SHADOW_PAGING isn't compiled in, crash the domain
instead. This allows well-behaved PV guests to function, while preventing
L1TF from being exploited. (Note: PV guest kernels which haven't been updated
with L1TF mitigations will likely be crashed as soon as they try paging a
piece of userspace out to disk.)

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 06e8b622d3f3c0fa5075e91b041c6f45549ad70a)

x86/mm: Plumbing to allow any PTE update to fail with -ERESTART

Switching to shadow mode is performed in tasklet context.  To facilitate this,
we schedule the tasklet, then create a hypercall continuation to allow the
switch to take place.

As a consequence, the x86 mm code needs to cope with an L1e operation being
continuable.  do_mmu{,ext}_op() may no longer assert that a continuation
doesn't happen on the final iteration.

To handle the arguments correctly on continuation, compat_update_va_mapping*()
may no longer call into their non-compat counterparts.  Move the compat
functions into mm.c rather than exporting __do_update_va_mapping() and
{get,put}_pg_owner(), and fix an unsigned long/int inconsistency with
compat_update_va_mapping_otherdomain().

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c612481d1c9232c6abf91b03ec655e92f808805f)

x86/shadow: Infrastructure to force a PV guest into shadow mode

To mitigate L1TF, we cannot alter an architecturally-legitimate PTE a PV guest
chooses to write, but we can force the PV domain into shadow mode so Xen
controls the PTEs which are reachable by the CPU pagewalk.

Introduce new shadow mode, PG_SH_forced, and a tasklet to perform the
transition. Later patches will introduce the logic to enable this mode at the
appropriate time.

To simplify vcpu cleanup, make tasklet_kill() idempotent with respect to
tasklet_init(), which involves adding a helper to check for an uninitialised
list head.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b76ec3946bf6caca2c3950b857c008bc8db6723f)

x86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests

Shadowing a PV guest is only available when shadow paging is compiled in.
When shadow paging isn't available, guests can be crashed instead as
mitigation from Xen's point of view.

Ideally, dom0 would also be potentially-shadowed-by-default, but dom0 has
never been shadowed before, and there are some stability issues under
investigation.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 66a4e986819a86ba66ca2fe9d925e62a4fd30114)

x86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations

Safe PTE addresses for L1TF mitigations are ones which are within the L1D
address width (may be wider than reported in CPUID), and above the highest
cacheable RAM/NVDIMM/BAR/etc.

All logic here is best-effort heuristics, which should in practice be fine for
most hardware. Future work will see about disentangling the SRAT handling
further, as well as having L0 pass this information down to lower levels when
virtualised.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b03a57c9383b32181e60add6b6de12b473652aa4)

tools/oxenstored: Make evaluation order explicit

In Store.path_write(), Path.apply_modify() updates the node_created
reference and both the value of apply_modify() and node_created are
returned by path_write().

At least with OCaml 4.06.1 this leads to the value of node_created being
returned *before* it is updated by apply_modify(). This in turn leads
to the quota for a domain not being updated in Store.write(). Hence, a
guest can create an unlimited number of entries in xenstore.

The fix is to make evaluation order explicit.

This is XSA-272.

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Rob Hoes <rob.hoes@citrix.com>
(cherry picked from commit 73392c7fd14c59f8c96e0b2eeeb329e4ae9086b6)

x86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits

The VPMU_MODE_OFF early-exit in vpmu_do_wrmsr() introduced by c/s
11fe998e56 bypasses all reserved bit checking in the general case. As a
result, a guest can enable BTS when it shouldn't be permitted to, and
lock up the entire host.

With vPMU active (not a security supported configuration, but useful for
debugging), the reserved bit checking in broken, caused by the original
BTS changeset 1a8aa75ed.

From a correctness standpoint, it is not possible to have two different
pieces of code responsible for different parts of value checking, if
there isn't an accumulation of bits which have been checked. A
practical upshot of this is that a guest can set any value it
wishes (usually resulting in a vmentry failure for bad guest state).

Therefore, fix this by implementing all the reserved bit checking in the
main MSR_DEBUGCTL block, and removing all handling of DEBUGCTL from the
vPMU MSR logic.

This is XSA-269.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a8a8e99feb950504559196521bc9fd63ed3a962)

ARM: disable grant table v2

It was never expected to work, the implementation is incomplete.

As a side effect, it also prevents guests from triggering a
"BUG_ON(page_get_owner(pg) != d)" in gnttab_unpopulate_status_frames().

This is XSA-268.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9a5c16a3e75778c8a094ca87784d93b74676f46c)

VMX: fix vmx_{find,del}_msr() build

Older gcc at -O2 (and perhaps higher) does not recognize that apparently
uninitialized variables aren't really uninitialized. Pull out the
assignments used by two of the three case blocks and make them
initializers of the variables, as I think I had suggested during review.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 97cb0516a322ecdf0032fa9d8aa1525c03d7772f)

x86/vmx: Support load-only guest MSR list entries

Currently, the VMX_MSR_GUEST type maintains completely symmetric guest load
and save lists, by pointing VM_EXIT_MSR_STORE_ADDR and VM_ENTRY_MSR_LOAD_ADDR
at the same page, and setting VM_EXIT_MSR_STORE_COUNT and
VM_ENTRY_MSR_LOAD_COUNT to the same value.

However, for MSRs which we won't let the guest have direct access to, having
hardware save the current value on VMExit is unnecessary overhead.

To avoid this overhead, we must make the load and save lists asymmetric. By
making the entry load count greater than the exit store count, we can maintain
two adjacent lists of MSRs, the first of which is saved and restored, and the
second of which is only restored on VMEntry.

For simplicity:
* Both adjacent lists are still sorted by MSR index.
* It undefined behaviour to insert the same MSR into both lists.
* The total size of both lists is still limited at 256 entries (one 4k page).

Split the current msr_count field into msr_{load,save}_count, and introduce a
new VMX_MSR_GUEST_LOADONLY type, and update vmx_{add,find}_msr() to calculate
which sublist to search, based on type. VMX_MSR_HOST has no logical sublist,
whereas VMX_MSR_GUEST has a sublist between 0 and the save count, while
VMX_MSR_GUEST_LOADONLY has a sublist between the save count and the load
count.

One subtle point is that inserting an MSR into the load-save list involves
moving the entire load-only list, and updating both counts.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 1ac46b55632626aeb935726e1b0a71605ef6763a)

x86/vmx: Pass an MSR value into vmx_msr_add()

The main purpose of this change is to allow us to set a specific MSR value,
without needing to know whether there is already a load/save list slot for it.

Previously, callers wanting this property needed to call both vmx_add_*_msr()
and vmx_write_*_msr() to cover both cases, and there are no callers which want
the old behaviour of being a no-op if an entry already existed for the MSR.

As a result of this API improvement, the default value for guest MSRs need not
be 0, and the default for host MSRs need not be passed via hardware register.
In practice, this cleans up the VPMU allocation logic, and avoids an MSR read
as part of vcpu construction.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ee7689b94ac7094b975ab4a023cfeae209da0a36)