dgit.raspbian.org Git

x86/xpti: really hide almost all of Xen image

Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
.data/.rodata/.bss mappings") carefully limited the Xen image cloning to
just entry code, but then overwrote the just allocated and populated L3
entry with the normal one again covering both Xen image and stubs.

Drop the respective code in favor of an explicit clone_mapping()
invocation. This in turn now requires setup_cpu_root_pgt() to run after
stub setup in all cases. Additionally, with (almost) no unintended
mappings left, the BSP's IDT now also needs to be page aligned.

The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
for functionality, but things are more logical this way, and we retain
cleanup being done in the inverse order of setup.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: move and rename XSTATE_*

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: support SWAPGS

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools: ARM: vGICv3: Avoid inserting optional DT properties

When creating a GICv3 devicetree node, we currently insert the
redistributor-stride and #redistributor-regions properties, with fixed
values which are actually the architected ones. Since those properties are
optional, and in the case of the stride only needed to cover for broken
platforms, we don't need to describe them if they don't differ from the
default values. This will always be the case for our constructed
DomU memory map.
So we drop those properties altogether and provide a clean and architected
GICv3 DT node for DomUs.

Signed-off-by: Andre Przywara <andre.przywara@linaro.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

Please Welcome Julien, our new Committer

In recognition of his expertise and commitment to Xen Project, please
join me in welcoming Julien among the Committers and REST Maintainers.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>

x86/pv: Drop int80_bounce from struct pv_vcpu

The int80_bounce field of struct pv_vcpu is a bit of an odd special case,
because it is a simple derivation of trap_ctxt[0x80], which is also stored.

It is also the only use of {compat_,}create_bounce_frame() which isn't
referencing the plain trap_bounce field of struct pv_vcpu. (And altering this
property the purpose of this patch.)

Remove the int80_bounce field entirely, along with init_int80_direct_trap(),
which in turn requires that the int80_direct_trap() path gain logic previously
contained in init_int80_direct_trap().

This does admittedly make the int80 fastpath slightly longer, but these few
instructions are in the noise compared to the architectural context switch
overhead, and it now matches the syscall/sysenter paths (which have far less
architectural overhead already).

No behavioural change from the guests point of view.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/entry: Correct comparisons against boolean variables

The correct way to check a boolean is `cmpb $0` or `testb $0xff`, whereas a
lot of our entry code uses `testb $1`. This will work in principle for values
which are really C _Bool types, but won't work for other integer types which
are intended to have boolean properties.

cmp is the more logical way of thinking about the operation, so adjust all
outstanding uses of `testb $1` against boolean values. Changing test to cmp
changes the logical mnemonic of the following condition from 'zero' to
'equal', but the actual encoding remains the same.

No functional change, as all uses are real C _Bool types, and confirmed by
diffing the disassembly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

x86/boot: Annotate the multiboot headers with size and type information

This causes objdump not to try and disassemble the data.

While altering this area, switch to using .balign, and fill with 0xc2 to help
highlight the embedded padding (rather than having it filled with 0f 1f 40 00
which is a long nop).  Also, shorten the labels by stripping off the _start
suffix.

The end result is now:
  ffff82d080200000 <_start>:
  ffff82d080200000:       e9 af c1 1c 00          jmpq   ffff82d0803cc1b4 <__start>
  ffff82d080200005:       0f 1f 00                nopl   (%rax)

  ffff82d080200008 <multiboot1_header>:
  ffff82d080200008:       02 b0 ad 1b 03 00 00 00 fb 4f 52 e4 c2 c2 c2 c2     .........OR.....

  ffff82d080200018 <multiboot2_header>:
  ffff82d080200018:       d6 50 52 e8 00 00 00 00 88 00 00 00 a2 ae ad 17     .PR.............
  ffff82d080200028:       01 00 00 00 10 00 00 00 04 00 00 00 06 00 00 00     ................
  ffff82d080200038:       06 00 00 00 08 00 00 00 0a 00 01 00 18 00 00 00     ................
  ffff82d080200048:       00 00 20 00 ff ff ff ff 00 00 20 00 02 00 00 00     .. ....... .....
  ffff82d080200058:       04 00 01 00 0c 00 00 00 02 00 00 00 c2 c2 c2 c2     ................
  ffff82d080200068:       05 00 01 00 14 00 00 00 00 00 00 00 00 00 00 00     ................
  ffff82d080200078:       00 00 00 00 c2 c2 c2 c2 07 00 01 00 08 00 00 00     ................
  ffff82d080200088:       09 00 01 00 0c 00 00 00 5e c0 3c 00 c2 c2 c2 c2     ........^.<.....
  ffff82d080200098:       00 00 00 00 08 00 00 00                             ........

  ffff82d0802000a0 <__high_start>:
  ffff82d0802000a0:       0f 01 15 5f 8f 25 00    lgdt   0x258f5f(%rip)        # ffff82d080459006 <gdt_descr>

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86emul: support XOP insns

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: support AVX2 gather insns

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: support most remaining AVX2 insns

I.e. those not being equivalents of SSEn ones, but with the exception
of the various gather operations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: extend vbroadcasts{s,d} to AVX2

These gain register forms now.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: domain_builder: irq sanity check logic fix

Since commit "xen/arm: domain_build: Rework the way to allocate the
event channel interrupt", it is not possible for an irq to be both below 16
and greater/equal than 32.

Also fix the reference to linux documentation while we're at it.

Signed-off-by: Stewart Hildebrand <stewart.hildebrand@dornerworks.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
[Slightly rework the commit message]

xen/arm: domain_build: Rework the way to allocate the event channel interrupt

At the moment, a placeholder will be created in the device-tree for the
event channel information. Later in the domain construction, the
interrupt for the event channel upcall will be allocated the device-tree
fixed up.

Looking at the code, the current split is not necessary because all the
PPIs used by the hardware domain will by the time we create the node in
the device-tree.

>From now, mandate that all interrupts are registered before
acpi_prepare() and dtb_prepare(). This allows us to rework the event
channel code and remove one placeholder.

Note, this will also help to fix the BUG(...) condition in set_interrupt_ppi
which is completely wrong. See in a follow-up patch.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: domain_build: Prepare DTB/ACPI tables after specific mappings

A follow-up patch will require to have all interrupts routed to the
hardware registered before calling prepare_dtb/prepare_acpi.

At the moment, it is not necessary to call platform specific mappings
(gic and platform) after, so it is fine to move them.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

tools/xenstore: try to get minimum thread stack size for watch thread

When creating a pthread in xs_watch() try to get the minimal needed
size of the thread from glibc instead of using a constant. This avoids
problems when the library is used in programs with large per-thread
memory.

Use dlsym() to get the pointer to __pthread_get_minstack() in order to
avoid linkage problems and fall back to the current constant size if
not found.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: Jim Fehlig <jfehlig@suse.com>

x86: rename HAVE_GAS_* to HAVE_AS_*

Xen also uses clang's assembler when it is possible. Change the macro
names to not be GAS specific.

Patch produced with:

$ for f in `git grep HAVE_GAS_ | cut -d':' -f1`; \
do sed -i 's/HAVE_GAS_/HAVE_AS_/g' $f; done

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86: invpcid support

Provide the functions needed for different modes. Add cpu_has_invpcid.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

public: correct GNTTABOP_set_version comment

Version changes are allowed any number of times. Simply re-use the
comment XTF has (thanks Andrew).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: guard more stack pages

There's no reason to keep the unused pages (of which there are actually
two; respective commentary also gets adjusted) mapped.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/link: Don't merge .init.text and .init.data

c/s 1308f0170c merged .init.text and .init.data, because EFI might properly
write-protect r/o sections.

However, that change makes xen-syms unusable for disassembly analysis. In
particular, searching for indirect branches as part of the SP2/Spectre
mitigation series.

As the merging isn't necessary for ELF targets at all, make it conditional on
the EFI side of the build.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

fuzz/x86_emulate: fix bounds for input size

The maximum size for the input size was set to INPUT_SIZE, which is actually
the size of the data array inside the fuzz_corpus structure and so was not
abling user (or AFL) to fill in the whole structure. Changing to
sizeof(struct fuzz_corpus) correct this problem.

Signed-off-by: Paul Semel <semelpaul@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools: drop stale references to curl/xml2-config

Curl and xml2 are not required anymore since 185bb58be3 ("tools: drop
libxen") removed their only user.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: run autogen.sh ]

libxl: set channel devid when not provided by application

Applications like libvirt may not populate a device devid field,
delegating that to libxl. If needed, the application can later
retrieve the libxl-produced devid. Indeed most devices are handled
this way in libvirt, channel devices included.

This works well when only one channel device is defined, but more
than one results in

qemu-system-i386: -chardev socket,id=libxl-channel-1,\
path=/tmp/test-org.qemu.guest_agent.00,server,nowait:
Duplicate ID 'libxl-channel-1' for chardev

Besides the odd '-1' value in the id, multiple channels have the same
id, causing qemu to fail. A simple fix is to set an uninitialized
devid (-1) to the dev_num passed to libxl__init_console_from_channel().

Signed-off-by: Jim Fehlig <jfehlig@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

libxl: do not fail device removal if backend domain is gone

Backend domain may be independently destroyed - there is no
synchronization of libxl structures (including /libxl tree) elsewhere.
Backend might also remove the device info from its backend xenstore
subtree on its own.

We have various cases (not comprehensive list):

- both frontend and backend operational: after setting
   be/state=XenbusStateClosing backend wait for frontend confirmation
   and respond with be/state=XenbusStateClosed; then libxl in dom0
   remove frontend entries and libxl in backend domain (which may be the
   same) remove backend entries
- unresponsive backend/frontend: after a timeout, force=1 is used to remove
   frontend entries, instead of just setting
   be/state=XenbusStateClosing; then wait for be/state=XenbusStateClosed.
   If that timeout too, remove both frontend and backend entries
- backend gone, with this patch: no place for setting/waiting on
   be/state - go directly to removing frontend entries, without waiting
   for be/state=XenbusStateClosed (this is the difference vs force=1)

Without this patch the end result is similar, both frontend and backend
entries are removed, but in case of backend gone:
- libxl waits for be/state=XenbusStateClosed (and obviously timeout)
- return value from the function signal an error, which for example
   confuse libvirt - it thinks the device remove failed, so is still
   there

If such situation is detected, do not fail the removal, but finish the
cleanup of the frontend side and return 0.

This is just workaround, the real fix should watch when the device
backend is removed (including backend domain destruction) and remove
frontend at that time. And report such event to higher layer code, so
for example libvirt could synchronize its state.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/misc: Tweak reserved bit handling for xen-cpuid

Instead of printing REZ, use NULL pointers to indicate missing information,
and have dump_leaf() print out the bit which is unknown.

E.g.

....
Dynamic sets:
Raw                       178bfbff:fed8320b:2fd3fbff:35c233ff:0000000f:209c01a9:00000000:00006799:00001007:00000000
  [00] 0x00000001.edx     fpu vme de pse tsc msr pae mce cx8 apic sysenter mtrr pge  ...
  [01] 0x00000001.ecx     sse3 pclmulqdq monitor ssse3 fma cx16 sse41 sse42 movebe   ...
  [02] 0x80000001.edx     fpu vme de pse tsc msr pae mce cx8 apic syscall mtrr pge   ...
  [03] 0x80000001.ecx     lahf_lm cmp svm extapic cr8d lzcnt sse4a msse 3dnowpf osvw ...
  [04] 0x0000000d:1.eax   xsaveopt xsavec xgetbv1 xsaves
  [05] 0x00000007:0.ebx   fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha
  [06] 0x00000007:0.ecx
  [07] 0x80000007.edx     <0> <3> <4> <7> itsc <9> efro <13> <14>
  [08] 0x80000008.ebx     clzero <1> <2> ibpb
  [09] 0x00000007:0.edx
...

which is the output on an AMD EPYC system, where Xen doesn't know about, and
has therefore masked, most of the more advanced thermal/perf features.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/PV: convert page table emulation code from paddr_t to intpte_t

It's dealing with PTEs after all.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: make all FPU emulation use the stub

While this means quite some reduction of (source) code, the main
purpose is to no longer have exceptions raised from other than stubs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

ignores: update .hgignore

To add the shim build output and build directory.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

ignores: update list of git ignored files

Add the shim build symbol file and remove the xen-shim binary (which
is no longer created).

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

firmware/shim: better filtering of intermediate files during Xen tree setup

I have no idea what *.1 is meant to cover. Instead also exclude
preprocessed and non-source assembly files.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

firmware/shim: better filtering of dependency files during Xen tree setup

I have no idea what *.d1 is supposed to refer to - we only have .*.d
and .*.d2 files (note also the leading dot). Also switch to passing
-name instead of -path to find - that's a requirement for .*.d et al to
work, but would probably have been better from the beginning.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

libxc: really tolerate empty PV records

Commit 119ee4d773 ("tools/libxc: Tolerate specific zero-content records
in migration v2 streams") meant tolerate those, but failed to set rc
accordingly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

x86/hvm: Constify the read side of vlapic handling

This is in preparation to make hvm_x2apic_msr_read() take a const vcpu
pointer. One modification is to alter vlapic_get_tmcct() to not use current.

This in turn needs an alteration to hvm_get_guest_time_fixed(), which is safe
because the only mutable action it makes is to take the domain plt lock.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vmx: Simplfy the default cases in vmx_msr_{read,write}_intercept()

The default case of vmx_msr_write_intercept() in particular is very tangled.

First of all, fold long_mode_do_msr_{read,write}() into their callers. These
functions were split out in the past because of the 32bit build of Xen, but it
is unclear why the cases weren't simply #ifdef'd in place.

Next, invert the vmx_write_guest_msr()/is_last_branch_msr() logic to break if
the condition is satisfied, rather than nesting if it wasn't. This allows the
wrmsr_hypervisor_regs() call to be un-nested with respect to the other default
logic.

No practical difference from a guests point of view.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/hvm: fix domain crash when CR3 has the noflush bit set

In hardware, when PCID support is enabled and the NOFLUSH bit is set
when writing a CR3 value, the hardware will clear that that bit and
change the CR3 without flushing the TLB. hvm_set_cr3(), however, was
ignoring this bit; the result was that post-vm_event checks detected
an invalid CR3 value and crashed the domain.

Handle NOFLUSH in hvm_set_cr3() by:
1. Clearing the bit
2. Passing a "noflush" flag to lower-level cr3 setting functions to
indicate that a flush should not be performed.

Also clear X86_CR3_NOFLUSH when reporting CR3 monitored CR3 writes.

This allows introspection to be used on VMs whose operating system uses
the NOFLUSH bit.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reported-by: Bitweasil <bitweasil@cryptohaze.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

xen: sched/credit: convert scheduling parameter to s_time_t when set

Basically, instead of converting integers to s_time_t
at usage time (hot paths), do the convertion when the
values are set (cold paths).

This applies to the timeslice and the ratelimit
parameters of Credit1.

Note that, when changing the type of the fields of
struct csched_private (from unsigned to s_time_t),
ncpus is moved up a bit, for better packing.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen/arm: Flush TLBs before turning on the MMU to avoid stale entries

We don't know what is the state of the TLBs when booting Xen. To avoid
stale entries, it is necessary to flush the TLBs before turning on the
MMU.

Reported-by: Iain Hunter <iain@hunterembedded.co.uk>
Signed-off-by: Julien Grall <julien.gralL@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

release-checklist.txt: Say to increment SUPPORT.md version number

CC: Andrew Cooper <andrew.cooper3@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

SUPPORT.md: increment version number

CC: Andrew Cooper <andrew.cooper3@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

common/gnttab: Introduce command line feature controls

This patch was originally released as part of XSA-226. It retains the same
command line syntax (as various downstreams are mitigating XSA-226 using this
mechanism) but the defaults have been updated due to the revised XSA-226
patched, after which transitive grants are believed to functioning
properly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/HVM: don't give the wrong impression of WRMSR succeeding

... for non-existent MSRs: wrmsr_hypervisor_regs()'s comment clearly
says that the function returns 0 for unrecognized MSRs, so
{svm,vmx}_msr_write_intercept() should not convert this into success. We
don't want to unconditionally fail the access though, as we can't be
certain the list of handled MSRs is complete enough for the guest types
we care about, so instead mirror what we do on the read paths and probe
the MSR to decide whether to raise #GP.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

vmx/hap: optimize CR4 trapping

There a bunch of bits in CR4 that should be allowed to be set directly
by the guest without requiring Xen intervention, currently this is
already done by passing through guest writes into the CR4 used when
running in non-root mode, but taking an expensive vmexit in order to
do so.

xenalyze reports the following when running a PV guest in shim mode:

CR_ACCESS             3885950  6.41s 17.04%  3957 cyc { 2361| 3378| 7920}
   cr4  3885940  6.41s 17.04%  3957 cyc { 2361| 3378| 7920}
   cr3        1  0.00s  0.00%  3480 cyc { 3480| 3480| 3480}
     *[  0]        1  0.00s  0.00%  3480 cyc { 3480| 3480| 3480}
   cr0        7  0.00s  0.00%  7112 cyc { 3248| 5960|17480}
   clts        2  0.00s  0.00%  4588 cyc { 3456| 5720| 5720}

After this change this turns into:

CR_ACCESS                  12  0.00s  0.00%  9972 cyc { 3680|11024|24032}
   cr4        2  0.00s  0.00% 17528 cyc {11024|24032|24032}
   cr3        1  0.00s  0.00%  3680 cyc { 3680| 3680| 3680}
     *[  0]        1  0.00s  0.00%  3680 cyc { 3680| 3680| 3680}
   cr0        7  0.00s  0.00%  9209 cyc { 4184| 7848|17488}
   clts        2  0.00s  0.00%  8232 cyc { 5352|11112|11112}

Note that this optimized trapping is currently only applied to guests
running with HAP on Intel hardware. If using shadow paging more CR4
bits need to be unconditionally trapped, which makes this approach
unlikely to yield any important performance improvements.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/PV: fix off-by-one in I/O bitmap limit check

With everyone having their tags below agreeing that putting things the
other way around in the comparison makes things easier to understand, do
that rearrangement while changing the line anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.apu@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

hvm/svm: implement CPUID events

At this moment the CPUID events for the AMD architecture are not
forwarded to the monitor layer.

This patch adds the CPUID event to the common capabilities and then
forwards the event to the monitor layer.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>

x86/hvm: Disallow the creation of HVM domains without Local APIC emulation

There are multiple problems, not necesserily limited to:

* Guests which configure event channels via hvmop_set_evtchn_upcall_vector(),
   or which hit %cr8 emulation will cause Xen to fall over a NULL vlapic->regs
   pointer.

* On Intel hardware, disabling the TPR_SHADOW execution control without
   reenabling CR8_{LOAD,STORE} interception means that the guests %cr8
   accesses interact with the real TPR.  Amongst other things, setting the
   real TPR to 0xf blocks even IPIs from interrupting this CPU.

* On hardware which sets up the use of Interrupt Posting, including
   IOMMU-Posting, guests run without the appropriate non-root configuration,
   which at a minimum will result in dropped interrupts.

Whether no-LAPIC mode is of any use at all remains to be seen.

This is XSA-256.

Reported-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gnttab: don't blindly free status pages upon version change

There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

gnttab/ARM: don't corrupt shared GFN array

... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.

While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

memory: don't implicitly unpin for decrease-reservation

It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.

This is XSA-252.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

grant: Release domain lock on 'map' path in cache_flush

common/grant_table.c:cache_flush() grabs the rcu lock for the current
domain, but only releases it on error paths.

Note that this is not a security issue, as the preempt count is used
exclusively for assertions at the moment.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/time: Rework pv_soft_rdtsc() to aid further cleanup

Having pv_soft_rdtsc() emulate all parts of an rdtscp is awkward, and gets in
the way of some intended cleanup.

* Drop the rdtscp parameter and always make the caller responsible for ecx
   updates when appropriate.
* Switch the function from being void, and return the main timestamp in the
   return value.

The regs parameter is still needed, but only for the stats collection, once
again bringing into question their utility.  The parameter can however switch
to being const.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Avoid leaking other guests' MSR_TSC_AUX values into PV context

If the CPU pipeline supports RDTSCP or RDPID, a guest can observe the value in
MSR_TSC_AUX, irrespective of whether the relevant CPUID features are
advertised/hidden.

At the moment, paravirt_ctxt_switch_to() only writes to MSR_TSC_AUX if
TSC_MODE_PVRDTSCP mode is enabled, but this is not the default mode.
Therefore, default PV guests can read the value from a previously scheduled
HVM vcpu, or TSC_MODE_PVRDTSCP-enabled PV guest.

Alter the PV path to always write to MSR_TSC_AUX, using 0 in the common case.

To amortise overhead cost, introduce wrmsr_tsc_aux() which performs a lazy
update of the MSR, and use this function consistently across the codebase.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

xen/arm: vpsci: Rework the logic to start AArch32 vCPU in Thumb mode

32-bit domain is able to select the instruction (ARM vs Thumb) to use
when boot a new vCPU via CPU_ON. This is indicated via bit[0] of the
entry point address (see "T32 support" in PSCI v1.1 DEN0022D). bit[0]
must be cleared when setting the PC.

At the moment, Xen is setting the CPSR.T but never clear bit[0]. Clear
it to match the specification.

At the same time, slighlty rework the code to make clear thumb is only for
32-bit domain. Lastly, take the opportunity to switch is_thumb from int
to bool.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm: vpsci: Introduce and use PSCI_INVALID_ADDRESS

PSCI 1.0 added the error return PSCI_INVALID_ADDRESS. It is used to
indicate the entry point address is known to be invalid.

In Xen case, this error could be returned when a 64-bit vCPU is using a
Thumb entry address.

For PSCI 0.1 implementation, return PSCI_INVALID_PARAMETERS instead.

Suggested-by: mirela.simonovic@aggios.com
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: mirela.simonovic@aggios.com

xen/arm: vpsci: Update the return type for MIGRATE_INFO_TYPE

int32_t. Update the function return type to match it.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: mirela.simonovic@aggios.com

xen/arm: psci: Prefix with static any functions not exported

A bunch of PSCI functions are not prefixed with static despite no one is
using them outside the file and the prototype is not available in
psci.h.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: psci: Consolidate PSCI version print

Xen is printing the same way the PSCI version for 0.1, 0.2 and later.
The only different is the former is hardcoded.

Furthermore PSCI is now used for other things than SMP bring up. So only
print the PSCI version in psci_init.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: vpsci: Remove parameter 'ver' from do_common_cpu

Currently, the behavior of do_common_cpu will slightly change depending
on the PSCI version passed in parameter. Looking at the code, more the
specific 0.2 behavior could move out of the function or adapted for 0.1:

    - x0/r0 can be updated on PSCI 0.1 because general purpose registers
    are undefined upon CPU on. This was deduced from the spec not
    mentioning the state of general purpose registers on CPU on.
    - PSCI 0.1 does not defined PSCI_ALREADY_ON. However, it would be
    safer to bail out if the CPU is already on.

Based on this, the parameter 'ver' is removed and do_psci_cpu_on
(implementation for PSCI 0.1) is adapted to avoid returning
PSCI_ALREADY_ON.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm64: Kill PSCI_GET_VERSION as a variant-2 workaround

Now that we've standardised on SMCCC v1.1 to perform the branch
prediction invalidation, let's drop the previous band-aid. If vendors
haven't updated their firmware to do SMCCC 1.1, they haven't updated
PSCI either, so we don't loose anything.

This is aligned with the Linux commit 3a0a397ff5ff.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support

Add the detection and runtime code for ARM_SMCCC_ARCH_WORKAROUND_1.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm: smccc: Implement SMCCC v1.1 inline primitive

One of the major improvement of SMCCC v1.1 is that it only clobbers the
first 4 registers, both on 32 and 64bit. This means that it becomes very
easy to provide an inline version of the SMC call primitive, and avoid
performing a function call to stash the registers that woudl otherwise
be clobbered by SMCCC v1.0.

This patch has been adapted to Xen from Linux commit f2d3b2e8759a. The
changes mades are:
    - Using Xen coding style
    - Remove HVC as not used by Xen
    - Add arm_smccc_res structure

Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: psci: Detect SMCCC version

PSCI 1.0 and later allows the SMCCC version to be (indirectly) probed
via PSCI_FEATURES. If the PSCI_FEATURES does not exist (PSCI 0.2 or
earlier) and the function returns an error, then we assume SMCCC 1.0
is implemented.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm: smccc: Add macros SMCCC_VERSION, SMCCC_VERSION_{MINOR, MAJOR}

Add macros SMCCC_VERSION, SMCCC_VERSION_{MINOR, MAJOR} to easily convert
between a 32-bit value and a version number. The encoding is based on
2.2.2 in "Firmware interfaces for mitigation CVE-2017-5715" (ARM DEN 0070A).

Also re-use them to define ARM_SMCCC_VERSION_1_0 and ARM_SMCCC_VERSION_1_1.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: Print a per-CPU message with the BP hardening method used

This will make easier to know whether BP hardening has been enabled for
a CPU and which method is used.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babcuk <volodymyr_babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm64: Implement a fast path for handling SMCCC_ARCH_WORKAROUND_1

The function SMCCC_ARCH_WORKAROUND_1 will be called by the guest for
hardening the branch predictor. So we want the handling to be as fast as
possible.

As the mitigation is applied on every guest exit, we can check for the
call before saving all the context and return very early.

For now, only provide a fast path for HVC64 call. Because the code rely
on 2 registers, x0 and x1 are saved in advance.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm: Adapt smccc.h to be able to use it in assembly code

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: vsmc: Implement SMCCC_ARCH_WORKAROUND_1 BP hardening support

SMCCC 1.1 offers firmware-based CPU workarounds. In particular,
SMCCC_ARCH_WORKAROUND_1 provides BP hardening for variant 2 of XSA-254
(CVE-2017-5715).

If the hypervisor has some mitigation for this issue, report that we
deal with it using SMCCC_ARCH_WORKAROUND_1, as we apply the hypervisor
workaround on every guest exit.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>

xen/arm: vsmc: Implement SMCCC 1.1

The new SMC Calling Convention (v1.1) allows for a reduced overhead when
calling into the firmware, and provides a new feature discovery
mechanism. See "Firmware interfaces for mitigating CVE-2017-5715"
ARM DEN 00070A.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: vpsci: Add support for PSCI 1.1

At the moment, Xen provides virtual PSCI interface compliant with 0.1
and 0.2. Since them, the specification has been updated and the latest
version is 1.1 (see ARM DEN 0022D).

>From an implementation point of view, only PSCI_FEATURES is mandatory.
The rest is optional and can be left unimplemented for now.

At the same time, the compatible for PSCI node have been updated to
expose "arm,psci-1.0".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: mirela.simonovic@aggios.com

xen/arm: psci: Rework the PSCI definitions

Some PSCI functions are only available in the 32-bit version. After
recent changes, Xen always needs to know whether the call was made using
32-bit id or 64-bit id. So we don't emulate reserved one.

With the current naming scheme, it is not easy to know which call
supports 32-bit and 64-bit id. So rework the definitions to encode the
version in the name. From now the functions will be named PSCI_0_2_FNxx
where xx is 32 or 64.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Volodymyr Babchuk <volodymyr.babchuk@epam.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

x86/hvm: Don't shadow the domain parameter in hvm_save_cpu_msrs()

c/s d2f86bf604 which introduced "struct hvm_save_descriptor *d" accidentally
ended up shadowing the "struct domain *d" function parameter. Rename the
former to desc.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/clang: allow integrated assembler usage

If the required features are present.

Modify as-option-add to add an option in case the test fails, and use
it to detect whether the required clang integrated assembler features
are present.

This patch has been tested with clang 3.5, clang 6, gcc 6.4.0 without
retpoline support and gcc 7.3.1 with retpoline support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86: add .size/.type directives to indirect thunk generation macro

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

get_maintainers.pl: Avoid THE_REST when files are added or removed

When files are added or removed /dev/null is used as a place
holder name in the patch for the absent file. Don't try and
find a MAINTAINER for this place holder, it only ever flags
and then spams THE REST, behaviour for a real filename is
unchanged.

Signed-off-by: Alan Robinson <Alan.Robinson@ts.fujitsu.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

build: Rename as-insn-check to as-option-add

as-insn-check mutates the passed-in flags. Rename it to as-option-add, in
line with cc-option-add, and update all callers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

build: Help attempts to syntax highlight Config.mk

Some attempts to syntax highlight Config.mk end up thinking that most of
Config.mk is a string, due to the unbalanced squote. Provide a balancing
squote in a comment to compensate.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen: append EXTRA_CFLAGS_XEN_CORE to CFLAGS

Allow a user to supply extra CFLAGS via the EXTRA_CFLAGS_XEN_CORE
environment variable for hypervisor builds. This is not a
configuration that is supported but is only aimed to help support
testing and troubleshooting when you need to make changes.

Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

build: remove shim related targets

There's no need to have shim specific targets, so just use the regular
xen makefile targets in order to build the shim binary.

When the shim is build as part of the firmware directory install the
stripped Xen binary to the firmware directory and place a binary with
symbols in the debug directory.

The objcopy step of the shim build is also removed in this patch:
since the shim is booted in PVH mode there's no need for the resulting
binary to be in elf32 format. Xen can load PVH kernels with either a
32 or 64bit elf header.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/svm: enable pause filtering threshold

If available, enable the pause filtering threshold feature. See the
previous commit for more information.

Signed-off-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86/svm: add support for pause filtering threshold

Add support for enabling the pause filtering threshold feature.  This
causes the pause filtering count to reset if there's pause filtering
threshold cycles or greater between pauses.  See AMD APM Vol 2 Section
15.14.4 for more details.

The values of the pause filtering count and threshold were found by
iterating over different values of the count and threshold while running
kernbench and a pi spigot algorithm with yields placed in it.  A
balanced setting for both variable provides:

(Using averaged elapsed time with kernbench)
old = 852.0
new = 848.8
improvement = .4%

For system without pause filtering threshold, the change, from 3000 to
4000 for the count, should not negatively effect system performance.

Signed-off-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86: fix indirect thunk usage of CONFIG_INDIRECT_THUNK

When indirect_thunk_asm.h is instantiated directly into assembly files
CONFIG_INDIRECT_THUNK might not be defined, and thus using .if against
it is wrong.

Add a check to define CONFIG_INDIRECT_THUNK to 0 if not defined, so
that using .if CONFIG_INDIRECT_THUNK is always correct.

This suppresses the following clang error:

<instantiation>:8:9: error: expected absolute expression
    .if CONFIG_INDIRECT_THUNK == 1
        ^
<instantiation>:1:1: note: while in macro instantiation
INDIRECT_BRANCH call %rdx
^
entry.S:589:9: note: while in macro instantiation
        INDIRECT_CALL %rdx
        ^

Note that this is a preparatory patch in order to enable clang's
integrated assembler, the integrated assembler is not yet enabled for
assembly files.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

VT-d: use two 32-bit writes to update DMAR fault address registers

The 64-bit DMAR fault address is composed of two 32 bits registers
DMAR_FEADDR_REG and DMAR_FEUADDR_REG. According to VT-d spec:
"Software is expected to access 32-bit registers as aligned doublewords",
a hypervisor should use two 32-bit writes to DMAR_FEADDR_REG and
DMAR_FEUADDR_REG separately in order to update a 64-bit fault address,
rather than a 64-bit write to DMAR_FEADDR_REG. Note that when x2APIC
is not enabled DMAR_FEUADDR_REG is reserved and it's not necessary to
update it.

Though I haven't seen any errors caused by such one 64-bit write on
real machines, it's still better to follow the specification.

Fixes: ae05fd3912b ("VT-d: use qword MMIO access for MSI address writes")
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

x86/svm: add EFER SVME support for VGIF/VLOAD

Only enable virtual VMLOAD/SAVE and VGIF if the guest EFER.SVME is set.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

sysctl: correct comment in xen_sysctl_pcitopoinfo

Refer to correct member of struct xen_sysctl_pcitopoinfo in comment.

Fixes: commit 61319fbfd9 ("sysctl: add sysctl interface for querying PCI topology")
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/tmem: Convert the file common/tmem_xen.c to use typesafe MFN

The file common/tmem_xen.c is now converted to use typesafe. This is
requiring to override the macro page_to_mfn to make it work with mfn_t.

Note that all variables converted to mfn_t havem there initial value,
when set, switch from 0 to INVALID_MFN. This is fine because the initial
values was always overriden before used.

Also add a couple of missing newlines suggested by Andrew in the code.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

build: filter out command line assembler arguments

If the assembler is not used. This happens when using cc -E or cc -S
for example. GCC will just ignore the -Wa,... when the assembler is
not called, but clang will complain loudly and fail.

Also enable passing -Wa,-I$(BASEDIR)/include to clang now that it's
safe to do so.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

build: do not hardcode AFLAGS for as-insn tests

Hardcoding as-insn to use AFLAGS is not correct. For once the test is
performed using a C file with inline assembly, and secondly the flags
used can be passed by the caller together with the CC.

Fix as-insn-check to pass the flags given as parameter to the test.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Fix usage comments as they are changing]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: vgic: Make sure the number of SPIs is a multiple of 32

The vGIC relies on having a pending_irq available for every IRQs
described in the ranks. As each rank describes 32 interrupts, we need to
make sure the number of SPIs is a multiple of 32.

Reported-by: Jeff Kubascik <Jeff.Kubascik@dornerworks.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: Jarvis Roach <Jarvis.Roach@dornerworks.com>

asm-x86/monitor: Add MONITOR_EVENT_INTERRUPT to common capabilities

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

x86/msr: add Raw and Host domain policies

Raw policy contains the actual values from H/W MSRs. Add PLATFORM_INFO
msr to the policy during probe_cpuid_faulting().

Host policy may have certain features disabled if Xen decides not
to use them. For now, make Host policy equal to Raw policy with
cpuid_faulting availability dependent on X86_FEATURE_CPUID_FAULTING.

Finally, derive HVM/PV max domain policies from the Host policy.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap

We're noticing a reproducible system boot hang on certain
Skylake platforms where the BIOS is configured in legacy
boot mode with x2APIC disabled. The system stalls immediately
after writing the first SMP initialization sequence into APIC ICR.

The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation for
some reason takes more time than we expect during INIT-SIPI-SIPI
sequence. As the result, the system is constantly moving between
NMI and SMI handler and not making any progress.

To avoid this, initialize the watchdog after SMP bootstrap on
CPU0 and, additionally, protect the NMI handler by moving
IO port access before NMI re-scheduling. The latter should also
help in case of post boot CPU onlining. Although we're running
watchdog at much lower frequency at this point, it's neveretheless
possible we may trigger the issue anyway.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

shim: allow building of just the shim with build-ID-incapable linker

The ELF note the shim build inserts causes mkelf32 to choke on the
second program header. However, the output of mkelf32 isn't really
needed when building inside tools/firmware/ - an attempt to build it is
made solely because of a wrong dependency.

Further changes to the make logic will be needed to also allow building
a shim-enabled "normal" xen with such a linker (as it looks the --notes
option will need passing not just when the linker support build ID
generation).

Also drop a stray variable setting from the x86 Makefile.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools: libxenstat: fix format string overflow

With gcc 7.3.0, the build fails like this:

src/xenstat_linux.c: In function ‘getBridge’
src/xenstat_linux.c:78:34: warning: ‘%s’ directive writing up to 255 bytes into a region of size 241 [-Wformat-overflow=]
     sprintf(tmp, "/sys/class/net/%s/bridge", de->d_name);
                                  ^~
src/xenstat_linux.c:78:5: note: ‘sprintf’ output between 23 and 278 bytes into a destination of size 256
     sprintf(tmp, "/sys/class/net/%s/bridge", de->d_name);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix by making the buffer bigger.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

shut down domain when last vCPU goes down

I've just had to deal with an early boot crash of Linux which occurred
so early that even "earlyprintk=xen" did not produce any useful output.
Hence the domain appeared to hang, while in fact it had brought down its
only vCPU. By translating this to a shutdown, the situation will be
better recognizable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/PV: avoid indirect call/thunk in I/O emulation

The stub is within reach from the .text section, so there's no point
using an indirect call here. This has the added benefit of there no
longer being two sufficiently different approaches, breaking one of
which people may not even notice.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citix.com>

hvm/monitor: fix usage of the control register mask

Previous usage is not correct and would prevent certain updates from
being notified to the monitor client.

For example if (value ^ old) == (PGE | PSE) and mask == PGE this
update would not be notified.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

x86/microcode: Propagate microcode update errors

Errors on updating the microcode in the processor were silently
dropped when invoked via the microcode_update hypercall. Also, the log
message was misleading.

Signed-off-by: Uwe Dannowski <uwed@amazon.de>
Reviewed-by: Stefan Nuernberger <snu@amazon.de>
Reviewed-by: Martin Pohlack <mpohlack@amazon.de>
Reviewed-by: Amit Shah <aams@amazon.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/srat: fix end calculation in nodes_cover_memory()

Along the lines of commit 7226486767 ("x86/srat: fix the end pfn check
in valid_numa_range()") nodes_cover_memory() also doesn't consistently
use "end": It's set to an inclusive value initially, but then compared
to the exclusive "end" field of struct node and also possibly set to
nodes[j].start, making it exclusive too. Change the initialization to
make the variable consistently exclusive.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/hvm/dmop: only copy what is needed to/from the guest

dm_op() fails with -EFAULT if the struct xen_dm_op given by the guest is
smaller than Xen's struct xen_dm_op. This is a problem because DMOP is
meant to be a stable ABI but it breaks whenever the size of struct
xen_dm_op changes.

To fix this, change how the copying to and from the guest is done. When
copying from the guest, first copy the header and inspect the op. Then,
only copy the correct amount needed for that op. When copying to the
guest, don't copy the header. Rather, copy only the correct amount
needed for that particular op.

So now the dm_op() will fail if the guest does not supply enough bytes
for the specific op. It will not fail if the guest supplies too many
bytes for the specific op, but Xen will not copy the extra bytes.

Remove some now unused macros and helper functions.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

hvm/svm: Enable CR events

The CR_INTERCEPT_CR3_WRITE intercept is out of the vmcb->_cr_intercepts
so the AMD arch can't intercept CR events.

This patch implements the CR intercept by adding the flag on a
write_ctrlreg event. The monitor write ctrlreg event is moved from the
Intel side to the common capabilities side.

We just need to enable the SVM intercept and then hvm_mov_to_cr() will
forward the event on to the monitor when appropriate.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>