dgit.raspbian.org Git

x86/vmx: Fix vmentry failure because of invalid LER on Broadwell

Occasionally, on certain Broadwell CPUs MSR_IA32_LASTINTTOIP has been
observed to have the top three bits corrupted as though the MSR is using
the LBR_FORMAT_EIP_FLAGS_TSX format. This is incorrect and causes a
vmentry failure -- the MSR should contain an offset into the current
code segment. This is assumed to be erratum BDF14. Workaround the issue
by sign-extending into bits 48:63 for MSR_IA32_LASTINT{FROM,TO}IP.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

ns16550: add support for UART parameters to be specifed with name-value pairs

Add name=value parsing options for com1 and com2 to add flexibility
in setting register values for MMIO UART devices.

Maintain backward compatibility with previous positional parameter
specfications.

eg. com1=115200,8n1,0x3f8,4
eg. com1=115200,8n1,0x3f8,4,reg_width=4,reg_shift=2
eg. com1=baud=115200,parity=n,reg_width=4,reg_shift=2,irq=4

Signed-off-by: Swapnil Paratey <swapnil.paratey@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: ensure invalidate_icache() definition is visible only when !__ASSEMBLY__

Commit edff605421 introduces an empty invalidate_icache() function in
page.h for x86 but mistakenly places it outside the !__ASSEMBLY__
block. This causes build failure on x86.

Address this by moving the function definition to within the existing
!__ASSEMBLY__ block.

Fixes: edff605421 ("Avoid excess icache flushes in populate_physmap() before domain has been created")
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/arm: Remove unused helpers access_ok and array_access_ok

Both helpers access_ok and array_access_ok are not used on ARM. Remove
them.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

Avoid excess icache flushes in populate_physmap() before domain has been created

populate_physmap() calls alloc_heap_pages() per requested
extent. alloc_heap_pages() invalidates the entire icache per
extent. During domain creation, the icache invalidations can be deffered
until all the extents have been allocated as there is no risk of
executing stale instructions from the icache.

Introduce a new flag "MEMF_no_icache_flush" to be used to prevent
alloc_heap_pages() from performing icache maintenance operations. Use
the flag in populate_physmap() before the domain has been unpaused and
perform required icache maintenance function at the end of the
allocation.

One concern is the lack of synchronisation around testing for
"creation_finished". But it seems, in practice the window where it is
out of sync should be small enough to not matter.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

arm: p2m: Prevent redundant icache flushes

When toolstack requests flushing the caches, flush_page_to_ram() is
called for each page of the requested domain. This needs to unnecessary
icache invalidation operations.

Let's take the responsibility of performing icache operations and use
the recently introduced flag to prevent redundant icache operations by
flush_page_to_ram().

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

Allow control of icache invalidations when calling flush_page_to_ram()

flush_page_to_ram() unconditionally drops the icache. In certain
situations this leads to execessive icache flushes when
flush_page_to_ram() ends up being repeatedly called in a loop.

Introduce a parameter to allow callers of flush_page_to_ram() to take
responsibility of synchronising the icache. This is in preparations for
adding logic to make the callers perform the necessary icache
maintenance operations.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

vif-common.sh: Have iptables wait for the xtables lock

iptables has a system-wide lock on the xtables.  Strangely though, in
the case of two concurrent invocations, the default is for the
instance not grabbing the lock to exit out rather than waiting for it.
This means that when starting a large number of guests in parallel,
many will fail out with messages like this:

  2017-05-10 11:45:40 UTC libxl: error: libxl_exec.c:118: libxl_report_child_exitstatus: /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
  2017-05-10 11:50:52 UTC libxl: error: libxl_exec.c:118: libxl_report_child_exitstatus: /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4

In order to instruct iptables to wait for the lock, you have to
specify '-w'.  Unfortunately, not all versions of iptables have the
'-w' option, so on first invocation check to see if it accepts the -w
command.

Reported-by: Antony Saba <awsaba@gmail.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/HAP: don't open code clear_domain_page()

Also drop a stray initializer.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/HVM: correct notion of new CPL in task switch emulation

Commit aac1df3d03 ("x86/HVM: introduce hvm_get_cpl() and respective
hook") went too far in one aspect: When emulating a task switch we
really shouldn't be looking at what hvm_get_cpl() returns, as we're
switching all segment registers.

The issue manifests as a vmentry failure for 32bit VMs which use task
gates to service interrupts/exceptions, in situations where delivering
the event interrupts user code, and a privilege increase is required.

However, instead of reverting the relevant parts of that commit, have
the caller tell the segment loading function what the new CPL is. This
at once fixes ES being loaded before CS so far having had its checks
done against the old CPL.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/physdev: factor out the code to allocate and map a pirq

Move the code to allocate and map a domain pirq (either GSI or MSI)
into the x86 irq code base, so that it can be used outside of the
physdev ops.

This change shouldn't affect the functionality of the already existing
physdev ops.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

domctl/pt: remove hvm_domid field from bind struct

This filed is unused and serves no purpose.

Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
[jb: bump domctl interface version]
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vlapic: fix two flaws in emulating MSR_IA32_APICBASE

According to SDM Chapter ADVANCED PROGRAMMABLE INTERRUPT CONTROLLER (APIC)
-> Extended XAPIC (x2APIC) -> x2APIC State Transitions, The existing code to
handle guest's writing MSR_IA32_APICBASE has two flaws:
1. Transition from x2APIC Mode to Disabled Mode is allowed but wrongly
disabled currently. Fix it by removing the related check.
2. Transition from x2APIC Mode to xAPIC Mode is illegal but wrongly allowed
currently. Considering changing ENABLE bit of the MSR has been handled,
it can be fixed by only allowing transition from xAPIC Mode to x2APIC Mode
(the other two transitions: from x2APIC mode to xAPIC Mode, from disabled mode
to invalid state (EN=0, EXTD=1) are disabled).

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/PoD: drop a pointless local variable

... and move another one into a more narrow scope.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/NPT: deal with fallout from 2Mb/1Gb unmapping change

Commit efa9596e9d ("x86/mm: fix incorrect unmapping of 2MB and 1GB
pages") left the NPT code untouched, as there is no explicit alignment
check matching the one in EPT code. However, the now more widespread
storing of INVALID_MFN into PTEs requires adjustments:
- calculations when shattering large pages may spill into the p2m type
  field (converting p2m_populate_on_demand to p2m_grant_map_rw) - use
  OR instead of PLUS,
- the use of plain l{2,3}e_from_pfn() in p2m_pt_set_entry() results in
  all upper (flag) bits being clobbered - introduce and use
  p2m_l{2,3}e_from_pfn(), paralleling the existing L1 variant.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen/public: Correct the HYPERVISOR_dm_op() documentation to match reality

The number of buffers is ahead of the buffer list in the argument list.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/pagewalk: Fix pagewalk's handling of instruction fetches

Despite the claim in the comment (which was based partly on the code already
being like that, and mistaken reasoning because of Xen leaking NX into guest
context), reality differs.

Use of the SMAP feature without NX, or in a 2-level guest, demonstrate an
observable difference between reads and instruction fetches, despite
PFEC_insn_fetch not being reported in the #PF error code.  This demonstrates
that instruction fetches are distinguished from data reads even without
PFEC_insn_fetch being reported.

Alter the pagewalk logic to keep the pagewalk insn_fetch input intact, but
only conditionally report insn_fetch in the error code.  This logic is more
in line with the Intel SDM text:

* I/D flag (bit 4).
   This flag is 1 if (1) the access causing the page-fault exception was an
   instruction fetch; and (2) either (a) CR4.SMEP = 1; or (b) both (i) CR4.PAE
   = 1 (either PAE paging or 4-level paging is in use); and (ii) IA32_EFER.NXE
   = 1. Otherwise, the flag is 0. This flag describes the access causing the
   page-fault exception, not the access rights specified by paging.

and the AMD SDM text:

* I/D - Bit 4. If this bit is set to 1, it indicates that the access that
   caused the page fault was an instruction fetch. Otherwise, this bit is
   cleared to 0. This bit is only defined if no-execute feature is enabled
   (EFER.NXE=1 && CR4.PAE=1).

Curiously, the AMD manual doesn't mention SMEP despite some Fam16h processors
and all Fam17h processors supporting it.  Experimentally, it behaves as
described by Intel.

In addition, add some extra clarification and sanity checking around the use
of NX for the access checks, where it might be reserved.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Revert "x86/hvm: disable pkeys for guests in non-paging mode"

This reverts commit c41e0266dd59ab50b7a153157e9bd2a3ad114b53.

When determining Access Rights, Protection Keys only take effect when CR4.PKE
it set, and 4-level paging is active. All other circumstances (notibly, 32bit
PAE paging) skip the Protection Key control mechanism.

Therefore, we do not need to clear CR4.PKE behind the back of a guest which is
not using paging, as such a guest is necesserily running with EFER.LMA
disabled.

The {RD,WR}PKRU instructions are specified as being legal for use in any
operating mode, but only if CR4.PKE is set. By clearing CR4.PKE behind the
back of an unpaged guest, these instructions yield #UD despite the guest
correctly seeing PKE set if it reads CR4, and OSPKE being visible in CPUID.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Huaitong Han <huaitong.han@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

stop_machine: fill fn_result only in case of error

When stop_machine_run() is called with NR_CPUS as last argument,
fn_result member must be filled only if an error happens since it is
shared across all cpus.

Assume CPU1 detects an error and set fn_result to -1, then CPU2 doesn't
detect an error and set fn_result to 0. The error detected by CPU1 will
be ignored.

Note that in case multiple failures occur on different CPUs, only the
last error will be reported.

Signed-off-by: Gregory Herrero <gregory.herrero@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

x86: partially undo "fix build with gcc 7"

While f32400e90c ("x86: fix build with gcc 7")'s change to
compat_array_access_ok() is necessary, I had blindly and needlessly
also added it to array_access_ok(). There's no conditional expression
involved there, so undo it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

smp: assert that all affected CPUs are online in on_selected_cpus()

Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

hvmloader: drop pointless objcopy invocation

It doesn't alter the image in any way.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/MSI: improve memory usage in struct msi_desc

There's no reason to have both a 4-byte hole and 4 bytes of tail
padding.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/x86: Drop sync_core()

As identified in Linux c/s c198b121b1a1d "x86/asm: Rewrite sync_core() to use
IRET-to-self", sync_core() is only appropriate for two very specific usecases.

Xen doesn't have need of either of these usecases, so drop sync_core() to
avoid any misuse.

In the unlikely event that we do gain a legitimate use for sync_core(), it
should be reintroduced as a mov to %cr2 rather than cpuid, which has a lower
overhead.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/x86/alternatives: Do not use sync_core() to serialize I$

We use sync_core() in the alternatives code to stop speculative
execution of prefetched instructions because we are potentially changing
them and don't want to execute stale bytes.

What it does on most machines is call CPUID which is a serializing
instruction. And that's expensive.

However, the instruction cache is serialized when we're on the local CPU
and are changing the data through the same virtual address. So then, we
don't need the serializing CPUID but a simple control flow change. Last
being accomplished with a CALL/RET which the noinline causes.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
[Linux commit 34bfab0eaf0fb5c6fb14c6b4013b06cdc7984466]

Ported to Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/string: Clean up x86/string.h

* None of the GCC docs mention memmove() in its list of builtins even today,
   but 4.1 does have the builtin, meaning that all currently supported
   compilers have it.
* Consistently use Xen style, matching the common code, and introduce symbol
   definitions for function pointer use.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>

xen/string: Use compiler __builtin_*() where possible

The use of -fno-builtin inhibits these automatic transformations. This causes
constructs such as strlen("literal") to be evaluated at compile time, and
certain simple operations to be replaced with repeated string operations.

To avoid the macro altering the function names, use the method recommended by
the C specification by enclosing the function name in brackets to avoid the
macro being expanded. This means that optimisation opportunities continue to
work in the rest of the translation unit.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/string: Clean up {xen,arm}/string.h

* Drop __kernel_size_t entirely.  It isn't a useful distinction, especially
   as it means the the prototypes don't appear to match their common
   definitions.
* Introduce __HAVE_ARCH_* guards for strpbrk(), strsep() and strspn(), which
   match their implementation in common/string.c
* Apply consistent Xen style throughout.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>

Revert "tools/libxc: Drop broken xc_{get,set}_hvm_param() functions"

This reverts commit fa4583333ddba6afb7b07ff7eb4d16e1a6a7459c.

QEMU build is broken by that patch.

xl man page cleanup and fixes

Signed-off-by: Armando Vega <armando@greenhost.nl>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
[ wei: remove trailing spaces ]
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxc: Drop broken xc_{get,set}_hvm_param() functions

xc_{get,set}_hvm_param() are deprecated because they truncate their value
parameter in 32bit builds of libxc, and are therefore unfit for use.

As there is only a single remaining user, switch that user over to
xc_hvm_param_get() and drop these functions completely.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

Revert "ns16550: add support for UART parameters to be specifed with name-value pairs"

This reverts commit a91252ff0d219d801f2dc947511c1755fe5b05fe,
as it breaks the build on ARM.

docs: remove PVHv1 document

The current misc/pvh.markdown document refers to PVHv1, remove it to
avoid confusion with PVHv2 since the PVHv1 code has already been
removed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/vpmu: add cpu hot unplug notifier for vpmu

Currently, Hot unplug a physical CPU with vpmu enabled may cause
system hang due to send a remote call to an offlined pCPU. This
patch add a cpu hot unplug notifer to save vpmu context before
cpu offline.

Consider one scenario, hot unplug pCPU N with vpmu enabled.
The vcpu which running on this pCPU will be switch to other
online cpu. A remote call will be send to pCPU N to save the
vpmu context before loading the vpmu context on this pCPU.
System will hang in function on_select_cpus() because of that
pCPU is offlined and can not do any respond.

The purpose of add a VPMU_CONTEXT_LOADED check in vpmu_arch_destroy()
before send a remote call to save vpmu contex is:
a. when a vpmu context has been loaded in a remote pCPU, make a
remote call to save the vpmu contex and stop counters is necessary.
b. VPMU_CONTEXT_LOADED flag will be reset if a pCPU is offlined.
this check will prevent send a remote call to an offlined pCPU.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

acpi: enlarge NUM_FIXMAP_ACPI_PAGES to support larger scale boards

In acpi_tb_verify_table()->__acpi_map_table(), it suppose all ACPI tables
may not exceed 4 pages, the tables includes SRAT/APIC/ERST etc.
Please note that the table DSDT is not mapped through
acpi_tb_verify_table(), thus we don't care its size although it's usually
the largest table among all the ACPI tables. Then the biggest table we
concern is SRAT.
As we know, the size of SRAT if affected by both CPU number and memory
slot number, each CPU costs 24B, and each memory slot costs 40B.

Please note: even when SRAT table is within 4 pages, eg. 14128B, in
__acpi_map_table(), it maps pages to get the table. suppose the start
address is near the end of the first page:

1000B 4096B 4096B 4096B 840B
|___|_____________|______________|______________|____|

although the total page is within 4 pages , but it may be in fact across 5
pages, as shown above. Thus the NUM_FIXMAP_ACPI_PAGES should be much
larger nowadays. If not, xen would wrongly thinks no NUMA configuration
could be found as that it could not get SRAT table.

Thus, we make NUM_FIXMAP_ACPI_PAGES much larger, to 64(256KB). it's
calculated for that the theoretical largest CPU number on main Linux
distros is about 8192, and memory slots number should be within 1000,
that's 24B*8192+40B*1000 = 236608B. Meanwhile, because IOREMAP_VIRT_*
region is 16GB, thus I think extending it to 256KB is safe enough.

Of course, there's much more work to do to support large scale boards of
that many(8192) CPUs and 1000 memory slots. We just make life easier for
boards with serveral hundreds of CPUs and serveral TBs of memory.

Signed-off-by: Zhang Bo <oscar.zhangbo@huawei.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

ns16550: add support for UART parameters to be specifed with name-value pairs

Add name=value parsing options for com1 and com2 to add flexibility
in setting register values for MMIO UART devices.

Maintain backward compatibility with previous positional parameter
specfications.

eg. com1=115200,8n1,0x3f8,4
eg. com1=115200,8n1,0x3f8,4,reg_width=4,reg_shift=2
eg. com1=baud=115200,parity=n,reg_width=4,reg_shift=2,irq=4

Signed-off-by: Swapnil Paratey <swapnil.paratey@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mmcfg: set pci_mmcfg_config_num to 0 on error path

One error path of acpi_parse_mcfg doesn't set pci_mmcfg_config_num to zero, fix
this.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mce: make 'found_error' and 'mce_fatal_cpus' private to mcheck_cmn_handler()

mcheck_cmn_handler() is the only user of 'found_error' and
'mce_fatal_cpus'.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mce: make mce barriers private to their users

Each of current mce barriers is actually used by only one function, so
move their definitions into their users. A static mce barrier initializer
is introduced so we can move the initialization of above mce barriers
to their definitions.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Replace do_guest_trap() with pv_inject_hw_exception()

do_guest_trap() is now functionally equivalent to pv_inject_hw_exception(),
but with a less useful API as it requires the error code parameter to be
passed implicitly via cpu_user_regs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mtrr: Improvements to control register handling

Use X86_CR0_CD rather than opencoding it (and its inversion).  Drop the
pointless cr0 variable.

Xen always uses CR4.PGE, and altering PGE is a full TLB flush.  There is no
need to call flush_tlb_local() (which itself, toggles CR4.PGE rather than
writing to CR3!) as well as clearing CR4.PGE.  The static cr4 variable isn't
needed either.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/tlb: Don't use locked operations in tlbflush_filter()

All passed cpumask_t's are context-local and not at risk of concurrent
updates.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: export trapstr()

It will be used in common and pv specific code. Export it in traps.h.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/x86: Remove APIC_INTEGRATED() checks

All 64bit processors have integrated APICs. Xen has no need to attempt to
cope with external APICs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

x86/mm: Fix the odd indentation of the pin_page block of do_mmuext_op()

The pin_page block is missing one level of indentation, which makes the
MMUEXT_UNPIN_TABLE case label appear to be outside of the switch statement.

However, the block isn't needed at all if page is declared with switch level
scope. This allows for the removal of the identical local declarations for
MMUEXT_UNPIN_TABLE, MMUEXT_NEW_USER_BASEPTR and MMUEXT_CLEAR_PAGE.

While making this adjustment, delete one other piece of trailing whitespace.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Drop unused switch_kernel_stack()

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/mm: make free_perdomain_mappings() idempotent

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Makefile: Mention usual targets of subdir Makefiles

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: M A Young <m.a.young@durham.ac.uk>
CC: Andrew Cooper <andrew.cooper3@citrix.com>

Branching 4.9: Fix versions to be 4.10-unstable

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

Branch Xen 4.9: Make staging be an unstable branch

Config.mk
  MINIOS_UPSTREAM_REVISION   } changed from tag to equivalent
  QEMU_TRADITIONAL_REVISION  }  specific commit hash
  QEMU_UPSTREAM_REVISION     now tracking master again

README, xen/Makefile
  Update version number

*/configure
  Reran autoconf; only change is version number

tools/Rules.mk, xen/Kconfig.debug
  Enable debug.
  Reverts 229ff3125b3d "Use non-debug build for Xen 4.9".

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

Makefile: Regularise subdir targets and their dependencies

Recent changes to this Makefile have broken some build targets, and
some parallel builds.

Looking at it, I think I have identified the undocumented design
intent in the top-level Makefile.  So in this patch I document it, and
also make it true.

In detail:

* Add a comment with the new design intent
* Get rid of the ad-hoc rules for recursing into tools/include,
   and replace them with a pattern rule
* Add an appropriate dependency on TARGET-tools-public-headers from
   TARGET-tools and TARGET-stubdom (but not dist-*).
* Get rid of all the separate invocations of $(MAKE) -C tools/include
   which are now obsolete
* Un-deprecate the simple `tools' etc. targets (aliases for `dist-tools')
   which we seem not to be making any effort to get rid of

I have verified with the following shell script that after my change,
the tree producese the same results for various build targets as
3fafdc28eb98 (before the Makefile-hacking started).

My tests failed as expected for make -C tools, both before and after.

Separately, there is a bug in the Makefiles that `make distclean-tools'
fails.  I have not investigated that bug in detail.

    #!/bin/bash

    set -e
    set -o pipefail

    listings=../listings

    rm -rf $listings
    mkdir $listings

    chks () {
         reskey="C$subdir $*"
         reskey="${reskey// /_}"
         reskey="${reskey//\//:}"
         lk=$listings/$reskey
         for suffix in '' -xen -tools -stubdom -docs; do
             case "$subdir:$suffix" in
             .:*) ;;
             *:) ;;
             *) continue;;
             esac
             git clean -qxdff
             rm -rf $output
             printf '%s' "running -C$subdir suffix=$suffix "
             case "$subdir $suffix" in
             *xen*) ;;
             *) printf 'configure '; ./configure >$lk.cfg 2>&1 ;;
             esac
             fail=''
             for targ in $*; do
                 realtarg=$targ$suffix
                 printf '%s ' "$realtarg"
                 if ! make -C $subdir -j10 $realtarg >${lk}_${realtarg}.log 2>&1
                 then
                    fail=$realtarg
                    break
                 fi
             done
             if [ "$fail" ]; then
               echo fail!
               echo "$fail failed" >$lk.list
             else
               echo ok.
               (test ! -e "$output" || find $output) |sort >$lk.list
             fi
        done
    }

    subdirs='. xen docs tools'

    output=$PWD/dist
    for subdir in $subdirs; do
        chks build clean distclean
    done

    output=$PWD/dist
    subdir=.
    chks dist

    export DESTDIR=$PWD/destdir
    output=$PWD/destdir
    for subdir in $subdirs; do
        chks install
    done

And the output:

    (64)iwj@mariner:~/work/xen.git$ ~/junk/chks
    running -C. suffix= configure build clean distclean ok.
    running -C. suffix=-xen build-xen clean-xen distclean-xen ok.
    running -C. suffix=-tools configure build-tools clean-tools distclean-tools fail!
    running -C. suffix=-stubdom configure build-stubdom clean-stubdom distclean-stubdom ok.
    running -C. suffix=-docs configure build-docs clean-docs distclean-docs ok.
    running -Cxen suffix= build clean distclean ok.
    running -Cdocs suffix= configure build clean distclean ok.
    running -Ctools suffix= configure build fail!
    running -C. suffix= configure dist ok.
    running -C. suffix=-xen dist-xen ok.
    running -C. suffix=-tools configure dist-tools ok.
    running -C. suffix=-stubdom configure dist-stubdom ok.
    running -C. suffix=-docs configure dist-docs ok.
    running -C. suffix= configure install ok.
    running -C. suffix=-xen install-xen ok.
    running -C. suffix=-tools configure install-tools ok.
    running -C. suffix=-stubdom configure install-stubdom ok.
    running -C. suffix=-docs configure install-docs ok.
    running -Cxen suffix= install ok.
    running -Cdocs suffix= configure install ok.
    running -Ctools suffix= configure install fail!
    (64)iwj@mariner:~/work/xen.git$

CC: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: M A Young <m.a.young@durham.ac.uk>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

tools/include/Makefile: Support `build' target

This is the only one of the Makefiles invoked with -C from the
toplevel which lacks this target.

CC: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: M A Young <m.a.young@durham.ac.uk>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/hvmloader: Don't wait for the producer to fill the ring if

The condition: if there is a space in the ring then wait for the producer
to fill the ring also evaluates to true even if the ring if full. It
leads to a deadlock where producer is waiting for consumer
to consume the items and consumer is waiting for producer to fill the ring.

Fix for the issue: check if the ring is full and then break from
the loop to consume the items from the ring.
eg. case: prod = 1272, cons = 248.

Signed-off-by: Anshul Makkar <anshul.makkar@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

Restore HVM_OP hypercall continuation (partial revert of ae20ccf)

Commit ae20ccf removed the hypercall continuation logic from the end
of do_hvm_op(), claiming:

"This patch removes the need for handling HVMOP restarts, so that
infrastructure is removed."

That turns out to be false. The removal of HVMOP_set_mem_type removed
the need to store a start iteration value in the hypercall
continuation, but a grep through hvm.c for ERESTART turns up at least
two places where do_hvm_op() may still need a hypercall continuation:

* HVMOP_set_hvm_param can return -ERESTART when setting
HVM_PARAM_IDENT_PT in the event that it fails to acquire the domctl
lock

* HVMOP_flush_tlbs can return -ERESTART if several vcpus call it at
the same time

In both cases, a simple restart (with no stored iteration information)
is necessary.

Add a check for -ERESTART again, along with a comment at the top of
the function regarding the lack of decoding any information from the
op value.

Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>

xen/arm: p2m: Fix incorrect mapping of superpages

The same set of functions is used to set as well as to clean P2M
entries, except for clean operations (INVALID_MFN ~0UL) is passed as a
parameter. Unfortunately, when calculating an appropriate target order
for a particular mapping INVALID_MFN is taken into account which leads
to 4K page target order being set each time even for 2MB and 1GB
mappings.

This will result to break down the superpage into 4K mappings and leave
empty tables allocated.

This was introduced by commit 2ef3e36ec7 "xen/arm: p2m: Introduce
p2m_set_entry and __p2m_set_entry".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86/pagewalk: Fix determination of Protection Key access rights

* When fabricating gl1e's from superpages, propagate the protection key as
   well, so the protection key logic sees the real key as opposed to 0.

* Experimentally, the protection key checks are performed ahead of the other
   access rights.  In particular, accesses which fail both protection key and
   regular permission checks yield PFEC_prot_key in the resulting pagefault.

* Protection keys apply to all data accesses to user-mode addresses,
   including accesses from supervisor code.  PKRU WD applies to any data
   write, not just to mapping which are writable.  However, a supervisor
   access without CR0.WP bypasses any protection from protection keys.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Release-acked-by: Julien Grall <julien.grall@arm.com>

hvmloader: avoid tests when they would clobber used memory

First of all limit the memory range used for testing to 4Mb: There's no
point placing page tables right above 8Mb when they can equally well
live at the bottom of the chunk at 4Mb - rep_io_test() cares about the
5Mb...7Mb range only anyway. In a subsequent patch this will then also
allow simply looking for an unused 4Mb range (instead of using a build
time determined one).

Extend the "skip tests" condition beyond the "is there enough memory"
question.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Gary Lin <glin@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

Use non-debug build for Xen 4.9

Modify Config.mk and Kconfig.debug to disable debug by default in
preparation for late RCs and eventual release.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl/devd: move the device allocation/removal code

Move the device addition/removal code to the {add/remove}_device functions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

libxl/devd: correctly manipulate the dguest list

Current code in backend_watch_callback has two issues when manipulating the
dguest list:

1. backend_watch_callback forgets to remove a libxl__ddomain_guest from the
list of tracked domains when the related data is freed, causing dereferences
later on when the list is traversed. Make sure that a domain is always removed
from the list when freed.

2. A spurious device state change can cause a dguest to be freed, with active
devices and without being removed from the list. Fix this by always checking if
a dguest has active devices before freeing and removing it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Reinis Martinsons <admin@frp.lv>
Suggested-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

libxl/devd: fix a race with concurrent device addition/removal

Current code can free the libxl__device inside of the libxl__ddomain_device
before the addition has finished if a removal happens while an addition is
still in process:

  backend_watch_callback
            |
            v
       add_device
            |                 backend_watch_callback
    (async operation)                   |
            |                           v
            |                     remove_device
            |                           |
            |                           V
            |                    device_complete
            |                 (free libxl__device)
            v
     device_complete
  (deref libxl__device)

Fix this by creating a temporary copy of the libxl__device, that's tracked by
the GC of the nested async operation. This ensures that the libxl__device used
by the async operations cannot be freed while being used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

build: more adjustments to top-level Makefile dependencies

In the original code, top-level dist target unconditionally invokes
dist target for tools/include, which is wrong when tools component is
not enabled.

Make dist-tools depend on dist-tools-public-headers, which depends on
build-tools-public-headers.

Discovered by Travis-CI.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
elease-acked-by: Julien Grall <julien.grall@arm.com>

arm: fix build with gcc 7

The compiler dislikes duplicate "const", and the ones it complains
about look like they we in fact meant to be placed differently.

Also fix array_access_okay() (just like on x86), despite the construct
being unused on ARM: -Wint-in-bool-context, enabled by default in
gcc 7, doesn't like multiplication in conditional operators. "Hide" it,
at the risk of the next compiler version becoming smarter and
recognizing even that. (The hope is that added smartness then would
also better deal with legitimate cases like the one here.) The change
could have been done in access_ok(), but I think we better keep it at
the place the compiler is actually unhappy about.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86: fix build with gcc 7

-Wint-in-bool-context, enabled by default in gcc 7, doesn't like
multiplication in conditional operators. "Hide" them, at the risk of
the next compiler version becoming smarter and recognizing even those.
(The hope is that added smartness then would also better deal with
legitimate cases like the ones here.)

The change could have been done in access_ok(), but I think we better
keep it at the places the compiler is actually unhappy about.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

xmalloc: correct _xmalloc_array() indentation

It's been wrongly indented using tabs till now, and the stray blank
ahead of the final return statement gets in the way of using .i files
for detailed analysis of other compiler issues
(-Wmisleading-indentation kicks in due to the tab->space
transformation done in the course of pre-processing).

Also add missing spaces inside the if() at once, including the similar
case in _xzalloc_array().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

build: add missing dependency

Commit f745b55 missed install-tools' dependency on
build-tools-public-headers.

Discovered by Travis-CI.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

build: stubdom and tools should depend on public header target

Build can fail if stubdom build is run before tools build because:

1. tools/include build uses relative path and depends on XEN_OS
2. stubdom needs tools/include to be built, at which time XEN_OS is
   mini-os and corresponding symlinks are created
3. libraries inside tools needs tools/include to be built, at which
   time XEN_OS is the host os name, but symlinks won't be created
   because they are already there
4. libraries get the wrong headers and fail to build

Since both tools and stubdom build need the public headers, we build
tools/include before stubdom and tools. Remove runes in stubdom and
tools to avoid building tools/include more than once.

Provide a new dist target for tools/include.  Hook up the install,
clean, dist and distclean targets for tools/include.

The new arrangement ensures tools build gets the correct headers
because XEN_OS is set to host os when building tools/include. As for
stubdom, it explicitly links to the mini-os directory without relying
on XEN_OS so it should be fine.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

tools/xenconsoled: Preserve errno while rotating logfile handles

The logic to optionally exit after a poll() error relies on errno, but
handle_log_reload() does not preserve it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

libxl/arm: Fix ARM build.

Initialise *size in default branch to prevent certain compilers (i.e.
Linaro GCC 5.2-2015.11-2) from reporting "variable may be used uninitialized"
errors in caller function.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/ioreq_server: make p2m_finish_type_change actually work

Commit 6d774a951696 ("x86/ioreq server: synchronously reset outstanding
p2m_ioreq_server entries when an ioreq server unmaps") introduced
p2m_finish_type_change(), which was meant to synchronously finish a
previously initiated type change over a gpfn range.  It did this by
calling get_entry(), checking if it was the appropriate type, and then
calling set_entry().

Unfortunately, a previous commit (1679e0df3df6 "x86/ioreq server:
asynchronously reset outstanding p2m_ioreq_server entries") modified
get_entry() to always return the new type after the type change, meaning
that p2m_finish_type_change() never changed any entries.  Which means
when an ioreq server was detached and then re-attached (as happens in
XenGT on reboot) the re-attach failed.

Fix this by using the existing p2m-specific recalculation logic instead
of doing a read-check-write loop.

Fix: 'commit 6d774a951696 ("x86/ioreq server: synchronously reset
      outstanding p2m_ioreq_server entries when an ioreq server unmaps")'

Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/mm: fix incorrect unmapping of 2MB and 1GB pages

The same set of functions is used to set as well as to clean
P2M entries, except that for clean operations INVALID_MFN (~0UL)
is passed as a parameter. Unfortunately, when calculating an
appropriate target order for a particular mapping INVALID_MFN
is not taken into account which leads to 4K page target order
being set each time even for 2MB and 1GB mappings. This eventually
breaks down an EPT structure irreversibly into 4K mappings which
prevents consecutive high order mappings to this area.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/pv: Fix the handling of `int $x` for vectors which alias exceptions

The claim at the top of c/s 2e426d6eecf "x86/traps: Drop use_error_code
parameter from do_{,guest_}trap()" is only actually true for hardware
exceptions. It is not true for `int $x` instructions (which never push error
code), irrespective of whether the vector aliases an exception or not.

Furthermore, c/s 6480cc6280e "x86/traps: Fix failed ASSERT() in
do_guest_trap()" really should have helped highlight that a regression had
been introduced.

Modify pv_inject_event() to understand event types other than
X86_EVENTTYPE_HW_EXCEPTION, and introduce pv_inject_sw_interrupt() for the
`int $x` handling code.

Add further assertions to pv_inject_event() concerning the type of events
passed in, which in turn requires that do_guest_trap() set its type
appropriately (which is now used exclusively for hardware exceptions).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

include: fix build without C++ compiler installed

The rule for headers++.chk wants to move headers++.chk.new to the
designated target, which means we have to create that file in the first
place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

ioemu-stubdom: don't link *-softmmu* and *-linux-user*

They are generated by ./configure. Having them linked can cause race
between tools build and stubdom build.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

tools: don't require unavailable optional libraries in pkg-config files

blktap2 is optional, so there should be no pkg-config file requiring
xenblktapctl if it isn't enabled for the build.

Add a filter mechanism to tools/Rules.mk to filter out optional
libraries.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

public/elfnote: document non-alignment of relocated init-P2M

Since PV kernels can't use large pages anyway, when the init-P2M
support was added it was decided to keep the implementation simple and
not align large pages in PFN space. Document this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

Fix broken package config file xenlight.pc.in

The Requires line in this config file uses the wrong names for two dependencies.

The package config file for xenctrl is called 'xencontrol' and for blktapctl is
called 'xenblktapctl'. Running a command like 'pkg-config --exists xenlight' will
fail without this fix.

Signed-off-by: Charles Arnold <carnold@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

xl: don't ignore return value from libxl_device_events_handler

That function can return a whole slew of error codes. Translate them
to EXIT_FAILURE.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

libxenforeignmemory: bump minor version number

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/pv: Align %rsp before pushing the failsafe stack frame

Architecturally, all 64bit stacks are aligned on a 16 byte boundary before an
exception frame is pushed. The failsafe frame should not special in this
regard.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/pv: Fix bugs with the handling of int80_bounce

Testing has revealed two issues:

1) Passing a NULL handle to set_trap_table() is intended to flush the entire
    table.  The 64bit guest case (and 32bit guest on 32bit Xen, when it
    existed) called init_int80_direct_trap() to reset int80_bounce, but c/s
    cda335c279 which introduced the 32bit guest on 64bit Xen support omitted
    this step.  Previously therefore, it was impossible for a 32bit guest to
    reset its registered int80_bounce details.

2) init_int80_direct_trap() doesn't honour the guests request to have
    interrupts disabled on entry.  PVops Linux requests that interrupts are
    disabled, but Xen currently leaves them enabled when following the int80
    fastpath.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

xen/arm: Survive unknown traps from guests

Currently we crash Xen if we see an ESR_EL2.EC value we don't recognise.
As configurable disables/enables are added to the architecture
(controlled by RES1/RESO bits respectively), with associated synchronous
exceptions, it may be possible for a guest to trigger exceptions with
classes that we don't recognise.

While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0487A.k_iss10775, page
D7-1937, EC values within the range 0x00 - 0x2c are reserved for future
use with synchronous exceptions, and EC within the range 0x2d - 0x3f may
be used for either synchronous or asynchronous exceptions.

The patch makes Xen handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the log.

This patch is based on Linux commit f050fe7a9164 "arm: KVM: Survive unknown
traps from the guest".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: do_trap_hypervisor: Separate hypervisor and guest traps

The function do_trap_hypervisor is currently handling both trap coming
from the hypervisor and the guest. This makes difficult to get specific
behavior when a trap is coming from either the guest or the hypervisor.

Split the function into two parts:
- do_trap_guest_sync to handle guest traps
- do_trap_hyp_sync to handle hypervisor traps

On AArch32, the Hyp Trap Exception provides the standard mechanism for
trapping Guest OS functions to the hypervisor (see B1.14.1 in ARM DDI
0406C.c). It cannot be generated when generated when the processor is in
Hyp Mode, instead other exception will be used. So it is fine to replace
the call to do_trap_hypervisor by do_trap_guest_sync.

For AArch64, there are two distincts exception depending whether the
exception was taken from the current level (hypervisor) or lower level
(guest).

Note that the unknown traps from guests will lead to panic Xen. This is
already behavior and is left unchanged for simplicy. A follow-up patch
will address that.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen/arm: arm32: Rename the trap to the correct name

Per Table B1-3 in ARM DDI 0406C.c, the vector 0x8 for hyp is called
"Hypervisor Call".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86/mm: add temporary debugging code to get_page_from_gfn_p2m()

See the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

libxl: u.hvm.usbdevice_list is checked for emptiness

Currently usbdevice_list is only checked for nullity. But the OCaml
binding will convert empty list to a pointer to NULL, instead of a
NULL pointer. That means the OCaml binding will fail to disable USB.

This patch will check emptiness of usbdevice_list. And NULL is still a
valid empty list.

Signed-off-by: Robin Lee <robinlee.sysu@gmail.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86: correct boot time page table setup

While using alloc_domheap_pages() and assuming the allocated memory is
directly accessible is okay at boot time (as we run on the idle page
tables there), memory hotplug code too assumes it can access the
resulting page tables without using map_domain_page() or alike, and
hence we need to obtain memory suitable for ordinary page table use
here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86: correct create_bounce_frame

Commit d9b7ef209a7 ("x86: drop failsafe callback invocation from
assembly") didn't go quite far enough with the cleanup it did: The
changed maximum frame size should also have been reflected in the early
address range check (which has now been pointed out to have been wrong
anyway, using 60 instead of 0x60), and it should have updated the
comment ahead of the function.

Also adjust the lower bound - all is fine (for our purposes) if the
initial guest kernel stack pointer points right at the hypervisor base
address, as only memory _below_ that address is going to be written.

Additionally limit the number of times %rsi is being adjusted to what
is really needed.

Finally move exception fixup code into the designated .fixup section
and macroize the stores to guest stack.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien grall <julien.grall@arm.com>

x86/vm_event: fix race between __context_switch() and vm_event_resume()

The introspection agent can reply to a vm_event faster than
vmx_vmexit_handler() can complete in some cases, where it is then
not safe for vm_event_set_registers() to modify v->arch.user_regs.
In the test scenario, we were stepping over an INT3 breakpoint by
setting RIP += 1. The quick reply tended to complete before the VCPU
triggering the introspection event had properly paused and been
descheduled. If the reply occurs before __context_switch() happens,
__context_switch() clobbers the reply by overwriting
v->arch.user_regs from the stack. If we don't pass through
__context_switch() (due to switching to the idle vCPU), reply data
wouldn't be picked up when switching back straight to the original
vCPU.

This patch ensures that vm_event_resume() code only sets per-VCPU
data to be used for the actual setting of registers later in
hvm_do_resume() (similar to the model used to control setting of CRs
and MSRs).

The patch additionally removes the sync_vcpu_execstate(v) call from
vm_event_resume(), which is no longer necessary, which removes the
associated broadcast TLB flush (read: performance improvement).

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/vm_event: add hvm/vm_event.{h,c}

Created arch/x86/hvm/vm_event.c and include/asm-x86/hvm/vm_event.h,
where HVM-specific vm_event-related code will live. This cleans up
hvm_do_resume() and ensures that the vm_event maintainers are
responsible for changes to that code.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/vpmu_intel: fix hypervisor crash by masking PC bit in MSR_P6_EVNTSEL

Setting Pin Control (PC) bit (19) in MSR_P6_EVNTSEL results in a General
Protection Fault and thus results in a hypervisor crash. This behavior has
been observed on two generations of Intel processors namely, Haswell and
Broadwell. Other Intel processor generations were not tested. However, it
does seem to be a possible erratum that hasn't yet been confirmed by Intel.

To fix the problem this patch masks PC bit and returns an error in
case any guest tries to write to it on any Intel processor. In addition
to the fact that setting this bit crashes the hypervisor on Haswell and
Broadwell, the PC flag bit toggles a hardware pin on the physical CPU
every time the programmed event occurs and the hardware behavior in
response to the toggle is undefined in the SDM, which makes this bit
unsafe to be used by guests and hence should be masked on all machines.

Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

VMX: constrain vmx_intr_assist() debugging code to debug builds

This is because that code, added by commit 997382b771 ("y86/vmx: dump
PIR and vIRR before ASSERT()"), was meant to be removed by the time we
finalize 4.9, but the root cause of the ASSERT() wrongly(?) triggering
still wasn't found.

Take the opportunity and also correct the format specifiers, which I
had got wrong when editing said change while committing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-Acked-by: Julien Grall <julien.grall@arm.com>

x86/public: correct register naming

Commit 897129deab ("x86: use unambiguous register names") went a little
too far: With it we also get register names like _e15 and e15 for
non-Xen consumers using a gcc compatible compiler. Correct this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86: polish __{get,put}_user_{,no}check()

The primary purpose is correcting a latent bug in __get_user_check()
(the macro has no active user at present): The access_ok() check should
be before the actual access, or else any PV guest could initiate MMIO
reads with side effects.

Clean up all four macros at once:
- all arguments evaluated exactly once
- build the "check" flavor using the "nocheck" ones, instead of open
coding them
- "int" is wide enough for error codes
- name local variables without using underscores as prefixes
- avoid pointless parentheses
- add blanks after commas separating parameters or arguments
- consistently use tabs for indentation

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien grall <julien.grall@arm.com>

x86/asm: Clobber %r{8..15} on exit to 32bit PV guests

In the presence of bugs such as XSA-214 where a 32bit PV guest can get its
hands on a long mode segment, this change prevents register content leaking
between domains.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/asm: Fold LOAD_C_CLOBBERED into RESTORE_ALL

With its sole other user removed, fold LOAD_C_CLOBBERED into RESTORE_ALL to
reduce the cognitive load of trying to work out which registers get modified.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/traps: Lift all non-entrypoint logic in entry_int82() up into C

This is more readable, maintainable, and livepatchable.

This involves declaring check_for_unexpected_msi(), untrusted_msi and
pv_hypercall() suitably for use by C. While making these changes,
untrusted_msi is switched over to being a C99 bool.

No behavioural change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/traps: Rename compat_hypercall() to entry_int82()

This follows the Linux example of naming the entry point by how it is arrived
at, rather than its purpose.

Doing so highlights that the SAVE_VOLATILE instantiation sets up the wrong
entry_vector on the stack (although this is currently benign as we never
sysret back to a 32bit PV, and the iret path doesn't care).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/mm: Further restrict permissions on some virtual mappings

As originally reported, the Linear Pagetable slot maps 512GB of ram as RWX,
where the guest has full read access and a lot of direct or indirect control
over the written content.  It isn't hard for a PV guest to hide shellcode
here.

Therefore, increase defence in depth by auditing our current pagetable
mappings.

* The regular linear, shadow linear, and per-domain slots have no business
   being executable (but need to be written), so are updated to be NX.
* The Read Only mappings of the M2P (compat and regular) don't need to be
   writeable or executable.
* The PV GDT mappings and bits of the directmap don't need to be executable.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>

x86/traps: Poison unused stack pointers in the TSS

This is for additional defence-in-depth following LDT/GDT/IDT corruption.

It causes attempted control transfers to ring 1 or 2 (via a call gate), or
attempts to use IST 3 through 7 to yield #SS, rather than executing with a
stack starting at the top of virtual address space.

Express the TSS setup in terms of structure assignment, which should be less
fragile if the IST indexes need to change, and has the useful side effect of
zeroing the reserved fields.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>