dgit.raspbian.org Git

x86: limit GFNs to 32 bits for shadowed superpages.

Superpage shadows store the shadowed GFN in the backpointer field,
which for non-BIGMEM builds is 32 bits wide. Shadowing a superpage
mapping of a guest-physical address above 2^44 would lead to the GFN
being truncated there, and a crash when we come to remove the shadow
from the hash table.

Track the valid width of a GFN for each guest, including reporting it
through CPUID, and enforce it in the shadow pagetables. Set the
maximum witth to 32 for guests where this truncation could occur.

This is XSA-173.

Reported-by: Ling Liu <liuling-it@360.cn>
Signed-off-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

tools/libxc: Correct use of X86_XSS_MASK in guest xstate generation

c/s 75f9455e "tools/libxc: Calculate xstate cpuid leaf from guest information"
incorrectly inverted the shift and mask when using X86_XSS_MASK. Luckily, the
mask is currently zero, avoiding incorrect calculations.

While adjusting this, use an explcit uint32_t cast rather than masking against
0xffffffff.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen: cpupools: Document XEN_SYSCTL_CPUPOOL_OP_* error codes

Requested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxc: cpupools: adjust retry loop in xc_cpupool_removecpu()

Commit 1ef6beea187b ("libxc: do some retries in xc_cpupool_removecpu()
for EBUSY case") added a retry loop in xc_cpupool_removecpu() for the
EBUSY case. As EBUSY was returned in multiple error situations the
loop would have been executed in situations where a retry would not
be successful. Additionally calling sleep(1) between the rerires is a
bad idea when being called in a daemon.

The hypervisor has been changed to return different error values now.
The retry added in above mentioned commit should be done in the
EADDRINUSE case now. As the error condition should last only for a
very short time, the sleep(1) call can be removed.

Requested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Alan Robinson <alan.robinson@ts.fujitsu.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen: cpupools: return different error values for cpupool operations

Today there are several different situations in which moving a cpu
from or to a cpupool will return -EBUSY. This makes it hard for the
user to know what he did wrong, as the Xen tools are not capable to
print a detailed error message.

Depending on the situation return different error codes in order to
enable the tools to print useful messages.

Requested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: fix old style declarations

Fix errors like:

/local/work/xen.git/dist/install/usr/local/include/libxl_uuid.h:59:1: error: 'static' is not at beginning of declaration [-Werror=old-style-declaration]
void static inline libxl_uuid_copy_0x040400(libxl_uuid *dst,
^
/local/work/xen.git/dist/install/usr/local/include/libxl_uuid.h:59:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]

/local/work/xen.git/dist/install/usr/local/include/libxl.h:1233:1: error: 'static' is not at beginning of declaration [-Werror=old-style-declaration]
int static inline libxl_domain_create_restore_0x040200(
^

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

hotplug/Linux: fix same_vm check in block script

The original same_vm check has two bugs. When stubdom is in use because
it relies on numeric domid to check if two domains are in fact the same
one. Another one is that the check would fail when two stubdoms are
checked against each other.

The first bug is fixed by using uuid to identify a domain. The second
bug is fixed by comparing the domains two stubdoms serve.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxl: Fix legacy migration following COLO backchannel breakage

c/s f5d947bf1b "tools/libxl: add back channel support to read stream"
made a bogus adjustment to libxl__stream_read_start(), including
removing the comment hinting at what was going on, which breaks
conversion of a legacy migration stream.

Symptoms look like:

  root@anonymi:~ # xl migrate domU host
  migration target: Ready to receive domain.
  Saving to migration stream new xl format (info 0x1/0x0/2677)
  xc: error: error polling suspend notification channel: -1: Internal error
  Loading new save file <incoming migration stream> (new xl fmt info 0x1/0x0/2677)
   Savefile contains xl domain config in JSON format
  Parsing config from <saved>
  libxl: error: libxl_stream_read.c:327:stream_header_done: Invalid ident: expected 0x4c6962786c466d74, got 0x01f00f0000000000
  libxl: error: libxl_utils.c:430:libxl_read_exactly: file/stream truncated reading ipc msg header from domain 1 save/restore helper stdout pipe

The adjustment is not required for backchannel support (as there is no
interaction between back channels and legacy conversion), and caused
stream->fd to be latched in the datacopier before legacy conversion
substitutes it for the fd which is the output of the conversion script.

This causes libxl to consume data from the legacy stream rather than the
v2 stream, and for the conversion script to encounter an error as the
legacy stream appears to skip ahead.

Undo the adjustments to libxl__stream_read_start(), and introduce a
better description of what is going on.  Introduce some extra assertions
to try and catch similar breakage in the future.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Olaf Hering <olaf@aepfle.de>

xen: change the sizes of memory fields in the HVM start info to be 64bits

At the moment the only consumer of this structure is x86, but other arches
might also use it, so make all the fields 64bits. On x86 Xen will still try
to place everything below the 4GiB boundary, but that might not be feasible
in other arches.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Requested-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxl/save: set domain_suspend_state->domid in do_domain_soft_reset()

c/s d5c693d "libxl/save: Refactor libxl__domain_suspend_state" broke soft
reset as libxl__domain_suspend_device_model() now fails when domid in not set
in libxl__domain_suspend_state.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Config.mk: update mini-os changeset

[commits pulled in are:
  Fix time update
  Clean arch/x86/time.c
  Mini-OS: netfront: fix off-by-one error introduced in 7c8f3483
-iwj]

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

drivers/pl011: ACPI: The interrupt should always be high level triggered

The SPCR does not specify if the interrupt is edge or level triggered.
So the configuration needs to be hardcoded in the code.

Based on the PL011 TRM (see 2.2.8 in ARM DDI 0183G), the interrupt generated
will be active high. Whilst the wording may be interpreted differently,
the SBSA (section 4.3.2 in ARM-DEN-0029 v2.3) states the PL011 is
implemented with a level triggered interrupt.

So the driver should configure the interrupt as high level triggered.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

xen: sched: fix spinlock issue in schedule_cpu_switch().

Commit 94734ab7c3f5 ("xen: sched: close potential races
when switching scheduler to CPUs") buggily replaced a call
to pcpu_schedule_lock_irq() with just pcpu_schedule_lock(),
causing the relevant irq_safe vs. non-irq_safe ASSERT()
in check_lock() to trigger.

Fix that.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/pv: Correctly fold vIOPL back into vcpu_guest_context

c/s f71ecb6 "x86: introduce a new VMASSIST for architectural behaviour of
iopl" shifted the vcpu iopl field by 12, but didn't update the logic which
reconstructs the guests eflags for migration.

Existing guest kernels set a vIOPL of 1, to prevent them from faulting when
accessing IO ports. This bug manifests as a crash after migrate, as the vIOPL
reverts back to the default of 0, and the guest suffers an unexpected #GP
fault.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

xen/arm: acpi: Print more error messages in acpi_map_gic_cpu_interface

It's helpful to spot any error without having to modify the hypervisor
code.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Shannon Zhao <shannon.zhao@linaro.org>

xen/arm: acpi: Remove uncessary check in acpi_map_gic_cpu_interface

This part of the code will never be executed when the entry
corresponds to the boot CPU.

Also print an error message rather when arch_cpu_init has failed.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Shannon Zhao <shannon.zhao@linaro.org>

xen/arm: acpi: Fix SMP support when booting with ACPI

The variable enabled_cpus is used to know the number of CPU enabled in
the MADT.

Currently this variable is used to check the validity of the boot CPU.
It will be considered invalid when "enabled_cpus > 1".

However, this condition also means that multiple CPUs are present on the
system. So secondary will never be brought up.

The correct way to check the validity of the boot CPU is to use the
variable bootcpu_valid.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Shannon Zhao <shannon.zhao@linaro.org>

xen/arm: acpi: The boot CPU does not always match the first entry in the MADT

Since the ACPI 6.0 errata document [1], the first entry in the MADT
does not have to correspond to the boot CPU.

Introduce a new variable to know if a MADT entry matching the boot CPU
is found. Furthermore, it's not necessary to check if the MPIDR is
duplicated for the boot CPU. So the rest of the function can be skipped.

[1] 1380 Unnecessary restrictions to FW vendors in ordering of GIC structures
in MADT

Signed-off-by: Julien Grall <julien.grall@arm.com>

x86: Alter nmi_callback_t typedef

Drop paranthesis and function pointer on nmi_callback_t typedef.

Make it more inline with how x86 maintainers want function
typedefs to be.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/libxc: Calculate xstate cpuid leaf from guest information

The existing logic is broken for heterogeneous migration. By always
advertising the host maximum xstate, a migration to a less capable host always
fails as Xen cannot accomodate the xcr0_accum in the migration stream.

By calculating xstate from the feature information (which a multi-host
toolstack will have levelled appropriately), the guest will have the current
hosts maximum xstate advertised, allowing for correct migration to less
capable hosts.

In addition, some further improvements and corrections:
- don't discard the known flags in sub-leaves 2..63 ECX
- zap sub-leaves beyond 62
- zap all bits in leaf 1, EBX/ECX. No XSS features are currently supported.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxc: Use featuresets rather than guesswork

It is conceptually wrong to base a VM's featureset on the features visible to
the toolstack which happens to construct it.

Instead, the featureset used is either an explicit one passed by the
toolstack, or the default which Xen believes it can give to the guest.

Collect all the feature manipulation into a single function which adjusts the
featureset, and perform deep dependency removal.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxc: Wire a featureset through to cpuid policy logic

Later changes (Patch titled "tools/libxc: Use featuresets rather than
guesswork") will cause the cpuid generation logic to seed their
information from a featureset. This patch adds the infrastructure to
specify a featureset, and will obtain the appropriate defaults from Xen
if omitted.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools: Utility for dealing with featuresets

It is able to reports the current featuresets; both the static masks and
dynamic featuresets from Xen, or to decode an arbitrary featureset into
`/proc/cpuinfo` style strings.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

tools/libxc: Expose the automatically generated cpu featuremask information

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

tools/libxc: Use public/featureset.h for cpuid policy generation

Rather than having a different local copy of some of the feature
definitions.

Modify the xc_cpuid_x86.c cpumask helpers to appropiately truncate the
new values.

As some of the feature have been renamed in the public API, similar renames
are made here.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

tools/libxc: Modify bitmap operations to take void pointers

The type of the pointer to a bitmap is not interesting; it does not affect the
representation of the block of bits being pointed to.

Make the libxc functions consistent with those in Xen, so they can work just
as well with 'unsigned int *' based bitmaps.

As part of doing so, change the implementation to be in terms of char rather
than unsigned long. This fixes alignment concerns with ARM.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen+tools: Export maximum host and guest cpu featuresets via SYSCTL

And provide stubs for toolstack use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: David Scott <dave@recoil.org>
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/domctl: Update PV domain cpumasks when setting cpuid policy

This allows PV domains with different featuresets to observe different values
from a native cpuid instruction, on supporting hardware.

It is important to leak the host view of X2APIC, HTT and CMP_LEGACY through to
guests, even though they could be hidden. These flags affect how to interpret
other cpuid leaves which are not maskable.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/pv: Provide custom cpumasks for PV domains

And use them in preference to cpumask_defaults on context switch. HVM domains
must not be masked (to avoid interfering with cpuid calls within the guest),
so always lazily context switch to the host default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Context switch cpuid masks and faulting state in context_switch()

A single ctxt_switch_levelling() function pointer is provided
(defaulting to an empty nop), which is overridden in the appropriate
$VENDOR_init_levelling().

set_cpuid_faulting() is made private and included within
intel_ctxt_switch_levelling().

One (attempted) functional change is that the faulting configuration should
not be special cased for dom0. It turns out that the toolstack relies on the
special case (and indeed, on being a PV domain in the first place) to
correctly build HVM domains.

For now, the control domain is left as a special case, until futher work can
be completed to remove the restriction.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Rework Intel masking/faulting setup

This patch is best reviewed as its end result rather than as a diff, as it
rewrites almost all of the setup.

On the BSP, cpuid information is used to evaluate the potential available set
of masking MSRs, and they are unconditionally probed, filling in the
availability information and hardware defaults. A side effect of this is that
probe_intel_cpuid_faulting() can move to being __init.

The command line parameters are then combined with the hardware defaults to
further restrict the Xen default masking level.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Rework AMD masking MSR setup

This patch is best reviewed as its end result rather than as a diff, as it
rewrites almost all of the setup.

On the BSP, cpuid information is used to evaluate the potential available set
of masking MSRs, and they are unconditionally probed, filling in the
availability information and hardware defaults.

The command line parameters are then combined with the hardware defaults to
further restrict the Xen default masking level.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Sysctl and common infrastructure for levelling context switching

A toolstack needs to know how much control Xen has over the visible cpuid
values in PV guests. Provide an explicit mechanism to query what Xen is
capable of.

This interface will currently report no capabilities. This change is
scaffolding for future patches, which will introduce detection and switching
logic, after which the interface will report hardware capabilities correctly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/cpu: Move set_cpumask() calls into c_early_init()

Before c/s 44e24f8567 "x86: don't call generic_identify() redundantly", the
commandline-provided masks would take effect in Xen's view of the processor
features.

As the masks got applied after the query for features, the redundant call to
generic_identify() would clobber the pre-masking feature information with the
post-masking information.

Move the set_cpumask() calls into c_early_init() so the effects of the command
line parameters take place before the main query for features in
generic_identify().

The cpuid_mask_* command line parameters now limit the entire system.
Subsequent changes will cause the mask MSRs to be context switched per-domain,
removing the need to use the command line parameters for heterogeneous
levelling purposes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: Improvements to in-hypervisor cpuid sanity checks

Currently, {pv,hvm}_cpuid() has a large quantity of essentially-static logic
for modifying the features visible to a guest.  A lot of this can be subsumed
by {pv,hvm}_featuremask, which identify the features available on this
hardware which could be given to a PV or HVM guest.

This is a step in the direction of full per-domain cpuid policies, but lots
more development is needed for that.  As a result, the static checks are
simplified, but the dynamic checks need to remain for now.

As a side effect, some of the logic for special features can be improved.
OSXSAVE and OSPKE will be automatically cleared because of being absent in the
featuremask.  This allows the fast-forward logic to be more simple.

In addition, there are some corrections to the existing logic:

* Hiding PSE36 out of PAE mode is architecturally wrong.  It turns out that
   it was a bugfix for running HyperV under Xen, which wanted to see PSE36
   even after choosing to use PAE paging.  PSE36 is not supported by shadow
   paging, so is hidden from non-HAP guests, but is still visible for HAP
   guests.  It is also leaked into non-HAP guests when the guest is already
   running in PAE mode.
* Changing the visibility of RDTSCP based on host TSC stability or virtual
   TSC mode is bogus, so dropped.
* When emulating Intel to a guest, the common features in e1d should be
   cleared.
* The APIC bit in e1d (on non-Intel) is also a fast-forward from the
   APIC_BASE MSR.
* A guest with XSAVES and no xcr0|xss features should see
   XSTATE_AREA_MIN_SIZE in %ebx (bug in c/s 9d313bde "x86/xsaves: ebx may
   return wrong value using CPUID eax=0xd,ecx =1").

As a small improvement, use compiler-visible &'s and |'s, rather than
{clear,set}_bit().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/x86: Improve disabling of features which have dependencies

APIC and XSAVE have dependent features, which also need disabling if Xen
chooses to disable a feature.

Use setup_clear_cpu_cap() rather than clear_bit(), as it takes care of
dependent features as well.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/x86: Clear dependent features when clearing a cpu cap

When clearing a cpu cap, clear all dependent features. This avoids having a
featureset with intermediate features disabled, but leaf features enabled.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/x86: Generate deep dependencies of features

Some features depend on other features. Working out and maintaining the exact
dependency tree is complicated, so it is expressed in the automatic generation
script.

At runtime, Xen needs to be disable all features which are dependent on a
feature being disabled. Because of the flattening performed at compile time,
runtime can use a single mask to disable all eventual features.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: introduce a new VMASSIST for architectural behaviour of iopl

The existing vIOPL interface is hard to use, and need not be.

Introduce a VMASSIST with which a guest can opt-in to having vIOPL behaviour
consistenly with native hardware.

Specifically:
- virtual iopl updated from do_iret() hypercalls.
- virtual iopl reported in bounce frames.
- guest kernels assumed to be level 0 for the purpose of iopl checks.

v->arch.pv_vcpu.iopl is altered to store IOPL shifted as it would exist
eflags, for the benefit of the assembly code.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vMSI-X: fix qword write covering vector control field

Along with using the upper 32 bits of the written value, the address
also needs advancing, so that msix_write_completion() will use the
correct address for re-invocation of msixtbl_write().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

mwait-idle: support for Intel Xeon Phi Processor x200 Product Family

Enables "Intel(R) Xeon Phi(TM) Processor x200 Product Family" support,
formerly code-named KNL. It is based on modified Intel Atom Silvermont
microarchitecture.

Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
[micah.barany@intel.com: adjusted values of residency and latency]
Signed-off-by: Micah Barany <micah.barany@intel.com>
[Linux commit: 281baf7a702693deaa45c98ef0c5161006b48257]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

mwait-idle: prevent SKL-H boot failure when C8+C9+C10 enabled

Some SKL-H configurations require "max_cstate=7" to boot.
While that is an effective workaround, it disables C10.

This patch detects the problematic configuration,
and disables C8 and C9, keeping C10 enabled.

Note that enabling SGX in BIOS SETUP can also prevent this issue,
if the system BIOS provides that option.

https://bugzilla.kernel.org/show_bug.cgi?id=109081
"Freezes with Intel i7 6700HQ (Skylake), unless intel_idle.max_cstate=7"

Signed-off-by: Len Brown <len.brown@intel.com>
[Linux commit: d70e28f57e14a481977436695b0c9ba165472431]

Adjust to Xen infrastructure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: calculate maximum host and guest featuresets

All of this information will be used by the toolstack to make informed
levelling decisions for VMs, and by Xen to sanity check toolstack-provided
information.

The split between the shadow and hap HVM masks is necessary due to the lack of
a "get cpuid policy" hypercall. Multi-host toolstacks (i.e. not libxl)
dealing with hap and non-hap capable hosts need to be able to calculate that
migrating a shadow guest is safe.

Future planned development work will implement proper cpuid policy handing in
Xen, including a "get policy" hypercall, but until then, the difference is
made available for toolstack use via a non-stable interface.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: annotate VM applicability in featureset

Use attributes to specify whether a feature is applicable to be exposed to:
1) All guests
2) HVM guests
3) HVM HAP guests
and, via absence of an attribute, to no guests.

There is no current need for other categories (e.g. PV-only features), and
such categories should not be introduced if possible.  These categories follow
from the fact that, with increased hardware support, a guest gets more
features to use.

These settings are derived from the existing code in {pv,hvm}_cpuid(), and
xc_cpuid_x86.c.  One notable exception is EXTAPIC which was previously
erroneously exposed to guests.  PV guests don't get to use the APIC and the
HVM APIC emulation doesn't support extended space.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Revert "xen: change the sizes of fields in the HVM start info layout to be 64bits"

This reverts commit 1aee836fe2829c39775c6434bf914f752521cfa7.

This patch was nacked after commit. Jan writes:

And already applied as I see. I think rushing things in like this is
not a solution, no matter that we want to freeze the tree today.
Changes like this should be free to go in as de-facto bug fixes
after the freeze date.

libxl: remove code added to use the 'phy' backend with CDROM devices

This is a partial revert of 612f15, that allowed CDROM devices to use the
'phy' PV backend. Due to limitations in the current implementation of the
libxl_cdrom_insert function, the PV backend used in conjunction with an
emulated CDROM device must always be Qdisk at the moment. This is due to
libxl_cdrom_insert not running disk hotplug scripts on plug and unplug of PV
CDROM backends (and possibly other yet to be identified issues).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: only allow guests with a device model to use cd-{eject/insert}

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: set the backend type to Qdisk for CDROM devices on DM HVM guests

This is needed because the cd-{insert/eject} functions are not prepared to
deal with blkback, which would be used by default if no backend was
specified.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: set the device model version earlier in xenstore

So libxl doesn't have to pass the build info around just to get the device
model used by the guest. This allows to simplify
libxl__device_nic_setdefault.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

Merge branch 'staging' of xenbits.xen.org:/home/xen/git/xen into staging

xen/arm64: correctly emulate the {w, x}zr registers

On AArch64, encoding 31 for an R<n> in the HSR is used to represent
either {w,x}sp or {w,x}zr (See C1.2.4 in ARM DDI 0486A.d) depending on
how the register field is interpreted by the instruction.

All the instructions trapped by Xen (either via a sysreg access or
data abort) interpret encoding 31 as {w,x}zr. Therefore we don't have
to worry about the possibility that a trap could refer to sp or about
decoding the instruction.

For example AArch64 LDR and STR can have zr in the source/target
register <Xt>, but never sp. sp can be present in the destination
pointer( i.e. "[sp]"), but that would be represented by the value of
FAR_EL2, not in the HSR.

For AArch32 it is possible for a LDR to target the PC, but this would
not result in a valid ISS in the HSR register. However this could only
occur if loading or storing the PC to MMIO, which we simply choose not
to support for now.

Finally, features such as xenaccess can lead to us trapping on
arbitrary instructions accessing RAM and not just for MMIO. However in
many such cases HSR.ISS is not valid and in general features such as
xenaccess do not rely on the nature of the specific instruction, they
resolve the fault (via information found elsewhere e.g. FAR_EL2)
without needing to know anything about the instruction which triggered
the trap.

The register zr represents the zero register, i.e it will always
return 0 and write to it is ignored. To properly handle this property,
2 new helpers have been introduced {get,set}_user_reg to read/write a
value from/to a register. All the calls to select_user_reg have been
replaced by these 2 helpers.

Furthermore, the code to emulate encoding 31 in select_user_reg has been
dropped because it was invalid. For Aarch64 context, the encoding is
used for sp or zr. For AArch32 context, the ISS won't be valid for data
abort from AArch32 using r15 (i.e pc) as source/destination (See D7-1881
ARM DDI 0487A.d, note the validity is more restrictive than on ARMv7).
It's also not possible to use r15 in co-processor instructions.

This patch fixes setting MMIO register and sysreg to a random value
(actually PC) instead of zero by something like:

*((volatile int*)reg) = 0;

compilers tend to generate "str wzr, [xx]" here.

[ian: added BUG_ON to select_user_reg and clarified bits of the commit message]
Reported-by: Marc Zyngier <Marc.Zyngier@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xen/arm64: check XSM Magic from the second unknown module.

This patch adds a has_xsm_magic helper function for detecting XSM
from the second unknown module.

If Xen can't get the kind of module from compatible, we guess the kind of
these unknowns respectively:
    (1) The first unknown must be kernel.
    (2) Detect the XSM Magic from the 2nd unknown:
        a. If it's XSM, set the kind as XSM, and that also means we
won't load ramdisk;
b. if it's not XSM, set the kind as ramdisk.
So if user want to load ramdisk, it must be the 2nd unknown.
We also detect the XSM Magic for the following unknowns, then set its kind
according to the return value of has_xsm_magic.

By this way, arm64 behavior can be compatible to x86 and can simplify
multi-arch bootloader such as GRUB.

Signed-off-by: Fu Wei <fu.wei@linaro.org>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Julien Grall <julien.grall@arm.com>

xen: change the sizes of fields in the HVM start info layout to be 64bits

At the moment the only consumer of this structure is x86, but other arches
might also use it, so make all the fields 64bits. On x86 Xen will still try
to place everything below the 4GiB boundary, but that might not be feasible
in other arches.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Requested-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: map_dev_mmio_region: printk should be ratelimited

The function map_dev_mmio_region is used in a hypercall. Therefore all
printks should be ratelimited to avoid a malicious guest flooding the
console.

Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Julien Grall <julien.grall@arm.com>

flask: change default state to enforcing

The previous default of "permissive" is meant for developing or
debugging a disaggregated system.  However, this default makes it too
easy to accidentally boot a machine in this state, which does not place
any restrictions on guests.  This is not suitable for normal systems
because any guest can perform any operation (including operations like
rebooting the machine, kexec, and reading or writing another domain's
memory).

This change will cause the boot to fail if you do not specify an XSM
policy during boot; if you need to load a policy from dom0, use the
"flask=late" boot parameter.

Original patch by Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>; modified
to also change the default value of flask_enforcing so that the policy
is not still in permissive mode.  This also removes the (no longer
documented) command line argument directly changing that variable since
it has been superseded by the flask= parameter.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

arm: Fix asynchronous aborts (SError exceptions) due to bogus PTEs

ARMv8 architecture allows performing prefetch data/instructions
from memory locations marked as normal memory. Prefetch does not
mean that the data/instruction has to be used/executed in code
flow. All PTEs that appear to be valid to MMU must contain valid
physical address with proper attributes otherwise MMU table walk
might cause imprecise asynchronous aborts.

The way current XEN code is preparing page tables for frametable
and xenheap memory can create bogus PTEs. This patch fixes the
issue by clearing page table memory before populating EL2 L0/L1
PTEs. Without this patch XEN crashes on Qualcomm Technologies
server chips due to asynchronous aborts.

The speculative/prefetch feature explanation is scattered everywhere
in ARM specification but below two sections have useful information.

E2.8 Memory types and attributes (ver DDI0487A_h)
G4.12.6 External abort on a translation table walk (ver DDI0487A_h)

Signed-off-by: Vikram Sethi <vikrams@codeaurora.org>
Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Acked-by: Julien Grall <julien.grall@arm.com>

xen: sched: implement vcpu hard affinity in Credit2

as it was still missing.

Note that this patch "only" implements hard affinity,
i.e., the possibility of specifying on what pCPUs a
certain vCPU can run. Soft affinity (which express a
preference for vCPUs to run on certain pCPUs) is still
not supported by Credit2, even after this patch.

Signed-off-by: Justin Weaver <jtweaver@hawaii.edu>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: provide some scratch space for not putting cpumasks on stack

directly, from schedule.c, for any scheduler that needs
it to use it.

In fact, Credit1 and RTDS needs this already. Credit2 is
also going to need it, for supporting hard affinity
(which is, typically, what requires a lot of cpumask
manipulations, inside various functions).

Therefore, let's define the scratch space at a broader
scope, to limit code duplication in handling it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>

xen: sched: per-core runqueues as default in credit2

Experiments have shown that arranging the scheduing
runqueues on a per-core basis yields better results,
in most cases.

Such evaluation has been done, for the first time,
by Uma Sharma, during her participation to OPW. Some
of the results she got are summarized here:

http://lists.xen.org/archives/html/xen-devel/2015-03/msg01499.html

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: allow for choosing credit2 runqueues configuration at boot

In fact, credit2 uses CPU topology to decide how to arrange
its internal runqueues. Before this change, only 'one runqueue
per socket' was allowed. However, experiments have shown that,
for instance, having one runqueue per physical core improves
performance, especially in case hyperthreading is available.

In general, it makes sense to allow users to pick one runqueue
arrangement at boot time, so that:
- more experiments can be easily performed to even better
assess and improve performance;
- one can select the best configuration for his specific
use case and/or hardware.

This patch enables the above.

Note that, for correctly arranging runqueues to be per-core,
just checking cpu_to_core() on the host CPUs is not enough.
In fact, cores (and hyperthreads) on different sockets, can
have the same core (and thread) IDs! We, therefore, need to
check whether the full topology of two CPUs matches, for
them to be put in the same runqueue.

Note also that the default (although not functional) for
credit2, since now, has been per-socket runqueue. This patch
leaves things that way, to avoid mixing policy and technical
changes.

Finally, it would be a nice feature to be able to select
a particular runqueue arrangement, even when creating a
Credit2 cpupool. This is left as future work.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: fix per-socket runqueue creation in credit2

The credit2 scheduler tries to setup runqueues in such
a way that there is one of them per each socket. However,
that does not work. The issue is described in bug #36
"credit2 only uses one runqueue instead of one runq per
socket" (http://bugs.xenproject.org/xen/bug/36), and a
solution has been attempted by an old patch series:

http://lists.xen.org/archives/html/xen-devel/2014-08/msg02168.html

Here, we take advantage of the fact that now initialization
happens (for all schedulers) during CPU_STARTING, so we
have all the topology information available when necessary.

This is true for all the pCPUs _except_ the boot CPU. That
is not an issue, though. In fact, no runqueue exists yet
when the boot CPU is initialized, so we can just create
one and put the boot CPU in there.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: on credit2, don't reprogram the timer if idle

As other schedulers are doing already: if the idle vcpu
is picked and scheduled, there is no need to reprogram the
scheduler timer to fire and invoke csched2_schedule()
again in future.

Tickling or external events will serve as pokes, when
necessary, but until we can, we should just stay idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reported-by: Tianyang Chen <tiche@seas.upenn.edu>
Suggested-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: improve credit2 bootparams' scope, placement and signedness

and, while we are adjusting signedness of opt_load_window_shift,
make also prv->load_window_shift unsigned, as approapriate.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

xen: sched: close potential races when switching scheduler to CPUs

In short, the point is making sure that the actual switch
of scheduler and the remapping of the scheduler's runqueue
lock occur in the same critical section, protected by the
"old" scheduler's lock (and not, e.g., in the free_pdata
hook, as it is now for Credit2 and RTDS).

Not doing  so, is (at least) racy. In fact, for instance,
if we switch cpu X from, Credit2 to Credit, we do:

schedule_cpu_switch(x, csched2 --> csched):
   //scheduler[x] is csched2
   //schedule_lock[x] is csched2_lock
   csched_alloc_pdata(x)
   csched_init_pdata(x)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   [1]
   csched2_free_pdata(x)
     pcpu_schedule_lock(x) --> takes csched2_lock
     schedule_lock[x] = csched_lock
     spin_unlock(csched2_lock)

While, if we switch cpu X from, Credit to Credit2, we do:

schedule_cpu_switch(X, csched --> csched2):
   //scheduler[x] is csched
   //schedule_lock[x] is csched_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes csched_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(csched_lock)
   [2]
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[X] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   csched_free_pdata(x)

And if we switch cpu X from RTDS to Credit2, we do:

schedule_cpu_switch(X, RTDS --> csched2):
   //scheduler[x] is rtds
   //schedule_lock[x] is rtds_lock
   csched2_alloc_pdata(x)
   csched2_init_pdata(x)
     pcpu_schedule_lock(x) --> takes rtds_lock
     schedule_lock[x] = csched2_lock
     spin_unlock(rtds_lock)
   pcpu_schedule_lock(x) ----> takes csched2_lock
   scheduler[x] = csched2
   pcpu_schedule_unlock(x) --> unlocks csched2_lock
   rtds_free_pdata(x)
     spin_lock(rtds_lock)
     ASSERT(schedule_lock[x] == rtds_lock) [3]
     schedule_lock[x] = DEFAULT_SCHEDULE_LOCK [4]
     spin_unlock(rtds_lock)

So, the first problem is that, if anything related to
scheduling, and involving CPU, happens at [1] or [2], we:
- take csched2_lock,
- operate on Credit1 functions and data structures,
which is no good!

The second problem is that the ASSERT at [3] triggers, and
the third that at [4], we screw up the lock remapping we've
done for ourself in csched2_init_pdata()!

The first problem arises because there is a window during
which the lock is already the new one, but the scheduler is
still the old one. The other two, becase we let schedulers
mess with the lock (re)mapping done by others.

This patch, therefore, introduces a new hook in the scheduler
interface, called switch_sched, meant at being used when
switching scheduler on a CPU, and implements it for the
various schedulers, so that things are done in the proper
order and under the protection of the best suited (set of)
lock(s). It is necessary to add the hook (as compared to
keep doing things in generic code), because different
schedulers may have different locking schemes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Robert VanVossen <robert.vanvossen@dornerworks.com>

xl: make return type of create_domain() more consistent.

create_domain() is of uint32_t return type, because on
success it returns the domid of the new domain, and
uint32_t is what we typically use for domid-s.

However, on failure, it returns ERROR_FAIL or ERROR_INVAL,
which are -3 and -6. Callers assign the return value to an
'int rc' variable and then check for '(rc < 0)'.

Although things work, and no tool (compiler, Coverity, ecc.)
is complaining, using 'int' as return type seems better.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: improve exit codes of debug related functions

by making them more consistent with other examples in xl.

Signed-off-by: Harmandeep Kaur <write.harmandeep@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: improve exit codes of some of the domain handling functions

by making them more consistent with other examples in xl.

Affected functions are the ones related to console, vnc,
dump, destroy, shutdown, list, domid and domname.

Signed-off-by: Harmandeep Kaur <write.harmandeep@gmail.com>
Signed-off-by: Dario Fagggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: improve exit codes of save/restore and migration functions

by making them more consistent with other examples in xl.

Macros CHK_ERRNOVAL, CHK_SYSCALL, MUST are also updated.

Signed-off-by: Harmandeep Kaur <write.harmandeep@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: improve return and exit codes of memory related functions

by making them more consistent with other examples in xl.

While there, make freemem() of boolean return type, which
looks more natural, and add comment explaining why
parse_mem_size_kb() needs to diverge from the pattern.

Signed-off-by: Harmandeep Kaur <write.harmandeep@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: replace the usage of uuid_t with a char array

The internals of the uuid_t struct don't match a big endian octet stream on
*BSD systems, which means that it cannot be directly casted to a
uint8_t[16].

In order to solve that change the type to be an unsigned char[16], which
doesn't imply any other change on Linux. On *BSDs change the helpers so that
the uuid is always stored as a big endian byte stream.

NB: tested on FreeBSD and Linux only.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Discussed-with: Ian Jackson <Ian.Jackson@eu.citrix.com>
Discussed-with: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools: handle xl migrate --debug in legacy stream

Doing a 'xl migrate --debug domU host' on xen-4.5 adds a
XC_SAVE_ID_ENABLE_VERIFY_MODE marker, which is not handled.
Since using --debug is valid usage, handle it by logging the fact
instead of aborting the migration.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: bind_usbintf: do not reuse 'path'

To avoid confusion, use "intf_path" to indicate driver/interface path,
and "bind_path" indicate driver/bind path.

CID: 1358111

Signed-off-by: Chunyan Liu <cyliu@suse.com>
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: Set rc on failure of usbdev_busaddr_to_busid

We must set rc before using `goto out'.

Bug introduced in bf7628f0 "libxl: add pvusb API".

CID: 1358113
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
CC: coverity@xenproject.org
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Chunyan Liu <cyliu@suse.com>

libxl: fix rc handling in libxl_device_usbdev_list

In testing with libvirt pvusb functionality, found a rc check
error in libxl_device_usbdev_list. Correct it. This function
is not used by xl.

Signed-off-by: Chunyan Liu <cyliu@suse.com>
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86: remove the use of vm86_mode()

Xen, being 64bit only, cannot run PV guests in vm86 mode.  HVM guests however
can be running in vm86 mode, and common codepaths need to be able to cope.

The definition of vm86_mode() in x86_64/regs.h is incorrect, as the predicate
is used by non-PV codepaths.

One buggy use is in hvm/emulate.c.  An HVM guest can be in vm86 mode, and
vm86_mode() sliently omits the check.  Luckily, due to the VEX prefix decoding
logic in x86_emulate(), there is no path to the erronious use with EFLAGS_VM
set.

Another potentially problematic use is in show_guest_stack().  In principle,
show_guest_stack() is common code called for both PV and HVM vcpus.  HVM vcpus
exit early (with no reasonable way of making the code generic), making this
part a PV-only codepath.

Open-code its use in emulate.c, matching the surrounding code.  This causes
all other uses to be in PV-only codepaths, making the code to be logically
dead.  Drop it completely, to avoid future misuse.

Part of resulting cleanup removes vm86attr from read_descriptor(), although
retaining one relevant piece of information; i.e. whether we are reading a
selector for an instruction fetch, or a data fetch.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/xsaves: ebx may return wrong value using CPUID eax=0xd,ecx =1

Refer to SDM Volume 1 Extended Region of an XSAVE Area. The value returned
by ecx[1] with cpuid function 0xd and sub-function i (i>1) indicates
the alignment of the state component i when the compacted format of the
extended region of an xsave area is used.

So when hvm guest using CPUID eax=0xd, ecx=1 to get the size of area
used for compacted format, we need to take alignment into consideration.

tools side is fixed by
"tools/libxc: Calculate xstate cpuid leaf from guest information"
by Andrew Cooper

Signed-off-by: Shuai Ruan <shuai.ruan@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/xsaves: fix two miscellaneous issues

1. get_xsave_addr() will only be called when
xsave_area_compressed(xsave) is true. So drop the
conditional expression.

2. expand_xsave_states() will memset the area when
get NULL from get_xsave_addr().

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Shuai Ruan <shuai.ruan@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

build: fix build with Clang

c/s 607044bf9 "build: avoid putting local absolute symbols in symbol tables"
breaks the build with Clang, as the command line argument isn't understood.

Clang does not appear to have any equivielent option, and already has
outstanding issues with duplicate symbols. Excluding this option makes the
problem no worse.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

sched: move pCPU initialization in a helper

That will turn out useful in following patches, where such
code will need to be called more than just once. Create an
helper now, and move the code there, to avoid mixing code
motion and functional changes later.

In Credit2, some style cleanup is also done.

No functional change intended.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>

sched: implement .init_pdata in Credit, Credit2 and RTDS

In fact, if a scheduler needs per-pCPU information,
that needs to be initialized appropriately. So, we take
the code that is performing initializations from (right
now) .alloc_pdata, and use it for .init_pdata, leaving
only actualy allocations in the former, if any (which
is the case in RTDS and Credit1).

On the other hand, in Credit2, since we don't really
need any per-pCPU data allocation, everything that was
being done in .alloc_pdata, is now done in .init_pdata.
And the fact that now .alloc_pdata can be left undefined,
allows us to just get rid of it.

Still for Credit2, the fact that .init_pdata is called
during CPU_STARTING (rather than CPU_UP_PREPARE) kills
the need for the scheduler to setup a similar callback
itself, simplifying the code.

And thanks to such simplification, it is now also ok to
move some of the logic meant at double checking that a
cpu was (or was not) initialized, into ASSERTS (rather
than an if() and a BUG_ON).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>

sched: make implementing .alloc_pdata optional

The .alloc_pdata scheduler hook must, before this change,
be implemented by all schedulers --even those ones that
don't need to allocate anything.

Make it possible to just use the SCHED_OP(), like for
the other hooks, by using ERR_PTR() and IS_ERR() for
error reporting. This:
- makes NULL a variant of success;
- allows for errors other than ENOMEM to be properly
communicated (if ever necessary).

This, in turn, means that schedulers not needing to
allocate any per-pCPU data, can avoid implementing the
hook. In fact, the artificial implementation of
.alloc_pdata in the ARINC653 is removed (and, while there,
nuke .free_pdata too, as it is equally useless).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

libxl: pvusb: Correctly check the controller type

Missing a check of controller type.

Signed-off-by: Chunyan Liu <cyliu@suse.com>
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: libxl_pvusb.c: Remove redundant lstat

There is no harm in calling realpath, and we can handle ENOENT then.

CID: 1358112

Signed-off-by: Chunyan Liu <cyliu@suse.com>
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: libxl_write_exactly: correct argument to sizeof

sizeof is wrongly used in libxl_write_exactly function, using
strlen instead.

CID: 1358110
CID: 1358109

Signed-off-by: Chunyan Liu <cyliu@suse.com>
CC: Simon Cao <caobosimon@gmail.com>
CC: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxc: fix uninitialized variable when changing rtds scheduling parameters

Commit 046c2b503a89d21b41e4d555a9f75d02af00dbc6 introduces a build
failure: in some cases (e.g., num_vcpus <=0),
xc_sched_rtds_vcpu_get/set returns an uninitialized variable.

Fix it.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Chong Li <chong.li@wustl.edu>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xl: enable per-VCPU parameter for RTDS

Change main_sched_rtds and related output functions to support
per-VCPU settings.

Signed-off-by: Chong Li <chong.li@wustl.edu>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: enable per-VCPU parameter for RTDS

Add libxl_vcpu_sched_params_get/set and sched_rtds_vcpu_get/set
functions to support per-VCPU settings.

Signed-off-by: Chong Li <chong.li@wustl.edu>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxc: enable per-VCPU parameter for RTDS

Add xc_sched_rtds_vcpu_get/set functions to interact with
Xen to get/set a domain's per-VCPU parameters.

Signed-off-by: Chong Li <chong.li@wustl.edu>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

hotplug/FreeBSD: document disk hotplug interface

Add the FreeBSD disk hotplug interface details to the block-scripts.txt
document.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

libxl: fix error message in local_device_attach_cb

The fields that are printed might not be set in the case of a failure, which
generates a segmentation fault.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: add a FreeBSD implementation of libxl__devid_to_localdev

This code is extracted from the FreeBSD blkfront implementation.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: properly use vdev vs local device

The current code in libxl assumed that vdev is equal to local device, but
this is only true for Linux systems. In other OSes the local device can use
a nomenclature completely different from the virtual device one.

Move the current libxl__devid_to_localdev Linux implementation out of the
OS-specific file and rename it to libxl__devid_to_vdev, and then make sure
local_device_attach_cb return the local device in the diskpath field.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: add support for disk hotplug scripts on FreeBSD

Allow FreeBSD to execute hotplug scripts when attaching disk devices.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

libxl: refactor the FreeBSD hotplug script code

This factors out the nic hotplug specific code from the common code path in
order to make it easier to add support for disk hotplug scripts. It
shouldn't include any functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

hotplug/FreeBSD: add block hotplug script

This is the default hotplug script for block devices. Its only job is to
copy the "params" blkback xenstore node to "physical-device".

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

blkif: document how FreeBSD uses the physical-device backend node

FreeBSD blkback uses the physical-device-path xenstore node in order to
fetch the path to the underlying backing storage (either a block device or
raw image). This node is set by the hotplug scripts. Also clarify the usage
of the physical-device node.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

blktap2: Invalid logic detecting unaligned buffers in vhd_write_block

It seems the logic is meant to detect sector unaligned buffers for block
writes. The NOTing of the logic instead masks off any unaligned bits and
also would cause the function to always fail. It seems the function is not
used in any of the tools so that is probably why the problem is not seen.
In the vhd_read_block function it is correct.

Signed-off-by: Ross Philipson <ross.philipson@ainfosec.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: colo: make it depend on availability of libnl

Netlink is required when initialising COLO, so make sure only to compile
COLO only when netlink is available. Change the inclusion of
linux/netlink.h to netlink/netlink.h so that it doesn't use Linux kernel
header directly.

Provide necessary stub functions in case COLO is disabled. This should
fix build on FreeBSD because there is no netlink there. It would also
make libxl build properly when netlink is not present on a Linux
system.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

build: rename CONFIG_REMUS_NETBUF to CONFIG_LIBNL

COLO and Remus net buffer support both depend on the availability of
libnl. Use a generic name.

No functional changes.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: colo: move netlink related stuff to libxl_colo_proxy.c

They are only used there, no need to expose them to other parts of
libxl.

This is necessary to make libxl build on FreeBSD again because FreeBSD
doesn't have netlink.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>