dgit.raspbian.org Git

libxl/xl: add support for Xen 9pfs

Add functions to libxl to setup a Xen 9pfs frontend/backend connection.
Add support to xl to parse a 9pfs option in the VM config file, in the
following format:

p9=["tag=share_dir,security_model=none,path=/root/share_dir"]

where tag identifies the 9pfs share and it is required to mount it on
the guest side, path is the path of the filesystem to share and the only
security_model supported is "none" which means that files are stored
using the same credentials as they are created on the guest (no user
ownership squash or remap).

Signed-off-by: Stefano Stabellini <stefano@aporeto.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/firmware: add ACPI device for Windows laptop/slate mode switch

Microsoft have defined an ACPI device to support switching Windows 10
between laptop/desktop mode and slate/tablet mode [1].

This patch adds an SSDT containing such a device. The presence of the
device is controlled by a new 'acpi_laptop_slate' boolean in xl.cfg.
The new device will not be present by default.

[1] https://msdn.microsoft.com/en-us/windows/hardware/commercialize/design/device-experiences/continuum

Signed-off-by: Owen Smith <owen.smith@citrix.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/emul: Add feature check for clzero

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/emul: Correct the decoding of vlddqu

vlddqu is encoded with 0xf2 which causes it to fall into the Scalar general
case in x86_decode_twobyte(). However, it really does have just two operands,
so must remain TwoOp

AFL discovered that the instruction c5 5b f0 3c e5 95 0a cd 63 was considered
valid despite it being a two operand instruction and VEX.vvvv having the value
11. The resulting use in a stub yielded #UD.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

Merge XSA 206 branch

oxenstored transaction conflicts: improve logging

For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.

Additional detailed logging has been added at "debug" priority.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>

oxenstored: don't wake to issue no conflict-credit

In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>

oxenstored: do not commit read-only transactions

The packet telling us to end the transaction has always carried an
argument telling us whether to commit.

If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.

This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.

We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.

This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>

oxenstored: allow self-conflicts

We already avoid inter-domain conflicts but now allow intra-domain
conflicts. Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.

When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>

oxenstored: blame the connection that caused a transaction conflict

Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.

Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

Changes since v1:
* use correct log levels for informational messages
Changes since v2:
* fix the blame algorithm and improve logging
(fix was reviewed by Jonathan Davies)

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>

oxenstored: track commit history

Since the list of historic activity cannot grow without bound, it is safe to use
this to track commits.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>

oxenstored: discard old commit-history on txn end

The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.

Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.

We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

oxenstored: only record operations with side-effects in history

There is no need to record "read" operations as they will never cause another
transaction to fail.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Forward port to xen-unstable:
* Remove Xenbus.Xb.Op.Restrict

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

oxenstored: support commit history tracking

Add ability to track xenstore tree operations -- either non-transactional
operations or committed transactions.

For now, the call to actually retain commits is commented out because history
can grow without bound.

For now, we call record_commit for all non-transactional operations. A
subsequent patch will make it retain only the ones with side-effects.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

oxenstored: add transaction info relevant to history-tracking

Specifically:
* retain the original store (not just the root) in full transactions
* store commit count at the time of the start of the transaction

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

oxenstored: ignore domains with no conflict-credit

When processing connections, skip those from domains with no remaining
conflict-credit.

Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds". When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

oxenstored: handling of domain conflict-credit

This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.

This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

oxenstored: comments explaining some variables

It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

xenstored: Log when the write transaction rate limit bites

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

xenstored: apply a write transaction rate limit

This avoids a rogue client being about to stall another client (eg the
toolstack) indefinitely.

This is XSA-206.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

x86: clarify shadow paging Dom0 support

Classic PV shadow paging Dom0 has been broken for years, and can't
possibly be configured after 4045953.

PVH shadow paging Dom0 should still be possible.

Change the code and documentation to clarify that.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>

xen: sched: don't call hooks of the wrong scheduler via VCPU2OP

Within context_saved(), we call the context_saved hook,
and we use VCPU2OP() to determine from what scheduler.
VCPU2OP uses DOM2OP, which uses d->cpupool, which is
NULL when d is the idle domain. And in that case,
DOM2OP just returns ops, the scheduler of cpupool0.

Therefore, if:
- cpupool0's scheduler defines context_saved (like
  Credit2 and RTDS do),
- we are not in cpupool0 (i.e., our scheduler is
  not ops),
- we are context switching from idle,

we call VCPU2OP(idle_vcpu), which means
DOM2OP(idle->cpupool), which is ops.

Therefore, we both:
- check if context_saved is defined in the wrong
  scheduler;
- if yes, call the wrong one.

When using Credit2 at boot, and also Credit2 in
the other cpupool, this is wrong but innocuous,
because it only involves the idle vcpus.

When using Credit2 at boot, and Credit1 in the
other cpupool, this is *totally* wrong, and
it's by chance it does not explode!

When using Credit2 and other schedulers I'm
developping, I hit the following assert (in
sched_credit2.c, on a CPU inside a cpupool that
does not use Credit2):

csched2_context_saved()
{
...
ASSERT(!vcpu_on_runq(svc));
...
}

Fix this by dealing explicitly, in VCPU2OP, with
idle vcpus, returning the scheduler of the pCPU
they (always) run on.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

tracing: xenalyze: kill spurious ", " in Credit1 traces.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

tools/libxenforeignmemory: bind restrict operation to new version

Commit 5823d6eb "add a call to restrict the handle" added a new function
to the foreignmemory API. This API is considered stable and so the new
function should be bound to a new version.

This patch creates version 1.1 of the API, dependent on version 1.0, and
binds the restrict call to version 1.1. Thus version 1.0 is as it was
before the new function was added.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/pagewalk: non-functional cleanup

* Drop trailing whitespace
* Consistently apply Xen style
* Introduce a local variable block

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/pagewalk: Improve the logic behind setting access and dirty bits

The boolean pse2M is misnamed, because it might refer to a 4M superpage.

Switch the logic to be in terms of the level of the leaf entry, and rearrange
the calls to set_ad_bits() to be a fallthrough switch statement, to make it
easier to follow.

Alter set_ad_bits() to take properly typed pointers and booleans rather than
integers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86/shadow: Use the pagewalk reserved bits helpers

The shadow logic should not create a valid/present shadow of a guest PTE which
contains reserved bits from the guests point of view. It is not guaranteed
that the hardware pagewalk will come to the same conclusion, and raise a
pagefault.

Shadows created on demand from the pagefault handler are fine because the
pagewalk over the guest tables will have injected the fault into the guest
rather than creating a shadow.

However, shadows created by sh_resync_l1() and sh_prefetch() haven't undergone
a pagewalk and need to account for reserved bits before creating the shadow.

In practice, this means a 3-level guest could previously cause PTEs with bits
63:52 set to be shadowed (and discarded). This PTE should cause #PF[RSVD]
when encountered by hardware, but the installed shadow is valid and hardware
doesn't fault.

Reuse the pagewalk reserved bits helpers, and assert in
l?e_propagate_from_guest() that shadows are not attempted to be created with
reserved bits set.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86/pagewalk: Re-implement the pagetable walker

The existing pagetable walker has complicated return semantics, which squeeze
multiple pieces of information into single integer.  This would be fine if the
information didn't overlap, but it does.

Specifically, _PAGE_INVALID_BITS for 3-level guests alias _PAGE_PAGED and
_PAGE_SHARED.  A guest which constructs a PTE with bits 52 or 53 set (the
start of the upper software-available range) will create a virtual address
which, when walked by Xen, tricks Xen into believing the frame is paged or
shared.  This behaviour was introduced by XSA-173 (c/s 8b17648).

It is also complicated to turn rc back into a normal pagefault error code.
Instead, change the calling semantics to return a boolean indicating success,
and have the function accumulate a real pagefault error code as it goes
(including synthetic error codes, which do not alias hardware ones).  This
requires an equivalent adjustment to map_domain_gfn().

Issues fixed:
* 2-level PSE36 superpages now return the correct translation.
* 2-level L2 superpages without CR0.PSE now return the correct translation.
* SMEP now inhibits a user instruction fetch even if NX isn't active.
* Supervisor writes without CR0.WP now set the leaf dirty bit.
* L4e._PAGE_GLOBAL is strictly reserved on AMD.
* 3-level l3 entries have all reserved bits checked.
* 3-level entries can no longer alias Xen's idea of paged or shared.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pagewalk: Helpers for reserved bit handling

Some bits are unconditionally reserved in pagetable entries, or reserved
because of alignment restrictions. Other bits are reserved because of control
register configuration.

Introduce helpers which take an individual vcpu and guest pagetable entry, and
calculates whether any reserved bits are set.

While here, add a couple of newlines to aid readability.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>

x86/pagewalk: Clean up guest_supports_* predicates

Switch them to returning bool, and taking const parameters.

Rename guest_supports_superpages() to guest_can_use_l2_superpages() to
indicate which level of pagetables it is actually referring to as well as
indicating that it is more complicated than just control register settings,
and rename guest_supports_1G_superpages() to guest_can_use_l3_superpages() for
consistency.

guest_can_use_l3_superpages() is a static property of the domain, rather than
control register settings, so is switched to take a domain pointer.
hvm_pse1gb_supported() is inlined into its sole user because it isn't strictly
hvm-specific (it is hap-specific) and really should be beside a comment
explaining why the cpuid policy is ignored.

guest_supports_nx() on the other hand refers simply to a control register bit,
and is renamed to guest_nx_enabled().

While cleaning up part of the file, clean up all trailing whilespace, and fix
one comment which accidently refered to PG living in CR4 rather than CR0.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: support larger memory map from EFI

Use a larger e820 map buffer for non-BIOS memory map sources. This
requires to have different defines for the maximum number of E820 map
entries for the raw BIOS buffer and the later used struct e820map.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: use trampoline e820 buffer for BIOS interface only

Instead of using the E820 raw buffer for BIOS, EFI and multiboot based
memory map information use it for the BIOS interface only. This will
enable us to support more E820 entries than the limited trampoline
located buffer can.

Add a new raw e820 table for common purpose and copy the BIOS buffer
to it. Doing the copying in assembly avoids the need to export the
symbols for the BIOS E820 buffer and number of entries.

Signed-off-by: Juergen Gross <jgross@suse.com>
[jb: eliminate an unneeded local variable]
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: split boot trampoline into permanent and temporary part

The hypervisor needs a trampoline in low memory for early boot and
later for bringing up cpus and during wakeup from suspend. Today this
trampoline is kept completely even if most of it isn't needed later.

Split the trampoline into a permanent part and a temporary part needed
at early boot only. Introduce a new entry at the boundary.

Reduce the stack for wakeup code in order for the permanent
trampoline to fit in a single page. 4k of stack seems excessive, about
3k should be more than enough.

Add an ASSERT() to the linker script to ensure the wakeup stack is
always at least 3k.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: fix the check in get_pg_owner

PVH (both v1 and v2) guest is actually an translated guest. It should be
able to manipulate page table for other domains when acting as Dom0.

The removal of PVHv1 deleted the special case for PVH guest but didn't
add a check for HVM guest.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

docs: update HVM emulated unplug protocol to cover NVMe disks

Recent discussions on xen-devel have highlighted that to properly
support displacing emulated NVMe disks with PV equivalents will need
updates to PV frontends. Therefore it is important that, if an emulated
NVMe disk is exposed to a guest with an existing PV storage frontend,
that frontend does not inadvertently cause unplug of that emulated
disk when unplugging IDE or SCSI disks.

This patch defines a new bit in the mask used to instruct QEMU to unplug
emulated devices which will instruct QEMU to unplug NVMe disks and limits
the semantics of the existing 'all' disk-unplug bit to only IDE and/or SCSI
disks.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

Config.mk: update OVMF changeset

This new changeset contain a fix to build with GCC 6.3.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xen/Makefile: remove all temporary files for every architecture

Execute the clean target for both arm and x86 architecture.

When trying to build Xen for a different architecture in the same
tree, the command make clean will only remove temporary files for
the host architecture.
This will lead a compilation error when trying to build ARM64 and
ARM32 Xen in the same tree.
(See also: https://lists.xenproject.org/archives/html/xen-devel/2016-11/msg02176.html)

Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxenforeignmemory: add a call to restrict the handle

Commit 8ef5f344d061 "tools/libxendevicemodel: add a call to restrict the
handle" added a function to the devicemodel interface to restrict
operations through the API to a specific domain, where a capable under-
lying privcmd driver exists.

This patch adds similar functionality to the xenforeignmemory API. This
will be necessary (as much as xendevicemodel restriction) for limiting
the scope of device models to specific domains.

NOTE: My patch to the linux kernel [1] added the appropriate checks to
the foreign memory ioctls.

[1] https://git.kernel.org/cgit/linux/kernel/git/ostr/linux.git/commit/?id=4610d240

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

arm/mm: remove unused p2m_refcount in page_info

The code which used that field has been deleted. Found by code
inspection.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86/shadow: Drop VALID_GFN()

There is only one single user of VALID_GFN(). Inline the macro to remove the
added layer of indirection in sh_gva_to_gfn()

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/pagewalk: Use pointer syntax for pfec parameter

It is a pointer, not an array.

No functional change.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

x86/cpuid: Sort cpu_has_* predicates by feature number

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/viridian: annotate intentional fallthrough

This stops Coverity complaining.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

arch: drop ALIGN_STR

... as being unused and having been unusable: It was clearly intended
for use in asm(), yet was placed inside __ASSEMBLY__ conditionals.

Also drop __ALIGN{,_STR} - there's no need to have a second flavor of
these constructs with no difference in behavior.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>

x86/viridian: implement the crash MSRs

Section 2.4.4 of the Hypervisor Top Level Functional Specification states
that enabling bit 10 in EDX of CPUID leaf 3 advertises to Windows a set
of MSRs into which it can write crash information.

This patch advertises that bit and implements the MSRs such that Xen can
log the information if a Windows guest crashes.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

x86/viridian: make the threshold for HvNotifyLongSpinWait tunable

The current threshold before the guest issues the hypercall is, and always
has been, hard-coded to 2047. It is not clear where this number came
from so, to at least allow for ease of experimentation, this patch makes
the threshold tunable via the Xen command line.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/viridian: don't put Xen version information in CPUID leaf 2

The Hypervisor Top Level Functional Specification v5.0a states in section
2.5:

"The hypervisor version information is encoded in leaf 0x40000002. Two
version numbers are provided: the main version and the service version.
The main version includes a major and minor version number and a build
number. These correspond to Microsoft Windows release numbers."

It also goes on to advise clients (i.e. guest versions of Windows) to use
the following algorithm to determine compatibility with the hypervisor
enlightenments:

if <your-main-version> greater than <hypervisor-main-version>
{
your version is compatible
}
else if <your-main-version> equal to <hypervisor-main-version> and
<your-service-version> greater than or equal to <hypervisor-service-version>
{
your version is compatible
}
else
{
your version is NOT compatible
}

So, clearly putting Xen hypervisor version information in that leaf is
spurious, but we probably get away with it because Xen's major version
is lower than the major version of Windows in which Hyper-V first
appeared (Server 2008).

This patch changes the leaf to use the kernel major and minor
versions, and build number from Windows Server 2008 (64-bit) by default.
These default values can be overriden from the Xen command line using new
'viridian-version' parameter.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

libxl: use libxl__xs_read_checked() instead or raw xs_read() in do_domain_soft_reset()

Replace raw xs_read() calls with libxl__xs_read_checked() and bail on error.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: preserve console tty across soft reset

On soft reset we remove the domain from xenstore and introduce it back to
have everything reconnected. Console, however, stays attached (as xenconsoled
checks if the domain is dying and our domain is not) but we lose the
information about tty:

before soft reset:
   console = ""
    ...
    type = "xenconsoled"
    output = "pty"
    tty = "/dev/pts/1"
    ...

after:
   console = ""
    ...
    type = "xenconsoled"
    output = "pty"
    tty = ""
    ...

The issue applies to both HVM and PVH but for HVM guests serial console
through QEMU is usually in use and for PVH we don't have it.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

COLO-proxy: Fix argument check error

Here, we should check the 'outdev' before use.

Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

ARM: arm64: activate atomic 64-bit accessors

For some reason (probably because there was no user before) the 64-bit
atomic access wrappers were commented out so far.
As we will need them in the next patch, active (and fix) them now.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

libxl: don't try to rename dm save file for PVH

Guests with LIBXL_DEVICE_MODEL_VERSION_NONE don't have a device model
running so there is no save file to rename.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

tools/libxc: Drop dombuilder support for PV autotranslate guests

c/s 4045953 "x86/paging: Enforce PG_external == PG_translate == PG_refcounts"
in the hypervisor finally prevented the construction of PV autotranslate
guests.

Remove support for such guests in the domain builder, bailing out with an
obvious "no longer supported" message, rather than a more obscure
"SHADOW_OP_ENABLED failed".

As a piece of cleanup, rename xc_dom_feature_translated() to
xc_dom_translated() to match its actual semantics.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xenstore: add missing checks for allocation failure

Add missing allocation failure checks.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

xenstore: set correct error code when violating quota

When the number of permitted xenstore entries for a domain is being
exceeded the operation trying to create a new entry is denied.
Unfortunately errno isn't being set in this case so the error code
returned to the client is undefined.

Set errno to ENOSPC in this case.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/viridian: add warnings for unimplemented hypercalls and MSRs

These warnings can be useful when Microsoft updates Windows.

In the past there have been several cases when Windows erroneously uses
hypercalls and MSRs that should be gated on CPUID flags than Xen does
not set. The usual symptom is a guest crash with little or no information
in the hypervisor log. Adding these warnings at least gives a clue as to
what might be happening in such cases.

Some versions of Windows do currently issue hypercalls that they should
not, so this patch whitelists those to avoid the warnings as the lack
of implementation is clearly proved not to be a problem to the guest.

The warnings are rate limited so a malicious guest cannot use them to
as a DoS.

NOTE: Because the MSR warnings need to be gated on range checking the
MSR address this patch imports the up-to-date definitions of all
the viridian MSRs from the specification.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/viridian: get rid of the magic numbers in CPUID leaves 1 and 2

The numbers correspond to ASCII characters so just use appropriate
character strings directly.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/viridian: fix xen-hvmcrash when vp_assist page is present

Currently use of xen-hvmcrash will cause an immediate domain_crash() in
initialize_vp_assist() because it is called from viridian_load_vcpu_ctxt()
without having first cleared any previous mapping.

This patch addes a check into viridian_load_vcpu_ctxt() to avoid re-
initialization and turned the domain_crash() in initialize_vp_assist()
into an ASSERT() since neither codepath into that function should allow
it to be hit.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: use statically defined locking order

Instead of using a locking order based on line numbers which interacts
poorly with trying to create a live patch, statically define the locking
order.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

QEMU_TAG update

misc/branching-checklist: Call mg-branch-setup in Cambridge too

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

misc/release-checklist: Split out branching-checklist.txt

This is almost all just motion. There is one new paragraph in
branching-checklist.txt:

+ Update both new branches according to release-checklist.txt section re
+ README etc.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

misc/release-checklist: Remove pre-4.3 tarball target instructions

4.2 is well out of support and we will never need to make a release of
it again. Delete all the stuff for making combined tarballs "by hand".

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

misc/release-checklist: Remove clearly-obsolete stuff

Remove:

- Head comment saying not to edit here.  This came from the
   now-no-longer-master xenbits copy which I have deleted.

- Many old (commented-out) instruction related to hg

- Many old (commented-out) instruction related to pre-unified
   qemu trees.

- Many old (commented-out) instruction related to ancient
   locations within Citrix.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

common: allow a default compiled-in command line using Kconfig

This allows downstreams to set their defaults without modifying the source code
all over the place. Also probably useful for the embedded space.
(See Also: https://xenproject.atlassian.net/browse/XEN-41)

If CMDLINE is set, it will be parsed prior to the bootloader command line.
This order of parsing implies that if any non-cumulative options are set in
both CMDLINE and the bootloader command line, only the ones in the latter will
take effect. Furthermore, if CMDLINE_OVERRIDE is set to y, the whole
bootloader command line will be ignored, which will be useful to work around
broken bootloaders. A wrapper to the original common/kernel.c:cmdline_parse()
was introduced to complete this task.

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
[jb: fix non-EXPERT build]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86emul: correct FPU code/data pointers and opcode handling

Prevent leaking the hypervisor ones (stored by hardware during stub
execution), at once making sure the guest sees correct values there.
This piggybacks on the backout logic used to deal with write faults of
FPU insns.

Deliberately ignore the NO_FPU_SEL feature here: Honoring it would
merely mean extra code with no benefit (once we XRSTOR state, the
selector values will simply be lost anyway).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com> [hvm/emulate.c]
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: correct handling of FPU insns faulting on memory write

When an FPU instruction with a memory destination fails during the
memory write, it should not affect FPU register state. Due to the way
we emulate FPU (and SIMD) instructions, we can only guarantee this by
- backing out changes to the FPU register state in such a case or
- doing a descriptor read and/or page walk up front, perhaps with the
stubs accessing the actual memory location then.
The latter would require a significant change in how the emulator does
its guest memory accessing, so for now the former variant is being
chosen.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com> [hvm/emulate.c]
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

x86emul: centralize put_fpu() invocations

..., splitting parts of it into check_*() macros. This is in
preparation of making ->put_fpu() do further adjustments to register
state. (Some of the check_xmm() invocations could be avoided, as in
some of the cases no insns handled there can actually raise #XM, but I
think we're better off keeping them to avoid later additions of further
insn patterns rendering the lack of the check a bug.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/insn-fuzz: Support AFL's afl-clang-fast mode

AFL has an alternative llvm-base instrumentation mode, which has much lower
overhead than the traditional afl-gcc.

One extra ability is to chose exactly where the master process gets
initialised to, before being forked for testing. This point is chosen after
the call to LLVMFuzzerInitialize(), so the stack isn't being remapped
executable for every test.

Another extra ability is to feed multiple inputs into a single test process,
to reduce the number of fork() calls required overall. Two caveats are that if
stdin is used for data, it must be unbuffered, and if input is passed via a
command line parameter, the underlying file must be opened and closed on each
iteration.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/insn-fuzz: Make use of LLVMFuzzerInitialize()

libfuzz can perform one-time initialisation by calling LLVMFuzzerInitialize().
Move emul_test_init() into this, to avoid repeating it on every
LLVMFuzzerTestOneInput() call.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/insn-fuzz: Accept fuzzing input on stdin

This is rather faster for afl-fuzz to arrange than using an explicit file
parameter. Also update the README to recommend using a tmpfs for findings_dir
which reduces disk load and is more performant.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/insn-fuzz: Use getopt() for parsing the command line

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/insn-fuzz: Use shorter filenames

Amongst other things, these tab complete more easily.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/fuzz: Include LLVMFuzzerTestOneInput() in the generated .a

Otherwise they are not suitable for use with libfuzz.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/fuzz: Use $(CC) for linking the harnesses

This is necessary to make use of compiler features such as UBSAN.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

tools/fuzz: Remove .d files in clean

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>

AMD-Vi: allocate root table on demand

This was my originally intended fix for the AMD side of XSA-207:
There's no need to unconditionally allocate the root table, and with
that there's then also no way to leak it when a guest has no devices
assigned.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>

x86/viridian: update to version 5.0a of the specification

The Hypervisor Top Level Functional Specification v5.0a has many differences
from previous versions and introduces whole new sections.

This patch:

- Updates the URL at the top of the source.
- Fixes up section references accordingly.
- Modifies the MSR naming convention in the code to match the specification.
- Rename the apic_assist page to the vp_assist page to reflect the change
  in the specification.
  (The APIC assist feature itself is inconsistently named in the
  specification so stick wth the current feature name).
- Updates the handling of CPUID leaf 3.

There is one functional change in this patch: The vp_assist page is
mapped (and completely zeroed) regardless of whether the APIC assist
feature is enabled. This reflects its new wider remit and simplifies the
code slightly.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mce: add blank lines between non-fall-through switch case blocks

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/mce_intel: refine messages of MCA capabilities

... to only print available ones.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mce: switch bool_t/1/0 to bool/true/false

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86: remove stale PVHv1 comment from PV domain builder

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

sched.h: remove stale PVHv1 comment

With the removal of PVHv1 this comment is wrong. Just remove it.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

docs: update dmop.markdown

... to match the code after the removal of PVHv1.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: clean up header files in dom0_build.c

Remove the ones that are no longer needed and sort them.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86: split PVH dom0 builder to hvm/dom0_build.c

Long term we want to be able to disentangle PV and HVM code. Move
the PVH domain builder to a dedicated file.

Lift function declarations to dom0_build.h and rename them when
necessary.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: split PV dom0 builder to pv/dom0_builder.c

Long term we want to be able to disentangle PV and HVM code. Move the PV
domain builder to a dedicated file.

This in turn requires exposing a few functions and variables via a new
header dom0_build.h. These functions and variables are now prefixed with
"dom0_" if they weren't already so.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: modify setup_dom0_vcpu to use dom0_cpus internally

We will later move dom0 builders to different directories. To avoid the
need of making dom0_cpus visible outside dom0_builder.c, modify
setup_dom0_vcpus to cycle through dom0_cpus internally instead of
relying on the callers to do that.

No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: rename domain_build.c to dom0_build.c

To reflect the true nature of this file. No functional change.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxl/libxl_qmp.c: Update COLO query replication status API

The QEMU community has asked us to change QMP command
xen-get-replication-error to query-xen-replication-status. Modify Xen
side to use the new name.

Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libxl/libxl_qmp.c: Update COLO do checkpoint API

The QEMU community has asked us to change the QMP command from
xen-do-checkpoint to xen-colo-do-checkpoint. Modify Xen side to use
the new name.

Signed-off-by: Zhang Chen <zhangchen.fnst@cn.fujitsu.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/time: don't use virtual TSC if host and guest frequencies are equal

Commit 82713ec8d2 ("x86: use native RDTSC(P) execution when guest and
host frequencies are the same") left out optimization for PV guests
when host and guest run at the same frequency.

For such a case we should be able not to use virtual TSC regardless
of whether we are runing before or after a migration (i.e. regardless
of incarnation value).

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
[jb: retain parts of the original comment]
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/EFI: avoid Xen image when looking for module/kexec position

When booting straight from EFI, we don't further try to relocate Xen.
As a result, so far we also didn't avoid the area Xen uses when looking
for a location to put modules or the kexec area. Introduce a fake
module slot to deal with that without having to fiddle with a lot of
code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/EFI: avoid IOMMU faults on [_end,__2M_rwdata_end)

Commit c9a4a1c419 ("x86/layout: Correct Xen's idea of its own memory
layout") didn't go far enough with the conversion, causing IOMMU faults
when memory from that range was handed to a domain. We must not make
this memory available for allocation (the change is benign to xen.gz at
this point in time).

Note that the change to tboot_shutdown() is fixing another issue at
once: As it looks, the function so far skipped all memory below the Xen
image.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/EFI: avoid overrunning mb_modules[]

Commit 436fb462ab ("x86/microcode: enable boot time (pre-Dom0)
loading") added a 4th module without providing an array slot for it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: parallelize SIMD test code building

In anticipation of further flavors (AVX, AVX-512) going to be added
(which would make the current situation even worse), facilitate
reduction of build time (and hence latency to availability of test
results) via use of make's -j option.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86emul: correct DECLARE_ALIGNED()

Stop creating an excessively large array on the stack, by properly
taking into account the array element size when establishing its
element count (and of course also when calculating the pointer to
be actually used to access the memory).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

credit2: remove undefined declaration of __dump_execstate()

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

Revert "x86/vvmx: correct nested shadow VMCS handling"

This reverts commit dc05c0ceeb8609b6d60f6a117a0192e9160946b8,
causing a regression.

Revert "x86/vvmx: add a shadow vmcs check to vmlaunch"

This reverts commit b22ee98c4ecc4e7c827451dee01181529df4d26c,
causing a regression.