dgit.raspbian.org Git

sched: avoid races on time values read from NOW()

or (even in cases where there is no race, e.g., outside
of Credit2) avoid using a time sample which may be rather
old, and hence stale.

In fact, we should only sample NOW() from _inside_
the critical region within which the value we read is
used. If we don't, in case we have to spin for a while
before entering the region, when actually using it:

1) we will use something that, at the veryy least, is
    not really "now", because of the spinning,

2) if someone else sampled NOW() during a critical
    region protected by the lock we are spinning on,
    and if we compare the two samples when we get
    inside our region, our one will be 'earlier',
    even if we actually arrived later, which is a
    race.

In Credit2, we see an instance of 2), in runq_tickle(),
when it is called by csched2_context_saved() as it samples
NOW() before acquiring the runq lock. This makes things
look like the time went backwards, and it confuses the
algorithm (there's even a d2printk() about it, which would
trigger all the time, if enabled).

In RTDS, something similar happens in repl_timer_handler(),
and there's another instance in schedule() (in generic code),
so fix these cases too.

While there, improve csched2_vcpu_wake() and and rt_vcpu_wake()
a little as well (removing a pointless initialization, and
moving the sampling a bit closer to its use). These two hunks
entail no further functional changes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

docs/feature: Tweaks to the feature document template

During review of the migration feature doc, some changes were made which
should have been reflected in the template.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

docs/xsplice: Fix syntax when compiling to pdf with pandoc

Pandoc (version 1.12.4.2 from Debian Jessie) complains at the embedded \n in
the signature checking paragraph.

  /usr/bin/pandoc --number-sections --toc --standalone misc/xsplice.markdown
    --output pdf/misc/xsplice.pdf
  ! Undefined control sequence.
  l.1085 appended\textasciitilde{}\n

Surround the string in backticks to make it verbatim text.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

docs/build: Avoid using multi-target pattern rules

Multi-target non-pattern rules and Multi-target pattern rules behave rather
differently.  From `Pattern Intro':

  Pattern rules may have more than one target.  Unlike normal rules, this does
  not act as many different rules with the same prerequisites and commands.
  If a pattern rule has multiple targets, `make' knows that the rule's
  commands are responsible for making all of the targets.  The commands are
  executed only once to make all the targets.

The intended use of the multi-target pattern rules was to avoid repeating the
identical recipe multiple times.  The issue can be demonstrated with the
generation of documentation from pandoc source.

  ./xen.git$ touch docs/features/template.pandoc
  ./xen.git$ make -C docs/
  # Regenerates html/features/template.html
  ./xen.git$ make -C docs/
  # Regenerates txt/features/template.txt
  ./xen.git$ make -C docs/
  # Regenerates pdf/features/template.pdf

To work around this, there need to be three distinct rules, so the execution
of one recipe doesn't short ciruit the others.  To avoid copy&paste
duplication, introduce a metarule, and evalute it for each document target.

As $(PANDOC) is used to generate documentation from different source types,
the metarule can be extended to also encompas the rule to create pdfs from
markdown.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Revert "Config.mk: update ovmf changeset"

This reverts commit 1542efcea893df874b13b1ea78101e1ff6a55c41. It fails
consistently on our Debian HVM test when the VM has more than 4G of
memory.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>

xen/arm: p2m: Release the p2m lock before undoing the mappings

Since commit 4b25423a "arch/arm: unmap partially-mapped memory regions",
Xen has been undoing the P2M mappings when an error occurred during
insertion or memory allocation.

This is done by calling recursively apply_p2m_changes, however the
second call is done with the p2m lock taken which will result in a
deadlock for the current processor.

The p2m lock is here to protect 2 threads modifying concurrently the
page tables. However, it does not guarantee the ordering of the
changes. I.e if 2 threads request change on regions that overlaps,
then the result is undefined.

Therefore it is fine to move the recursive call to undo the changes
after the lock is released.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Wei Chen <Wei.Chen@arm.com>

xen/arm: p2m: apply_p2m_changes: Do not undo more than necessary

Since commit 4b25423a "arch/arm: unmap partially-mapped memory regions",
Xen has been undoing the P2M mappings when an error occurred during
insertion or memory allocation.

The function apply_p2m_changes can work with region not-aligned to a
block size (2MB, 1G) or page size (4K). The mapping will be done by
splitting the region in a set of regions aligned to the size supported
by the page table.

The mapping of a region could fail when it is not possible to allocate
memory for an intermediate table (i.e a new or when shattering a block).

When the mapping is undone, the end of the region is computed using the
base address of the current region and the size of the failing level.
However the failing level may not be the leaf one, therefore unrelated
entries will be removed.

Fix it by removing the mapping from the start address up to the last
region that has been successfully mapped.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

libxl: consolidate casting to xc psr type to a function

In commit 31268fea (libxl: fix passing the type argument to xc_psr_*),
casting to xc psr type was done at each call site.

This patch consolidates casting into a function to avoid casting at each
conversion point. Each call site remains more type safe.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xenalyze: fix a spurious newline

in dump mode, when tracing context switches.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86emul: suppress writeback upon unsuccessful MMX/SSE/AVX insn emulation

This in particular prevents updating guest IP when handling the retry
needed to forward the memory access to qemu.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen/nested_p2m: Don't walk EPT tables with a regular PT walker

hostmode->p2m_ga_to_gfn() is a plain PT walker, and is not appropriate for a
general L1 p2m walk.  It is fine for AMD as NPT share the same format as
normal pagetables.  For Intel EPT however, it is wrong.

The translation ends up correct (as the formats are sufficiently similar), but
the control bits in lower 12 bits differ in meaning.  A plain PT walker sets
A/D bits (bits 5 and 6) as it walks, but in EPT tables, these are the IPAT and
top bit of EMT (caching type).  This in turn causes problem when the EPT
tables are subsequently used.

Replace hostmode->p2m_ga_to_gfn() with nestedhap_walk_L1_p2m() in
paging_gva_to_gfn(), which is the correct function for the task.  This
involves making nestedhap_walk_L1_p2m() non-static, and adding
vmx_vmcs_enter/exit() pairs to nvmx_hap_walk_L1_p2m() as it is now reachable
from contexts other than v == current.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

QEMU_TAG update

Config.mk: update qemu-xen tag

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Config.mk: update ovmf changeset

Signed-off-by: Wei Liu <wei.liu2@citrix.com>

xen/device-tree: Do not remap IRQs for secondary IRQ controllers

Do not remap IRQs connected to secondary interrupt controllers.
These IRQs have no meaning to us until they connect to the
primary controller.

Secondary IRQ controllers will at some point connect to the
primary controller (possibly via other IRQ controllers). We
map the IRQs at that last connection point.

Reviewed-by: Julien Grall <julien.grall@linaro.org>
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>

x86/cpuid: Avoid unconditionally clobbering ITSC for guests

In general, Invariant TSC is not a feature which can be advertised to guests,
because it cannot be guaranteed across migrate. domain_cpuid() goes so far as
to deliberately clobber the feature flag under a number of circumstances.

Because ITSC is absent from the static {pv,hvm}_featureset masks, c/s b648feff
"xen/x86: Improvements to in-hypervisor cpuid sanity checks" caused ITSC to be
unconditionally masked out.

As an interim solution, include the hosts idea of ITSC along with the static
{pv,hvm}_featureset when restricting the guests view of features. This causes
the hardware domain, and VMs explicitly configured with ITSC and no-migrate to
be offered ITSC (subject to hardware availability).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: make SMEP/SMAP suppression tolerate NMI/MCE at the "wrong" time

There is one instruction boundary where any kind of interruption would
break the assumptions cr4_pv32_restore's debug mode checking makes on
the correlation between the CR4 register value and its in-memory cache.
Correct this (see the code comment) even in non-debug mode, or else
a subsequent cr4_pv32_restore would also be misguided into thinking the
features are enabled when they really aren't.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: refine debugging of SMEP/SMAP fix

Instead of just latching cr4_pv32_mask into %rdx, correct the found
wrong value in %cr4 (to avoid triggering another BUG). The value left
in %rdx should be sufficient for deducing cr4_pv32_mask from the
register dump.

Also there is one more place for XEN_CR4_PV32_BITS to be used.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/mm: fully honor PS bits in guest page table walks

In L4 entries it is currently unconditionally reserved (and hence
should, when set, always result in a reserved bit page fault), and is
reserved on hardware not supporting 1Gb pages (and hence should, when
set, similarly cause a reserved bit page fault on such hardware).

This is CVE-2016-4480 / XSA-176.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: mm: fix nr_second calculation in setup_frametable_mappings

On ARM64, "frametable_size >> SECOND_SHIFT" computes the number
of second level entries, not the number of second level pages.

"ROUNDUP(frametable_size, FIRST_SIZE) >> FIRST_SHIFT" which computes
the number of the first level entries (the number of second level pages),
is the correct one that should be used.

Signed-off-by: Peng Fan <van.freenix@gmail.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>

x86/compat: Cleanup and further debugging of SMAP/SMEP fixup

* Abstract (X86_CR4_SMEP | X86_CR4_SMAP) behind XEN_CR4_PV32_BITS to avoid
   opencoding the invidial bits which are fixed up behind a 32bit PV guests
   back.
* In the debug case, perform the the AND and CMP on 64bit values rather than
   32bit values, to match the logic in then non-debug case.
* Show cr4_pv32_mask in the BUG register dump

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Config.mk: update mini-os changeset

There is one commit pulled in:
lib/sys.c: enclose file_types in define guards

This is required to fix stubdom build on Arch Linux.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>

libxl: don't add cache mode for qdisk cdrom drives

qemu commit 91a097e7 forbids specifying cache mode for empty
drives. Attempting to create a domain with an empty qdisk cdrom
drive results in

qemu-system-x86_64: -drive if=ide,index=1,readonly=on,media=cdrom,
cache=writeback,id=ide-832: Must specify either driver or file

libxl only allows an empty 'target=' for cdroms. By default, cdroms
are readonly (see the 'access' parameter in xl-disk-configuration.txt)
and forced to readonly by any tools (e.g. xl) using libxlutil's
xlu_disk_parse() function. With cdroms always marked readonly,
explicitly specifying the cache mode for cdrom drives can be dropped.
The drive's 'readonly=on' option can also be set unconditionally.

Signed-off-by: Jim Fehlig <jfehlig@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: reduce code size of struct cpu_info member accesses

Instead of addressing these fields via the base of the stack (which
uniformly requires 4-byte displacements), address them from the end
(which for everything other than guest_cpu_user_regs requires just
1-byte ones). This yields a code size reduction somewhere between 8k
and 12k in my builds.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: use 32-bit loads for 32-bit PV guest state reload

This is slightly more efficient than loading 64-bit quantities.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: use optimal NOPs to fill the SMEP/SMAP placeholders

Alternatives patching code picks the most suitable NOPs for the
running system, so simply use it to replace the pre-populated ones.

Use an arbitrary, always available feature to key off from, but
hide this behind the new X86_FEATURE_ALWAYS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: suppress SMEP and SMAP while running 32-bit PV guest code

Since such guests' kernel code runs in ring 1, their memory accesses,
at the paging layer, are supervisor mode ones, and hence subject to
SMAP/SMEP checks. Such guests cannot be expected to be aware of those
two features though (and so far we also don't expose the respective
feature flags), and hence may suffer page faults they cannot deal with.

While the placement of the re-enabling slightly weakens the intended
protection, it was selected such that 64-bit paths would remain
unaffected where possible. At the expense of a further performance hit
the re-enabling could be put right next to the CLACs.

Note that this introduces a number of extra TLB flushes - CR4.SMEP
transitioning from 0 to 1 always causes a flush, and it transitioning
from 1 to 0 may also do.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xendriverdomain: use POSIX sh and not bash

The script doesn't use any bash-isms and works fine with BusyBox's ash.

Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/PoD: skip eager reclaim when possible

Reclaiming pages is pointless when the cache can already satisfy all
outstanding PoD entries, and doing reclaims in that case can be very
harmful to performance when that memory gets used by the guest, but
only to store zeroes there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

Revert "blktap2: Use RING_COPY_REQUEST"

This reverts commit 19f6c522a6a9599317ee1d8c4a155d1400d04c89. It
did wrongly get associated with XSA-155, and was (rightfully) never
backported to any of the stable trees. See also
http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg00571.html.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Unmask (aka reinstall NMI handler) if we need to abort.

If we have to abort in xsplice_spin() we end following
the goto abort. But unfortunataly we neglected to unmask.
This patch fixes that.

Reported-by: Martin Pohlack <mpohlack@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools/xendomains: Create lockfile on start unconditionally

At the moment, the xendomains init script will only create a lockfile
if when started, it actually does something -- either tries to restore
a previously saved domain as a result of XENDOMAINS_RESTORE, or tries
to create a domain as a result of XENDOMAINS_AUTO.

RedHat-based SYSV init systems try to only call "${SERVICE} shutdown"
on systems which actually have an actively running component; and they
use the existence of /var/lock/subsys/${SERVICE} to determine which
systems are running.

This means that at the moment, on RedHat-based SYSV systems (such as
CentOS 6), if you enable xendomains, and have XENDOMAINS_RESTORE set
to "true", but don't happen to start a VM, then your running VMs will
not be suspended on shutdown.

Since the lockfile doesn't really have any other effect than to
prevent duplicate starting, just create it unconditionally every time
we start the xendomains script.

The other option would have been to touch the lockfile if
XENDOMAINS_RESTORE was true regardless of whether there were any
domains to be restored. But this would mean that if you started with
the xendomains script active but XENDOMAINS_RESTORE set to "false",
and then changed it to "true", then xendomains would still not run the
next time you shut down. This seems to me to violate the principle of
least surprise.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

hotplug: Fix xendomains lock path for RHEL-based systems

Commit c996572 changed the LOCKFILE path from a check between two
hardcoded paths (/var/lock/subsys/ or /var/lock) to using the
XEN_LOCK_DIR variable designated at configure time. Since
XEN_LOCK_DIR doesn't (and shouldn't) have the 'subsys' postfix, this
effectively moves all the lock files by default to /var/lock instead.

Unfortunately, this breaks xendomains on RedHat-based SYSV init
systems. RedHat-based SYSV init systems try to only call "${SERVICE}
shutdown" on systems which actually have an actively running
component; and they use the existence of /var/lock/subsys/${SERVICE}
to determine which systems are running.

Changing XEN_LOCK_DIR to /var/lock/subsys is not suitable, as only
system services like xendomains should create lockfiles there; other
locks (such as the console locks) should be created in /var/lock
instead.

Instead, re-instate the check for the subsys/ subdirectory of the lock
directory in the xendomains script.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools: configure correct trace backend for QEMU

Newer versions of the QEMU source have replaced the 'stderr' trace
backend with 'log'. This patch adjusts the tools Makefile to test for
the 'log' backend and specify it if it is available.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: correct remaining extended CPUID level checks

We should consistently check the upper 16 bits to be equal 0x8000 and
only then the full value to be >= the desired level.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86: cap address bits CPUID output

Don't use more or report more to guests than we are capable of
handling.

At once
- correct the involved extended CPUID level checks,
- simplify the code in hvm_cpuid() and mtrr_top_of_ram().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

XSA-77: widen scope again

As discussed on the hackathon, avoid us having to issue security
advisories for issues affecting only heavily disaggregated tool stack
setups, which no-one appears to use (or else they should step up to get
things into shape).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

QEMU_TAG update

xsplice: Prevent new symbols duplicating core symbols

When loading patches, the code prevents loading a patch containing a new
symbol that duplicates a symbol from another loaded patch. However, the
check should also prevent loading a new symbol that duplicates a symbol
from the core hypervisor.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/hvm: Fix invalidation for emulated invlpg instructions

hap_invlpg() is reachable from the instruction emulator, which means
introspection and tests using hvm_fep can end up here.  As such, crashing the
domain is not an appropriate action to take.

Fixing this involves rearranging the callgraph.

paging_invlpg() is now the central entry point.  It first checks for the
non-canonical NOP case, and calls ino the paging subsystem.  If a real flush
is needed, it will call the appropriate handler for the vcpu.  This allows the
PV callsites of paging_invlpg() to be simplified.

The sole user of hvm_funcs.invlpg_intercept() is altered to use
paging_invlpg() instead, allowing the .invlpg_intercept() hook to be removed.

For both VMX and SVM, the existing $VENDOR_invlpg_intercept() is split in
half.  $VENDOR_invlpg_intercept() stays as the intercept handler only (which
just calls paging_invlpg()), and new $VENDOR_invlpg() functions do the
ASID/VPID management.  These later functions are made available in hvm_funcs
for paging_invlpg() to use.

As a result, correct ASID/VPID management occurs for the hvmemul path, even if
it did not originate from an real hardware intercept.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/svm: Don't unconditionally use a new ASID in svm_invlpg_intercept()

paging_invlpg() already returns a boolean indicating whether an invalidation
is necessary or not. A return value of 0 indicates that the specified virtual
address wasn't shadowed (or has already been flushed), cannot currently be
cached in the TLB.

This is a performance optimisation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/hvm: Correct the emulated interaction of invlpg with segments

The `invlpg` instruction is documented to take a memory address, and is not
documented to suffer faults from segmentation violations. It is also
explicitly documented to be a NOP when issued on a non-canonical address.

Experimentally, and subsequently confirmed by both Intel and AMD, the
instruction does take into account segment bases, but will happily invalidate
a TLB entry for a mapping beyond the segment limit.

The emulation logic will currently raise #GP/#SS faults for segment limit
violations, or non-canonical addresses, which doesn't match hardware's
behaviour. Instead, squash exceptions generated by
hvmemul_virtual_to_linear() and proceed with invalidation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/hvm: Raise #SS faults for %ss-based segmentation violations

Raising #GP under such circumstances is architecturally wrong.

Refer to the Intel or AMD manuals describing faults, and the conditions
under which #SS is raised.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/hvm: Always return the linear address from hvm_virtual_to_linear_addr()

Some callers need the linear address (with appropriate segment base), whether
or not the limit or canonical check succeeds.

While modifying the function, change the return type to bool_t to match its
semantics.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

sched/rt: Fix memory leak in rt_init()

c/s 2656bc7b0 "xen: adopt .deinit_pdata and improve timer handling"
introduced a error path into rt_init() which leaked prv if the
allocation of prv->repl_timer failed.

Introduce an error cleanup path.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
---
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Dario Faggioli <dario.faggioli@citrix.com>

xen: adopt .deinit_pdata and improve timer handling

The scheduling hooks API is now used properly, and no
initialization or de-initialization happen in
alloc/free_pdata any longer.

In fact, just like it is for Credit2, there is no real
need for implementing alloc_pdata and free_pdata.

This also made it possible to improve the replenishment
timer handling logic, such that now the timer is always
kept on one of the pCPU of the scheduler it's servicing.
Before this commit, in fact, even if the pCPU where the
timer happened to be initialized at creation time was
moved to another cpupool, the timer stayed there,
potentially inferfering with the new scheduler of the
pCPU itself.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-and-Tested-by: Meng Xu <mengxu@cis.upenn.edu>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen: sched: avoid spuriously re-enabling IRQs in csched2_switch_sched()

interrupts are already disabled when calling the hook
(from schedule_cpu_switch()), so we must use spin_lock()
and spin_unlock().

Add an ASSERT(), so we will notice if this code and its
caller get out of sync with respect to disabling interrupts
(and add one at the same exact occurrence of this pattern
in Credit1 too)

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

doc: document that Domain-0 can't be migrated across cpupools

Domain-0 is always member of Pool-0 (or, to be precise: of the cpuppol
with cpupool-id 0). Document this in the xl man page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

MAINTAINERS: put docs/man/ under tool stack

Right now there's only tool stack related documentation there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

IOMMU: don't BUG() on exotic hardware

On x86, iommu_get_ops() BUG()s when running on non-Intel, non-AMD
hardware. While, with our current code, that's a correct prerequisite
assumption for IOMMU presence, this is wrong on systems without IOMMU.
Hence iommu_enabled (and alike) checks should be done prior to calling
that function, not after.

Also move iommu_suspend() next to iommu_resume() - it escapes me why
iommu_do_domctl() had got put between the two.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen/xsplice: add ELFOSABI_FREEBSD as a supported OSABI for payloads

The calling convention used by the FreeBSD ELF OSABI is exactly the same as
the the one defined by System V, so payloads with a FreeBSD OSABI should be
accepted by the xsplice machinery.

Specifically "the FreeBSD ELF OSABI only has a meaning for userspace
applications, it's used by FreeBSD in order to detect if an application
is native or if it needs to be run in the linuxator (the Linux emulator,
or any other emulator that is available and matches the ELF OSABI specified
in the binary FWIW).

The only difference from SYSV to FreeBSD OSABI is the sysentvec that's
selected inside of the FreeBSD kernel (the ABI between the kernel and the
user-space application), but of course this doesn't apply to kernel code,
which is what Xen and the xsplice payloads are. Sadly this is not written
anywhere. " And since the ELF tools on FreeBSD by default build with
this - they would stick this OSABI entry.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ross Lagerwall <ross.lagerwall@citrix.com>

blktap2: initialise buf in qcow2raw.c:main

Gcc complains:

qcow2raw.c: In function ‘main’:
qcow2raw.c:387:17: error: ‘buf’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
treq.buf = buf;
^

But buf is a valid buffer allocated by posix_memalign at that point.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

blktap2: initialise buf to NULL in img2qcow.c:main

Gcc complains:

qcow2raw.c: In function ‘main’:
qcow2raw.c:387:17: error: ‘buf’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
treq.buf = buf;
^

But at the point of that assignment, buf is a valid buffer allocated by
posix_memalign and filled in by read.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

blktap2: initialise buf in vhd_util_check_footer

Gcc complains:

vhd-util-check.c: In function ‘vhd_util_check_footer’:
vhd-util-check.c:413:2: error: ‘buf’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
memcpy(&backup, buf, sizeof(backup));

In fact buf is initialised a few lines above.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

rombios/tcgbios: initialise logdataptr in HashLogEvent32

Gcc complains:

tcgbios.c: In function ‘HashLogEvent32’:
tcgbios.c:1131:10: error: ‘logdataptr’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
entry = tcpa_extend_acpi_log(logdataptr);

It fails to figure out when logdataptr is used it is always initialised
in a if block a few line above.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

rombios/tcgbios: initialise entry in HashLogEvent32

Gcc complains:

tcgbios.c:1142:22: error: ‘entry’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
hleo->eventnumber = entry;

It fails to figure out if entry is used it is always initialised in
previous if block.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

rombios/tcgbios: initialise size in tcpa_extend_acpi_log

Gcc complains:

tcgbios.c:362:3: error: ‘size’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
memcpy((char *)lasa_last, (char *)entry_ptr, size);

It fails to figure out if size is used in memcpy it is always initialised.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

init: shebang should be the first line

The shebang was not on the first line in the init script and it should
be.

Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

init: drop GNU-isms for sleep command

Most implementations of the sleep command only take integers. GNU
coreutils has a GNU extension to allow any floating point number to be
passed but we shouldn't depend on that.

Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

flask/policy: don't audit commandline / build_id queries

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

pvusb: add missing definition to usbif.h

The pvusb request structure contains the transfer_flags member which
is missing definitions of it's semantics.

Add the definition of the USBIF_SHORT_NOT_OK flag.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

MAINTAINERS: update monitor/vm_event covered code

Add headers to the covered list.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

build: Honor '--enable-githttp' in toplevel Makefile generation

During the make world, git mini-os.git didn't honor the 'configure
--enable-githttp' option. The 'enable-githttp' was only honored in
the tools subdirectory.

Signed-off-by: Paul Lai <paul.c.lai@intel.com>
[ wei: add prefix "build:" to title ]
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen: credit2: fix 2 (minor) issues in load tracking logic

All calculations that involve load_last_update uses quantities
shifted by LOADAVG_GRANULARITY_SHIFT, so make sure that this
is true even when the field is assigned a value for the first
time, during vcpu allocation.

Also, during migration, while the loads of both the source and
destination runqueues certainly need changing, the vcpu being
moved does not change its running/non-running status, and its
calculated load should hence not be affected.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xen: sched: fix killing an uninitialized timer in free_pdata.

commit 64269d9365 "sched: implement .init_pdata in Credit,
Credit2 and RTDS" helped fixing Credit2 runqueues, and
the races we had in switching scheduler for pCPUs, but
introduced another issue. In fact, if CPU bringup fails
during __cpu_up() (and, more precisely, after CPU_UP_PREPARE,
but before CPU_STARTING) the CPU_UP_CANCELED notifier
would be executed, which calls the free_pdata hook.

Such hook does, right now, two things: (1) undo the
initialization done inside the init_pdata hook and (2)
free the memory allocated by the alloc_pdata hook.

However, in the failure path just described, it is possible
that only alloc_pdata were called, and this is potentially
an issue (depending on how actually free_pdata does).

In fact, for Credit1 (the only scheduler that actually
allocates per-pCPU data), this result in calling kill_timer()
on a timer that had not yet been initialized, which causes
the following:

(XEN) Xen call trace:
(XEN)    [<000000000022e304>] timer.c#active_timer+0x8/0x24 (PC)
(XEN)    [<000000000022f624>] kill_timer+0x108/0x2e0 (LR)
(XEN)    [<00000000002208c0>] sched_credit.c#csched_free_pdata+0xd8/0x114
(XEN)    [<0000000000227a18>] schedule.c#cpu_schedule_callback+0xc0/0x12c
(XEN)    [<0000000000219944>] notifier_call_chain+0x78/0x9c
(XEN)    [<00000000002015fc>] cpu_up+0x104/0x130
(XEN)    [<000000000028f7c0>] start_xen+0xaf8/0xce0
(XEN)    [<00000000810021d8>] 00000000810021d8
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:279
(XEN) ****************************************

Solve this by making the scheduler hooks API symmetric again,
i.e., by adding a deinit_pdata hook and making it responsible
of undoing what init_pdata did, rather than asking to free_pdata
to do everything.

This is cleaner and, in the case at hand, makes it possible to
only call free_pdata (which is the right thing to do) as only
allocation and no initialization was performed.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

configure: Fix when no libsystemd compat lib are available

From systemd change log, since version 209, libsystemd.so contain
everything, including libsystemd-daemon.so. Distro may, or may not provide
the compatibility libraries which libsystemd-daemon is part of.

So, if libsystemd-daemon is not available, check for the presence of
a recent enough libsystemd.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
[ wei: run autogen.sh ]
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: fix usage of XEN_EOPNOTSUPP

The errno values returned by libxc are already translated into the
underlying OS error space, so it's wrong to compare them against Xen error
codes.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools/xsplice: fix mixing system errno values with Xen ones.

Avoid using system errno values when comparing with Xen errno values.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools/xsplice: corrently use errno

Some error paths incorrectly used rc instead of errno.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: add a define for equivalent ENODATA errno on FreeBSD

Currently FreeBSD lacks the ENODATA errno value, so the privcmd driver
always translates ENODATA to ENOENT, add a define to libxl in order to
correctly match ENODATA with ENOENT on FreeBSD.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Config.mk: update mini-os revision

This is only one commit:
build: change MINI-OS_ROOT to MINIOS_ROOT

This change is required to fix stubdom build on Ubuntu 16.04.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/arm64: ensure that the correct SP is used for exceptions

The ARMv8 architecture has a SPSel ("stack pointer selection") machine
register that allows us to determine which exception level's stack
pointer is loaded when an exception occurs. As we don't want to
use the non-privileged SP_EL0 stack pointer -- or even assume that SP_EL0
points to a valid address in the hypervisor context-- we'll need to ensure
that our EL2 code sets the SPSel to SP_ELn mode, so exceptions that trap
to EL2 use the EL2 stack pointer.

This corrects an issue that can manifest as a hang-on-IRQ on some
arm64 cores if the firmware/bootloader has previously initialized SPSel
to 0; in which case Xen's exceptions will incorrectly use an invalid SP_EL0,
and will endlessly spin on the synchronous abort handler.

Signed-off-by: Kyle Temkin <temkink@ainfosec.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>

xsplice: check against ELFOSABI_NONE instead of ELFOSABI_SYSV

They are equivalent, but using ELFOSABI_NONE is more correct in this
context.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

IOMMU/x86: per-domain control structure is not HVM-specific

... and hence should not live in the HVM part of the PV/HVM union. In
fact it's not even architecture specific (there already is a per-arch
extension type to it), so it gets moved out right to common struct
domain.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/p2m: also tear down altp2m

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/monitor: disallow setting mem_access_emulate_each_rep when vm_event is NULL

It is meaningless (and potentially dangerous - see hvmemul_virtual_to_linear())
to set mem_access_emulate_each_rep before xc_monitor_enable() (which allocates
vcpu->arch.vm_event) has been called, so return an error from the
XEN_DOMCTL_MONITOR_OP_EMULATE_EACH_REP hypercall when that is the case.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citirx.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>

x86: show correct code in CPU state

When showing the CPU state (e.g., after a crash) the dump of code
around RIP is incorrect.

Incorrect:

    Xen code around <ffff82d0801113cf> (...):
     00 c6 c1 ee 08 48 c1 e0 <04> 03 04 f1 8b ...
     ^^ Uninitialized         ^^ Missing 0x48

Correct:

    Xen code around <ffff82d0801113cf> (...):
     c6 c1 ee 08 48 c1 e0 04 <48> 03 04 f1 8b ...

When coping the bytes before RIP, the destination was off-by-one.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

Final touches for Xen 4.7.0-rc1

* Update README and xen/Makefile to "Xen 4.7-rc"
* Fix QEMU_UPSTREAM_REVISION to a tag, so we get a specific commit

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

xsplice: Missing if ( sec )

Add the missing conditional.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

tools: xen-xsplice.c: fix length parameter of memset in list_func

The length expression should be the same one used in malloc.

CID: 1358947

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

ocaml/xc_get_cpu_featureset/arm: Return not implemented on ARM

... as it is not implemented on it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/shadow: account for ioreq server pages before complaining about not found mapping

prepare_ring_for_helper(), just like share_xen_page_with_guest(),
takes a write reference on the page, and hence should similarly be
accounted for when determining whether to log a complaint.

This requires using recursive locking for the ioreq server lock, as the
offending invocation of sh_remove_all_mappings() is down the call stack
from hvm_set_ioreq_server_state(). (While not strictly needed to be
done in all other instances too, convert all of them for consistency.)

At once improve the usefulness of the shadow error message: Log all
values involved in triggering it as well as the GFN (to aid
understanding which guest page it is that there is a problem with - in
cases like the one here the GFN is invariant across invocations, while
the MFN obviously can [and will] vary).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

mkelf32: fix compilation on 32 bit build host

When cross-compiling xen on a 32 bit build host:

boot/mkelf32.c: In function 'main':
boot/mkelf32.c:360:21: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'Elf64_Off' [-Werror=format]
cc1: all warnings being treated as errors

Fix that by using PRId64 in format string.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

MAINTAINERS: Clarify the meaning of nested maintainership

Clarify the meaning of nested maintainership.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
We had a discussion about the meaning of nested maintainership at the
recent Xen Hackathon.  The notes of that meeting can be found on this
list [1].  No decision is official until discussed on this list, so
consider this patch the official proposal for this change, and object
or ask for clarification accordingly.

Compared to v1, there is one change that is worth pointing out: The
claim that THE REST consists of all committers.  This is the case at
the moment, but this change would codify that this is an invariant we
intend to keep going forward.

The advantage of this is that the dispute resolution mentioned in this
patch for maintainers who can't agree lines up directly with the
fall-back for broader community issues upon which we can't reach
consensus.

[1] marc.info/?i=<EDB48431-C3EF-4461-B2D2-3AB95EA6C392@gmail.com>

Changes in v2:
- fixed spelling of "maintainer"
- fixed path of multi.c
- clarified that the resolution by REST would be by *majority* vote
- Asserted that The REST consists of all committers

CC: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Keir Fraser <keir@xen.org>
CC: Tim Deegan <tim@xen.org>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
CC: Lars Kurth <lars.kurth@citrix.com>

x86: fix domain cleanup

Free d->arch.cpuids on both the creation error path and during
destruction.

Don't bypass iommu_domain_destroy().

Move psr_domain_init() up so that hvm_domain_initialise() again is the
last thing which can fail, and hence doesn't require undoing later on.

Move psr_domain_free() up on the creation error path, so that cleanup
once again gets done in reverse order of setup.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86/vMSI-X: also snoop REP MOVS

... as at least certain versions of Windows use such to update the
MSI-X table. However, to not overly complicate the logic for now
- only EFLAGS.DF=0 is being handled,
- only updates not crossing MSI-X table entry boundaries are handled.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Don't perform multiple operations on same payload once work is scheduled.

Currently it is possible to:

1)  xc_xsplice_apply()
     \-> xsplice_action
spin_lock(payload_lock)
             \- schedule_work()
        spin_unlock(payload_lock);

2)  xc_xsplice_unload()
     \-> xsplice_action
spin_lock(payload_lock)
             free_payload(data);
        spin_unlock(payload_lock);

.. all CPUs are quiesced.

3) check_for_xsplice_work()
     \-> apply_payload
        \-> arch_xsplice_apply_jmp
BOOM

The reason is that state is in 'CHECKED' which changes to 'APPLIED'
once check_for_xsplice_work finishes. So we have a race between 1) -> 3)
where one can manipulate the payload.

To guard against this we add a check in xsplice_action to not allow
any actions if schedule_work has been called for this specific payload.

The function 'is_work_scheduled' checks xsplice_work which is safe as:
- The ->do_work changes to 1 under the payload_lock (which we also hold).
- The ->do_work changes to 0 when all CPUs are quisced and IRQs have
   been disabled.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-and-Tested-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

MAINTAINERS/xsplice: Add myself and Ross as the maintainers.

If you have a patch for xSplice send it our way!

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Prevent duplicate payloads from being loaded.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice/xen_replace_world: Test-case for XSPLICE_ACTION_REPLACE

With this third payload one can do:

-bash-4.1# xen-xsplice load xen_hello_world.xsplice
Uploading xen_hello_world.xsplice (10148 bytes)
Performing check: completed
Performing apply:. completed

[xen_hello_world depends on hypervisor build-id]
-bash-4.1# xen-xsplice load xen_bye_world.xsplice
Uploading xen_bye_world.xsplice (7076 bytes)
Performing check: completed
Performing apply:. completed
[xen_bye_world depends on xen_hello_world build-id]
-bash-4.1# xen-xsplice upload xen_replace_world xen_replace_world.xsplice
Uploading xen_replace_world.xsplice (7148 bytes)
-bash-4.1# xen-xsplice list
ID                                     | status
----------------------------------------+------------
xen_hello_world                         | APPLIED
xen_bye_world                           | APPLIED
xen_replace_world                       | CHECKED
-bash-4.1# xen-xsplice replace xen_replace_world
Performing replace:. completed
-bash-4.1# xl info | grep extra
xen_extra              : Hello Again World!
-bash-4.1# xen-xsplice list
ID                                     | status
----------------------------------------+------------
xen_hello_world                         | CHECKED
xen_bye_world                           | CHECKED
xen_replace_world                       | APPLIED

and revert both of the previous payloads and apply
the xen_replace_world.

All the magic of this is in the Makefile - we extract
the build-id from the hypervisor (xen-syms) and jam it
in the xen_replace_world as .xsplice.depends.

We also make .old_addr be zero, forcing the hypervisor
to lookup the xen_extra_version.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Stacking build-id dependency checking.

We now expect that the ELF payloads be built with the
--build-id.

Also the .xsplice.deps section has to have the contents
of the hypervisor (or a preceding payload) build-id.

We already have the code to verify the Elf_Note build-id
so export parts of it.

This dependency means the hypervisor MUST be compiled with
--build-id - so we gate the build of xSplice on the availability
of said functionality.

This does not impact the ordering of how the payloads can
be loaded, but it does enforce an STRICT ordering when the
payloads are applied. Also the REPLACE is special - we need
to check that its dependency against the hypervisor - not
the last applied patch.

To make this easier to test we also add an extra test-case
to be used - which can only be applied on top of the
xen_hello_world payload.

As in, one can apply xen_hello_world and then xen_bye_world
on top of that. Not the other way.

We also print the dependency and payloads build_in the keyhandler.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

libxl: info: Display build_id of the hypervisor.

If the hypervisor is built with we will display it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

XENVER_build_id/libxc: Provide ld-embedded build-id

If the hypervisor was built with build-ids we can expose the
build-id value to the toolstack (if it is not built with
it will just return -ENODATA). This is a priviligied operation
so only the controlling stack is able to request this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Print build_id in keyhandler and on bootup.

As it should be an useful debug mechanism.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

build_id: Provide ld-embedded build-ids

This patch enables the Elf to be built with the build-id
and provide in the Xen hypervisor the code to extract it.

The man-page for ld --build-id says it is:

"Request the creation of a ".note.gnu.build-id" ELF note
section or a ".build-id" COFF section. The contents of the
note are unique bits identifying this linked file. style can be
"uuid" to use 128 random bits, "sha1" to use a 160-bit SHA1 hash
on the normative parts of the output contents, ..."

One can also retrieve the value of the build-id by doing
'readelf -n xen-syms'.

For EFI builds we re-use the same build-id that the xen-syms
was built with.

The version of ld that first implemented --build-id is v2.18.
We check for to see if the linker supports the --build-id
parameter and if so use it.

For x86 we have two binaries - the xen-syms and the xen - an
smaller version with lots of sections removed. To make it possible
for readelf -n xen we also modify mkelf32 and xen.lds.S to include
the PT_NOTE ELF section.

The EFI binary is more complicated. We only build one type of
binary and expanding the amount of sections the EFI binary has to
include an .note one is pointless - as there is no concept of
PT_NOTE. The best we can do is move this .note in the .rodata section.

Further development wise should move it to .buildid section
so that DataDirectory debug data nor CodeView can view it.
(The author has no clue what those are).

Note that in earlier patches the linker script had:

__note_gnu_build_id_start = .;
*(.rodata.note.gnu.build-id)
__note_gnu_build_id_end = .;
*(.note)
*(.note.*)

Which meant you could have different ELF notes _outside_ the
__note_gnu_build_id_end. However for EFI builds we take the whole
.note* section and jam it in the EFI to be between
__note_gnu_build_id_start and __note_gnu_build_id_end.
To not make this happend we make on the ELF build the section
be called .note.gnu.build-id (instead of just .note).
If there is a need for a different type of note other folks
can add it as a different section name.

Note that we do call --binary-id=sha1 on all linker invocations.
We have to do to enforce that the symbol offsets don't changes
(the side effect is that we we would have multiple binary ids -
except that the last one is the final one).

Without this working the symbol table embedded in Xen ends
up incorrect - some of the values it contains would be offset by the
size of the included build id.

This obviously causes problems when resolving symbols.

We also define the NT_GNU_BUILD_ID in the elfstructs.h as we
need to use it in various places.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Martin Pohlack <mpohlack@amazon.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Add support for alternatives

Add support for applying alternative sections within xsplice payload.
At payload load time, apply an alternative sections that are found.

Also we add an test-case exercising a rather useless alternative
(patching a NOP with a NOP) - but it does exercise the code-path.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Add support for exception tables.

Add support for exception tables contained within xSplice payloads. If an
exception occurs search either the main exception table or a particular
active payload's exception table depending on the instruction pointer.

Also we add an test-case to make sure we have an exception that
is handled.

To not grow the code-base if xSplice is not compiled in we add
certain #define to help in determining if code needs to be __init
or not.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

xsplice: Add support for bug frames.

Add support for handling bug frames contained with xsplice modules. If a
trap occurs search either the kernel bug table or an applied payload's
bug table depending on the instruction pointer.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>

x86, xsplice: Print payload's symbol name and payload name in backtraces

Naturally the backtrace is presented when an instruction
hits an bug_frame or %p is used.

The payloads do not support bug_frames yet - however the functions
the payloads call could hit an BUG() or WARN().

The traps.c has logic to scan for it this - and eventually it will
find the correct bug_frame and the walk the stack using %p to print
the backtrace. For %p and symbols to print a string -  the
'is_active_kernel_text' is consulted which uses an 'struct virtual_region'.

Therefore we register our start->end addresses so that
'is_active_kernel_text' will include our payload address.

We also register our symbol lookup table function so that it can
scan the list of payloads and retrieve the correct name.

Lastly we change vsprintf to take into account s and namebuf.
For core code they are the same, but for payloads they are different.
This gets us:

Xen call trace:
   [<ffff82d080a00041>] revert_hook+0x31/0x35 [xen_hello_world]
   [<ffff82d0801431bd>] xsplice.c#revert_payload+0x86/0xc6
   [<ffff82d080143502>] check_for_xsplice_work+0x233/0x3cd
   [<ffff82d08017a0b2>] domain.c#continue_idle_domain+0x9/0x1f

Which is great if payloads have similar or same symbol names.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>