xen.git
12 years agoarch/x86: use XSM hooks for get_pg_owner access checks
Daniel De Graaf [Fri, 11 Jan 2013 10:39:58 +0000 (10:39 +0000)]
arch/x86: use XSM hooks for get_pg_owner access checks

There are three callers of get_pg_owner:
 * do_mmuext_op, which does not have XSM hooks on all subfunctions
 * do_mmu_update, which has hooks that are inefficient
 * do_update_va_mapping_otherdomain, which has a simple XSM hook

In order to preserve return values for the do_mmuext_op hypercall, an
additional XSM hook is required to check the operation even for those
subfunctions that do not use the pg_owner field. This also covers the
MMUEXT_UNPIN_TABLE operation which did previously have an XSM hook.

The XSM hooks in do_mmu_update were capable of replacing the checks in
get_pg_owner; however, the hooks are buried in the inner loop of the
function - not very good for performance when XSM is enabled and these
turn in to indirect function calls. This patch removes the PTE from
the hooks and replaces it with a bitfield describing what accesses are
being requested. The XSM hook can then be called only when additional
bits are set instead of once per iteration of the loop.

This patch results in a change in the FLASK permissions used for
mapping an MMIO page: the target for the permisison check on the
memory mapping is no longer resolved to the device-specific type, and
is instead either the domain's own type or domio_t (depending on if
the domain uses DOMID_SELF or DOMID_IO in the map
command). Device-specific access is still controlled via the "resource
use" permisison checked at domain creation (or device hotplug).

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoarch/x86: Add missing mem_sharing XSM hooks
Daniel De Graaf [Fri, 11 Jan 2013 10:39:20 +0000 (10:39 +0000)]
arch/x86: Add missing mem_sharing XSM hooks

This patch adds splits up the mem_sharing and mem_event XSM hooks to
better cover what the code is doing. It also changes the utility
function get_mem_event_op_target to rcu_lock_live_remote_domain_by_id
because there is no mm-specific logic in there.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxsm/flask: add distinct SIDs for self/target access
Daniel De Graaf [Fri, 11 Jan 2013 10:38:39 +0000 (10:38 +0000)]
xsm/flask: add distinct SIDs for self/target access

Because the FLASK XSM module no longer checks IS_PRIV for remote
domain accesses covered by XSM permissions, domains now have the
ability to perform memory management and other functions on all
domains that have the same type. While it is possible to prevent this
by only creating one domain per type, this solution significantly
limits the flexibility of the type system.

This patch introduces a domain type transition to represent a domain
that is operating on itself. In the example policy, this is
demonstrated by creating a type with _self appended when declaring a
domain type which will be used for reflexive operations. AVCs for a
domain of type domU_t will look like the following:

scontext=system_u:system_r:domU_t
tcontext=system_u:system_r:domU_t_self

This change also allows policy to distinguish between event channels a
domain creates to itself and event channels created between domains of
the same type.

The IS_PRIV_FOR check used for device model domains is also no longer
checked by FLASK; a similar transition is performed when the target is
set and used when the device model accesses its target domain.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxsm/flask: add missing hooks
Daniel De Graaf [Fri, 11 Jan 2013 10:37:47 +0000 (10:37 +0000)]
xsm/flask: add missing hooks

The FLASK module was missing implementations of some hooks and did not
have access vectors defined for 10 domctls; define these now.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxsm/flask: Add checks on the domain performing the set_target operation
Daniel De Graaf [Fri, 11 Jan 2013 10:37:10 +0000 (10:37 +0000)]
xsm/flask: Add checks on the domain performing the set_target operation

The existing domain__set_target check only verifies that the source
and target domains can be associated. We also need to check that the
privileged domain making this association is allowed to do so.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxsm: Move flask policy files into hypervisor (missed from earlier commit).
Keir Fraser [Fri, 11 Jan 2013 10:36:06 +0000 (10:36 +0000)]
xsm: Move flask policy files into hypervisor (missed from earlier commit).

Signed-off-by: Keir Fraser <keir@xen.org>
--HG--
rename : tools/flask/policy/policy/flask/access_vectors => xen/xsm/flask/policy/access_vectors
rename : tools/flask/policy/policy/flask/initial_sids => xen/xsm/flask/policy/initial_sids
rename : tools/flask/policy/policy/flask/mkaccess_vector.sh => xen/xsm/flask/policy/mkaccess_vector.sh
rename : tools/flask/policy/policy/flask/mkflask.sh => xen/xsm/flask/policy/mkflask.sh
rename : tools/flask/policy/policy/flask/security_classes => xen/xsm/flask/policy/security_classes

12 years agoarch/x86: convert platform_hypercall to use XSM
Daniel De Graaf [Fri, 11 Jan 2013 10:11:02 +0000 (10:11 +0000)]
arch/x86: convert platform_hypercall to use XSM

The newly introduced xsm_platform_op hook addresses new sub-ops, while
most ops already have their own XSM hooks.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxen: convert do_sysctl to use XSM
Daniel De Graaf [Fri, 11 Jan 2013 10:10:21 +0000 (10:10 +0000)]
xen: convert do_sysctl to use XSM

The xsm_sysctl hook now covers every sysctl, in addition to the more
fine-grained XSM hooks in most sub-functions.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxen: convert do_domctl to use XSM
Daniel De Graaf [Fri, 11 Jan 2013 10:09:45 +0000 (10:09 +0000)]
xen: convert do_domctl to use XSM

The xsm_domctl hook now covers every domctl, in addition to the more
fine-grained XSM hooks in most sub-functions. This also removes the
need to special-case XEN_DOMCTL_getdomaininfo.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxen: avoid calling rcu_lock_*target_domain when an XSM hook exists
Daniel De Graaf [Fri, 11 Jan 2013 10:07:19 +0000 (10:07 +0000)]
xen: avoid calling rcu_lock_*target_domain when an XSM hook exists

The rcu_lock_{,remote_}target_domain_by_id functions are wrappers
around an IS_PRIV_FOR check for the current domain. This is now
redundant with XSM hooks, so replace these calls with
rcu_lock_domain_by_any_id or rcu_lock_remote_domain_by_id to remove
the duplicate permission checks.

When XSM_ENABLE is not defined or when the dummy XSM module is used,
this patch should not change any functionality. Because the locations
of privilege checks have sometimes moved below argument validation,
error returns of some functions may change from EPERM to EINVAL when
called with invalid arguments and from a domain without permission to
perform the operation.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxen: use XSM instead of IS_PRIV where duplicated
Daniel De Graaf [Fri, 11 Jan 2013 10:06:43 +0000 (10:06 +0000)]
xen: use XSM instead of IS_PRIV where duplicated

The Xen hypervisor has two basic access control function calls:
IS_PRIV and the xsm_* functions. Most privileged operations currently
require that both checks succeed, and many times the checks are at
different locations in the code. This patch eliminates the explicit
and implicit IS_PRIV checks that are duplicated in XSM hooks.

When XSM_ENABLE is not defined or when the dummy XSM module is used,
this patch should not change any functionality. Because the locations
of privilege checks have sometimes moved below argument validation,
error returns of some functions may change from EPERM to EINVAL or
ESRCH if called with invalid arguments and from a domain without
permission to perform the operation.

Some checks are removed due to non-obvious duplicates in their
callers:

 * acpi_enter_sleep is checked in XENPF_enter_acpi_sleep
 * map_domain_pirq has IS_PRIV_FOR checked in its callers:
   * physdev_map_pirq checks when acquiring the RCU lock
   * ioapic_guest_write is checked in PHYSDEVOP_apic_write
 * PHYSDEVOP_{manage_pci_add,manage_pci_add_ext,pci_device_add} are
   checked by xsm_resource_plug_pci in pci_add_device
 * PHYSDEVOP_manage_pci_remove is checked by xsm_resource_unplug_pci
   in pci_remove_device
 * PHYSDEVOP_{restore_msi,restore_msi_ext} are checked by
   xsm_resource_setup_pci in pci_restore_msi_state
 * do_console_io has changed to IS_PRIV from an explicit domid==0

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoarch/x86: add distinct XSM hooks for map/unmap
Daniel De Graaf [Thu, 10 Jan 2013 17:32:10 +0000 (17:32 +0000)]
arch/x86: add distinct XSM hooks for map/unmap

The xsm_iomem_permission and xsm_ioport_permission hooks are intended
to be called by the domain builder, while the calls in
arch/x86/domctl.c which control mapping are also performed by the
device model.  Because these operations require distinct access
control policies, they cannot use the same XSM hooks.

This also adds a missing XSM hook in the unbind IRQ domctl.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoflask: move policy headers into hypervisor
Daniel De Graaf [Thu, 10 Jan 2013 17:30:47 +0000 (17:30 +0000)]
flask: move policy headers into hypervisor

Rather than keeping around headers that are autogenerated in order to
avoid adding build dependencies from xen/ to files in tools/, move the
relevant parts of the FLASK policy into the hypervisor tree and
generate the headers as part of the hypervisor's build.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoxsm: Use the dummy XSM module if XSM is disabled
Daniel De Graaf [Thu, 10 Jan 2013 17:27:58 +0000 (17:27 +0000)]
xsm: Use the dummy XSM module if XSM is disabled

This patch moves the implementation of the dummy XSM module to a
header file that provides inline functions when XSM_ENABLE is not
defined. This reduces duplication between the dummy module and callers
when the implementation of the dummy return is not just "return 0",
and also provides better compile-time checking for completeness of the
XSM implementations in the dummy module.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Keir Fraser <keir@xen.org>
12 years agohvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough.
Keir Fraser [Thu, 10 Jan 2013 17:26:24 +0000 (17:26 +0000)]
hvmloader: Allocate 3 pages for Intel GPU OpRegion passthrough.

The 8kB region may not be page aligned, hence requiring 3 pages to
be mapped through.

Signed-off-by: Keir Fraser <keir@xen.org>
12 years agoHVM firmware passthrough ACPI processing
Ross Philipson [Thu, 10 Jan 2013 17:18:43 +0000 (17:18 +0000)]
HVM firmware passthrough ACPI processing

ACPI table passthrough support allowing additional static tables and
SSDTs (AML code) to be loaded. These additional tables are added at
the end of the secondary table list in the RSDT/XSDT tables.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoHVM firmware passthrough SMBIOS processing
Ross Philipson [Thu, 10 Jan 2013 17:18:10 +0000 (17:18 +0000)]
HVM firmware passthrough SMBIOS processing

Passthrough support for the SMBIOS structures including three new DMTF
defined types and support for OEM defined tables. Passed in SMBIOS
types override the default internal values. Default values can be
enabled for the new type 22 portable battery using a xenstore
flag. All other new DMTF defined and OEM structures will only be added
to the SMBIOS table if passthrough values are present.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoHVM firmware passthrough control tools support
Ross Philipson [Thu, 10 Jan 2013 17:17:21 +0000 (17:17 +0000)]
HVM firmware passthrough control tools support

Xen control tools support for loading the firmware passthrough blocks
during domain construction. SMBIOS and ACPI blocks are passed in using
the new xc_hvm_build_args structure. Each block is read and loaded
into the new domain address space behind the HVMLOADER image. The base
address for the two blocks is returned as an out parameter to the
caller via the args structure.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoHVM xenstore strings and firmware passthrough header
Ross Philipson [Thu, 10 Jan 2013 17:16:28 +0000 (17:16 +0000)]
HVM xenstore strings and firmware passthrough header

Add public HVM definitions header for xenstore strings used in
HVMLOADER. In addition this header describes the use of the firmware
passthrough values set using xenstore.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
12 years agoVT-d: fix interrupt remapping source validation for devices behind legacy bridges
Jan Beulich [Wed, 9 Jan 2013 16:13:26 +0000 (17:13 +0100)]
VT-d: fix interrupt remapping source validation for devices behind legacy bridges

Using SVT_VERIFY_BUS here doesn't make sense; native Linux also
uses SVT_VERIFY_SID_SQ here instead.

This is XSA-33 / CVE-2012-5634.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
12 years agox86/hvm: Bind device-model event-channels to registered device-model
Keir Fraser [Wed, 9 Jan 2013 15:00:58 +0000 (15:00 +0000)]
x86/hvm: Bind device-model event-channels to registered device-model
domid during vcpu initialisation.

Signed-off-by: Keir Fraser <keir@xen.org>
12 years agomini-os: Fix test application link when various fronts are not enabled.
Samuel Thibault [Wed, 9 Jan 2013 08:44:17 +0000 (08:44 +0000)]
mini-os: Fix test application link when various fronts are not enabled.

When fronts are not enabled, the test application needs to disable the
corresponding tests.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Committed-by: Keir Fraser <keir@xen.org>
12 years agomini-os: Notify shutdown through weak function call instead of wake queue
Samuel Thibault [Wed, 9 Jan 2013 08:43:53 +0000 (08:43 +0000)]
mini-os: Notify shutdown through weak function call instead of wake queue

To allow for more flexibility, this notifies domain shutdown through a
function rather than a wake queue, to let the application use a wake
queue only if it wishes.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Committed-by: Keir Fraser <keir@xen.org>
12 years agonested vmx: synchronize page fault error code match and mask
Dongxiao Xu [Tue, 8 Jan 2013 09:43:35 +0000 (10:43 +0100)]
nested vmx: synchronize page fault error code match and mask

Page fault is specially handled not only with exception bitmaps,
but also with consideration of page fault error code mask/match
values. Therefore in nested virtualization case, the two values
need to be synchronized from virtual VMCS to shadow VMCS.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
12 years agonested vmx: emulate IA32_VMX_MISC MSR
Dongxiao Xu [Tue, 8 Jan 2013 09:42:19 +0000 (10:42 +0100)]
nested vmx: emulate IA32_VMX_MISC MSR

Use the host value to emulate IA32_VMX_MISC MSR for L1 VMM.
For CR3-target value, we don't support this feature currently and
set the number to zero.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
12 years agox86/hvm: Bind xen-created event channels to building domain
Daniel De Graaf [Tue, 8 Jan 2013 09:37:38 +0000 (10:37 +0100)]
x86/hvm: Bind xen-created event channels to building domain

Instead of using a hardcoded domain 0 as the endpoint for the event
channels created in hvm_vcpu_initialise, use the domain ID of the
building domain so that a domain builder in a domain other than dom0 has
the expected access to the event channels.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Committed-by: Jan Beulich <jbeulich@suse.com>
12 years agox86: fix assertion in get_page_type()
Jan Beulich [Mon, 7 Jan 2013 13:20:26 +0000 (14:20 +0100)]
x86: fix assertion in get_page_type()

c/s 22998:e9fab50d7b61 (and immediately following ones) made it
possible that __get_page_type() returns other than -EINVAL, in
particular -EBUSY. Consequently, the assertion in get_page_type()
should check for only the return values we absolutely don't expect to
see there.

This is XSA-37 / CVE-2013-0154.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
12 years agox86: compat_show_guest_stack() should not truncate MFN
Jan Beulich [Mon, 7 Jan 2013 12:28:29 +0000 (13:28 +0100)]
x86: compat_show_guest_stack() should not truncate MFN

Re-using "addr" here was a mistake, as it is a 32-bit quantity.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
12 years agoIOMMU: add option to specify devices behaving like ones using phantom functions
Jan Beulich [Mon, 7 Jan 2013 11:58:09 +0000 (12:58 +0100)]
IOMMU: add option to specify devices behaving like ones using phantom functions

At least certain Marvell SATA controllers are known to issue bus master
requests with a non-zero function as origin, despite themselves being
single function devices.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoVT-d: relax source qualifier for MSI of phantom functions
Jan Beulich [Mon, 7 Jan 2013 11:56:52 +0000 (12:56 +0100)]
VT-d: relax source qualifier for MSI of phantom functions

With ordinary requests allowed to come from phantom functions, the
remapping tables ought to be set up to allow for MSI triggers to come
from other than the "real" device too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoIOMMU: add phantom function support
Jan Beulich [Mon, 7 Jan 2013 11:55:42 +0000 (12:55 +0100)]
IOMMU: add phantom function support

Apart from generating device context entries for the base function,
all phantom functions also need context entries to be generated for
them.

In order to distinguish different use cases, a variant of
pci_get_pdev() is being introduced that, even when passed a phantom
function number, would return the underlying actual device.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoIOMMU/PCI: consolidate pdev_type() and cache its result for a given device
Jan Beulich [Mon, 7 Jan 2013 11:54:39 +0000 (12:54 +0100)]
IOMMU/PCI: consolidate pdev_type() and cache its result for a given device

Add an "unknown" device types as well as one for PCI-to-PCIe bridges
(the latter of which other IOMMU code with or without this patch
doesn't appear to handle properly).

Make sure we don't mistake a device for which we can't access its
config space as a legacy PCI device (after all we in fact don't know
how to deal with such a device, and hence shouldn't try to).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoAMD IOMMU: adjust flush function parameters
Jan Beulich [Mon, 7 Jan 2013 11:53:19 +0000 (12:53 +0100)]
AMD IOMMU: adjust flush function parameters

... to use a (struct pci_dev *, devfn) pair.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoVT-d: adjust context map/unmap parameters
Jan Beulich [Mon, 7 Jan 2013 11:52:29 +0000 (12:52 +0100)]
VT-d: adjust context map/unmap parameters

... to use a (struct pci_dev *, devfn) pair.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoIOMMU: adjust add/remove operation parameters
Jan Beulich [Mon, 7 Jan 2013 11:51:22 +0000 (12:51 +0100)]
IOMMU: adjust add/remove operation parameters

... to use a (struct pci_dev *, devfn) pair.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agoIOMMU: adjust (re)assign operation parameters
Jan Beulich [Mon, 7 Jan 2013 11:49:24 +0000 (12:49 +0100)]
IOMMU: adjust (re)assign operation parameters

... to use a (struct pci_dev *, devfn) pair.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: "Zhang, Xiantao" <xiantao.zhang@intel.com>
12 years agomerge
Ian Campbell [Fri, 4 Jan 2013 15:58:37 +0000 (15:58 +0000)]
merge

12 years agopassthrough/domctl: use correct struct in union
Andrew Cooper [Fri, 4 Jan 2013 09:06:47 +0000 (10:06 +0100)]
passthrough/domctl: use correct struct in union

This appears to be a copy paste error from c/s 23861:ec7c81fbe0de.

It is safe, functionally speaking, as both the xen_domctl_assign_device
and xen_domctl_get_device_group structure start with a 'uint32_t
machine_sbdf'.  We should however use the correct union structure.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
12 years agotools/tests: Restrict some tests to x86 only
Ian Campbell [Fri, 21 Dec 2012 17:05:38 +0000 (17:05 +0000)]
tools/tests: Restrict some tests to x86 only

MCE injection and x86_emulator are clearly x86 specific.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
13 years agoxen: arm: fix guest register access.
Ian Campbell [Thu, 20 Dec 2012 11:53:08 +0000 (11:53 +0000)]
xen: arm: fix guest register access.

We weren't taking the guest mode (CPSR) into account and would always
access the user version of the registers.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoarm: trim pagetable flag definitions to fit in 80 characters
Tim Deegan [Thu, 20 Dec 2012 11:53:07 +0000 (11:53 +0000)]
arm: trim pagetable flag definitions to fit in 80 characters

Signed-off-by: Tim Deegan <tim@xen.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agox86: also print CRn register values upon double fault
Jan Beulich [Thu, 20 Dec 2012 10:00:32 +0000 (11:00 +0100)]
x86: also print CRn register values upon double fault

Do so by simply re-using _show_registers().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agoxen: arm: remove now empty dummy.S
Ian Campbell [Wed, 19 Dec 2012 16:04:50 +0000 (16:04 +0000)]
xen: arm: remove now empty dummy.S

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: remove nr_irqs_gsi from generic code
Ian Campbell [Wed, 19 Dec 2012 16:04:49 +0000 (16:04 +0000)]
xen: remove nr_irqs_gsi from generic code

The concept is X86 specific.

AFAICT the generic concept here is the number of static physical IRQs
which the current hardware has, so call this nr_static_irqs.

Also using "defined NR_IRQS" as a standin for x86 might have made
sense at one point but its just cleaner to push the necessary
definitions into asm/irq.h.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agolibxl: move definition of libxl_domain_config into the IDL
Ian Campbell [Wed, 19 Dec 2012 14:33:24 +0000 (14:33 +0000)]
libxl: move definition of libxl_domain_config into the IDL

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: mark early_panic as a noreturn function
Ian Campbell [Wed, 19 Dec 2012 14:16:30 +0000 (14:16 +0000)]
xen: arm: mark early_panic as a noreturn function

Otherwise gcc complains about variables being used when not
initialised when in fact that point is never reached.

There aren't any instances of this in tree right now, I noticed this
while developing another patch.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: introduce arm32 as a subarch of arm.
Ian Campbell [Wed, 19 Dec 2012 14:16:30 +0000 (14:16 +0000)]
xen: arm: introduce arm32 as a subarch of arm.

- move 32-bit specific files into subarch specific arm32 subdirectory.
- move gic.h to xen/include/asm-arm (it is needed from both subarch
  and generic code).
- make the appropriate build and config file changes to support
  XEN_TARGET_ARCH=arm32.

This prepares us for an eventual 64-bit subarch.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: reorder registers in struct cpu_user_regs.
Ian Campbell [Wed, 19 Dec 2012 14:16:29 +0000 (14:16 +0000)]
xen: arm: reorder registers in struct cpu_user_regs.

Primarily this is so that they are ordered in the same way as the
mapping from arm64 x0..x31 registers to the arm32 registers, which is
just less confusing for everyone going forward.

It also makes the implementation of select_user_regs in the next patch
slightly simpler.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: remove hard tabs from asm code.
Ian Campbell [Wed, 19 Dec 2012 14:16:28 +0000 (14:16 +0000)]
xen: arm: remove hard tabs from asm code.

Run expand(1) over xen/arch/arm/.../*.S

Add emacs local vars block.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[ijc -- stripped trailing whitespace caught by git apply]
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: fix long lines in entry.S
Ian Campbell [Wed, 19 Dec 2012 14:16:27 +0000 (14:16 +0000)]
xen: arm: fix long lines in entry.S

This is a purely whitespace change.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: implement share_xen_page_with_privileged_guests
Ian Campbell [Wed, 19 Dec 2012 14:16:27 +0000 (14:16 +0000)]
xen: arm: implement share_xen_page_with_privileged_guests

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: implement send_timer_event.
Ian Campbell [Wed, 19 Dec 2012 14:16:26 +0000 (14:16 +0000)]
xen: arm: implement send_timer_event.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: initialise dom_{xen,io,cow}
Ian Campbell [Wed, 19 Dec 2012 14:16:25 +0000 (14:16 +0000)]
xen: arm: initialise dom_{xen,io,cow}

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub domain_relinquish_resources.
Ian Campbell [Wed, 19 Dec 2012 14:16:25 +0000 (14:16 +0000)]
xen: arm: stub domain_relinquish_resources.

Currently unimplemented. Domain teardown in general needs looking at.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub out domain_get_maximum_gpfn
Ian Campbell [Wed, 19 Dec 2012 14:16:24 +0000 (14:16 +0000)]
xen: arm: stub out domain_get_maximum_gpfn

It currently has no callers, so return ENOSYS until such a time as one
arrives.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub page_is_ram_type.
Ian Campbell [Wed, 19 Dec 2012 14:16:23 +0000 (14:16 +0000)]
xen: arm: stub page_is_ram_type.

Callers are VT-d (so x86 specific) and various bits of page offlining
support, which although it looks generic (and is in xen/common) does
things like diving into page_info->count_info which is not generic.

In any case on this is only reachable via XEN_SYSCTL_page_offline_op,
which clearly shouldn't be called on ARM just yet.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub out steal_page.
Ian Campbell [Wed, 19 Dec 2012 14:16:23 +0000 (14:16 +0000)]
xen: arm: stub out steal_page.

Callers handle the failure gracefully, can be called by
GNTTABOP_transfer, XENMEM_exchange or tmem.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub out wallclock time.
Ian Campbell [Wed, 19 Dec 2012 14:16:22 +0000 (14:16 +0000)]
xen: arm: stub out wallclock time.

We don't currently have much concept of wallclock time on ARM (for
either the hypervisor, dom0 or guests). For now just stub everything
out. Specifically domain_set_time_offset, update_vcpu_system_time and
wallclock_time.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
`
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: stub out pirq related functions.
Ian Campbell [Wed, 19 Dec 2012 14:16:21 +0000 (14:16 +0000)]
xen: arm: stub out pirq related functions.

On ARM we use GIC functionality to inject virtualised real interrupts
for h/w devices rather than evtchn-pirqs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: implement arch_vcpu_reset.
Ian Campbell [Wed, 19 Dec 2012 14:16:21 +0000 (14:16 +0000)]
xen: arm: implement arch_vcpu_reset.

Untested.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: implement arch_get_info_guest
Ian Campbell [Wed, 19 Dec 2012 14:16:20 +0000 (14:16 +0000)]
xen: arm: implement arch_get_info_guest

Untested, but basically the inverse of arch_set_info_guest.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: make smp_send_state_dump a real function
Ian Campbell [Wed, 19 Dec 2012 14:16:19 +0000 (14:16 +0000)]
xen: arm: make smp_send_state_dump a real function

It still doesn't do anything useful, but at least it isn't in dummy.S!

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: define node_online_map.
Ian Campbell [Wed, 19 Dec 2012 14:16:19 +0000 (14:16 +0000)]
xen: arm: define node_online_map.

For now just initialise it as a single online node, which is what
asm-arm/numa.h assumes anyway.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: arm: Call init_xen_time earlier
Ian Campbell [Wed, 19 Dec 2012 14:16:18 +0000 (14:16 +0000)]
xen: arm: Call init_xen_time earlier

If we panic before calling init_xen_time then the "Rebooting in 5
seconds" delay ends up calling udelay which uses cntfrq before it has
been initialised resulting in a divide by zero.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen/arm: do not map vGIC twice for dom0
Stefano Stabellini [Wed, 19 Dec 2012 14:16:17 +0000 (14:16 +0000)]
xen/arm: do not map vGIC twice for dom0

We don't need to manually set the P2M for the vGIC in construct_dom0,
because we have already done it generally for every guest in gicv_setup.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agox86, amd: Disable way access filter on Piledriver CPUs
Andre Przywara [Wed, 19 Dec 2012 10:42:09 +0000 (11:42 +0100)]
x86, amd: Disable way access filter on Piledriver CPUs

The Way Access Filter in recent AMD CPUs may hurt the performance of
some workloads, caused by aliasing issues in the L1 cache.
This patch disables it on the affected CPUs.

The issue is similar to that one of last year:
http://lkml.indiana.edu/hypermail/linux/kernel/1107.3/00041.html
This new patch does not replace the old one, we just need another
quirk for newer CPUs.

The performance penalty without the patch depends on the
circumstances, but is a bit less than the last year's 3%.

The workloads affected would be those that access code from the same
physical page under different virtual addresses, so different
processes using the same libraries with ASLR or multiple instances of
PIE-binaries. The code needs to be accessed simultaneously from both
cores of the same compute unit.

More details can be found here:
http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf

CPUs affected are anything with the core known as Piledriver.
That includes the new parts of the AMD A-Series (aka Trinity) and the
just released new CPUs of the FX-Series (aka Vishera).
The model numbering is a bit odd here: FX CPUs have model 2,
A-Series has model 10h, with possible extensions to 1Fh. Hence the
range of model ids.

Signed-off-by: Andre Przywara <osp@andrep.de>
Add and use MSR_AMD64_IC_CFG. Update the value whenever it is found to
not have all bits set, rather than just when it's zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
Committed-by: Jan Beulich <jbeulich@suse.com>
13 years agoxen/arch/*: add struct domain parameter to arch_do_domctl
Daniel De Graaf [Tue, 18 Dec 2012 18:16:52 +0000 (18:16 +0000)]
xen/arch/*: add struct domain parameter to arch_do_domctl

Since the arch-independent do_domctl function now RCU locks the domain
specified by op->domain, pass the struct domain to the arch-specific
domctl function and remove the duplicate per-subfunction locking.

This also removes two get_domain/put_domain call pairs (in
XEN_DOMCTL_assign_device and XEN_DOMCTL_deassign_device), replacing
them with RCU locking.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: lock target domain in do_domctl common code
Daniel De Graaf [Tue, 18 Dec 2012 18:16:13 +0000 (18:16 +0000)]
xen: lock target domain in do_domctl common code

Because almost all domctls need to lock the target domain, do this by
default instead of repeating it in each domctl. This is not currently
extended to the arch-specific domctls, but RCU locks are safe to take
recursively so this only causes duplicate but correct locking.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agonested vmx: nested TPR shadow/threshold emulation
Dongxiao Xu [Tue, 18 Dec 2012 18:14:45 +0000 (18:14 +0000)]
nested vmx: nested TPR shadow/threshold emulation

TPR shadow/threshold feature is important to speedup the boot time
for Windows guest. Besides, it is a must feature for certain VMM.

We map virtual APIC page address and TPR threshold from L1 VMCS,
and synch it into shadow VMCS in virtual vmentry.
If TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest, we
inject it into L1 VMM for handling.

Besides, this commit fixes an issue for apic access page, if L1
VMM didn't enable this feature, we need to fill zero into the
shadow VMCS.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: sched_credit: add some tracing
Dario Faggioli [Tue, 18 Dec 2012 18:12:00 +0000 (18:12 +0000)]
xen: sched_credit: add some tracing

About tickling, and PCPU selection.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: tracing: introduce per-scheduler trace event IDs
Dario Faggioli [Tue, 18 Dec 2012 18:11:33 +0000 (18:11 +0000)]
xen: tracing: introduce per-scheduler trace event IDs

So that it becomes possible to create scheduler specific trace
records, within each scheduler, without worrying about the
overlapping, and also without giving up being able to recognise them
univocally. The latter is deemed as useful, since we can have more
than one scheduler running at the same time, thanks to cpupools.

The event ID is 12 bits, and this change uses the upper 3 of them for
the 'scheduler ID'. This means we're limited to 8 schedulers and to
512 scheduler specific tracing events. Both seem reasonable
limitations as of now.

This also converts the existing credit2 tracing (the only scheduler
generating tracing events up to now) to the new system.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: sched_credit: improve tickling of idle CPUs
Dario Faggioli [Tue, 18 Dec 2012 18:10:57 +0000 (18:10 +0000)]
xen: sched_credit: improve tickling of idle CPUs

Right now, when a VCPU wakes-up, we check whether it should preempt
what is running on the PCPU, and whether or not the waking VCPU can
be migrated (by tickling some idlers). However, this can result in
suboptimal or even wrong behaviour, as explained here:

 http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html

This change, instead, when deciding which PCPU(s) to tickle, upon
VCPU wake-up, considers both what it is likely to happen on the PCPU
where the wakeup occurs,and whether or not there are idlers where
the woken-up VCPU can run. In fact, if there are, we can avoid
interrupting the running VCPU. Only in case there aren't any of
these PCPUs, preemption and migration are the way to go.

This has been tested (on top of the previous change) by running
the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
ways and 12 GB RAM).

1) All VMs had 'cpus="all"' in their config file.

$ sysbench --test=cpu ... (time, lower is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
 | 6   | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
 | 10  | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
$ sysbench --test=memory ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
 | 6   | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
 | 10  | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
$ specjbb2005 ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 43150.63 +/- 1359.5616  | 43275.427 +/- 606.28185 |
 | 6   | 29274.29 +/- 1024.4042  | 29716.189 +/- 1290.1878 |
 | 10  | 19061.28 +/- 512.88561  | 19192.599 +/- 605.66058 |

2) All VMs had their VCPUs statically pinned to the host's PCPUs.

$ sysbench --test=cpu ... (time, lower is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 47.8211   +/- 0.0215504 | 47.826900 +/- 0.0077872 |
 | 6   | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
 | 10  | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
$ sysbench --test=memory ... (throughput, higher is better)
 | VMs | w/o this change         | w/ this change          |
 | 2   | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
 | 6   | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
 | 10  | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
$ specjbb2005 ... (throughput, higher is better)
 | 2   | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
 | 6   | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
 | 10  | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |

Numbers show how the change has either no or very limited impact
(specjbb2005 case) or, when it does have some impact, that is a
real improvement in performances (sysbench-memory case).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: sched_credit: improve picking up the idle CPU for a VCPU
Dario Faggioli [Tue, 18 Dec 2012 18:10:18 +0000 (18:10 +0000)]
xen: sched_credit: improve picking up the idle CPU for a VCPU

In _csched_cpu_pick() we try to select the best possible CPU for
running a VCPU, considering the characteristics of the underlying
hardware (i.e., how many threads, core, sockets, and how busy they
are). What we want is "the idle execution vehicle with the most
idling neighbours in its grouping".

In order to achieve it, we select a CPU from the VCPU's affinity,
giving preference to its current processor if possible, as the basis
for the comparison with all the other CPUs. Problem is, to discount
the VCPU itself when computing this "idleness" (in an attempt to be
fair wrt its current processor), we arbitrarily and unconditionally
consider that selected CPU as idle, even when it is not the case,
for instance:
 1. If the CPU is not the one where the VCPU is running (perhaps due
    to the affinity being changed);
 2. The CPU is where the VCPU is running, but it has other VCPUs in
    its runq, so it won't go idle even if the VCPU in question goes.

This is exemplified in the trace below:

]  3.466115364 x|------|------| d10v1   22005(2:2:5) 3 [ a 1 8 ]
   ... ... ...
   3.466122856 x|------|------| d10v1 runstate_change d10v1
   running->offline
   3.466123046 x|------|------| d?v? runstate_change d32767v0
   runnable->running
   ... ... ...
]  3.466126887 x|------|------| d32767v0   28004(2:8:4) 3 [ a 1 8 ]

22005(...) line (the first line) means _csched_cpu_pick() was called
on VCPU 1 of domain 10, while it is running on CPU 0, and it choose
CPU 8, which is busy ('|'), even if there are plenty of idle
CPUs. That is because, as a consequence of changing the VCPU affinity,
CPU 8 was chosen as the basis for the comparison, and therefore
considered idle (its bit gets unconditionally set in the bitmask
representing the idle CPUs). 28004(...) line means the VCPU is woken
up and queued on CPU 8's runq, where it waits for a context switch or
a migration, in order to be able to execute.

This change fixes things by only considering the "guessed" CPU idle if
the VCPU in question is both running there and is its only runnable
VCPU.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: sched_credit: define and use curr_on_cpu(cpu)
Dario Faggioli [Tue, 18 Dec 2012 13:17:27 +0000 (13:17 +0000)]
xen: sched_credit: define and use curr_on_cpu(cpu)

To fetch `per_cpu(schedule_data,cpu).curr' in a more readable
way. It's in sched-if.h as that is where `struct schedule_data'
is declared.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agousbif: drop bogus definition
Jan Beulich [Tue, 18 Dec 2012 13:08:55 +0000 (14:08 +0100)]
usbif: drop bogus definition

Just like recently done for vSCSI, remove a backend implementation
detail from the interface header.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agoadd maintainers entry for vendor-independent IOMMU code
Jan Beulich [Tue, 18 Dec 2012 08:48:57 +0000 (09:48 +0100)]
add maintainers entry for vendor-independent IOMMU code

As agreed to last week.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agolibxenstore: filter watch events in libxenstore when we unwatch
Julien Grall [Mon, 17 Dec 2012 18:04:54 +0000 (18:04 +0000)]
libxenstore: filter watch events in libxenstore when we unwatch

XenStore puts in queued watch events via a thread and notifies the user.
Sometimes xs_unwatch is called before all related message is read. The use
case is non-threaded libevent, we have two event A and B:
    - Event A will destroy something and call xs_unwatch;
    - Event B is used to notify that a node has changed in XenStore.
As the event is called one by one, event A can be handled before event B.
So on next xs_watch_read the user could retrieve an unwatch token and
a segfault occured if the token store the pointer of the structure
(ie: "backend:0xcafe").

To avoid problem with previous application using libXenStore, this behaviour
will only be enabled if XS_UNWATCH_FILTER is given to xs_open.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
13 years agox86/kexec: Change NMI and MCE handling on kexec path
Andrew Cooper [Thu, 13 Dec 2012 14:39:31 +0000 (14:39 +0000)]
x86/kexec: Change NMI and MCE handling on kexec path

Experimentally, certain crash kernels will triple fault very early
after starting if started with NMIs disabled.  This was discovered
when experimenting with a debug keyhandler which deliberately created
a reentrant NMI, causing stack corruption.

Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safer in more circumstances.

This patch adds three new low level routines:
 * nmi_crash
    This is a special NMI handler for using during a kexec crash.
 * enable_nmis
    This function enables NMIs by executing an iret-to-self, to
    disengage the hardware NMI latch.
 * trap_nop
    This is a no op handler which irets immediately.  It is not
    declared
    with ENTRY() to avoid the extra alignment overhead.

And adds three new IDT entry helper routines:
 * _write_gate_lower
    This is a substitute for using cmpxchg16b to update a 128bit
    structure at once.  It assumes that the top 64 bits are unchanged
    (and ASSERT()s the fact) and performs a regular write on the lower
    64 bits.
 * _set_gate_lower
    This is functionally equivalent to the already present
    _set_gate(), except it uses _write_gate_lower rather than updating
    both 64bit values.
 * _update_gate_addr_lower
    This is designed to update an IDT entry handler only, without
    altering any other settings in the entry.  It also uses
    _write_gate_lower.

The IDT entry helpers are required because:
  * Is it unsafe to attempt a disable/update/re-enable cycle on the
    NMI or MCE IDT entries.
  * We need to be able to update NMI handlers without changing the IST
    entry.

As a result, the new behaviour of the kexec_crash path is:

nmi_shootdown_cpus() will:

 * Disable the crashing cpus NMI/MCE interrupt stack tables.
    Disabling the stack tables removes race conditions which would
    lead
    to corrupt exception frames and infinite loops.  As this pcpu is
    never planning to execute a sysret back to a pv vcpu, the update
    is
    safe from a security point of view.

 * Swap the NMI trap handlers.
    The crashing pcpu gets the nop handler, to prevent it getting
    stuck in
    an NMI context, causing a hang instead of crash.  The non-crashing
    pcpus all get the nmi_crash handler which is designed never to
    return.

do_nmi_crash() will:

 * Save the crash notes and shut the pcpu down.
    There is now an extra per-cpu variable to prevent us from
    executing this multiple times.  In the case where we reenter
    midway through, attempt the whole operation again in preference to
    not completing it in the first place.

 * Set up another NMI at the LAPIC.
    Even when the LAPIC has been disabled, the ID and command
    registers are still usable.  As a result, we can deliberately
    queue up a new NMI to re-interrupt us later if NMIs get unlatched.
    Because of the call to __stop_this_cpu(), we have to hand craft
    self_nmi() to be safe from General Protection Faults.

 * Fall into infinite loop.

machine_kexec() will:

  * Swap the MCE handlers to be a nop.
     We cannot prevent MCEs from being delivered when we pass off to
     the crash kernel, and the less Xen context is being touched the
     better.

  * Explicitly enable NMIs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Minor style changes.

Signed-off-by: Keir Fraser <keir@xen.org>
Committed-by: Keir Fraser <keir@xen.org>
13 years agox86/mm/hap: Adjust vram tracking to play nicely with log-dirty.
Robert Phillips [Thu, 13 Dec 2012 12:10:14 +0000 (12:10 +0000)]
x86/mm/hap: Adjust vram tracking to play nicely with log-dirty.

The previous code assumed the guest would be in one of three mutually exclusive
modes for bookkeeping dirty pages: (1) shadow, (2) hap utilizing the log dirty
bitmap to support functionality such as live migrate, (3) hap utilizing the
log dirty bitmap to track dirty vram pages.
Races arose when a guest attempted to track dirty vram while performing live
migrate.  (The dispatch table managed by paging_log_dirty_init() might change
in the middle of a log dirty or a vram tracking function.)

This change allows hap log dirty and hap vram tracking to be concurrent.
Vram tracking no longer uses the log dirty bitmap.  Instead it detects
dirty vram pages by examining their p2m type.  The log dirty bitmap is only
used by the log dirty code.  Because the two operations use different
mechanisms, they are no longer mutually exclusive.

Signed-Off-By: Robert Phillips <robert.phillips@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Minor whitespace changes to conform with coding style
Signed-off-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
13 years agolibxl: introduce XSM relabel on build
Daniel De Graaf [Thu, 13 Dec 2012 11:44:02 +0000 (11:44 +0000)]
libxl: introduce XSM relabel on build

Allow a domain to be built under one security label and run using a
different label.  This can be used to prevent the domain builder or
control domain from having the ability to access a guest domain's memory
via map_foreign_range except during the build process where this is
required.

Example domain configuration snippet:
  seclabel='customer_1:vm_r:nomigrate_t'
  init_seclabel='customer_1:vm_r:nomigrate_t_building'

Note: this does not provide complete protection from a malicious dom0;
mappings created during the build process may persist after the relabel,
and could be used to indirectly access the guest's memory. However, if
dom0 correctly unmaps the domain upon building, a the domU is protected
against dom0 becoming malicious in the future.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agolibxl: qemu trad logdirty: Tolerate ENOENT on ret path
Ian Jackson [Thu, 13 Dec 2012 11:44:01 +0000 (11:44 +0000)]
libxl: qemu trad logdirty: Tolerate ENOENT on ret path

It can happen in error conditions that lds->ret_path doesn't exist,
and libxl__xs_read_checked signals this by setting got_ret=NULL.  If
this happens, fail without crashing.

Reported-by: Alex Bligh <alex@alex.org.uk>,
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen/arm: use strcmp in device_tree_type_matches
Stefano Stabellini [Thu, 13 Dec 2012 11:44:01 +0000 (11:44 +0000)]
xen/arm: use strcmp in device_tree_type_matches

We want to match the exact string rather than the first subset.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agoxen: get GIC addresses from DT
Stefano Stabellini [Thu, 13 Dec 2012 11:44:00 +0000 (11:44 +0000)]
xen: get GIC addresses from DT

Get the address of the GIC distributor, cpu, virtual and virtual cpu
interfaces registers from device tree.

Note: I couldn't completely get rid of GIC_BASE_ADDRESS, GIC_DR_OFFSET
and friends because we are using them from mode_switch.S, that is
executed before device tree has been parsed. But at least mode_switch.S
is known to contain vexpress specific code anyway.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Committed-by: Ian Campbell <ian.campbell@citrix.com>
13 years agovscsiif: allow larger segments-per-request values
Jan Beulich [Thu, 13 Dec 2012 10:22:54 +0000 (11:22 +0100)]
vscsiif: allow larger segments-per-request values

At least certain tape devices require fixed size blocks to be operated
upon, i.e. breaking up of I/O requests is not permitted. Consequently
we need an interface extension that (leaving aside implementation
limitations) doesn't impose a limit on the number of segments that can
be associated with an individual request.

This, in turn, excludes the blkif extension FreeBSD folks implemented,
as that still imposes an upper limit (the actual I/O request still
specifies the full number of segments - as an 8-bit quantity -, and
subsequent ring slots get used to carry the excess segment
descriptors).

The alternative therefore is to allow the frontend to pre-set segment
descriptors _before_ actually issuing the I/O request. I/O will then
be done by the backend for the accumulated set of segments.

To properly associate segment preset operations with the main request,
the rqid-s between them should match (originally I had hoped to use
this to avoid producing individual responses for the pre-set
operations, but that turned out to violate the underlying shared ring
implementation).

Negotiation of the maximum number of segments a particular backend
implementation supports happens through a new "segs-per-req" xenstore
node.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
13 years agoVMX: intr.c: remove i386 related code
Dongxiao Xu [Wed, 12 Dec 2012 09:47:18 +0000 (10:47 +0100)]
VMX: intr.c: remove i386 related code

i386 arch is no longer supported by Xen, remove the related code.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
13 years agox86/IST: Create set_ist() helper function
Andrew Cooper [Tue, 11 Dec 2012 16:49:19 +0000 (17:49 +0100)]
x86/IST: Create set_ist() helper function

... to save using open-coded bitwise operations, and update all IST
manipulation sites to use the function.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
13 years agox86/ucode: Improve error handling and container file processing on AMD
Boris Ostrovsky [Tue, 11 Dec 2012 16:46:57 +0000 (17:46 +0100)]
x86/ucode: Improve error handling and container file processing on AMD

Do not report error when a patch is not appplicable to current processor,
simply skip it and move on to next patch in container file.

Process container file to the end instead of stopping at the first
applicable patch.

Log the fact that a patch has been applied at KERN_WARNING level, modify
debug messages.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
13 years agox86/EFI: work around CFLAGS being passed in through environment
Charles Arnold [Tue, 11 Dec 2012 12:49:39 +0000 (13:49 +0100)]
x86/EFI: work around CFLAGS being passed in through environment

Short of a solution to the problem described in
http://lists.xen.org/archives/html/xen-devel/2012-12/msg00648.html,
deal with the bad effect this together with c/s 25751:02b4d5fedb7b has
on the EFI build by filtering out the problematic command line items.

Signed-off-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Committed-by: Jan Beulich <jbeulich@suse.com>
13 years agox86: frame table related improvements
Jan Beulich [Tue, 11 Dec 2012 12:47:53 +0000 (13:47 +0100)]
x86: frame table related improvements

- fix super page frame table setup for memory hotplug case (should
  create full table, or else the hotplug code would need to do the
  necessary table population)
- simplify super page frame table setup (can re-use frame table setup
  code)
- slightly streamline frame table setup code
- fix (tighten) a BUG_ON() and an ASSERT() condition
- fix spage <-> pdx conversion macros (they had no users so far, and
  hence no-one noticed how broken they were)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agoxen: reserve next two XENMEM_ op numbers for future/out-of-tree use
Dan Magenheimer [Mon, 10 Dec 2012 11:16:17 +0000 (11:16 +0000)]
xen: reserve next two XENMEM_ op numbers for future/out-of-tree use

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agoxen: centralize accounting for domain tot_pages
Dan Magenheimer [Mon, 10 Dec 2012 11:15:53 +0000 (11:15 +0000)]
xen: centralize accounting for domain tot_pages

Provide and use a common function for all adjustments to a
domain's tot_pages counter in anticipation of future and/or
out-of-tree patches that must adjust related counters
atomically.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Committed-by: Keir Fraser <keir@xen.org>
13 years agostreamline guest copy operations
Jan Beulich [Mon, 10 Dec 2012 10:18:25 +0000 (11:18 +0100)]
streamline guest copy operations

- use the variants not validating the VA range when writing back
  structures/fields to the same space that they were previously read
  from
- when only a single field of a structure actually changed, copy back
  just that field where possible
- consolidate copying back results in a few places

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agox86/oprofile: adjust CPU specific initialization
Jan Beulich [Mon, 10 Dec 2012 10:16:37 +0000 (11:16 +0100)]
x86/oprofile: adjust CPU specific initialization

Drop support for 32-bit only CPU models as well as those that can be
dealt with by the arch_perfmon bits. Models 14 and 15 remain as
questionable (I'm not 100% positive that these don't support 64-bit
mode).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agoscheduler: fix rate limit range checking
Jan Beulich [Mon, 10 Dec 2012 10:14:27 +0000 (11:14 +0100)]
scheduler: fix rate limit range checking

For one, neither of the two checks permitted for the documented value
of zero (disabling the functionality altogether).

Second, the range checking of the command line parameter was done by
the credit scheduler's initialization code, despite it being a generic
scheduler option.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agoQEMU_TAG update
Ian Jackson [Fri, 7 Dec 2012 16:19:11 +0000 (16:19 +0000)]
QEMU_TAG update

13 years agox86: mark certain items static
Jan Beulich [Fri, 7 Dec 2012 12:47:48 +0000 (13:47 +0100)]
x86: mark certain items static

..., and at once constify the data items among them.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agox86/HVM: add missing assert to stdvga's mmio_move()
Jan Beulich [Fri, 7 Dec 2012 12:45:57 +0000 (13:45 +0100)]
x86/HVM: add missing assert to stdvga's mmio_move()

... to match the IOREQ_READ path.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agox86/EFI: add code interfacing with the secure boot shim
Jan Beulich [Fri, 7 Dec 2012 12:44:32 +0000 (13:44 +0100)]
x86/EFI: add code interfacing with the secure boot shim

... to validate the kernel image (which is required to be in PE
format, as is e.g. the case for the Linux bzImage when built with
CONFIG_EFI_STUB).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agox86/p2m: drop redundant macro definitions
Jan Beulich [Fri, 7 Dec 2012 12:43:17 +0000 (13:43 +0100)]
x86/p2m: drop redundant macro definitions

Also, add log level indicator to P2M_ERROR(), and drop pointless
underscores from all related macros' parameter names.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
13 years agox86: properly fail mmuext ops when get_page_from_gfn() fails
Jan Beulich [Fri, 7 Dec 2012 12:40:46 +0000 (13:40 +0100)]
x86: properly fail mmuext ops when get_page_from_gfn() fails

I noticed this inconsistency while analyzing the code for XSA-32.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>