dgit.raspbian.org Git

tools: fix install of lomount

$(BIN) went away in 23124:e3d4c34b14a3.

Also there are no *.so, *.a or *.rpm built in here

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

x86: ucode-amd: Don't warn when no ucode is available for a CPU revision

This patch originally comes from the Linus mainline kernel (2.6.33),
find below the patch details:

From: Andreas Herrmann <herrmann.der.user@googlemail.com>

There is no point in warning when there is no ucode available
for a specific CPU revision. Currently the container-file, which
provides the AMD ucode patches for OS load, contains only a few
ucode patches.

It's already clearly indicated by the printed patch_level
whenever new ucode was available and an update happened. So the
warning message is of no help but rather annoying on systems
with many CPUs.

Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

XZ: Fix incorrect XZ_BUF_ERROR

From: Lasse Collin <lasse.collin@tukaani.org>

xz_dec_run() could incorrectly return XZ_BUF_ERROR if all of the
following was true:

- The caller knows how many bytes of output to expect and only
   provides
   that much output space.

- When the last output bytes are decoded, the caller-provided input
   buffer ends right before the LZMA2 end of payload marker.  So LZMA2
   won't provide more output anymore, but it won't know it yet and
   thus
   won't return XZ_STREAM_END yet.

- A BCJ filter is in use and it hasn't left any unfiltered bytes in
   the
   temp buffer.  This can happen with any BCJ filter, but in practice
   it's more likely with filters other than the x86 BCJ.

This fixes <https://bugzilla.redhat.com/show_bug.cgi?id=3D735408>
where Squashfs thinks that a valid file system is corrupt.

This also fixes a similar bug in single-call mode where the
uncompressed size of a block using BCJ + LZMA2 was 0 bytes and caller
provided no output space.  Many empty .xz files don't contain any
blocks and thus don't trigger this bug.

This also tweaks a closely related detail: xz_dec_bcj_run() could call
xz_dec_lzma2_run() to decode into temp buffer when it was known to be
useless.  This was harmless although it wasted a minuscule number of
CPU cycles.

Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

XZ decompressor: Fix decoding of empty LZMA2 streams

From: Lasse Collin <lasse.collin@tukaani.org>

The old code considered valid empty LZMA2 streams to be corrupt.
Note that a typical empty .xz file has no LZMA2 data at all,
and thus most .xz files having no uncompressed data are handled
correctly even without this fix.

Signed-off-by: Lasse Collin <lasse.collin@tukaani.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

VT-d: fix off-by-one error in RMRR validation

(base_addr,end_addr) is an inclusive range, and hence there shouldn't
be a subtraction of 1 in the second invocation of page_is_ram_type().
For RMRRs covering a single page that actually resulted in the
immediately preceding page to get checked (which could have resulted
in a false warning).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

VT-d: eliminate a mis-use of pcidevs_lock

dma_pte_clear_one() shouldn't acquire this global lock for the purpose
of processing a per-domain list. Furthermore the function a few lines
earlier has a comment stating that acquiring pcidevs_lock isn't
necessary here (whether that's really correct is another question).

Use the domain's mappin_lock instead to protect the mapped_rmrrs list.
Fold domain_rmrr_mapped() into its sole caller so that the otherwise
implicit dependency on pcidevs_lock there becomes more obvious (see
the comment there).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86: IO-APIC code has no dependency on PCI

The IRQ handling code requires pcidevs_lock to be held only for MSI
interrupts.

As the handling of which was now fully moved into msi.c (i.e. while
applying fine without, the patch needs to be applied after the one
titled "x86: split MSI IRQ chip"), io_apic.c now also doesn't need to
include PCI headers anymore.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: config space accessor adjustments

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: Pass-through adjustments

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: AMD-IOMMU specific adjustments

There are two places here where it is entirely unclear to me where the
necessary PCI segment number should be taken from (as IVMD descriptors
don't have such, only IVHD ones do). AMD confirmed that for the time
being it is acceptable to imply that only segment 0 exists.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: VT-d specific adjustments

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: adjust domctl interface

Again, a couple of directly related functions at once get adjusted to
account for the segment number.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86: split MSI IRQ chip

With the .end() accessor having become optional and noting that
several of the accessors' behavior really depends on the result of
msi_maskable_irq(), the splits the MSI IRQ chip type into two - one
for the maskable ones, and the other for the (MSI only) non-maskable
ones.

At once the implementation of those methods gets moved from io_apic.c
to msi.c.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

pass struct irq_desc * to all other IRQ accessors

This is again because the descriptor is generally more useful (with
the IRQ number being accessible in it if necessary) and going forward
will hopefully allow to remove all direct accesses to the IRQ
descriptor array, in turn making it possible to make this some other,
more efficient data structure.

This additionally makes the .end() accessor optional, noting that in a
number of cases the functions were empty.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

pass struct irq_desc * to set_affinity() IRQ accessors

This is because the descriptor is generally more useful (with the IRQ
number being accessible in it if necessary) and going forward will
hopefully allow to remove all direct accesses to the IRQ descriptor
array, in turn making it possible to make this some other, more
efficient data structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

convert more literal uses of cpumask_t to pointers

This is particularly relevant as the number of CPUs to be supported
increases (as recently happened for the default thereof).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

MAINTAINERS: Add Jan Beulich for EFI, x86, ACPI.

Signed-off-by: Keir Fraser <keir@xen.org>

PCI multi-seg: add new physdevop-s

The new PHYSDEVOP_pci_device_add is intended to be extensible, with a
first extension (to pass the proximity domain of a device) added right
away.

A couple of directly related functions at once get adjusted to account
for the segment number.

Should we deprecate the PHYSDEVOP_manage_pci_* sub-hypercalls?

Signed-off-by: Jan Beulich <jbeulich@suse.com>

PCI multi-seg: introduce notion of PCI segments

Signed-off-by: Jan Beulich <jbeulich@suse.com>

Fix PV CPUID virtualization of XSave

The patch will fix XSave CPUID virtualization for PV guests. The XSave
area size returned by CPUID leaf D is changed dynamically depending on
the XCR0. Tools/libxc only assigns a static value. The fix will adjust
xsave area size during runtime.

Note: This fix is already in HVM cpuid virtualization. And Dom0 is not
affected, either.

Signed-off-by: Shan Haitao <haitao.shan@intel.com>

Clear IRQ_GUEST in irq_desc->status when setting action to NULL.

Looking more closely at usage of action field with relation to
IRQ_GUEST flag. It appears that set IRQ_GUEST implies that action
is not NULL. As result it is not safe to set action to NULL and
leave IRQ_GUEST set.

Hence IRQ_GUEST should be cleared in dynamic_irq_cleanup where
action is set to NULL.

An addition remove BUGON at __pirq_guest_unbind that appears to be
bogus and not needed anymore.

Thanks Paolo Bonzini for NACKing previous patch, and pointing at the
correct solution.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reinstate the BUG_ON, but after the action==NULL check. Since we then
go and start interpreting action as an irq_guest_action_t, the BUG_ON
is relevant here.

More generally, the brute-force nature of dynamic_irq_cleanup() looks
a bit worrying. Possibly there should be more integratioin with
pirq_guest_unbind() logic, for cleaning up un-acked EOIs and the like.

Signed-off-by: Keir Fraser <keir@xen.org>

x86/time: verify_tsc_reliability() can be run as a generic initcall.

Signed-off-by: Keir Fraser <keir@xen.org>

x86-64/EFI: 2.0 hypercall extensions

Flesh out the interface to EFI 2.0 runtime calls and implement what
can reasonably be without actually having active call paths getting
there (i.e. without actual debugging possible: The capsule interfaces
certainly require an environment where an initial implementation can
actually be tested).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86-64/EFI: 2.0 header extensions

Updates from gnu-efi 3.0m. UEFI 2.0 runtime services additions taken
from EDK 1.06.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86/vmx: don't call __vmxoff() blindly

If vmx_vcpu_up() failed, __vmxon() would generally not have got
(successfully) executed, and in that case __vmxoff() will #UD.

Additionally, any panic() during early resume (namely the tboot
related one) would cause vmx_cpu_down() to get executed without
vmx_cpu_up() having run before.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86/tboot: make resume error messages visible

With tboot_s3_resume() running before console_resume(), the error
messages so far printed by it are mostly guaranteed to go into
nirwana. Latch MACs into a static variable instead, and issue the
messages right before calling panic().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

xen: Move tsc reliability check until after CPUs have booted

AMD CPUs by default enable X86_FEATURE_TSC_RELIABLE, and depend upon a
later check to disable this feature if TSC drift is detected.
Unfortunately, this check is done in time.c:init_xen_time(), which is
done before any secondary CPUs are brought up, and is thus guaranteed
to succed.

This patch moves the check into its own function, and calls it after
cpus are brought up.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

x86/hvm: Tidy up the viridian code a little and flesh out the APIC
assist MSR handling code.

We don't say we that handle that MSR but Windows assumes it. In
Windows 7 it just wrote to the MSR and we used to handle that
ok. Windows 8 also reads from the MSR so we need to keep a record of
the contents.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>

xen/xsm: Compile error due to naming clash between XSM and EFI runtime

The problem is that efi_runtime_call is the name of both a function in
xen/arch/x86/efi/runtime.c and a member of the xsm_operations struct
in xen/include/xsm/xsm.h. This causes the macro "#define
efi_runtime_call(x) efi_compat_runtime_call(x)" on line 15 of
xen/arch/x86/x86_64/platform_hypercall.c to cause the above compile
error.

Renaming the XSM struct member fixes the problem.

Signed-off-by: James Carter <jwcart2@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>

Avoid race in schedule() when switching schedulers

Selecting the scheduler to call must be done under lock. Otherwise a
race might occur when switching schedulers in a cpupool

Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

mem_event: use different ringbuffers for share, paging and access

Up to now a single ring buffer was used for mem_share, xenpaging and
xen-access. Each helper would have to cooperate and pull only its own
requests from the ring. Unfortunately this was not implemented. And
even if it was, it would make the whole concept fragile because a crash
or early exit of one helper would stall the others.

What happend up to now is that active xenpaging + memory_sharing would
push memsharing requests in the buffer. xenpaging is not prepared for
such requests.

This patch creates an independet ring buffer for mem_share, xenpaging
and xen-access and adds also new functions to enable xenpaging and
xen-access. The xc_mem_event_enable/xc_mem_event_disable functions will
be removed. The various XEN_DOMCTL_MEM_EVENT_* macros were cleaned up.
Due to the removal the API changed, so the SONAME will be changed too.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Tim Deegan <tim@xen.org>

mem_event: pass mem_event_domain pointer to mem_event functions

Pass a struct mem_event_domain pointer to the various mem_event
functions. This will be used in a subsequent patch which creates
different ring buffers for the memshare, xenpaging and memaccess
functionality.

Remove the struct domain argument from some functions.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

libxc: Enable cpuid performance counter leaf for HVM

In HVM domains the usable performance counters can be checked
automatically only, if cpuid leaf 0x0000000a is accessible.

Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>

xenstored: allow guest to shutdown all its watches/transactions

During kexec all old watches have to be removed, otherwise the new
kernel will receive unexpected events. Allow a guest to reset itself
and cleanup all of its watches and transactions.

Add a new XS_RESET_WATCHES command to do the reset on behalf of the
guest.

(Changes by iwj: specify the argument to be a single nul byte. Permit
read-only clients to use the new command.)

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools: Revert seabios and upstream qemu build changes

These have broken the build and it seems to be difficult to fix. So
we will revert the whole lot for now, and await corrected patch(es).

Revert "fix the build when CONFIG_QEMU is specified by the user"
Revert "tools: fix permissions of git-checkout.sh"
Revert "scripts/git-checkout.sh: Is not bash specific. Invoke with /bin/sh."
Revert "Clone and build Seabios by default"
Revert "Clone and build upstream Qemu by default"
Revert "Rename ioemu-dir as qemu-xen-traditional-dir"
Revert "Move the ioemu-dir-find shell script to an external file"

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

fix the build when CONFIG_QEMU is specified by the user

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Committed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

tools: fix permissions of git-checkout.sh

23828:0d21b68f528b introduced a new scripts/git-checkout.sh, but it
had the wrong permissions. chmod +x it, and add a blank line at the
end to make sure it actually gets updated.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

scripts/git-checkout.sh: Is not bash specific. Invoke with /bin/sh.

Signed-off-by: Keir Fraser <keir@xen.org>

xen,credit1: Add variable timeslice

Add a xen command-line parameter, sched_credit_tslice_ms,
to set the timeslice of the credit1 scheduler.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

IRQ: IO-APIC support End Of Interrupt for older IO-APICs

The old io_apic_eoi() function using the EOI register only works for
IO-APICs with a version of 0x20. Older IO-APICs do not have an EOI
register so line level interrupts have to be EOI'd by flipping the
mode to edge and back, which clears the IRR and Delivery Status bits.

This patch replaces the current io_apic_eoi() function with one which
takes into account the version of the IO-APIC and EOI's
appropriately.

v2: make recursive call to __io_apic_eoi() to reduce code size.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: if mapping GSIs we run out of pirq < nr_irqs_gsi, use the others

PV on HVM guests can have more GSIs than the host, in that case we
could run out of pirq < nr_irqs_gsi. When that happens use pirq >=
nr_irqs_gsi rather than returning an error.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Tested-by: Benjamin Schweikert <b.schweikert@googlemail.com>

Clone and build Seabios by default

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Clone and build upstream Qemu by default

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Rename ioemu-dir as qemu-xen-traditional-dir

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Move the ioemu-dir-find shell script to an external file

Add support for configuring upstream qemu and rename ioemu-remote
ioemu-dir-remote.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xenpaging: use batch of pages during final page-in

Map up to RING_SIZE pages in exit path to fill the ring instead of
populating one page at a time.

Signed-off-by: Olaf Hering <olaf@aepfle.de>

hvmloader: don't clear acpi_info after filling in some fields

In particular the madt_lapic0_addr and madt_csum_addr fields are
filled in while building the tables.

This fixes a bluescreen on shutdown with certain versions of Windows.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reported-by: Christoph Egger <Christoph.Egger@amd.com>
Tested-and-acked-by: Christoph Egger <Christoph.Egger@amd.com>

x86/mm: use new page-order interfaces in nested HAP code
to make 2M and 1G mappings in the nested p2m tables.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

x86/mm: adjust paging interface to return superpage sizes
to the caller of paging_ga_to_gfn_cr3()

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

x86/mm: adjust p2m interface to return superpage sizes

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Signed-off-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

p2m-ept: remove map_domain_page check

map_domain_page() can not fail, remove ASSERT in ept_set_entry().

Signed-off-by: Olaf Hering <olaf@aepfle.de>

x86: remove unnecessary indirection from irq_complete_move()'s sole parameter

Signed-off-by: Jan Beulich <jbeulich@suse.com>

bitmap_scnlistprintf() should always zero-terminate its output buffer

... as long as it has non-zero size. So far this would not happen if
the passed in CPU mask was empty.

Also fix the comment describing the return value to actually match
reality.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

docs: Fix 'make docs'

Signed-off-by: Keir Fraser <keir@xen.org>

mem_event: use mem_event_mark_and_pause() in mem_event_check_ring()

Signed-off-by: Olaf Hering <olaf@aepfle.de>

mem_event: add ref counting for free requestslots

If mem_event_check_ring() is called by many vcpus at the same time
before any of them called also mem_event_put_request(), all of the
callers must assume there are enough free slots available in the ring.

Record the number of request producers in mem_event_check_ring() to
keep track of available free slots.

Add a new mem_event_put_req_producers() function to release a request
attempt made in mem_event_check_ring(). Its required for
p2m_mem_paging_populate() because that function can only modify the
p2m type if there are free request slots. But in some cases
p2m_mem_paging_populate() does not actually have to produce another
request when it is known that the same request was already made
earlier by a different vcpu.

mem_event_check_ring() can not return a reference to a free request
slot because there could be multiple references for different vcpus
and the order of mem_event_put_request() calls is not known. As a
result, incomplete requests could be consumed by the ring user.

Signed-off-by: Olaf Hering <olaf@aepfle.de>

IRQ: Introduce old_vector to irq_cfg

Introduce old_vector to irq_cfg with the same principle as
old_cpu_mask. This removes a brute force loop from
__clear_irq_vector(), and paves the way to correct bitrotten logic
elsewhere in the irq code.

Signed-off-by Andrew Cooper <andrew.cooper3@citrix.com>

IRQ: Fold irq_status into irq_cfg

irq_status is an int for each of nr_irqs which represents a single
boolean variable. Fold it into the bitfield in irq_cfg, which saves
768 bytes per CPU with per-cpu IDTs in use.

Signed-off-by Andrew Cooper <andrew.cooper3@citrix.com>

IRQ: Remove bit-rotten code

irq_desc.depth is a write only variable.
LEGACY_IRQ_FROM_VECTOR(vec) is never referenced.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

xen, vtd: Fix device check for devices behind PCIe-to-PCI bridges

On some systems, requests devices behind a PCIe-to-PCI bridge all
appear to the IOMMU as though they come from from slot 0, function 0
on that device; so the mapping code much punch a hole for X:0.0 in the
IOMMU for such devices. When punching the hole, if that device has
already been mapped once, we simply need to check ownership to make
sure it's legal. To do so, domain_context_mapping_one() will look up
the device for the mapping with pci_get_pdev() and look for the owner.

However, if there is no device in X:0.0, this look up will fail.

Rather than returning -ENODEV in this situation (causing a failure in
mapping the device), try to get the domain ownership from the iommu
context mapping itself.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

xen: Add global irq_vector_map option, set if using AMD global intremap tables

As mentioned in previous changesets, AMD IOMMU interrupt
remapping tables only look at the vector, not the destination
id of an interrupt. This means that all IRQs going through
the same interrupt remapping table need to *not* share vectors.

The irq "vector map" functionality was originally introduced
after a patch which disabled global AMD IOMMUs entirely. That
patch has since been reverted, meaning that AMD intremap tables
can either be per-device or global.

This patch therefore introduces a global irq vector map option,
and enables it if we're using an AMD IOMMU with a global
interrupt remapping table.

This patch removes the "irq-perdev-vector-map" boolean
command-line optino and replaces it with "irq_vector_map",
which can have one of three values: none, global, or per-device.

Setting the irq_vector_map to any value will override the
default that the AMD code sets.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

ns16550: Simplify UART and UART-interrupt probing logic.

1. No need to check for UART existence in the polling routine. We
already check for UART existence during boot-time initialisation (see
check_existence() function).

2. No obvious need to send a dummy character. The poll routine will
run until a character is eventually sent, but for the most common use
of serial ports (console logging) that will happen almost immediately.

Signed-off-by: Keir Fraser <keir@xen.org>

xen/x86: only support >128 CPUs on x86_64

32 bit cannot cope with 256 cpus and hits:

    /* At least half the ioremap space should be available to us. */
    BUILD_BUG_ON(IOREMAP_VIRT_START + (IOREMAP_MBYTES << 19) >=
    FIXADDR_START);

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

x86/mm: use defines for page sizes rather hardcoding them.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

xen: get_free_pirq: make sure that the returned pirq is allocated

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xen: __hvm_pci_intx_assert should check for gsis remapped onto pirqs

If the isa irq corresponding to a particular gsi is disabled while the
gsi is enabled, __hvm_pci_intx_assert will always inject the gsi
through the violapic, even if the gsi has been remapped onto a pirq.
This patch makes sure that even in this case we inject the
notification appropriately.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xen: fix hvm_domain_use_pirq's behavior

hvm_domain_use_pirq should return true when the guest is using a
certain pirq, no matter if the corresponding event channel is
currently enabled or disabled. As an additional complication, qemu is
going to request pirqs for passthrough devices even for Xen unaware
HVM guests, so we need to wait for an event channel to be connected
before considering the pirq of a passthrough device as "in use".

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

IRQ: manually EOI migrating line interrupts

When migrating IO-APIC line level interrupts between PCPUs, the
migration code rewrites the IO-APIC entry to point to the new
CPU/Vector before EOI'ing it.

The EOI process says that EOI'ing the Local APIC will cause a
broadcast with the vector number, which the IO-APIC must listen to to
clear the IRR and Status bits.

In the case of migrating, the IO-APIC has already been
reprogrammed so the EOI broadcast with the old vector fails to match
the new vector, leaving the IO-APIC with an outstanding vector,
preventing any more use of that line interrupt.  This causes a lockup
especially when your root device is using PCI INTA (megaraid_sas
driver *ehem*)

However, the problem is mostly hidden because send_cleanup_vector()
causes a cleanup of all moving vectors on the current PCPU in such a
way which does not cause the problem, and if the problem has occured,
the writes it makes to the IO-APIC clears the IRR and Status bits
which unlocks the problem.

This fix is distinctly a temporary hack, waiting on a cleanup of the
irq code.  It checks for the edge case where we have moved the irq,
and manually EOI's the old vector with the IO-APIC which correctly
clears the IRR and Status bits.  Also, it protects the code which
updates irq_cfg by disabling interrupts.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: add irq count for IPIs

such count is useful to assist decision make in cpuidle governor,
while w/o this patch only device interrupts through do_IRQ is
currently counted.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>

vpmu: Add processors Westmere E7-8837 and SandyBridge i5-2500 to the vpmu list

Signed-off-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>

x86: Increase the default NR_CPUS to 256

Changeset 21012:ef845a385014 bumped the default to 128 about one and a
half years ago. Increase it now to 256, as systems with eg. 160
logical CPUs are becoming (have become) common.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>

nestedsvm: VMRUN doesn't use nextrip

VMRUN does not use nextrip. So remove pointless assignment.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>

x86-64: Fix off-by-one error in __addr_ok() macro

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Keir Fraser <keir@xen.org>

Merge

Config.mk: Include optional .config file *first* rather than *last*

Allows the core of Config.mk to correctly respond to any configuration
overrides specified in the .config file.

Signed-off-by: Keir Fraser <keir@xen.org>

x86: drop unused parameter from msi_compose_msg() and setup_msi_irq()

This particularly eliminates the bogus passing of NULL by hpet.c.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

x86: work around certain Intel BIOSes causing (transient) hangs during boot

They apparently leave the USB legacy emulation bits set in ICH10's
SMI Control and Enable register, but fail to handle the resulting SMIs
gracefully. The hangs can apparently extend indefinitely, but are
commonly observed to last between a few seconds and a minute.

This assumes that only ICH10-based systems on Intel main boards with
Intel BIOS may be affected. Until Intel comes up with a more precise
identification of affected BIOSes, all Intel ones on Intel boards
will get this workaround applied.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

x86-64: allow mapping mmcfg space for high numbered PCI segments

Rather than using the segment number directly when determining the
virtual address for a particular mmconfig block, use the array index
instead. Thus a system with (perhaps significantly) less than 2048 PCI
segments, but with some having numbers beyond 2047 can actually have
all its mmconfig blocks mapped.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

Add missing 'break' statement.

Without the 'break', assigning a pci device to a PV guest results in an abort,
since the code always falls through to the default abort case in the switch
statement.

Signed-off-by: Kaushik Kumar Ram <kaushik@rice.edu>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>

Update my email address in MAINTAINERS

Signed-off-by: Tim Deegan <tim@xen.org>

x86/mm/p2m: use defines for page sizes

Use defines for page sizes instead of hardcoding the value.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>

passthrough: Turn on IOMMU/HAP pagetable sharing by default.

Signed-off-by: Tim Deegan <tim@xen.org>

x86: don't limit dom0's maximum reservation by the available memory

Set dom0's initial maximum reservation using the max value supplied in
the dom0_mem command line option without limiting it by the available
memory.

This allows dom0 to make use of any hotplugged memory without having
to also adjust the maximum reservation.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Jan Beulich <jbeulich@novell.com>

Passthrough: fix iommu_use_hap_pt() to use hap_enabled()

In line with 22924:86000076dcee, paging_mode_hap(d) shouldn't be
used in HAP internals that are called during HAP setup.

Signed-off-by: Tim Deegan <tim@xen.org>

IOMMU: only try to share IOMMU and HAP tables for domains with P2M.
This makes the check more precise, and brings VTd in line with AMD code.

Signed-off-by: Tim Deegan <tim@xen.org>

VT-d: Explicitly test EPT capabilities during IOMMU init
because the cached version isn't set up until the EPT init happens.

Signed-off-by: Tim Deegan <tim@xen.org>

x86: Fix up irq vector map logic

We need to make sure that cfg->used_vector is only cleared once;
otherwise there may be a race condition that allows the same vector to
be assigned twice, defeating the whole purpose of the map.

This makes two changes:
* __clear_irq_vector() only clears the vector if the irq is not being
moved
* smp_iqr_move_cleanup_interrupt() only clears used_vector if this
is the last place it's being used (move_cleanup_count==0 after
decrement).

Also make use of asserts more consistent, to catch this kind of logic
bug in the future.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

Adjust non-debug ASSERT() definition to avoid unused-variable warnings.

Signed-off-by: Keir Fraser <keir@xen.org>

nested-p2m: suppress np2m flushes during p2m setup

There is no need to send IPIs within p2m_alloc_table() via
set_p2m_entry().

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Committed-by: Tim Deegan <tim@xen.org>

ACPI: add _PDC input override mechanism

In order to have Dom0 call _PDC with input fully representing Xen's
capabilities, and in order to avoid building knowledge of Xen
implementation details into Dom0, this provides a mechanism by which
the Dom0 kernel can, once it filled the _PDC input buffer according to
its own knowledge, present the buffer to Xen to apply overrides for
the parts of the C-, P-, and T-state management that it controls. This
is particularly to address the dependency of Xen using MWAIT to enter
certain C-states on the availability of the break-on-interrupt
extension (which the Dom0 kernel should have no need to know about).

Signed-off-by: Jan Beulich <jbeulich@novell.com>

x86/IO-APIC: clear remoteIRR in clear_IO_APIC_pin()

It was found that in a crash scenario, the remoteIRR bit in an IO-APIC
RTE could be left set, causing problems when bringing up a kdump
kernel. While this generally is most important to be taken care of in
the new kernel (which usually would be a native one), it still seems
desirable to also address this problem in Xen so that (a) the problem
doesn't bite Xen when used as a secondary emergency kernel and (b) an
attempt is being made to save un-fixed secondary kernels from running
into said problem.

Based on a Linux patch from suresh.b.siddha@intel.com.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

pm: don't truncate processors' ACPI IDs to 8 bits

This is just another adjustment to allow systems with very many CPUs
(or unusual ACPI IDs) to be properly power-managed.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

AMD IOMMU: remove iommu tlb flush for non-present entries

Fixes dom0 boot on some systems.

Signed-off-by: Wei Wang <wei.wang2@amd.com>

x86: use 'dom0_mem' to limit the number of pages for dom0

Use the 'dom0_mem' command line option to set the maximum number of
pages for dom0. dom0 can use then use the XENMEM_maximum_reservation
memory op to automatically find this limit and reduce the size of any
page tables etc.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>

nestedhvm: avoid endless loop of nested page faults

Stop sending IPIs to flush the nested-on-nested pagetable
after write operations. Instead flush the TLB only.
This fixes an endless loop of nested page faults after
adding an entry to the nested-on-nested pagetable.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Committed-by: Tim Deegan <tim@xen.org>

nestedhvm: do not send IPIs twice

In p2m_get_nestedp2m() there is no need to send IPIs via
nestedhvm_vmcx_flushtlb() since p2m_flush_table() already
did that.

Signed-off-by: Christoph Egger <Christoph.Egger@amd.com>
Committed-by: Tim Deegan <tim@xen.org>

x86/KEXEC: disable hpet legacy broadcasts earlier

On x2apic machines which booted in xapic mode,
hpet_disable_legacy_broadcast() sends an event check IPI to all online
processors. This leads to a protection fault as the genapic blindly
pokes x2apic MSRs while the local apic is in xapic mode.

One option is to change genapic when we shut down the local apic, but
there are still problems with trying to IPI processors in the online
processor map which are actually sitting in NMI loops

Another option is to have each CPU take itself out of the online CPU
map during the NMI shootdown.

Realistically however, disabling hpet legacy broadcasts earlier in the
kexec path is the easiest fix to the problem.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

mini-os: work around ld bug causing stupid CTOR count

I'm seeing pvgrub crashing when running CTORs.  It appears its because
the magic in the linker script is generating junk.  If I get ld to
output a map, I see:

.ctors          0x0000000000097000       0x18
                0x0000000000097000                __CTOR_LIST__ = .
                0x0000000000097000        0x4 LONG 0x25c04
                (((__CTOR_END__ - __CTOR_LIST__) / 0x4) - 0x2)
*(.ctors)
.ctors         0x0000000000097004       0x10
                /home/jeremy/hg/xen/unstable/stubdom/mini-os-x86_32-grub/mini-os.o
                0x0000000000097014        0x4 LONG 0x0
                0x0000000000097018                __CTOR_END__ = .

In other words, somehow ((0x97018-0x97000) / 4) - 2 = 0x25c04

The specific crash is that the ctor loop tries to call the NULL
sentinel.  I'm seeing the same with the DTOR list.

Avoid this by terminating the loop with the NULL sentinel, and get rid
of the CTOR count entirely.

From: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Keir Fraser <keir@xen.org>

x86-64/EFI: construct EDD data from device path protocol information

In the absence of a BIOS to handle INT13 requests, this information
must be constructed artificially instead when booted from EFI.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

x86: trampoline cleanup

To make future changes less error prone, and to slightly simplify a
possible future conversion to a relocatable trampoline even for the
multiboot path (pretty desirable given that we had to change the
trampoline base a number of times to escape collisions with firmware
placed data),
- remove final uses of bootsym_phys() from trampoline.S, allowing the
symbol to be undefined before including this file (to make sure no
new references get added)
- replace two easy to deal with uses of bootsym_phys() in head.S
- remove an easy to replace reference to BOOT_TRAMPOLINE

Signed-off-by: Jan Beulich <jbeulich@novell.com>