dgit.raspbian.org Git

tools/ocaml/xenstored: Fix path length validation

Currently, oxenstored checks the length of paths against 1024, then
prepends "/local/domain/$DOMID/" to relative paths.  This allows a domU
to create paths which can't subsequently be read by anyone, even dom0.
This also interferes with listing directories, etc.

Define a new oxenstored.conf entry: quota-path-max, defaulting to 1024
as before.  For paths that begin with "/local/domain/$DOMID/" check the
relative path length against this quota. For all other paths check the
entire path length.

This ensures that if the domid changes (and thus the length of a prefix
changes) a path that used to be valid stays valid (e.g. after a
live-migration).  It also ensures that regardless how the client tries
to access a path (domid-relative or absolute) it will get consistent
results, since the limit is always applied on the final canonicalized
path.

Delete the unused Domain.get_path to avoid it being confused with
Connection.get_path (which differs by a trailing slash only).

Rewrite Util.path_validate to apply the appropriate length restriction
based on whether the path is relative or not.  Remove the check for
connection_path being absolute, because it is not guest controlled data.

This is part of XSA-323.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

tools/ocaml/xenstored: clean up permissions for dead domains

domain ids are prone to wrapping (15-bits), and with sufficient number
of VMs in a reboot loop it is possible to trigger it.  Xenstore entries
may linger after a domain dies, until a toolstack cleans it up. During
this time there is a window where a wrapped domid could access these
xenstore keys (that belonged to another VM).

To prevent this do a cleanup when a domain dies:
* walk the entire xenstore tree and update permissions for all nodes
   * if the dead domain had an ACL entry: remove it
   * if the dead domain was the owner: change the owner to Dom0

This is done without quota checks or a transaction. Quota checks would
be a no-op (either the domain is dead, or it is Dom0 where they are not
enforced).  Transactions are not needed, because this is all done
atomically by oxenstored's single thread.

The xenstore entries owned by the dead domain are not deleted, because
that could confuse a toolstack / backends that are still bound to it
(or generate unexpected watch events). It is the responsibility of a
toolstack to remove the xenstore entries themselves.

This is part of XSA-322.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

tools/xenstore: revoke access rights for removed domains

Access rights of Xenstore nodes are per domid. Unfortunately existing
granted access rights are not removed when a domain is being destroyed.
This means that a new domain created with the same domid will inherit
the access rights to Xenstore nodes from the previous domain(s) with
the same domid.

This can be avoided by adding a generation counter to each domain.
The generation counter of the domain is set to the global generation
counter when a domain structure is being allocated. When reading or
writing a node all permissions of domains which are younger than the
node itself are dropped. This is done by flagging the related entry
as invalid in order to avoid modifying permissions in a way the user
could detect.

A special case has to be considered: for a new domain the first
Xenstore entries are already written before the domain is officially
introduced in Xenstore. In order not to drop the permissions for the
new domain a domain struct is allocated even before introduction if
the hypervisor is aware of the domain. This requires adding another
bool "introduced" to struct domain in xenstored. In order to avoid
additional padding holes convert the shutdown flag to bool, too.

As verifying permissions has its price regarding runtime add a new
quota for limiting the number of permissions an unprivileged domain
can set for a node. The default for that new quota is 5.

This is part of XSA-322.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Julien Grall <julien@amazon.com>

tools/ocaml/xenstored: add xenstored.conf flag to turn off watch permission checks

There are flags to turn off quotas and the permission system, so add one
that turns off the newly introduced watch permission checks as well.

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml/xenstored: avoid watch events for nodes without access

Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.

Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).

Permissions for nodes are looked up either in the old (pre
transaction/command) or current trees (post transaction).  If
permissions are changed multiple times in a transaction only the final
version is checked, because considering a transaction atomic the
individual permission changes would not be noticable to an outside
observer.

Two trees are only needed for set_perms: here we can either notice the
node disappearing (if we loose permission), appearing
(if we gain permission), or changing (if we preserve permission).

RM needs to only look at the old tree: in the new tree the node would be
gone, or could have different permissions if it was recreated (the
recreation would get its own watch fired).

Inside a tree we lookup the watch path's parent, and then the watch path
child itself.  This gets us 4 sets of permissions in worst case, and if
either of these allows a watch, then we permit it to fire.  The
permission lookups are done without logging the failures, otherwise we'd
get confusing errors about permission denied for some paths, but a watch
still firing. The actual result is logged in xenstored-access log:

  'w event ...' as usual if watch was fired
  'w notfired...' if the watch was not fired, together with path and
  permission set to help in troubleshooting

Adding a watch bypasses permission checks and always fires the watch
once immediately. This is consistent with the specification, and no
information is gained (the watch is fired both if the path exists or
doesn't, and both if you have or don't have access, i.e. it reflects the
path a domain gave it back to that domain).

There are some semantic changes here:

  * Write+rm in a single transaction of the same path is unobservable
    now via watches: both before and after a transaction the path
    doesn't exist, thus both tree lookups come up with the empty
    permission set, and noone, not even Dom0 can see this. This is
    consistent with transaction atomicity though.
  * Similar to above if we temporarily grant and then revoke permission
    on a path any watches fired inbetween are ignored as well
  * There is a new log event (w notfired) which shows the permission set
    of the path, and the path.
  * Watches on paths that a domain doesn't have access to are now not
    seen, which is the purpose of the security fix.

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml/xenstored: introduce permissions for special watches

The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.

Start to address this by treating the special watches as regular nodes
in the tree, which gives them normal semantics for permissions. A later
change will restrict the handling, so that they can't be listed, etc.

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml/xenstored: unify watch firing

This will make it easier insert additional checks in a follow-up patch.
All watches are now fired from a single function.

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml/xenstored: check privilege for XS_IS_DOMAIN_INTRODUCED

The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for privileged
domains only (the only user in the tree is the xenpaging daemon).

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml/xenstored: ignore transaction id for [un]watch

Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.

Really ignore the transaction id for XS_WATCH and XS_UNWATCH.

This is part of XSA-115.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/xenstore: avoid watch events for nodes without access

Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.

Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: allow special watches for privileged callers only

The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.

In order to allow for disaggregated setups where e.g. driver domains
need to make use of those special watches add support for calling
"set permissions" for those special nodes, too.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: introduce node_perms structure

There are several places in xenstored using a permission array and the
size of that array. Introduce a new struct node_perms containing both.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: fire watches only when removing a specific node

Instead of firing all watches for removing a subtree in one go, do so
only when the related node is being removed.

The watches for the top-most node being removed include all watches
including that node, while watches for nodes below that are only fired
if they are matching exactly. This avoids firing any watch more than
once when removing a subtree.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: rework node removal

Today a Xenstore node is being removed by deleting it from the parent
first and then deleting itself and all its children. This results in
stale entries remaining in the data base in case e.g. a memory
allocation is failing during processing. This would result in the
rather strange behavior to be able to read a node (as its still in the
data base) while not being visible in the tree view of Xenstore.

Fix that by deleting the nodes from the leaf side instead of starting
at the root.

As fire_watches() is now called from _rm() the ctx parameter needs a
const attribute.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: check privilege for XS_IS_DOMAIN_INTRODUCED

The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for
privileged domains only (the only user in the tree is the xenpaging
daemon).

Instead of having the privilege test for each command introduce a
per-command flag for that purpose.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: simplify and rename check_event_node()

There is no path which allows to call check_event_node() without a
event name. So don't let the result depend on the name being NULL and
add an assert() covering that case.

Rename the function to check_special_event() to better match the
semantics.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: fix node accounting after failed node creation

When a node creation fails the number of nodes of the domain should be
the same as before the failed node creation. In case of failure when
trying to create a node requiring to create one or more intermediate
nodes as well (e.g. when /a/b/c/d is to be created, but /a/b isn't
existing yet) it might happen that the number of nodes of the creating
domain is not reset to the value it had before.

So move the quota accounting out of construct_node() and into the node
write loop in create_node() in order to be able to undo the accounting
in case of an error in the intermediate node destructor.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Julien Grall <jgrall@amazon.com>

tools/xenstore: ignore transaction id for [un]watch

Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.

Really ignore the transaction id for XS_WATCH and XS_UNWATCH.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/xenstore: allow removing child of a node exceeding quota

An unprivileged user of Xenstore is not allowed to write nodes with a
size exceeding a global quota, while privileged users like dom0 are
allowed to write such nodes. The size of a node is the needed space
to store all node specific data, this includes the names of all
children of the node.

When deleting a node its parent has to be modified by removing the
name of the to be deleted child from it.

This results in the strange situation that an unprivileged owner of a
node might not succeed in deleting that node in case its parent is
exceeding the quota of that unprivileged user (it might have been
written by dom0), as the user is not allowed to write the updated
parent node.

Fix that by not checking the quota when writing a node for the
purpose of removing a child's name only.

The same applies to transaction handling: a node being read during a
transaction is written to the transaction specific area and it should
not be tested for exceeding the quota, as it might not be owned by
the reader and presumably the original write would have failed if the
node is owned by the reader.

This is part of XSA-115.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>

tools/ocaml/xenstored: do permission checks on xenstore root

This was lacking in a disappointing number of places.

The xenstore root node is treated differently from all other nodes, because it
doesn't have a parent, and mutation requires changing the parent.

Unfortunately this lead to open-coding the special case for root into every
single xenstore operation, and out of all the xenstore operations only read
did a permission check when handling the root node.

This means that an unprivileged guest can:

* xenstore-chmod / to its liking and subsequently write new arbitrary nodes
   there (subject to quota)
* xenstore-rm -r / deletes almost the entire xenstore tree (xenopsd quickly
   refills some, but you are left with a broken system)
* DIRECTORY on / lists all children when called through python
   bindings (xenstore-ls stops at /local because it tries to list recursively)
* get-perms on / works too, but that is just a minor information leak

Add the missing permission checks, but this should really be refactored to do
the root handling and permission checks on the node only once from a single
function, instead of getting it wrong nearly everywhere.

This is XSA-353.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/vioapic: fix usage of index in place of GSI in vioapic_write_redirent

The usage of idx instead of the GSI in vioapic_write_redirent when
accessing gsi_assert_count can cause a PVH dom0 with multiple
vIO-APICs to lose interrupts in case a pin of a IO-APIC different than
the first one is unmasked with pending interrupts.

Switch to use gsi instead to fix the issue.

Fixes: 9f44b08f7d0e4 ('x86/vioapic: introduce support for multiple vIO APICS')
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Roger Pau Monné <roge.rpau@citrix.com>
Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3ae469af8e680df31eecd0a2ac6a83b58ad7ce53
master date: 2020-11-30 14:06:38 +0100

xen/events: rework fifo queue locking

Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.

Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race forever since the introduction and use of
per-channel locks, when an unmask operation was running in parallel with
an event channel send operation.

Using a spinlock for the per event channel lock had turned out
problematic due to some paths needing to take the lock are called with
interrupts off, so the lock would need to disable interrupts, which in
turn broke some use cases related to vm events.

For avoiding this race the queue locking in evtchn_fifo_set_pending()
needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
event channel needs to change queues both queues need to be locked
initially, in order to avoid having a window with no lock held at all.

Reported-by: Jan Beulich <jbeulich@suse.com>
Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
Fixes: de6acb78bf0e137c ("evtchn: use a per-event channel lock for sending events")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71ac522909e9302350a88bc378be99affa87067c
master date: 2020-11-30 14:05:39 +0100

x86/DMI: fix SMBIOS pointer range check

Forever since its introduction this has been using an inverted relation
operator.

Fixes: 54057a28f22b ("x86: support SMBIOS v3")
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6befe598706218673b14710d90d00ce90763b372
master date: 2020-11-24 11:25:29 +0100

xen/events: access last_priority and last_vcpu_id together

The queue for a fifo event is depending on the vcpu_id and the
priority of the event. When sending an event it might happen the
event needs to change queues and the old queue needs to be kept for
keeping the links between queue elements intact. For this purpose
the event channel contains last_priority and last_vcpu_id values
elements for being able to identify the old queue.

In order to avoid races always access last_priority and last_vcpu_id
with a single atomic operation avoiding any inconsistencies.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 1277cb9dc5e966f1faf665bcded02b7533e38078
master date: 2020-11-24 11:23:42 +0100

x86/vpt: fix build with old gcc

I believe it was the XSA-336 fix (42fcdd42328f "x86/vpt: fix race when
migrating timers between vCPUs") which has unmasked a bogus
uninitialized variable warning. This is observable with gcc 4.3.4, but
only on 4.13 and older; it's hidden on newer versions apparently due to
the addition to _read_unlock() done by 12509bbeb9e3 ("rwlocks: call
preempt_disable() when taking a rwlock").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2c620aa062767b318267d678ae249dcb637b870
master date: 2020-11-18 12:38:01 +0100

xen/evtchn: revert 52e1fc47abc3a0123

With the event channel lock no longer disabling interrupts commit
52e1fc47abc3a0123 ("evtchn/Flask: pre-allocate node on send path") can
be reverted again.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b5ad37f8e9284cc147218f7a5193d739ae7b956f
master date: 2020-11-10 14:37:15 +0100

xen/evtchn: rework per event channel lock

Currently the lock for a single event channel needs to be taken with
interrupts off, which causes deadlocks in some cases.

Rework the per event channel lock to be non-blocking for the case of
sending an event and removing the need for disabling interrupts for
taking the lock.

The lock is needed for avoiding races between event channel state
changes (creation, closing, binding) against normal operations (set
pending, [un]masking, priority changes).

Use a rwlock, but with some restrictions:

- Changing the state of an event channel (creation, closing, binding)
  needs to use write_lock(), with ASSERT()ing that the lock is taken as
  writer only when the state of the event channel is either before or
  after the locked region appropriate (either free or unbound).

- Sending an event needs to use read_trylock() mostly, in case of not
  obtaining the lock the operation is omitted. This is needed as
  sending an event can happen with interrupts off (at least in some
  cases).

- Dumping the event channel state for debug purposes is using
  read_trylock(), too, in order to avoid blocking in case the lock is
  taken as writer for a long time.

- All other cases can use read_lock().

Fixes: e045199c7c9c54 ("evtchn: address races with evtchn_reset()")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
xen/events: fix build

Commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
introduced a build failure for NDEBUG builds.

Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 5f2df45ead7c1195142f68b7923047a1e9479d54
master date: 2020-11-10 14:36:15 +0100
master commit: 53bacb86f496fdb11560d9e3b361bca7de60d268
master date: 2020-11-11 08:56:21 +0100

memory: fix off-by-one in XSA-346 change

The comparison against ARRAY_SIZE() needs to be >= in order to avoid
overrunning the pages[] array.

This is XSA-355.

Fixes: 5777a3742d88 ("IOMMU: hold page ref until after deferred TLB flush")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 9b156bcc3ffcc7949edd4460b718a241e87ae302
master date: 2020-11-24 14:01:31 +0100

x86/msr: Disallow guest access to the RAPL MSRs

Researchers have demonstrated using the RAPL interface to perform a
differential power analysis attack to recover AES keys used by other cores in
the system.

Furthermore, even privileged guests cannot use this interface correctly, due
to MSR scope and vcpu scheduling issues. The interface would want to be
paravirtualised to be used sensibly.

Disallow access to the RAPL MSRs completely, as well as other MSRs which
potentially access fine grain power information.

This is part of XSA-351.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL}

Currently a PV hardware domain can also be given control over the CPU
frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL.
However since commit 322ec7c89f6 the default behavior has been changed
to reject accesses to not explicitly handled MSRs, preventing PV
guests that manage CPU frequency from reading
MSR_IA32_PERF_{STATUS/CTL}.

Additionally some HVM guests (Windows at least) will attempt to read
MSR_IA32_PERF_CTL and will panic if given back a #GP fault:

vmx.c:3035:d8v0 RDMSR 0x00000199 unimplemented
d8v0 VIRIDIAN CRASH: 3b c0000096 fffff806871c1651 ffffda0253683720 0

Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR
handling shared between HVM and PV guests, and add an explicit case
for reads to MSR_IA32_PERF_{STATUS/CTL}.

Restore previous behavior and allow PV guests with the required
permissions to read the contents of the mentioned MSRs. Non privileged
guests will get 0 when trying to read those registers, as writes to
MSR_IA32_PERF_CTL by such guest will already be silently dropped.

Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs')
Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc)

xen/arm: Always trap AMU system registers

The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is
considered to be unsafe to be expose to guests as they might expose
information about code executed by other guests or the host.

Arm provided a way to trap all the AMU system registers by setting
CPTR_EL2.TAM to 1.

Unfortunately, on older revision of the specification, the bit 30 (now
CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and
therefore the system registers would be exposed to the guest when it is
run on processors with AMU.

As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution
for us is to always set CPTR_EL1.TAM to 1.

Guest trying to access the AMU system registers will now receive an
undefined instruction. Unfortunately, this means that even well-behaved
guest may fail to boot because we don't sanitize the ID registers.

This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer
Auth). This will taken care separately.

This is part of XSA-351 (or XSA-93 re-born).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit 628e1becb6fb121475a6ce68e3f1cb4499851255)

tools/libs/stat: use memcpy instead of strncpy in getBridge

Use memcpy in getBridge to prevent gcc warnings about truncated
strings. We know that we might truncate it, so the gcc warning
here is wrong.
Revert previous change changing buffer sizes as bigger buffers
are not needed.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 40fe714ca4245a9716264fcb3ee8df42c3650cf6)

tool/libs/light: Fix libxenlight gcc warning

Fix gcc10 compilation warning about uninitialized variable by setting
it to 0.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0241809bf838875615797f52af34222e5ab8e98f)

tools/libxc: report malloc errors in writev_exact

The caller of writev_exact should be notified about malloc errors
when dealing with partial writes.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0d8d289af7a679c028462c4ed5d98586f9ef9648)

tools/libs/stat: fix broken build

Making getBridge() static triggered a build error with some gcc versions:

error: 'strncpy' output may be truncated copying 15 bytes from a string of
length 255 [-Werror=stringop-truncation]

Fix that by using a buffer with 256 bytes instead.

Fixes: 6d0ec053907794 ("tools: split libxenstat into new tools/libs/stat directory")
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit c8099e48c3dbb8c7399551a265756ecf354eece2)

tools/xenstore: Do not abort xenstore-ls if a node disappears while iterating

The do_ls() function has somewhat inconsistent handling of errors.

If reading the node's contents with xs_read() fails, then do_ls() will
just quietly not display the contents.

If reading the node's permissions with xs_get_permissions() fails, then
do_ls() will print a warning, continue, and ultimately won't exit with
an error code (unless another error happens).

If recursing into the node with xs_directory() fails, then do_ls() will
abort immediately, not printing any further nodes.

For persistent failure modes — such as ENOENT because a node has been
removed, or EACCES because it has had its permisions changed since the
xs_directory() on the parent directory returned its name — it's
obviously quite likely that if either of the first two errors occur for
a given node, then so will the third and thus xenstore-ls will abort.

The ENOENT one is actually a fairly common case, and has caused tools to
fail to clean up a network device because it *apparently* already
doesn't exist in xenstore.

There is a school of thought that says, "Well, xenstore-ls returned an
error. So the tools should not trust its output."

The natural corollary of this would surely be that the tools must re-run
xenstore-ls as many times as is necessary until its manages to exit
without hitting the race condition. I am not keen on that conclusion.

For the specific case of ENOENT it seems reasonable to declare that,
but for the timing, we might as well just not have seen that node at
all when calling xs_directory() for the parent. By ignoring the error,
we give acceptable output.

The issue can be reproduced as follows:

(dom0) # for a in `seq 1 1000` ; do
              xenstore-write /local/domain/2/foo/$a $a ;
         done

Now simultaneously:

(dom0) # for a in `seq 1 999` ; do
              xenstore-rm /local/domain/2/foo/$a ;
         done
(dom2) # while true ; do
              ./xenstore-ls -p /local/domain/2/foo | grep -c 1000 ;
         done

We should expect to see node 1000 in the output, every time.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit beb105af19cc3e58e2ed49224cfe190a736e5fa0)

tools/xenpmd: Fix gcc10 snprintf warning

Add a check for snprintf return code and ignore the entry if we get an
error. This should in fact never happen and is more a trick to make gcc
happy and prevent compilation errors.

This is solving the following gcc warning when compiling for arm32 host
platforms with optimization activated:
xenpmd.c:92:37: error: '%s' directive output may be truncated writing
between 4 and 2147483645 bytes into a region of size 271
[-Werror=format-truncation=]

This is also solving the following Debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970802

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0dfddb2116e3757f77a691a3fe335173088d69dc)

libxl: fix -Werror=stringop-truncation in libxl__prepare_sockaddr_un

In file included from /usr/include/string.h:495,
                 from libxl_internal.h:38,
                 from libxl_utils.c:20:
In function 'strncpy',
    inlined from 'libxl__prepare_sockaddr_un' at libxl_utils.c:1262:5:
/usr/include/bits/string_fortified.h:106:10: error: '__builtin_strncpy' specified bound 108 equals destination size [-Werror=stringop-truncation]
  106 |   return __builtin___strncpy_chk (__dest, __src, __len, __bos (__dest));
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit fff1b7f50e75ad9535c86f3fcf425b4945c50a1c)

libxl: workaround gcc 10.2 maybe-uninitialized warning

It seems xlu_pci_parse_bdf has a state machine that is too complex for
gcc to understand. The build fails with:

    libxlu_pci.c: In function 'xlu_pci_parse_bdf':
    libxlu_pci.c:32:18: error: 'func' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       32 |     pcidev->func = func;
          |     ~~~~~~~~~~~~~^~~~~~
    libxlu_pci.c:51:29: note: 'func' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                             ^~~~
    libxlu_pci.c:31:17: error: 'dev' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       31 |     pcidev->dev = dev;
          |     ~~~~~~~~~~~~^~~~~
    libxlu_pci.c:51:24: note: 'dev' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                        ^~~
    libxlu_pci.c:30:17: error: 'bus' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       30 |     pcidev->bus = bus;
          |     ~~~~~~~~~~~~^~~~~
    libxlu_pci.c:51:19: note: 'bus' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                   ^~~
    libxlu_pci.c:29:20: error: 'dom' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       29 |     pcidev->domain = domain;
          |     ~~~~~~~~~~~~~~~^~~~~~~~
    libxlu_pci.c:51:14: note: 'dom' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |              ^~~
    cc1: all warnings being treated as errors

Workaround it by setting the initial value to invalid value (0xffffffff)
and then assert on each value being set. This way we mute the gcc
warning, while still detecting bugs in the parse code.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit d25cc3ec93ebda030349045d2c7fa14ffde07ed7)

SUPPORT: Add linux device model stubdom to Toolstack

Add qemu-xen linux device model stubdomain to the Toolstack section as a
Tech Preview.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 4ddd6499d999a7d08cabfda5b0262e473dd5beed
master date: 2020-10-23 17:14:08 +0100

arm,smmu: match start level of page table walk with P2M

Don't hardcode the lookup start level of the page table walk to 1
and instead match the one used in P2M. This should fix scenarios
involving SMMU where the start level is different than 1.
In order for the SMMU driver to also compile on arm32 move the
P2M_ROOT_LEVEL in the p2m header file (while at it, for
consistency also P2M_ROOT_ORDER) and use the macro in the smmu
driver.

Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9ae1197582798b394d696cff94c4d742319bdbbf)

xen/arm: sched: Ensure the vCPU context is seen before vcpu_pause() returns

Some callers of vcpu_pause() will expect to access the latest vcpu
context when the function returns (see XENDOMCTL_{set,get}vcpucontext}.

However, the latest vCPU context can only be observed after
v->is_running has been observed to be false.

As there is no memory barrier instruction generated, a processor could
try to speculatively access the vCPU context before it was observed.

To prevent the corruption of the vCPU context, we need to insert a
memory barrier instruction after v->is_running is observed and before
the context is accessed. This barrier is added in sync_vcpu_execstate()
as it seems to be the place where we expect the synchronization to
happen.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f6790389613cd54775ece6575013a679572b46b3)

xen/arm: bootfdt: Ignore empty memory bank

At the moment, Xen will stop processing the Device Tree if a memory
bank is empty (size == 0).

Unfortunately, some of the Device Tree (such as on Colibri imx8qxp)
may contain such a bank. This means Xen will not be able to boot
properly.

Relax the check to just ignore the banks. FWIW this also seems to be the
behavior adopted by Linux.

Reported-by: Daniel Wagner <Daniel.Wagner2@itk-engineering.de>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 5a37207df52066efefe419c677b089a654d37afc)

xen/arm64: force gcc 10+ to always inline generic atomics helpers

Recent versions of gcc (at least 10.x) will not inline generic atomics
helpers by default. Instead they will expect the software to either link
with libatomic.so or implement the helpers, which would result in

undefined reference to `__aarch64_ldadd4_acq_rel'

for us (not having any local implementation).

To keep the previous behavior, force gcc to always inline the generic
atomics helpers.

Long term we probably want to avoid relying on gcc atomics helpers as
this doesn't allow us to switch between LSE and LL/SC atomics.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 5d45ecabe3c0b2097df623ab7b471f8915cfdde6)

x86emul: fix PINSRW and adjust other {,V}PINSR*

The use of simd_packed_int together with no further update to op_bytes
has lead to wrong signaling of #GP(0) for PINSRW without a 16-byte
aligned memory operand. Use simd_none instead and override it after
general decoding with simd_other, like is done for the B/D/Q siblings.

While benign, for consistency also use DstImplicit instead of DstReg
in x86_decode_twobyte().

PINSR{B,D,Q} also had a stray (redundant) get_fpu() invocation, which
gets dropped.

For further consistency also
- use src.bytes instead of op_bytes in relevant memcpy() invocations,
- avoid the pointless updating of op_bytes (all we care about later is
that the value be less than 16).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 06f0598b41f23c9e4cf7d8c5a05b282de92f3a35
master date: 2020-10-23 18:03:18 +0200

pci: cleanup MSI interrupts before removing device from IOMMU

Doing the MSI cleanup after removing the device from the IOMMU leads
to the following panic on AMD hardware:

Assertion 'table.ptr && (index < intremap_table_entries(table.ptr, iommu))' failed at iommu_intr.c:172
----[ Xen-4.13.1-10.0.3-d  x86_64  debug=y   Not tainted ]----
CPU:    3
RIP:    e008:[<ffff82d08026ae3c>] drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[...]
Xen call trace:
   [<ffff82d08026ae3c>] R drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
   [<ffff82d08026af25>] F drivers/passthrough/amd/iommu_intr.c#update_intremap_entry_from_msi_msg+0xc0/0x342
   [<ffff82d08026ba65>] F amd_iommu_msi_msg_update_ire+0x98/0x129
   [<ffff82d08025dd36>] F iommu_update_ire_from_msi+0x1e/0x21
   [<ffff82d080286862>] F msi_free_irq+0x55/0x1a0
   [<ffff82d080286f25>] F pci_cleanup_msi+0x8c/0xb0
   [<ffff82d08025cf52>] F pci_remove_device+0x1af/0x2da
   [<ffff82d0802a42d1>] F do_physdev_op+0xd18/0x1187
   [<ffff82d080383925>] F pv_hypercall+0x1f5/0x567
   [<ffff82d08038a432>] F lstar_enter+0x112/0x120

That's because the call to iommu_remove_device on AMD hardware will
remove the per-device interrupt remapping table, and hence the call to
pci_cleanup_msi done afterwards will find a null intremap table and
crash.

Reorder the calls so that MSI interrupts are torn down before removing
the device from the IOMMU.

Fixes: d7cfeb7c13ed ("AMD/IOMMU: don't blindly allocate interrupt remapping tables")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 710f62cc826bb8c7ead99f9d6b6b269e39ff3e98
master date: 2020-10-23 10:13:14 +0200

build: use if_changed more consistently (and correctly) for prelink*.o

Switch to $(call if_changed,ld) where possible; presumably not doing so
in e321576f4047 ("xen/build: start using if_changed") right away was an
oversight, as it did for Arm in (just) one case. It failed to add
prelink.o to $(targets), though, causing - judging from the observed
behavior on x86 - undue rebuilds of the final binary (because of
prelink.o getting rebuild for $(cmd_prelink.o) being empty, in turn
because of .prelink.o.cmd not getting read) during "make install-xen".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dd2cfba88c3d0e144ffec07c6b5b86e54a9d98a9
master date: 2020-09-22 10:19:38 +0200

SUPPORT.md: Desupport qemu trad except stub dm

While investigating XSA-335 we discovered that many upstream security
fixes were missing. It is not practical to backport them. There is
no good reason to be running this very ancient version of qemu, except
that it is the only way to run a stub dm which is currently supported
by upstream.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 8587160b3e2951b722d395a0346bb17c3c22152f
master date: 2020-11-04 09:22:37 +0100

x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)

x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)

hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region

ACPI specification contains statements describing memory marked with regular
"ACPI data" type as reclaimable by the guest. Although the guest shouldn't
really do it if it wants kexec or similar functionality to work, there
could still be ambiguities in treating these regions as potentially regular
RAM.

One such example is SeaBIOS which currently reports "ACPI data" regions as
RAM to the guest in its e801 call. Which it might have the right to do as any
user of this is expected to be ACPI unaware. But a QEMU bootloader later seems
to ignore that fact and is instead using e801 to find a place for initrd which
causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS need
to be fixed / improved here, that is just one example of the potential problems
from using a reclaimable memory type.

Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
described by the spec as non-reclaimable (so cannot ever be treated like RAM).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: de6d188a519f9e3b7a1acc7784adf4c243865f9a
master date: 2020-10-20 08:54:23 +0200

x86/smpboot: Don't unconditionally call memguard_guard_stack() in cpu_smpboot_alloc()

cpu_smpboot_alloc() is designed to be idempotent with respect to partially
initialised state. This occurs for S3 and CPU parking, where enough state to
handle NMIs/#MCs needs to remain valid for the entire lifetime of Xen, even
when we otherwise want to offline the CPU.

For simplicity between various configuration, Xen always uses shadow stack
mappings (Read-only + Dirty) for the guard page, irrespective of whether
CET-SS is enabled.

Unfortunately, the CET-SS changes in memguard_guard_stack() broke idempotency
by first writing out the supervisor shadow stack tokens with plain writes,
then changing the mapping to being read-only.

This ordering is strictly necessary to configure the BSP, which cannot have
the supervisor tokens be written with WRSS.

Instead of calling memguard_guard_stack() unconditionally, call it only when
actually allocating a new stack. Xenheap allocates are guaranteed to be
writeable, and the net result is idempotency WRT configuring stack_base[].

Fixes: 91d26ed304f ("x86/shstk: Create shadow stacks")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a7952a320c1e202a218702bfdb14f75132f04894
master date: 2020-10-16 11:56:59 +0100

x86/traps: 'Fix' safety of read_registers() in #DF path

All interrupts and exceptions pass a struct cpu_user_regs up into C.  This
contains the legacy vm86 fields from 32bit days, which are beyond the
hardware-pushed frame.

Accessing these fields is generally illegal, as they are logically out of
bounds for anything other than an interrupt/exception hitting ring1/3 code.

show_registers() unconditionally reads these fields, but the content is
discarded before use.  This is benign right now, as all parts of the stack are
readable, including the guard pages.

However, read_registers() in the #DF handler writes to these fields as part of
preparing the state dump, and being IST, hits the adjacent stack frame.

This has been broken forever, but c/s 6001660473 "x86/shstk: Rework the stack
layout to support shadow stacks" repositioned the #DF stack to be adjacent to
the guard page, which turns this OoB write into a fatal pagefault:

  (XEN) *** DOUBLE FAULT ***
  (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:  C   ]----
  (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:  C   ]----
  (XEN) CPU:    4
  (XEN) RIP:    e008:[<ffff82d04031fd4f>] traps.c#read_registers+0x29/0xc1
  (XEN) RFLAGS: 0000000000050086   CONTEXT: hypervisor (d1v0)
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d04031fd4f>] R traps.c#read_registers+0x29/0xc1
  (XEN)    [<ffff82d0403207b3>] F do_double_fault+0x3d/0x7e
  (XEN)    [<ffff82d04039acd7>] F double_fault+0x107/0x110
  (XEN)
  (XEN) Pagetable walk from ffff830236f6d008:
  (XEN)  L4[0x106] = 80000000bfa9b063 ffffffffffffffff
  (XEN)  L3[0x008] = 0000000236ffd063 ffffffffffffffff
  (XEN)  L2[0x1b7] = 0000000236ffc063 ffffffffffffffff
  (XEN)  L1[0x16d] = 8000000236f6d161 ffffffffffffffff
  (XEN)
  (XEN) ****************************************
  (XEN) Panic on CPU 4:
  (XEN) FATAL PAGE FAULT
  (XEN) [error_code=0003]
  (XEN) Faulting linear address: ffff830236f6d008
  (XEN) ****************************************
  (XEN)

and rendering the main #DF analysis broken.

The proper fix is to delete cpu_user_regs.es and later, so no
interrupt/exception path can access OoB, but this needs disentangling from the
PV ABI first.

Not-really-fixes: 6001660473 ("x86/shstk: Rework the stack layout to support shadow stacks")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6065a05adf152a556fb9f11a5218c89e41b62893
master date: 2020-10-16 11:55:33 +0100

x86/mwait-idle: customize IceLake server support

On ICX platform, the C1E auto-promotion is enabled by default.
As a result, the CPU might fall into C1E more offen than previous
platforms. So disable C1E auto-promotion and expose C1E as a separate
idle state.

Beside C1 and C1E, the exit latency of C6 was measured
by a dedicated tool. However the exit latency(41us) exposed
by _CST is much smaller than the one we measured(128us). This
is probably due to the _CST uses the exit latency when woken
up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose
the latter as we need the longest latency in theory.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[Linux commit a472ad2bcea479ba068880125d7273fc95c14b70]
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 44ac57af81ff8097e228895738b911ca819bda19
master date: 2020-10-15 12:29:11 +0200

x86: fix resource leaks on arch_vcpu_create() error path

{hvm,pv}_vcpu_initialise() have always kind of been meant to be the
final possible source of errors in arch_vcpu_create(), hence not
requiring any unrolling of what they've done on the error path. (Of
course this may change once the various involved paths all have become
idempotent.)

But even beyond this aspect I think it is more logical to do policy
initialization ahead of the calling of these two functions, as they may
in principle want to access it.

Fixes: 4187f79dc718 ("x86/msr: introduce struct msr_vcpu_policy")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6a34e67c118408ebdd62bfa7be76598ca040f170
master date: 2020-10-14 14:03:38 +0200

x86/vLAPIC: don't leak regs page from vlapic_init() upon error

Fixes: 8a981e0bf25e ("Make map_domain_page_global fail")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8a62dee9ceff3056c7e0bd9632bac39bee2a51b3
master date: 2020-10-09 17:20:11 +0100

x86/S3: Restore CR4 earlier during resume

c/s 4304ff420e5 "x86/S3: Drop {save,restore}_rest_processor_state()
completely" moved CR4 restoration up into C, to account for the fact that MCE
was explicitly handled later.

However, time_resume() ends up making an EFI Runtime Service call, and EFI
explodes without OSFXSR, presumably when trying to spill %xmm registers onto
the stack.

Given this codepath, and the potential for other issues of a similar kind (TLB
flushing vs INVPCID, HVM logic vs VMXE, etc), restore CR4 in asm before
entering C.

Ignore the previous MCE special case, because its not actually necessary. The
handler is already suitably configured from before suspend.

Fixes: 4304ff420e5 ("x86/S3: Drop {save,restore}_rest_processor_state() completely")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: 7f66c0dc41ae5f770c614e516810eb1f336e2470
master date: 2020-10-06 12:28:37 +0100

xen/domain: check IOMMU options doesn't contain unknown bits set

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 59b27f360e3d9dc0378c1288e67a91fa41a77158
master date: 2020-10-02 08:38:50 +0200

evtchn/fifo: use stable fields when recording "last queue" information

Both evtchn->priority and evtchn->notify_vcpu_id could change behind the
back of evtchn_fifo_set_pending(), as for it - in the case of
interdomain channels - only the remote side's per-channel lock is held.
Neither the queue's priority nor the vCPU's vcpu_id fields have similar
properties, so they seem better suited for the purpose. In particular
they reflect the respective evtchn fields' values at the time they were
used to determine queue and vCPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536
master date: 2020-10-02 08:37:35 +0200

x86/S3: fix shadow stack resume path

Fix the resume path to load the shadow stack pointer from saved_ssp (not
saved_rsp), to match what suspend path does.

Fixes: 633ecc4a7cb2 ("x86/S3: Save and restore Shadow Stack configuration")
Backport: 4.14
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4bdbf746ac9152e70f264f87db4472707da805ce
master date: 2020-09-28 10:43:10 +0200

x86/pv: Don't deliver #GP for a SYSENTER with NT set

It is a matter of guest kernel policy what to do with offending userspace, and
terminating said userspace may not be the action chosen.

Linux explicitly tolerates this case.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: fdac951560 ("x86: clear EFLAGS.NT in SYSENTER entry path")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d4a04349895edc5a5868274b906ba61ef24f47
master date: 2020-09-24 21:02:35 +0100

x86/pv: Don't clobber NT on return-to-guest

A 64bit IRET can restore NT - the faulting case is when NT is set in the live
flags. This change had an unintended consequence of causing the NT flag to
spontaneously disappear from guest context whenever a interrupt/exception
occurred.

In combination with a SYSENTER which sets both TF and NT, Xen's handling of
the #DB exceptions clears NT before it is even recorded suitably in the guest
kernel's view of what userspace was doing.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 0e47f92b0 ("x86: force EFLAGS.IF on when exiting to PV guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5bcac985498ed83d89666959175ca9c9ed561ae1
master date: 2020-09-24 21:02:35 +0100

AMD/IOMMU: ensure suitable ordering of DTE modifications

DMA and interrupt translation should be enabled only after other
applicable DTE fields have been written. Similarly when disabling
translation or when moving a device between domains, translation should
first be disabled, before other entry fields get modified. Note however
that the "moving" aspect doesn't apply to the interrupt remapping side,
as domain specifics are maintained in the IRTEs here, not the DTE. We
also never disable interrupt remapping once it got enabled for a device
(the respective argument passed is always the immutable iommu_intremap).

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0514a3a25fb9ebff5d75cc8f00a9229385300858
master date: 2020-10-20 14:23:12 +0200

AMD/IOMMU: update live PTEs atomically

Updating a live PTE bitfield by bitfield risks the compiler re-ordering
the individual updates as well as splitting individual updates into
multiple memory writes. Construct the new entry fully in a local
variable, do the check to determine the flushing needs on the thus
established new entry, and then write the new entry by a single insn.

Similarly using memset() to clear a PTE is unsafe, as the order of
writes the function does is, at least in principle, undefined.

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 3b055121c5410e2c3105d6d06aa24ca0d58868cd
master date: 2020-10-20 14:22:52 +0200

AMD/IOMMU: convert amd_iommu_pte from struct to union

This is to add a "raw" counterpart to the bitfield equivalent. Take the
opportunity and
- convert fields to bool / unsigned int,
- drop the naming of the reserved field,
- shorten the names of the ignored ones.

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 73f62c7380edf07469581a3049aba98abd63b275
master date: 2020-10-20 14:22:26 +0200

IOMMU: hold page ref until after deferred TLB flush

When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB
flush for the "from" GFN range requires that the page remains allocated
to the guest until the TLB flush has actually occurred. Otherwise a
parallel hypercall to remove the page would only flush the TLB for the
GFN it has been moved to, but not the one is was mapped at originally.

This is part of XSA-346.

Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ")
Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 5777a3742d88ff1c0ebc626ceb4fd47f9b3dc6d5
master date: 2020-10-20 14:21:32 +0200

IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page

Deferring flushes to a single, wide range one - as is done when
handling XENMAPSPACE_gmfn_range - is okay only as long as
pages don't get freed ahead of the eventual flush. While the only
function setting the flag (xenmem_add_to_physmap()) suggests by its name
that it's only mapping new entries, in reality the way
xenmem_add_to_physmap_one() works means an unmap would happen not only
for the page being moved (but not freed) but, if the destination GFN is
populated, also for the page being displaced from that GFN. Collapsing
the two flushes for this GFN into just one (end even more so deferring
it to a batched invocation) is not correct.

This is part of XSA-346.

Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dea460d86957bf1425a8a1572626099ac3f165a8
master date: 2020-10-20 14:21:09 +0200

x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ce75e99d75907aaffae05fcf658a833802bce49
master date: 2020-10-20 14:20:19 +0200

x86/mm: Refactor modify_xen_mappings to have one exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b733f8a8b8db83f2d438cab3adb38b387cecfce0
master date: 2020-10-20 14:19:55 +0200

x86/mm: Refactor map_pages_to_xen to have only a single exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 08e6c6f80b018878476adc2c4e5679d2ce5cb4b1
master date: 2020-10-20 14:19:31 +0200

evtchn/Flask: pre-allocate node on send path

xmalloc() & Co may not be called with IRQs off, or else check_lock()
will have its assertion trigger about locks getting acquired
inconsistently. Re-arranging the locking in evtchn_send() doesn't seem
very reasonable, especially since the per-channel lock was introduced to
avoid acquiring the per-domain event lock on the send paths. Issue a
second call to xsm_evtchn_send() instead, before acquiring the lock, to
give XSM / Flask a chance to pre-allocate whatever it may need.

As these nodes are used merely for caching earlier decisions' results,
allocate just one node in AVC code despite two potentially being needed.
Things will merely be not as performant if a second allocation was
wanted, just like when the pre-allocation fails.

Fixes: c0ddc8634845 ("evtchn: convert per-channel lock to be IRQ-safe")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: 52e1fc47abc3a0123d2b5bb7e9172e84fd571851
master date: 2020-10-02 08:36:21 +0200

x86/HVM: more consistently set I/O completion

Doing this just in hvm_emulate_one_insn() is not enough.
hvm_ud_intercept() and hvm_emulate_one_vm_event() can get invoked for
insns requiring one or more continuations, and at least in principle
hvm_emulate_one_mmio() could, too. Without proper setting of the field,
handle_hvm_io_completion() will do nothing completion-wise, and in
particular the missing re-invocation of the insn emulation paths will
lead to emulation caching not getting disabled in due course, causing
the ASSERT() in {svm,vmx}_vmenter_helper() to trigger.

Reported-by: Don Slutz <don.slutz@gmail.com>
Similar considerations go for the clearing of vio->mmio_access, which
gets moved as well.

Additionally all updating of vio->mmio_* now gets done dependent upon
the new completion value, rather than hvm_ioreq_needs_completion()'s
return value. This is because it is the completion chosen which controls
what path will be taken when handling the completion, not the simple
boolean return value. In particular, PIO completion doesn't involve
going through the insn emulator, and hence emulator state ought to get
cleared early (or it won't get cleared at all).

The new logic, besides allowing for a caller override for the
continuation type to be set (for VMX real mode emulation), will also
avoid setting an MMIO completion when a simpler PIO one will do. This
is a minor optimization only as a side effect - the behavior is strictly
needed at least for hvm_ud_intercept(), as only memory accesses can
successfully complete through handle_mmio(). Care of course needs to be
taken to correctly deal with "mixed" insns (doing both MMIO and PIO at
the same time, i.e. INS/OUTS). For this, hvmemul_validate() now latches
whether the insn being emulated is a memory access, as this information
is no longer easily available at the point where we want to consume it.

Note that the presence of non-NULL .validate fields in the two ops
structures in hvm_emulate_one_mmio() was really necessary even before
the changes here: Without this, passing non-NULL as middle argument to
hvm_emulate_init_once() is meaningless.

The restrictions on when the #UD intercept gets actually enabled are why
it was decided that this is not a security issue:
- the "hvm_fep" option to enable its use is a debugging option only,
- for the cross-vendor case is considered experimental, even if
unfortunately SUPPORT.md doesn't have an explicit statement about
this.
The other two affected functions are
- hvm_emulate_one_vm_event(), used for introspection,
- hvm_emulate_one_mmio(), used for Dom0 only,
which aren't qualifying this as needing an XSA either.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Don Slutz <don.slutz@gmail.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: b807cfe954b8d0d8852398b4c8a586d95d69a342
master date: 2020-09-15 10:19:33 +0200

xen/hypfs: fix writing of custom parameter

Today the maximum allowed data length for writing a hypfs node is
tested in the generic hypfs_write() function. For custom runtime
parameters this might be wrong, as the maximum allowed size is derived
from the buffer holding the current setting, while there might be ways
to set the parameter needing more characters than the minimal
representation of that value.

One example for this is the "ept" parameter. Its value buffer is sized
to be able to hold the string "exec-sp=0" or "exec-sp=1", while it is
allowed to use e.g. "no-exec-sp" or "exec-sp=yes" for setting it.

Fix that by moving the length check one level down to the type
specific write function.

In order to avoid allocation of arbitrary sized buffers use a new
MAX_PARAM_SIZE macro as an upper limit for custom writes. The value
of MAX_PARAM_SIZE is the same as the limit in parse_params() for a
single parameter.

Fixes: a659d7cab9af ("xen: add runtime parameter access support to hypfs")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b4e41b1750d550bf2b1ccf97ee46f4f682bdbb62
master date: 2020-09-11 14:20:10 +0200

hvmloader: indicate ACPI tables with "ACPI data" type in e820

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass ACPI region locations to the second
kernel which is now a requirement in Linux after 02a3e3cdb7f12 ("x86/boot:
Parse SRAT table and count immovable memory regions") in order for kexec
transition to actually work.

That commit introduced accesses to XSDT and SRAT while the second kernel
is still using kexec transition tables. The transition tables do not have
e820 "reserved" regions mapped where those tables are located currently
in a Xen guest. Instead "ACPI data" regions are mapped with the transition
tables that was introduced by the following commit 6bbeb276b7 ("x86/kexec:
Add the EFI system tables and ACPI tables to the ident map").

Reserve 1MB (out of 16MB currently available) right after ACPI info page for
ACPI tables exclusively but populate this region on demand and only indicate
populated memory as "ACPI data" since according to ACPI spec that memory is
reclaimable by the guest if necessary. That is close to how we treat
the same ACPI data in PVH guests. 1MB should be enough for now but could be
later extended if required.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8efa46516c5f4cf185c8df179812c185d3c27eb6
master date: 2020-09-09 17:56:13 +0200

evtchn: arrange for preemption in evtchn_reset()

Like for evtchn_destroy() looping over all possible event channels to
close them can take a significant amount of time. Unlike done there, we
can't alter domain properties (i.e. d->valid_evtchns) here. Borrow, in a
lightweight form, the paging domctl continuation concept, redirecting
the continuations to different sub-ops. Just like there this is to be
able to allow for predictable overall results of the involved sub-ops:
Racing requests should either complete or be refused.

Note that a domain can't interfere with an already started (by a remote
domain) reset, due to being paused. It can prevent a remote reset from
happening by leaving a reset unfinished, but that's only going to affect
itself.

This is part of XSA-344.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

evtchn: arrange for preemption in evtchn_destroy()

Especially closing of fully established interdomain channels can take
quite some time, due to the locking involved. Therefore we shouldn't
assume we can clean up still active ports all in one go. Besides adding
the necessary preemption check, also avoid pointlessly starting from
(or now really ending at) 0; 1 is the lowest numbered port which may
need closing.

Since we're now reducing ->valid_evtchns, free_xen_event_channel(),
and (at least to be on the safe side) notify_via_xen_event_channel()
need to cope with attempts to close / unbind from / send through already
closed (and no longer valid, as per port_is_valid()) ports.

This is part of XSA-344.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

evtchn: address races with evtchn_reset()

Neither d->evtchn_port_ops nor max_evtchns(d) may be used in an entirely
lock-less manner, as both may change by a racing evtchn_reset(). In the
common case, at least one of the domain's event lock or the per-channel
lock needs to be held. In the specific case of the inter-domain sending
by evtchn_send() and notify_via_xen_event_channel() holding the other
side's per-channel lock is sufficient, as the channel can't change state
without both per-channel locks held. Without such a channel changing
state, evtchn_reset() can't complete successfully.

Lock-free accesses continue to be permitted for the shim (calling some
otherwise internal event channel functions), as this happens while the
domain is in effectively single-threaded mode. Special care also needs
taking for the shim's marking of in-use ports as ECS_RESERVED (allowing
use of such ports in the shim case is okay because switching into and
hence also out of FIFO mode is impossihble there).

As a side effect, certain operations on Xen bound event channels which
were mistakenly permitted so far (e.g. unmask or poll) will be refused
now.

This is part of XSA-343.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

evtchn: convert per-channel lock to be IRQ-safe

... in order for send_guest_{global,vcpu}_virq() to be able to make use
of it.

This is part of XSA-343.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

evtchn: evtchn_reset() shouldn't succeed with still-open ports

While the function closes all ports, it does so without holding any
lock, and hence racing requests may be issued causing new ports to get
opened. This would have been problematic in particular if such a newly
opened port had a port number above the new implementation limit (i.e.
when switching from FIFO to 2-level) after the reset, as prior to
"evtchn: relax port_is_valid()" this could have led to e.g.
evtchn_close()'s "BUG_ON(!port_is_valid(d2, port2))" to trigger.

Introduce a counter of active ports and check that it's (still) no
larger then the number of Xen internally used ones after obtaining the
necessary lock in evtchn_reset().

As to the access model of the new {active,xen}_evtchns fields - while
all writes get done using write_atomic(), reads ought to use
read_atomic() only when outside of a suitably locked region.

Note that as of now evtchn_bind_virq() and evtchn_bind_ipi() don't have
a need to call check_free_port().

This is part of XSA-343.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <jgrall@amazon.com>

evtchn/x86: enforce correct upper limit for 32-bit guests

The recording of d->max_evtchns in evtchn_2l_init(), in particular with
the limited set of callers of the function, is insufficient. Neither for
PV nor for HVM guests the bitness is known at domain_create() time, yet
the upper bound in 2-level mode depends upon guest bitness. Recording
too high a limit "allows" x86 32-bit domains to open not properly usable
event channels, management of which (inside Xen) would then result in
corruption of the shared info and vCPU info structures.

Keep the upper limit dynamic for the 2-level case, introducing a helper
function to retrieve the effective limit. This helper is now supposed to
be private to the event channel code. The used in do_poll() and
domain_dump_evtchn_info() weren't consistent with port uses elsewhere
and hence get switched to port_is_valid().

Furthermore FIFO mode's setup_ports() gets adjusted to loop only up to
the prior ABI limit, rather than all the way up to the new one.

Finally a word on the change to do_poll(): Accessing ->max_evtchns
without holding a suitable lock was never safe, as it as well as
->evtchn_port_ops may change behind do_poll()'s back. Using
port_is_valid() instead widens some the window for potential abuse,
until we've dealt with the race altogether (see XSA-343).

This is XSA-342.

Reported-by: Julien Grall <jgrall@amazon.com>
Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <jgrall@amazon.com>

xen/evtchn: Add missing barriers when accessing/allocating an event channel

While the allocation of a bucket is always performed with the per-domain
lock, the bucket may be accessed without the lock taken (for instance, see
evtchn_send()).

Instead such sites relies on port_is_valid() to return a non-zero value
when the port has a struct evtchn associated to it. The function will
mostly check whether the port is less than d->valid_evtchns as all the
buckets/event channels should be allocated up to that point.

Unfortunately a compiler is free to re-order the assignment in
evtchn_allocate_port() so it would be possible to have d->valid_evtchns
updated before the new bucket has finish to allocate.

Additionally on Arm, even if this was compiled "correctly", the
processor can still re-order the memory access.

Add a write memory barrier in the allocation side and a read memory
barrier when the port is valid to prevent any re-ordering issue.

This is XSA-340.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

x86/pv: Avoid double exception injection

There is at least one path (SYSENTER with NT set, Xen converts to #GP) which
ends up injecting the #GP fault twice, first in compat_sysenter(), and then a
second time in compat_test_all_events(), due to the stale TBF_EXCEPTION left
in TRAPBOUNCE_flags.

The guest kernel sees the second fault first, which is a kernel level #GP
pointing at the head of the #GP handler, and is therefore a userspace
trigger-able DoS.

This particular bug has bitten us several times before, so rearrange
{compat_,}create_bounce_frame() to clobber TRAPBOUNCE on success, rather than
leaving this task to one area of code which isn't used uniformly.

Other scenarios which might result in a double injection (e.g. two calls
directly to compat_create_bounce_frame) will now crash the guest, which is far
more obvious than letting the kernel run with corrupt state.

This is XSA-339

Fixes: fdac9515607b ("x86: clear EFLAGS.NT in SYSENTER entry path")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

evtchn: relax port_is_valid()

To avoid ports potentially becoming invalid behind the back of certain
other functions (due to ->max_evtchn shrinking) because of
- a guest invoking evtchn_reset() and from a 2nd vCPU opening new
channels in parallel (see also XSA-343),
- alloc_unbound_xen_event_channel() produced channels living above the
2-level range (see also XSA-342),
drop the max_evtchns check from port_is_valid(). For a port for which
the function once returned "true", the returned value may not turn into
"false" later on. The function's result may only depend on bounds which
can only ever grow (which is the case for d->valid_evtchns).

This also eliminates a false sense of safety, utilized by some of the
users (see again XSA-343): Without a suitable lock held, d->max_evtchns
may change at any time, and hence deducing that certain other operations
are safe when port_is_valid() returned true is not legitimate. The
opportunities to abuse this may get widened by the change here
(depending on guest and host configuration), but will be taken care of
by the other XSA.

This is XSA-338.

Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <jgrall@amazon.com>

x86/MSI-X: restrict reading of table/PBA bases from BARs

When assigned to less trusted or un-trusted guests, devices may change
state behind our backs (they may e.g. get reset by means we may not know
about). Therefore we should avoid reading BARs from hardware once a
device is no longer owned by Dom0. Furthermore when we can't read a BAR,
or when we read zero, we shouldn't instead use the caller provided
address unless that caller can be trusted.

Re-arrange the logic in msix_capability_init() such that only Dom0 (and
only if the device isn't DomU-owned yet) or calls through
PHYSDEVOP_prepare_msix will actually result in the reading of the
respective BAR register(s). Additionally do so only as long as in-use
table entries are known (note that invocation of PHYSDEVOP_prepare_msix
counts as a "pseudo" entry). In all other uses the value already
recorded will get used instead.

Clear the recorded values in _pci_cleanup_msix() as well as on the one
affected error path. (Adjust this error path to also avoid blindly
disabling MSI-X when it was enabled on entry to the function.)

While moving around variable declarations (in many cases to reduce their
scopes), also adjust some of their types.

This is part of XSA-337.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86/msi: get rid of read_msi_msg

It's safer and faster to just use the cached last written
(untranslated) MSI message stored in msi_desc for the single user that
calls read_msi_msg.

This also prevents relying on the data read from the device MSI
registers in order to figure out the index into the IOMMU interrupt
remapping table, which is not safe.

This is part of XSA-337.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vpt: fix race when migrating timers between vCPUs

The current vPT code will migrate the emulated timers between vCPUs
(change the pt->vcpu field) while just holding the destination lock,
either from create_periodic_time or pt_adjust_global_vcpu_target if
the global target is adjusted. Changing the periodic_timer vCPU field
in this way creates a race where a third party could grab the lock in
the unlocked region of pt_adjust_global_vcpu_target (or before
create_periodic_time performs the vcpu change) and then release the
lock from a different vCPU, creating a locking imbalance.

Introduce a per-domain rwlock in order to protect periodic_time
migration between vCPU lists. Taking the lock in read mode prevents
any timer from being migrated to a different vCPU, while taking it in
write mode allows performing migration of timers across vCPUs. The
per-vcpu locks are still used to protect all the other fields from the
periodic_timer struct.

Note that such migration shouldn't happen frequently, and hence
there's no performance drop as a result of such locking.

This is XSA-336.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/memory: Don't skip the RCU unlock path in acquire_resource()

In the case that an HVM Stubdomain makes an XENMEM_acquire_resource hypercall,
the FIXME path will bypass rcu_unlock_domain() on the way out of the function.

Move the check to the start of the function. This does change the behaviour
of the get-size path for HVM Stubdomains, but that functionality is currently
broken and unused anyway, as well as being quite useless to entities which
can't actually map the resource anyway.

This is XSA-334.

Fixes: 83fa6552ce ("common: add a new mappable resource type: XENMEM_resource_grant_table")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Handle the Intel-specific MSR_MISC_ENABLE correctly

This MSR doesn't exist on AMD hardware, and switching away from the safe
functions in the common MSR path was an erroneous change.

Partially revert the change.

This is XSA-333.

Fixes: 4fdc932b3cc ("x86/Intel: drop another 32-bit leftover")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>

xen/arm: cmpxchg: Add missing memory barriers in __cmpxchg_mb_timeout()

The function __cmpxchg_mb_timeout() was intended to have the same
semantics as __cmpxchg_mb(). Unfortunately, the memory barriers were
not added when first implemented.

There is no known issue with the existing callers, but the barriers are
added given this is the expected semantics in Xen.

The issue was introduced by XSA-295.

Backport: 4.8+
Fixes: 86b0bc958373 ("xen/arm: cmpxchg: Provide a new helper that can timeout")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit d501ef90ae7f2a79130ea89acb3d6d1792972934)

xen/arm: Missing N1/A76/A75 FP registers in vCPU context switch

Xen has cpu_has_fp/cpu_has_simd to detect whether the CPU supports
FP/SIMD or not. But currently, these two MACROs only consider value 0
of ID_AA64PFR0_EL1.FP/SIMD as FP/SIMD features enabled. But for CPUs
that support FP/SIMD and half-precision floating-point arithmetic, the
ID_AA64PFR0_EL1.FP/SIMD are 1 (see Arm ARM DDI0487F.b, D13.2.64).
For these CPUs, xen will treat them as no FP/SIMD support, the
vfp_save/restore_state will not take effect.

From the TRM documents of Cortex-A75/A76/N1, we know these CPUs support
basic Advanced SIMD/FP and half-precision floating-point arithmetic. In
this case, on N1/A76/A75 platforms, Xen will always miss the floating
pointer registers save/restore. If different vCPUs are running on the
same pCPU, the floating pointer registers will be corrupted randomly.

This patch fixes Xen on these new cores.

Signed-off-by: Wei Chen <wei.chen@arm.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 968bb86d04913f52d7678a842474f2a674a8b23e)

xen/arm: Update silicon-errata.txt with the Neovers AT erratum

Commit 858c0be8c2fa "xen/arm: Enable CPU Erratum 1165522 for Neoverse"
added a new erratum but forgot to update silicon-errata.txt.

Update the file accordingly to keep track of errata workaround in Xen.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 1814a626fb5811184eda64fe22f0055df4600211)

xen/arm: Enable CPU Erratum 1165522 for Neoverse

Enable CPU erratum of Speculative AT on the Neoverse N1 processor
versions r0p0 to r2p0.
Also Fix Cortex A76 Erratum string which had a wrong errata number.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 858c0be8c2fa4125a0fa0acaa03ae730e5c7cb3c)

arm: Add Neoverse N1 processor identification

Add MIDR and CPU part numbers for Neoverse N1

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 3b418b33265402aab0cb1bf2b745a25724bae2d8)

x86/pv: Rewrite segment context switching from scratch

There are multiple bugs with the existing implementation.

On AMD CPUs prior to Zen2, loading a NUL segment selector doesn't clear the
segment base, which is a problem for 64bit code which typically expects to use
a NUL %fs/%gs selector.

On a context switch from any PV vcpu, to a 64bit PV vcpu with an %fs/%gs
selector which faults, the fixup logic loads NUL, and the guest is entered at
the failsafe callback with the stale base.

Alternatively, a PV context switch sequence of 64 (NUL, non-zero base) =>
32 (NUL) => 64 (NUL, zero base) will similarly cause Xen to enter the guest
with a stale base.

Both of these corner cases manifest as state corruption in the final vcpu.
However, damage is limited to to 64bit code expecting to use Thread Local
Storage with a base pointer of 0, which doesn't occur by default.

The context switch logic is extremely complicated, and is attempting to
optimise away loading a NUL selector (which is fast), or writing a 64bit base
of 0 (which is rare).  Furthermore, it fails to respect Linux's ABI with
userspace, which manifests as userspace state corruption as far as Linux is
concerned.

Always restore all selector and base state, in all cases.

Leave a large comment explaining hardware behaviour, and the new ABI
expectations.  Update the comments in the public headers.

Drop all "segment preloading" to handle the AMD corner case.  It was never
anything but a waste of time for %ds/%es, and isn't needed now that %fs/%gs
bases are unconditionally written for 64bit PV guests.  In load_segments(),
store the result of is_pv_32bit_vcpu() as it is an expensive predicate now,
and not used in a way which impacts speculative safety.

Reported-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/pv: Fix assertions in svm_load_segs()

OSSTest has shown an assertion failure:
http://logs.test-lab.xenproject.org/osstest/logs/153906/test-xtf-amd64-amd64-1/serial-rimava1.log

This is because we pass a non-NUL selector into svm_load_segs(), which is
something we must not do, as this path does not load the attributes/limits
from the GDT/LDT.

Drop the {fs,gs}_sel parameters from svm_load_segs() and use 0 instead.  This
is acceptable even for non-zero NUL segments, as it is how the IRET
instruction behaves in all CPUs.

Only use the svm_load_segs() path when both FS and GS are NUL, which is the
common case when scheduling a 64bit vcpu with 64bit userspace in context.

Fixes: ad0fd291c5 ("x86/pv: Rewrite segment context switching from scratch")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ad0fd291c5e79191c2e3c70e43dded569f11a450
master date: 2020-09-07 11:32:34 +0100
master commit: 1e2d3be2e516e6f415ca6029f651b76a8563a27c
master date: 2020-09-08 16:46:31 +0100

x86/pv: Fix consistency of 64bit segment bases

The comments in save_segments(), _toggle_guest_pt() and write_cr() are false.
The %fs and %gs bases can be updated at any time by the guest.

As a consequence, Xen's fs_base/etc tracking state is always stale when the
vcpu is in context, and must not be used to complete MSR_{FS,GS}_BASE reads, etc.

In particular, a sequence such as:

  wrmsr(MSR_FS_BASE, 0x1ull << 32);
  write_fs(__USER_DS);
  base = rdmsr(MSR_FS_BASE);

will return the stale base, not the new base.  This may cause guest a guest
kernel's context switching of userspace to malfunction.

Therefore:
* Update save_segments(), _toggle_guest_pt() and read_msr() to always read
   the segment bases from hardware.
* Update write_cr(), write_msr() and do_set_segment_base() to not not waste
   time caching data which is instantly going to become stale again.
* Provide comments explaining when the tracking state is and isn't stale.

This bug has been present for 14 years, but several bugfixes since have built
on and extended the original flawed logic.

Fixes: ba9adb737ba ("Apply stricter checking to RDMSR/WRMSR emulations.")
Fixes: c42494acb2f ("x86: fix FS/GS base handling when using the fsgsbase feature")
Fixed: eccc170053e ("x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a5eaac9245f4f382a3cd0e9710e9d1cba7db20e4
master date: 2020-09-07 11:32:34 +0100

x86/pv: Fix multiple bugs with SEGBASE_GS_USER_SEL

The logic takes the segment selector unmodified from guest context.  This
allowed the guest to load DPL0 descriptors into %gs.  Fix up the RPL for
non-NUL selectors to be 3.

Xen's context switch logic skips saving the inactive %gs base, as it cannot be
modified by the guest behind Xen's back.  This depends on Xen caching updates
to the inactive base, which is was missing from this path.

The consequence is that, following SEGBASE_GS_USER_SEL, the next context
switch will restore the stale inactive %gs base, and corrupt vcpu state.

Rework the hypercall to update the cached idea of gs_base_user, and fix the
behaviour in the case of the AMD NUL selector bug to always zero the segment
base.

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: afe018e041ec112d90a8b4e6ed607d22aa06f280
master date: 2020-08-31 14:21:46 +0100

x86/intel: Expose MSR_ARCH_CAPS to dom0

The overhead of (the lack of) MDS_NO alone has been measured at 30% on some
workloads. While we're not in a position yet to offer MSR_ARCH_CAPS generally
to guests, dom0 doesn't migrate, so we can pass a subset of hardware values
straight through.

This will cause PVH dom0's not to use KPTI by default, and all dom0's not to
use VERW flushing by default, and to use eIBRS in preference to retpoline on
recent Intel CPUs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e46474278a0e87e2b32ad5dd5fc20e8d2cb0688b
master date: 2020-08-31 13:43:26 +0100

x86: Begin to introduce support for MSR_ARCH_CAPS

... including serialisation/deserialisation logic and unit tests.

There is no current way to configure this MSR correctly for guests.
The toolstack side this logic needs building, which is far easier to
do with it in place.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e32605b07ef2e01c9d05da9b2d5d7b8f9a5c7c1b
master date: 2020-08-27 12:48:46 +0100

x86: use constant flags for section .init.rodata

LLVM 11 complains with:

<instantiation>:1:1: error: changed section flags for .init.rodata, expected: 0x2
.pushsection .init.rodata
^
<instantiation>:30:9: note: while in macro instantiation
        entrypoint 0
        ^
entry.S:979:9: note: while in macro instantiation
        .rept 256
        ^

And:

entry.S:1015:9: error: changed section flags for .init.rodata, expected: 0x2
        .section .init.rodata
        ^

Fix it by explicitly using the same flags and type in all the
instances.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: d2770047a277ccdc7924fb99d1b051eeb0d5a90f
master date: 2020-08-27 09:53:46 +0200

build: work around bash issue

Older bash (observed with 3.2.57(2)) fails to honor "set -e" for certain
built-in commands ("while" here), despite the command's status correctly
being non-zero. The subsequent objcopy invocation now being separated by
a semicolon results in no failure. Insert an explicit "exit" (replacing
; by && ought to be another possible workaround).

Fixes: e321576f4047 ("xen/build: start using if_changed")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5132a0a37190b73c99dbbecf48dc4fb214feaf14
master date: 2020-08-07 13:12:00 +0200