dgit.raspbian.org Git

docs/xl: fix typo in xl.cfg

The name of the option is nographic.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wl@xen.org>

docs/misc: xen-command-line: fix parameters in sample serial configuration

The names of the serial parameters use hyphens, not underscores.

Signed-off-by: Sarah Newman <srn@prgmr.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86 / vmx: move teardown from domain_destroy()...

... to domain_relinquish_resources().

The teardown code frees the APICv page. This does not need to be done late
so do it in domain_relinquish_resources() rather than domain_destroy().

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/EPT: drop redundant ept_p2m_type_to_flags() parameters

All callers set the respective fields in the entry being updated before
the call.

Take the opportunity and also constify the first parameter as well as
make a few style adjustments.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/EPT: do away with hidden GUEST_TABLE_MAP_FAILED == 0 assumptions

The code is quite a bit easier to read and to reason about this way,
I think.

In ept_set_entry() additionally change the function's return value in
the MAP_FAILED case to -ENOMEM; -ENOENT would be applicable only when
ept_next_entry() was invoked with "read_only" set to true.

In two cases, where ept_next_level() follows an ept_split_superpage()
invocation, actually tighten the loop exit condition from
"== MAP_FAILED" to "!= NORMAL_PAGE". Continuing these loops for other
than NORMAL_PAGE is invalid, and there are ASSERT()s in place after
these loops.

Also reduce the scope of "ret" variables where possible, in particular
to better distinguish them from "rc" often used in the same function.

Finally drop pointless "else" in a few areas touched anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/tlb: fix NEED_FLUSH return type

The returned type wants to be bool instead of int.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>

xen: split parameter related definitions in own header file

Move the parameter related definitions from init.h into a new header
file param.h. This will avoid include hell when new dependencies are
added to parameter definitions.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Acked-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext

The HVM context may not fill up the full buffer passed by the caller.
While we report corectly the size of the context, we will still be
copying back the full size of the buffer.

As the buffer is allocated through xmalloc(), we will be copying some
bits from the previous allocation.

Only copy back the part of the buffer used by the HVM context to prevent
any leak.

Note that per XSA-72, this is not a security issue.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: domain: Remove specific case when allocating struct domain

Commit 8916fcf4577 "x86/domain: compile with lock_profile=y enabled"
allowed the struct domain to use more than a PAGE_SIZE (i.e 4096).
However, the function free_domheap_struct() will only free the first
page.

We could modify the free part to free the correct number of pages, but
the structure has been fitting in a page (even with lock profile
enabled) since commit 428607a410 "x86: shrink 'struct domain', was
already PAGE_SIZE" (part of Xen 4.7).

Therefore, the specific case for lock profile is now removed.

This is not a security issue because struct domain can only be bigger
than a page size for lock profiling. The feature can only be selected
in DEBUG and EXPERT mode.

Fixes: 8916fcf4577 ("x86/domain: compile with lock_profile=y enabled")
Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: make paddr_bits available earlier

Move early_cpu_init before init_e820, such that paddr_bits can be used
by e820 code.

This will reduce code repetition and prepare for further adjustment when
L0 hypervisor comes into play.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

tools/xenstore: don't apply write limiting for privileged domain

Xenstore write limiting should not be applied to dom0. Unfortunately
write limiting is disabled only for connections via sockets. When
running in a stubdom Xenstore will apply write limiting to dom0, too.
Change that by testing for the domain to be privileged as well.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>

tools/xenstore: add newline for printing of stubdom console messages

There are several places in xenstore-stubdom where newlines at the end
of messages on the console are missing. Add them.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>

libxl: generalise libxl__domain_userdata_lock()

This function implements a file-based lock with a file name generated
from a domid.

This patch splits it into two, generalising the core of the locking code
into a new libxl__lock_file() function which operates on a specified file,
leaving just the file name generation in libxl__domain_userdata_lock().

This patch also generalises libxl__unlock_domain_userdata() to
libxl__unlock_file() and modifies all call-sites.

Suggested-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl_create: make 'soft reset' explicit

The 'soft reset' code path in libxl__domain_make() is currently taken if a
valid domid is passed into the function. A subsequent patch will enable
higher levels of the toolstack to determine the domid of newly created or
restored domains and therefore this criteria for choosing 'soft reset'
will no longer be usable.

This patch adds an extra boolean option to libxl__domain_make() to specify
whether it is being invoked in soft reset context and appropriately
modifies callers to choose the right value. To facilitate this, a new
'soft_reset' boolean field is added to struct libxl__domain_create_state
and the 'domid_soft_reset' field is renamed to 'domid' in anticipation of
its wider remit. For the moment do_domain_create() will always set
domid to INVALID_DOMID and hence we can add an assertion into
libxl__domain_create() that, if it is not called in soft reset context,
the passed in domid is exactly that value.

Whilst in the neighbourhood, some checks of 'restore_fd > -1' have been
replaced by 'restore_fd >= 0' to be more conventional and consistent with
checks of 'restore_fd < 0'.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: add definition of INVALID_DOMID to the API

Currently both xl and libxl have internal definitions of INVALID_DOMID
which happen to be identical. However, for the purposes of describing the
behaviour of libxl_domain_create_new/restore() it is useful to have a
specified invalid value for a domain id.

This patch therefore moves the libxl definition from libxl_internal.h to
libxl.h and removes the internal definition from xl_utils.h. The hardcoded
'-1' passed back via domcreate_complete() is then updated to INVALID_DOMID
and comment above libxl_domain_create_new/restore() is accordingly
modified.

NOTE: The value of INVALID_DOMID (~0) is distinct from the hypervisor's
DOMID_INVALID. This patch preserves that value.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86/HVM: avoid truncation of PM timer I/O port range version

Don't silently ignore the upper 32 bits.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HVM: relinquish resources also from hvm_domain_destroy()

Domain creation failure paths don't call domain_relinquish_resources(),
yet allocations and alike done from hvm_domain_initialize() need to be
undone nevertheless. Call the function also from hvm_domain_destroy(),
after making sure all descendants are idempotent.

Note that while viridian_{domain,vcpu}_deinit() were already used in
ways suggesting they're idempotent, viridian_time_vcpu_deinit() actually
wasn't: One can't kill a timer that was never initialized.

For hvm_destroy_all_ioreq_servers()'s purposes make
relocate_portio_handler() return whether the to be relocated port range
was actually found. This seems cheaper than introducing a flag into
struct hvm_domain's ioreq_server sub-structure.

In hvm_domain_initialise() additionally
- use XFREE() also to replace adjacent xfree(),
- use hvm_domain_relinquish_resources() as being idempotent now.
There as well as in hvm_domain_destroy() the explicit call to
rtc_deinit() isn't needed anymore.

In hvm_domain_relinquish_resources() additionally drop a no longer
relevant if().

Fixes: e7a9b5e72f26 ("viridian: separately allocate domain and vcpu structures")
Fixes: 26fba3c85571 ("viridian: add implementation of synthetic timers")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>

MAINTAINERS: put Hyper-V code under Viridian maintainership

And add myself as a maintainer.

Sort the list alphabetically while at it.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Signed-off-by: Wei Liu <wl@xen.org>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>

x86: fold linker script pre-processing rules

There's no need to have twice almost the same rule. Simply add the extra
-DEFI to AFLAGS for the EFI variant, and specify both targets for the
then single rule.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: undo part of "refine link time stub area related assertion"

The original check was not too strict: While we don't use one page of
memory per CPU, we do use ons page of VA space per CPU. It is the
latter which matters here.

Undo that part of the change, but leave everything else in place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Move GCC_HAS_VISIBILITY_ATTRIBUTE to Kconfig and common

The check for $(CC) -fvisibility=hidden is done by both arm and x86,
so the patch also move the check to the common area.

The check doesn't check if $(CC) is gcc, and clang can accept that
option as well, so s/GCC/CC/ is done to the define name.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Use $(CONFIG_CC_IS_CLANG) instead of $(clang) in Makefile

Kconfig can check if $(CC) is clang or not, if it is
CONFIG_CC_IS_CLANG will be set.

With that patch, the hypervisor can be built using clang by running
`make CC=clang CXX=clang++` without needed to provide an extra clang
parameter.

`make clang=y` still works as Config.mk will set CC and CXX.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Move CONFIG_INDIRECT_THUNK to Kconfig

Now that Kconfig has the capability to run shell command when
generating CONFIG_* we can use it in some cases to test CFLAGS.

CONFIG_INDIRECT_THUNK is a good example that wants to exist both in
Makefile and as a C macro, which Kconfig do. So use Kconfig to
generate CONFIG_INDIRECT_THUNK and have the CFLAGS depends on that.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Import cc-ifversion from Kbuild

This is in preparation of importing Kbuild to build Xen. We won't be
able to include Config.mk so we will need a replacement for the macro
`cc-ifversion'.

This patch imports parts of "scripts/Kbuild.include" from Linux v5.4,
the macro cc-ifversion. It makes use of CONFIG_GCC_VERSION that
Kconfig now provides.

Since they are no other use of Xen's `cc-ifversion' macro, we can
remove it.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Have Kconfig check $(CC)'s version

This import several files from Linux v5.3
- scripts/Kconfig.include
- scripts/clang-version.sh
- scripts/gcc-version.sh
and several config values from from Linux's init/Kconfig file.
But gcc-version.sh have been modified to return "0" when $CC isn't
GCC, like clang-version.sh do.

Files are copied into scripts/ directory because that's were the files
are found in Linux tree, and also because we are going to import more
of Kbuild from Linux which is located in scripts/.

CONFIG_GCC_VERSION and CONFIG_CC_IS_CLANG are going to be use in
follow-up patches.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Update Kconfig to Linux v5.4

This patch updates Kconfig to a more recent version of Kconfig, found
in Linux v5.4.0, 219d54332a09 ("Linux 5.4").

With the updated version of Kconfig, other changes are necessary to
avoid breaking the build.

Kconfig files:
- fix Kconfig files that where using option env=*:
  Since Linux commit 104daea149c4 ("kconfig: reference environment
  variables directly and remove 'option env='"), we can access the
  environment directly via $() and "option env=" as been removed.
- CONFIG_EXPERT='y' will now appear in .config file if
  XEN_CONFIG_EXPERT=y in the environment. The alternative is to change
  "EXPERT" to "$(XEN_CONFIG_EXPERT)" in all Kconfig files.

Makefile:
- silentoldconfig target as been removed from Kconfig. To update
  include/generated/autoconf.h, we need to use syncconfig target
  instead.

Makefile.kconfig:
- Import newer needed code from Linux's Makefile.lib and
  Kbuild.include and Makefile.build.
- Set Q to empty, Xen build system doesn't silence commands. Having Q
  empty mean we can import stuff from Linux without having to remove the
  leading $(Q) from build commands. And quiet='' means commands will be
  echoed.
- Add $(PHONY) to .PHONY. Like it is intended by Kbuild.

Makefile.host is also updated and copied from Linux.

Dependency change:
- Now depends on flex/bison, maybe we could _shipped those files like
  before. Linux doesn't do that anymore.

The .gitignore in kconfig/ has more entries, compared to upstream, for
file generated by Makefile.host.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/mem_access: use __get_gfn_type_access in set_mem_access

Use __get_gfn_type_access instead of p2m->get_entry to trigger page-forking
when the mem_access permission is being set on a page that has not yet been
copied over from the parent.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/suspend: disable watchdog before calling console_start_sync()

... and enable it after exiting S-state. Otherwise accumulated
output in serial buffer might easily trigger the watchdog if it's
still enabled after entering sync transmission mode.

The issue observed on machines which, unfortunately, generate non-0
output in CPU offline callbacks.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/domctl: fix typo in comment

The array is named msr_policy.

Fixes commit 60529dfeca1

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wl@xen.org>

x86/mem_sharing: replace MEM_SHARING_DEBUG with gdprintk

Using XENLOG_ERR level since this is only used in debug paths (ie. it's
expected the user already has loglvl=all set). Also use %pd to print the domain
ids.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/apic: fix disabling LVT0 in disconnect_bsp_APIC

The Intel SDM states:

"When an illegal vector value (0 to 15) is written to a LVT entry and
the delivery mode is Fixed (bits 8-11 equal 0), the APIC may signal an
illegal vector error, without regard to whether the mask bit is set or
whether an interrupt is actually seen on the input."

And that's exactly what's currently done in disconnect_bsp_APIC when
virt_wire_setup is true and LVT LINT0 is being masked. By writing only
APIC_LVT_MASKED Xen is actually setting the vector to 0 and the
delivery mode to Fixed (0), and hence it triggers an APIC error even
when the LVT entry is masked.

This would usually manifest when Xen is being shut down, as that's
where disconnect_bsp_APIC is called:

(XEN) APIC error on CPU0: 40(00)

Fix this by calling clear_local_APIC prior to setting the LVT LINT
registers which already clear LVT LINT0, and hence the troublesome
write can be avoided as the register is already cleared.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

docs: document CONTROL command of xenstore protocol

The CONTROL command (former DEBUG command) isn't specified in the
xenstore protocol doc. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.9+

docs: fix StudlyCaps in libxl-migration-stream.pandoc

Note that "LibxlFmt" in the stream should remain unchanged.

Signed-off-by: Wei Liu <wl@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

docs: Fix StudlyCaps in libxc-migration-stream.pandoc and xl.1.pod

$ git-grep libxenctrl | wc -l
99
$ git-grep libxc | wc -l
206
$ git-grep libxenlight | wc -l
48
$ git-grep libxl | wc -l
13433
$ git-grep LibXen | wc -l
2
$

Reported-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>

docs: add DIRECTORY_PART specification do xenstore protocol doc

DIRECTORY_PART was missing in docs/misc/xenstore.txt. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wl@xen.org>
Backport: 4.9+

libxl: event: Move poller pipe emptying to the end of afterpoll

This seems neater.  It doesn't have any significant effect because:

The poller fd wouldn't be emptied by time_occurs.  It would only be
woken by time_occurs as a result of an ao completing, or by
libxl__egc_ao_cleanup_1_baton.  But ...1_baton won't be called in
between (for one thing, this would violate the rule of not still
having the active caller when ...1_baton is called).

While discussing this patch, I noticed that there is a possibility (in
libxl in general) that poller_put might be called on a woken poller.
It would probably be sensible at some point to make poller_get empty
the pipe, at least if the pipe_nonempty flag is set.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
v3: Completely revised commit message; now we think this is just
    cleanup.

libxl: event: Fix possible hang with libxl_osevent_beforepoll

If the application uses libxl_osevent_beforepoll, a similar hang is
possible to the one described and fixed in
   libxl: event: Fix hang when mixing blocking and eventy calls
Application behaviour would have to be fairly unusual, but it
doesn't seem sensible to just leave this latent bug.

We fix the latent bug by waking up the "poller_app" pipe every time we
add osevents.  If the application does not ever call beforepoll, we
write one byte to the pipe and set pipe_nonempty and then we ignore
it.  We only write another byte if beforepoll is called again.

Normally in an eventy program there would only be one thread calling
libxl_osevent_beforepoll.  The effect in such a program is to
sometimes needlessly go round the poll loop again if a timeout
callback becomes interested in a new osevent.  We'll fix that in a
moment.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
---
v2: New addition to correctness arguments in libxl_event.c comment.

libxl: event: Break out baton_wake

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
---
v2: Now it takes a gc, not an egc.

libxl: event: poller pipe optimisation

Track in userland whether the poller pipe is nonempty. This saves us
writing many many bytes to the pipe if nothing ever reads them.

This is going to be relevant in a moment, where we are going to create
a situation where this will happen quite a lot.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>

libxl: event: Fix hang when mixing blocking and eventy calls

If the application calls libxl with ao_how==0 and also makes calls
like _occurred, libxl will sometimes get stuck.

The bug happens as follows (for example):

  Thread A
       libxl_do_thing(,ao_how==0)
       libxl_do_thing starts, sets up some callbacks
       libxl_do_thing exit path calls AO_INPROGRESS
       libxl__ao_inprogress goes into event loop
       eventloop_iteration sleeps on:
          - do_thing's current fd set
          - sigchld pipe if applicable
          - its poller

  Thread B
       libxl_something_occurred
       the something is to do with do_thing, above
       do_thing_next_callback does some more work
       do_thing_next_callback becomes interested in fd N
       thread B returns to application

Note that nothing wakes up thread A.  A is not listening on fd N.  So
do_thing_* will not spot when fd N signals.  do_thing will not make
further timely progress.  If there is no timeout thread A will never
wake up.

The problem here occurs because thread A is waiting on an out of date
osevent set.

There is also the possibility that a thread might block waiting for
libxl osevents but outside libxl, eg if the application used
libxl_osevent_beforepoll.  We will deal with that in a moment.

See the big comment in libxl_event.c for a fairly formal correctness
argument.

This depends on libxl__egc_ao_cleanup_1_baton being called everywhere
an egc or ao is disposed of.  Firstly egcs: in this patch we rename
libxl__egc_cleanup, which means we catch all the disposal sites.
Secondly aos: these are disposed of by (i) AO_CREATE_FAIL
(ii) ao__inprogress and (iii) an event which completes the ao later.
(i) and (ii) we handle by adding the call to _baton.  In the case of
(iii) any such function must be an event-generating function so it has
an egc too, so it will pass on the baton when the egc is disposed.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
---
v2: Call libxl__egc_ao_cleanup_1_baton (renamed from __egc_cleanup) on
    all exits from ao_inprogress, even requests for async processing.
    Fixes a remaining instance of this bug (!)
    This involves disposing of ao->poller somewhat earlier.

v2: New correctness arguments in libxl_event.c comment and
    in commit message.

libxl: event: Make libxl__poller_wakeup take a gc, not an egc

We are going to want to call this in the following situation:

* We have just set up an ao, which is to call back - so a
   non-synchronous one.  It ought not to call the application
   back right away, so no egc.

* There is a libxl thread blocking somewhere but it is using
   using an out of date fd or timeout set, which does not take into
   account the ao we have just started.

* We try to wake that thread up, but libxl__poller_wakeup fails.

In more detail:

The idea before was that these two functions take an egc, not so much
because it actually uses the egc, but to make sure it's only called in a
restricted set of conditions; and now we're relaxing those conditions.

Specifically, we need to make one exception, relating to ao's.

In the situation described above, there is no egc, but we need to call
libxl__poller_wakeup.  Introducing an egc is wrong because that would
imply that this situation might result in application callbacks, but
it shouldn't (and not having an egc prevents that).

libxl__poller_wakeup and LIBXL__EVENT_DISASTER only take an egc for
form's sake; they don't use any part of it other than the gc.  The
"form's sake" is to stop them being called from libxl entrypoints that
are not involved in event generation.

Before this patch this is enforced by the types: you can't call it in
the wrong place because it wants an egc which you don't have.

After this patch this is no longer enforced.  But the mistake
(principally, calling _DISASTER) seems unlikely.  The type enforcement
I mention above was done because it was possible and easy, not because
it was important.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
---
v3: Significantly expanded commit message based on irc comments
v2: New patch

libxl: event: Make LIBXL__EVENT_DISASTER take a gc, not an egc

We are going to want to change libxl__poller_wakeup to take a gc.

In theory there is a risk here that it would be called inappropriately
in a future patch but this seems unlikely.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
---
v2: New patch

libxl: event: Introduce CTX_UNLOCK_EGC_FREE

This is a very common exit pattern. We are going to want to change
this pattern. So we should make it into a macro of its own.

No functional change.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>

libxl: event: Rename ctx.pollers_fd_changed to .pollers_active

We are going to use this a bit more widely. Make the name more
general.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>

libxl: event: Rename poller.fds_changed to .fds_deregistered

This is only for deregistration. We are going to add another variable
for new events, with different semantics, and this overly-general name
will become confusing.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: George Dunlap <george.dunlap@citrix.com>

Revert "docs: retrospectively add XS_DIRECTORY_PART to the xenstore protocol..."

Jürgen Groß <jgross@suse.com> points out that this is entirely wrong.

Adding the "Backport" tag so we find this revert too.

This reverts commit d34dc88098c974acbd4fe774dcdb2b8b631bc386.

Backport: 4.9+
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

docs: retrospectively add XS_DIRECTORY_PART to the xenstore protocol...

... specification.

This was added by commit 0ca64ed8 "xenstore: add support for reading
directory with many children" but not added to the specification at that
point. A version of xenstored supporting the command was first released
in Xen 4.9.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.9+

MAINTAINERS: Make tools/xl part of LIBXENLIGHT stanza

xl is maintained in practice by the libxl maintainers. The effect is
simply to grant maintainership to Anthony.

CC: Wei Liu <wl@xen.org>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

automation: updating container to have python3-config binary

Those containers have already been updated in GitLab:
- debian/stretch
- debian/stretch-i386
- debian/unstable
- debian/unstable-i386
- fedora/29
- suse/opensuse-leap
- ubuntu/bionic
- ubuntu/trusty
- ubuntu/xenial

The container debian:unstable-arm64v8 haven't been changed.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Doug Goldstein <cardoe@cardoe.com>

automation: Only build QEMU if Python >= 3.5

Recent version of QEMU will not build anymore if Python < 3.5.
That is, QEMU 4.3 not released yet.

That check would also prevent the GitLab CI from building QEMU if
python3 binary isn't present.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Doug Goldstein <cardoe@cardoe.com>

xen/arm: Sign extend TimerValue when computing the CompareValue

Xen will only store the CompareValue as it can be derived from the
TimerValue (ARM DDI 0487E.a section D11.2.4):

CompareValue = (Counter[63:0] + SignExtend(TimerValue))[63:0]

While the TimerValue is a 32-bit signed value, our implementation
assumed it is a 32-bit unsigned value.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>

xen/arm: remove physical timer offset

The physical timer traps apply an offset so that time starts at 0 for
the guest. However, this offset is not currently applied to the physical
counter. Per the ARMv8 Reference Manual (ARM DDI 0487E.a), section
D11.2.4 Timers, the "Offset" between the counter and timer should be
zero for a physical timer. This removes the offset to make the timer and
counter consistent.

This also cleans up the physical timer implementation to better match
the virtual timer - both cval's now hold the hardware value.

In the case the guest sets cval to a time before Xen started, the correct
behavior is to expire the timer immediately. To do this, we set the expires
argument of set_timer to zero.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>

xen/mm: remove donate_page()

This function was only ever used by TMEM, so had its sole caller dropped by
c/s c492e19fdd "xen: remove tmem from hypervisor".

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <julien@xen.org>

x86/hvm: make domain_destroy() method optional

This method is currently empty for SVM so make it optional and, while in
the neighbourhood, make it an alternative_vcall().

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/hvm: add domain_relinquish_resources() method

There are two functions in hvm.c to deal with tear-down and a domain:
hvm_domain_relinquish_resources() and hvm_domain_destroy(). However, only
the latter has an associated method in 'hvm_funcs'. This patch adds
a method for the former.

A subsequent patch will define a VMX implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/vmx: make apic_access_mfn type-safe

Use mfn_t rather than unsigned long. Fix vmx_free_vlapic_mapping() to be
fully idempotent by avoiding a double free, but the sentinal needs to remain
as _mfn(0) to be safe even in the case that vmx_alloc_vlapic_mapping() hasn't
been called.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools/libxl: Reposition build_pre() logic between architectures

The call to xc_domain_disable_migrate() is made only from x86, while its
handling in Xen is common.  Move it to the libxl__build_pre().

hvm_set_conf_params(), hvm_set_viridian_features(),
hvm_set_mca_capabilities(), and the altp2m logic is all in common code (parts
ifdef'd) but despite this, is all actually x86 specific, as least as currently
implemented in Xen.  Some concepts (nested virt, altp2m) are common in
principle, but need their interface changing to be part of domain_create, and
are not expecting to survive in their current HVM_PARAM form.

Move it all into x86 specific code, and fold all of the xc_hvm_param_set()
calls together into hvm_set_conf_params() in a far more coherent way.

Finally - ensure that all hypercalls have their return values checked.

No practical change in constructed domains.  Fewer useless hypercalls now to
construct an ARM guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>

xen/list: Remove prefetching

Xen inherited its list infrastructure from Linux.  One area where has fallen
behind is that of prefetching, which as it turns out is a performance penalty
in most cases.

Prefetch of NULL on x86 is now widely measured to have glacial performance
properties, and will unconditionally hit on every hlist use due to the
termination condition.

Cross-port the following Linux patches:

  75d65a425c (2011) "hlist: remove software prefetching in hlist iterators"
  e66eed651f (2011) "list: remove prefetching from regular list iterators"
  c0d15cc7ee (2013) "linked-list: Remove __list_for_each"

to Xen, which results in the following net diffstat on x86:

  add/remove: 0/1 grow/shrink: 27/83 up/down: 576/-1648 (-1072)

(The code additions comes from a few now-inlined functions, and slightly
different basic block padding.)

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

libxl: Fix comment about dcs.sdss

The field 'sdss' was named 'dmss' before, commit 3148bebbf0ab did the
renamed but didn't update the comment.

Fixes: 3148bebbf0ab ("libxl: rename a field in libxl__domain_create_state")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/test/livepatch: remove include of Config.mk

livepatch/Makefile seems to only be used via Rules.mk, which already
includes Config.mk, avoid the second include.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>

xen/build: Remove left over -DMAX_PHYS_IRQS

The use of MAX_PHYS_IRQS have been removed in cf5e6f2d3441 ("x86:
eliminate hard-coded NR_IRQS"), so remove the left over CFLAGS.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen: make CONFIG_DEBUG_LOCKS usable without CONFIG_DEBUG

In expert mode it is possible to enable CONFIG_DEBUG_LOCKS without
having enabled CONFIG_DEBUG. The coding is depending on CONFIG_DEBUG
as it is using ASSERT(), however.

Fix that by using BUG_ON() instead of ASSERT() in rel_lock().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools/libxl: Code-gen improvements for libxl_save_msgs_gen.pl

our @msgs() is an array of $msginfo's where the first element is a
unique number. The $msgnum_used check ensures they are unique. Instead
if specifying them explicitly, generate msgnum locally. This reduces
the diff necessary to edit the middle of the @msgs() array.

All other hunks are adjusting formatting in the generated C, to make it
easier to follow.

No change in behaviour of the generated C.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

x86/mem_access: move _ve functions to x86 header

These functions don't belong in the common mem_access header as there is no #VE
equivalent on ARM.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

Revert "tools/libxl: Plumb domain_create_state down into libxl__build_pre()"

This reverts commit aacc143006429de46932aabae17c13846c71fa45.

OSSTest reports that it breaks stubdoms.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Arm/p2m: fix build after ea22bcd030da and 2aa977eb6baa

Each of these commits introduced a function prototype referencing a
structure which hadn't at least been forward declared. Add such
declarations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/microcode: use const qualifier for microcode buffer

The buffer holding the microcode bits should be marked as const.

Signed-off-by: Eslam Elnikety <elnikety@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/microcode: avoid unnecessary xmalloc/memcpy of ucode data

When using `ucode=scan` and if a matching module is found, the microcode
payload is maintained in an xmalloc()'d region. This is unnecessary since
the bootmap would just do. Remove the xmalloc and xfree on the microcode
module scan path.

This commit also does away with the restriction on the microcode module
size limit. The concern that a large microcode module would consume too
much memory preventing guests launch is misplaced since this is all the
init path. While having such safeguards is valuable, this should apply
across the board for all early/late microcode loading. Having it just on
the `scan` path is confusing.

Looking forward, we are a bit closer (i.e., one xmalloc down) to pulling
the early microcode loading of the BSP a bit earlier in the early boot
process. This commit is the low hanging fruit. There is still a sizable
amount of work to get there as there are still a handful of xmalloc in
microcode_{amd,intel}.c.

First, there are xmallocs on the path of finding a matching microcode
update. Similar to the commit at hand, searching through the microcode
blob can be done on the already present buffer with no need to xmalloc
any further. Even better, do the filtering in microcode.c before
requesting the microcode update on all CPUs. The latter requires careful
restructuring and exposing the arch-specific logic for iterating over
patches and declaring a match.

Second, there are xmallocs for the microcode cache. Here, we would need
to ensure that the cache corresponding to the BSP gets xmalloc()'d and
populated after the fact.

Signed-off-by: Eslam Elnikety <elnikety@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/microcode: improve documentation for ucode=

Specify applicability and the default value. Also state that, in case of
EFI, the microcode update blob specified in the EFI cfg takes precedence
over `ucode=scan`, if the latter is specified on Xen commend line.

No functional changes.

Signed-off-by: Eslam Elnikety <elnikety@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

sched: avoid cpumasks on stack in sched/core.c

There are still several instances of cpumask_t on the stack in
scheduling code. Avoid them as far as possible.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

x86/mem_sharing: Skip xen heap pages in memshr nominate

Trying to share these would fail anyway, better to skip them early.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: enable mem_sharing on first memop

It is wasteful to require separate hypercalls to enable sharing on both the
parent and the client domain during VM forking. To speed things up we enable
sharing on the first memop in case it wasn't already enabled.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: convert MEM_SHARING_DESTROY_GFN to a bool

MEM_SHARING_DESTROY_GFN is used on the 'flags' bitfield during unsharing.
However, the bitfield is not used for anything else, so just convert it to a
bool instead.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: make add_to_physmap static and shorten name

It's not being called from outside mem_sharing.c

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: use INVALID_MFN and p2m_is_shared in relinquish_shared_pages

While using _mfn(0) is of no consequence during teardown, INVALID_MFN is the
correct value that should be used.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: define mem_sharing_domain to hold some scattered variables

Create struct mem_sharing_domain under hvm_domain and move mem sharing
variables into it from p2m_domain and hvm_domain.

Expose the mem_sharing_enabled macro to be used consistently across Xen.

Remove some duplicate calls to mem_sharing_enabled in mem_sharing.c

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/mem_sharing: don't try to unshare twice during page fault

The page was already tried to be unshared in get_gfn_type_access. If that
didn't work, then trying again is pointless. Don't try to send vm_event again
either, simply check if there is a ring or not.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/mem_sharing: drop flags from mem_sharing_unshare_page

All callers pass 0 in.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/mem_sharing: make get_two_gfns take locks conditionally

During VM forking the client lock will already be taken.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/mm: Make use of the default access param from xc_altp2m_create_view

At this moment the default_access param from xc_altp2m_create_view is
not used.

This patch assigns default_access to p2m->default_access at the time of
initializing a new altp2m view.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/mm: Pull vendor-independent altp2m code out of p2m-ept.c and into p2m.c

No functional changes.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/altp2m: Add hypercall to set a range of sve bits

By default the sve bits are not set.
This patch adds a new hypercall, xc_altp2m_set_supress_ve_multi(),
to set a range of sve bits.
The core function, p2m_set_suppress_ve_multi(), does not break in case
of a error and it is doing a best effort for setting the bits in the
given range. A check for continuation is made in order to have
preemption on large ranges.
The gfn of the first error is stored in
xen_hvm_altp2m_suppress_ve_multi.first_error_gfn and the error code is
stored in xen_hvm_altp2m_suppress_ve_multi.first_error.
If no error occurred the values will be 0.

Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

x86/mm: Add array_index_nospec to guest provided index values

This patch aims to sanitize indexes, potentially guest provided
values, for altp2m_eptp[] and altp2m_p2m[] arrays.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>

x86/boot: Drop sym_fs()

All remaining users of sym_fs() can trivially be switched to using sym_esi()
instead. This is shorter to encode and faster to execute.

This removes the final uses of %fs during boot, which allows us to drop
BOOT_FS from the trampoline GDT, which drops an 16M arbitrary limit on Xen's
compiled size.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/boot: Simplify pagetable manipulation loops

For __page_tables_{start,end} and L3 bootmap initialisation, the logic is
unnecesserily complicated owing to its attempt to use the LOOP instruction,
which results in an off-by-8 memory address owing to LOOP's termination
condition.

Rewrite both loops for improved clarity and speed.

Misc notes:
* TEST $IMM, MEM can't macrofuse.  The loop has 0x1200 iterations, so pull
   the $_PAGE_PRESENT constant out into a spare register to turn the TEST into
   its %REG, MEM form, which can macrofuse.
* Avoid the use of %fs-relative references.  %esi-relative is the more common
   form in the code, and doesn't suffer an address generation overhead.
* Avoid LOOP.  CMP/JB isn't microcoded and faster to execute in all cases.
* For a 4 interation trivial loop, even compilers unroll these.  The
   generated code size is a fraction larger, but this is init and the asm is
   far easier to follow.
* Reposition the l2=>l1 bootmap construction so the asm reads in pagetable
   level order.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/boot: Drop explicit %fs uses

The trampoline relocation code uses %fs for accessing Xen, and this comes with
an arbitrary 16M limitation. We could adjust the limit, but the boot code is
a confusing mix of %ds/%esi-based and %fs-based accesses, and the use of %fs
is longer to encode, and incurs an address generation overhead.

Rewrite the logic to use %ds, for better consistency with the surrounding
code, and a marginal performance improvement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/boot: Size the boot/directmap mappings dynamically

... rather than presuming that 16M will do. On the EFI side, use
l2e_add_flags() to reduce the code-generation overhead of using
l2e_from_paddr() twice.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/boot: Create the l2_xenmap[] mappings dynamically

The build-time construction of l2_xenmap[] imposes an arbitrary limit of 16M
total, which is a limit looking to be lifted.

Adjust both the BIOS and EFI paths to fill it in dynamically, based on the
final linked size of Xen. l2_xenmap[] stays between __page_tables_{start,end}
(rather than move into .bss.page_aligned) as it is expected to gain a
different pagetable reference shortly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/sched: add const qualifier where appropriate

Make use of the const qualifier more often in scheduling code.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Meng Xu <mengxu@cis.upenn.edu>

xen/sched: eliminate sched_tick_suspend() and sched_tick_resume()

sched_tick_suspend() and sched_tick_resume() only call rcu related
functions, so eliminate them and do the rcu_idle_timer*() calling in
rcu_idle_[enter|exit]().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/sched: switch scheduling to bool where appropriate

Scheduling code has several places using int or bool_t instead of bool.
Switch those.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/sched: replace null scheduler percpu-variable with pdata hook

Instead of having an own percpu-variable for private data per cpu the
generic scheduler interface for that purpose should be used.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/sched: use scratch cpumask instead of allocating it on the stack

In rt scheduler there are three instances of cpumasks allocated on the
stack. Replace them by using cpumask_scratch.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>

xen/sched: remove special cases for free cpus in schedulers

With the idle scheduler now taking care of all cpus not in any cpupool
the special cases in the other schedulers for no cpupool associated
can be removed.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/sched: cleanup sched.h

There are some items in include/xen/sched.h which can be moved to
private.h as they are scheduler private.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/sched: make sched-if.h really scheduler private

include/xen/sched-if.h should be private to scheduler code, so move it
to common/sched/private.h and move the remaining use cases to
cpupool.c and core.c.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

xen/sched: move schedulers and cpupool coding to dedicated directory

Move sched*c and cpupool.c to a new directory common/sched.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>

VT-d: don't pass bridge devices to domain_context_mapping_one()

When passed a non-NULL pdev, the function does an owner check when it
finds an already existing context mapping. Bridges, however, don't get
passed through to guests, and hence their owner is always going to be
Dom0, leading to the assigment of all but one of the function of multi-
function PCI devices behind bridges to fail.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

x86/smp: use APIC ALLBUT destination shorthand when possible

If the IPI destination mask matches the mask of online CPUs use the
APIC ALLBUT destination shorthand in order to send an IPI to all CPUs
on the system except the current one. This can only be safely used
when no CPU hotplug or unplug operations are taking place, no
offline CPUs or those have been onlined and parked, all CPUs in the
system have been accounted for (ie: the number of CPUs doesn't exceed
NR_CPUS and APIC IDs are below MAX_APICS) and there's no possibility
of CPU hotplug (ie: no disabled CPUs have been reported by the
firmware tables).

This is specially beneficial when using the PV shim, since using the
shorthand avoids performing an APIC register write (or multiple ones
if using xAPIC mode) for each destination when doing a global TLB
flush.

The lock time of flush_lock on a 32 vCPU guest using the shim in
x2APIC mode without the shorthand is:

Global lock flush_lock: addr=ffff82d0804b21c0, lockval=f602f602, not locked
lock:228455938(79406065573135), block:205908580(556416605761539)

Average lock time: 347577ns

While the same guest using the shorthand:

Global lock flush_lock: addr=ffff82d0804b41c0, lockval=d9c4d9bc, cpu=12
lock:1890775(416719148054), block:1663958(2500161282949)

Average lock time: 220395ns

Approximately a 1/3 improvement in the lock time.

Note that this requires locking the CPU maps (get_cpu_maps) which uses
a trylock. This is currently safe as all users of cpu_add_remove_lock
do a trylock, but will need reevaluating if non-trylock users appear.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/arm: gic: Remove pointless assertion against enum gic_sgi

The Arm Compiler will complain that the assertions ASSERT(sgi < 16) are
always true. This is because sgi is an item of the enum gic_sgi and
should always contain less than 16 SGIs.

Rather than using ASSERTs, introduce a new item in the enum that could
be checked against a build time.

Take the opportunity to remove the specific assigned values for each
item. This is fine because enum always starts at zero and values will be
assigned by increment of one. None of our code also rely on hardcoded
value.

[stefano: grammar fixes in commit message]

Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
CC: Andrii Anisov <andrii_anisov@epam.com>