New upstream version 0.15.0.1+ds

author Steffen Moeller <moeller@debian.org>

Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)

committer Steffen Moeller <moeller@debian.org>

Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)
author Steffen Moeller <moeller@debian.org>
Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)
committer Steffen Moeller <moeller@debian.org>
Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)
diff --git a/.travis.yml b/.travis.yml

index f874a90fef82019016cd4e94051da9bdd0202551..beccc4d155d61d1fc90ed57b484cd19dd1a5df63 100644 (file)
--- a/.travis.yml
+++ b/.travis.yml
@@ -21,7 +21,7 @@ addons:
      - g++
  
  script:
-  - ./run_tests_travis.sh
+  - ./devtools/run_tests_travis.sh
  
  notifications:
    email:
diff --git a/NEWS b/NEWS

index 0e2cb71c790c3d68e8acbb8125fbd46ea83ac16b..61b925427c2fe80178b7c0e2e91158be78682553 100644 (file)
--- a/NEWS
+++ b/NEWS
@@ -5,6 +5,17 @@ http://pysam.readthedocs.io/en/latest/release.html
  Release notes
  =============
  
+Release 0.15.0
+==============
+
+This release wraps htslib/samtools/bcftools version 1.9.0.
+
+* [#673] permit dash in chromosome name of region string
+* [#656] Support `text` when opening a SAM file for writing
+* [#658] return None in get_forward_sequence if sequence not in record
+* [#683] allow lower case bases in MD tags
+* Ensure that = and X CIGAR ops are treated the same as M
+
  Release 0.14.1
  ==============
  
diff --git a/README.rst b/README.rst

index eb065f0f810f17f2488963d1727d7318e75227b7..2005eb6d0db0857de99e213684f0628cd2212ce0 100644 (file)
--- a/README.rst
+++ b/README.rst
@@ -15,7 +15,8 @@ includes an interface for tabix_.
  If you are using the conda packaging manager (e.g. miniconda or anaconda),
  you can install pysam from the `bioconda channel <https://bioconda.github.io/>`_::
  
-   conda config --add channels r
+   conda config --add channels defaults
+   conda config --add channels conda-forge
     conda config --add channels bioconda
     conda install pysam
  
@@ -29,7 +30,7 @@ Pysam is available through `pypi
  
     pip install pysam
  
-Pysam documentation is available through https://readthedocs.org/ from
+Pysam documentation is available
  `here <http://pysam.readthedocs.org/en/latest/>`_
  
  Questions and comments are very welcome and should be sent to the
diff --git a/bcftools/LICENSE b/bcftools/LICENSE

new file mode 100644 (file)

index 0000000..26fbc68
--- /dev/null
+++ b/bcftools/LICENSE
@@ -0,0 +1,725 @@
+This software is available to you under a choice of one of two licenses. You
+may chose to be licensed under the terms of the MIT/Expat license or the GNU
+General Public License (GPL), both included below. If compiled with the GNU
+Scientific Library (which is optional and disabled by default as explained in
+the INSTALL document), the use of this software is governed by the GPL license.
+
+
+-----------------------------------------------------------------------------
+
+The MIT/Expat License
+
+Copyright (C) 2012-2014 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
+
+
+-----------------------------------------------------------------------------
+
+
+GNU GENERAL PUBLIC LICENSE
+Version 3, 29 June 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.  We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors.  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+  To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights.  Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received.  You must make sure that they, too, receive
+or can get the source code.  And you must show them these terms so they
+know their rights.
+
+  Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+
+  For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software.  For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+
+  Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so.  This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software.  The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable.  Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products.  If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+
+  Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary.  To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 3 of the GNU General Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Use with the GNU Affero General Public License.
+
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+
+  14. Revised Versions of this License.
+
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+  If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+
+    <program>  Copyright (C) <year>  <name of author>
+    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<http://www.gnu.org/licenses/>.
+
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<http://www.gnu.org/philosophy/why-not-lgpl.html>.
+
+
+-----------------------------------------------------------------------------
+
diff --git a/bcftools/README b/bcftools/README

new file mode 100644 (file)

index 0000000..5cb1bbd
--- /dev/null
+++ b/bcftools/README
@@ -0,0 +1,5 @@
+BCFtools implements utilities for variant calling (in conjunction with
+SAMtools) and manipulating VCF and BCF files.  The program is intended
+to replace the Perl-based tools from vcftools.
+
+See INSTALL for building and installation instructions.
diff --git a/bcftools/bam_sample.c b/bcftools/bam_sample.c

index 66f572948a07bc5b4494a380bfd20e49c42f3500..a6da9432f8bf5df82f743bd1703f863b9bdb2519 100644 (file)
--- a/bcftools/bam_sample.c
+++ b/bcftools/bam_sample.c
@@ -1,7 +1,7 @@
  /*  bam_sample.c -- group data by sample.
  
      Copyright (C) 2010, 2011 Broad Institute.
-    Copyright (C) 2013, 2016 Genome Research Ltd.
+    Copyright (C) 2013, 2016-2018 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>, Petr Danecek <pd3@sanger.ac.uk>
  
@@ -167,10 +167,14 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
      void *bam_smpls = khash_str2int_init();
      int first_smpl = -1, nskipped = 0;
      const char *p = bam_hdr, *q, *r;
-    while ((q = strstr(p, "@RG")) != 0) 
+    while (p != NULL && (q = strstr(p, "@RG")) != 0)
      {
+        char *eol = strchr(q + 3, '\n');
+        if (q > bam_hdr && *(q - 1) != '\n') { // @RG must be at start of line
+            p = eol;
+            continue;
+        }
          p = q + 3;
-        r = q = 0;
          if ((q = strstr(p, "\tID:")) != 0) q += 4;
          if ((r = strstr(p, "\tSM:")) != 0) r += 4;
          if (r && q)
@@ -220,7 +224,7 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
          }
          else
              break;
-        p = q > r ? q : r;
+        p = eol;
      }
      int nsmpls = khash_str2int_size(bam_smpls);
      khash_str2int_destroy_free(bam_smpls);
@@ -234,6 +238,7 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
      {
          // no suitable read group is available in this bam: ignore the whole file.
          free(file->fname);
+        if ( file->rg2idx ) khash_str2int_destroy_free(file->rg2idx);
          bsmpl->nfiles--;
          return -1;
      }
diff --git a/bcftools/bam_sample.c.pysam.c b/bcftools/bam_sample.c.pysam.c

index c25358fe4a3f1f721d4bc311b7c409312bc51e8c..565cfc130e2921f3667db5fab4eb4ae0be2e4a37 100644 (file)
--- a/bcftools/bam_sample.c.pysam.c
+++ b/bcftools/bam_sample.c.pysam.c
@@ -3,7 +3,7 @@
  /*  bam_sample.c -- group data by sample.
  
      Copyright (C) 2010, 2011 Broad Institute.
-    Copyright (C) 2013, 2016 Genome Research Ltd.
+    Copyright (C) 2013, 2016-2018 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>, Petr Danecek <pd3@sanger.ac.uk>
  
@@ -169,10 +169,14 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
      void *bam_smpls = khash_str2int_init();
      int first_smpl = -1, nskipped = 0;
      const char *p = bam_hdr, *q, *r;
-    while ((q = strstr(p, "@RG")) != 0) 
+    while (p != NULL && (q = strstr(p, "@RG")) != 0)
      {
+        char *eol = strchr(q + 3, '\n');
+        if (q > bam_hdr && *(q - 1) != '\n') { // @RG must be at start of line
+            p = eol;
+            continue;
+        }
          p = q + 3;
-        r = q = 0;
          if ((q = strstr(p, "\tID:")) != 0) q += 4;
          if ((r = strstr(p, "\tSM:")) != 0) r += 4;
          if (r && q)
@@ -222,7 +226,7 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
          }
          else
              break;
-        p = q > r ? q : r;
+        p = eol;
      }
      int nsmpls = khash_str2int_size(bam_smpls);
      khash_str2int_destroy_free(bam_smpls);
@@ -236,6 +240,7 @@ int bam_smpl_add_bam(bam_smpl_t *bsmpl, char *bam_hdr, const char *fname)
      {
          // no suitable read group is available in this bam: ignore the whole file.
          free(file->fname);
+        if ( file->rg2idx ) khash_str2int_destroy_free(file->rg2idx);
          bsmpl->nfiles--;
          return -1;
      }
diff --git a/bcftools/bcftools.h b/bcftools/bcftools.h

index d3ba814197826231d28173e63731a3f0809625f4..4e23eff124ca26631008c092d12cbcecda9590e4 100644 (file)
--- a/bcftools/bcftools.h
+++ b/bcftools/bcftools.h
@@ -39,7 +39,7 @@ THE SOFTWARE.  */
  #define FT_STDIN (1<<3)
  
  char *bcftools_version(void);
-void error(const char *format, ...) HTS_NORETURN;
+void error(const char *format, ...) HTS_NORETURN HTS_FORMAT(HTS_PRINTF_FMT, 1, 2);
  void bcf_hdr_append_version(bcf_hdr_t *hdr, int argc, char **argv, const char *cmd);
  const char *hts_bcf_wmode(int file_type);
  
diff --git a/bcftools/bcftools.pysam.c b/bcftools/bcftools.pysam.c

index 63dfed5bed00e000f4604c4ad5f4f32186e11b7d..e3e251ef499b3849988617cb8748937fba4a6b39 100644 (file)
--- a/bcftools/bcftools.pysam.c
+++ b/bcftools/bcftools.pysam.c
@@ -54,6 +54,12 @@ void bcftools_unset_stdout(void)
    bcftools_stdout_fileno = STDOUT_FILENO;
  }
  
+int bcftools_puts(const char *s)
+{
+  if (fputs(s, bcftools_stdout) == EOF) return EOF;
+  return putc('\n', bcftools_stdout);
+}
+
  void bcftools_set_optind(int val)
  {
    // setting this in cython via 
diff --git a/bcftools/bcftools.pysam.h b/bcftools/bcftools.pysam.h

index 4c3806ccf0bf12340ad2a6adae957ffd84a5407f..92ab370a2e30aeb4d5bb9711301eec805e8d21df 100644 (file)
--- a/bcftools/bcftools.pysam.h
+++ b/bcftools/bcftools.pysam.h
@@ -1,5 +1,5 @@
-#ifndef BCFTOOLS_PYSAM_H
-#define BCFTOOLS_PYSAM_H
+#ifndef PYSAM_H
+#define PYSAM_H
  
  #include "stdio.h"
  
@@ -38,6 +38,8 @@ void bcftools_unset_stderr(void);
   */
  void bcftools_unset_stdout(void);
  
+int bcftools_puts(const char *s);
+
  int bcftools_dispatch(int argc, char *argv[]);
  
  void bcftools_set_optind(int);
diff --git a/bcftools/bin.c b/bcftools/bin.c

index b558b208a3be14997335494b0dc957dec9ac5bcb..95a2be156b33f25d6ae4a5603ec9f0a777376593 100644 (file)
--- a/bcftools/bin.c
+++ b/bcftools/bin.c
@@ -48,9 +48,9 @@ bin_t *bin_init(const char *list_def, float min, float max)
      {
          char *tmp;
          bin->bins[i] = strtod(list[i],&tmp);
-        if ( !tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
+        if ( *tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
          if ( min!=max && (bin->bins[i]<min || bin->bins[i]>max) )
-            error("Expected values from the interval [%f,%f], found %s\n", list[i]);
+            error("Expected values from the interval [%f,%f], found %s\n", min, max, list[i]);
          free(list[i]); 
      }
      free(list);
diff --git a/bcftools/bin.c.pysam.c b/bcftools/bin.c.pysam.c

index 4880231b6c39c2605b53177a3037195d314f5a81..426ef455515765e7dd40f5cad17bf5b17194722d 100644 (file)
--- a/bcftools/bin.c.pysam.c
+++ b/bcftools/bin.c.pysam.c
@@ -50,9 +50,9 @@ bin_t *bin_init(const char *list_def, float min, float max)
      {
          char *tmp;
          bin->bins[i] = strtod(list[i],&tmp);
-        if ( !tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
+        if ( *tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
          if ( min!=max && (bin->bins[i]<min || bin->bins[i]>max) )
-            error("Expected values from the interval [%f,%f], found %s\n", list[i]);
+            error("Expected values from the interval [%f,%f], found %s\n", min, max, list[i]);
          free(list[i]); 
      }
      free(list);
diff --git a/bcftools/consensus.c b/bcftools/consensus.c

index d67a052197e6c69c4cbb59ac948a2dfee5d48be9..2c4a9040e8362f03ea46a1a1099a6c5a1acea091 100644 (file)
--- a/bcftools/consensus.c
+++ b/bcftools/consensus.c
@@ -36,6 +36,7 @@
  #include <htslib/kstring.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kseq.h>
+#include <htslib/bgzf.h>
  #include "regidx.h"
  #include "bcftools.h"
  #include "rbuf.h"
@@ -73,6 +74,8 @@ typedef struct
      int fa_length;      // region's length in the original sequence (in case end_pos not provided in the FASTA header)
      int fa_case;        // output upper case or lower case?
      int fa_src_pos;     // last genomic coordinate read from the input fasta (0-based)
+    char prev_base;     // this is only to validate the REF allele in the VCF - the modified fa_buf cannot be used for inserts following deletions, see 600#issuecomment-383186778
+    int prev_base_pos;  // the position of prev_base
  
      rbuf_t vcf_rbuf;
      bcf1_t **vcf_buf;
@@ -96,7 +99,7 @@ typedef struct
      FILE *fp_chain;
      char **argv;
      int argc, output_iupac, haplotype, allele, isample;
-    char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname;
+    char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname, missing_allele;
  }
  args_t;
  
@@ -237,7 +240,7 @@ static void init_data(args_t *args)
          if ( ! args->fp_out ) error("Failed to create %s: %s\n", args->output_fname, strerror(errno));
      }
      else args->fp_out = stdout;
-    if ( args->isample<0 ) fprintf(stderr,"Note: the --sample option not given, applying all records\n");
+    if ( args->isample<0 ) fprintf(stderr,"Note: the --sample option not given, applying all records regardless of the genotype\n");
      if ( args->filter_str )
          args->filter = filter_init(args->hdr, args->filter_str);
  }
@@ -264,7 +267,7 @@ static void init_region(args_t *args, char *line)
      char *ss, *se = line;
      while ( *se && !isspace(*se) && *se!=':' ) se++;
      int from = 0, to = 0;
-    char tmp, *tmp_ptr = NULL;
+    char tmp = 0, *tmp_ptr = NULL;
      if ( *se )
      {
          tmp = *se; *se = 0; tmp_ptr = se;
@@ -280,10 +283,12 @@ static void init_region(args_t *args, char *line)
              else to--;
          }
      }
+    free(args->chr);
      args->chr = strdup(line);
      args->rid = bcf_hdr_name2id(args->hdr,line);
      if ( args->rid<0 ) fprintf(stderr,"Warning: Sequence \"%s\" not in %s\n", line,args->fname);
-    args->fa_buf.l = 0;
+    args->prev_base_pos = -1;
+    args->fa_buf.l  = 0;
      args->fa_length = 0;
      args->fa_end_pos = to;
      args->fa_ori_pos = from;
@@ -370,13 +375,10 @@ static void flush_fa_buffer(args_t *args, int len)
  }
  static void apply_variant(args_t *args, bcf1_t *rec)
  {
-    if ( rec->n_allele==1 ) return;
+    static int warned_haplotype = 0;
+
+    if ( rec->n_allele==1 && !args->missing_allele ) return;
  
-    if ( rec->pos <= args->fa_frz_pos )
-    {
-        fprintf(stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
-        return;
-    }
      if ( args->mask )
      {
          char *chr = (char*)bcf_hdr_id2name(args->hdr,args->rid);
@@ -395,35 +397,73 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          if ( fmt->type!=BCF_BT_INT8 )
              error("Todo: GT field represented with BCF_BT_INT8, too many alleles at %s:%d?\n",bcf_seqname(args->hdr,rec),rec->pos+1);
          uint8_t *ptr = fmt->p + fmt->size*args->isample;
-
          if ( args->haplotype )
          {
-            if ( args->haplotype > fmt->n ) error("Can't apply %d-th haplotype at %s:%d\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
-            ialt = ptr[args->haplotype-1];
-            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
-            ialt = bcf_gt_allele(ialt);
+            if ( args->haplotype > fmt->n )
+            {
+                if ( bcf_gt_is_missing(ptr[fmt->n-1]) || bcf_gt_is_missing(ptr[0]) )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else 
+                {
+                    if ( !warned_haplotype )
+                    {
+                        fprintf(stderr, "Can't apply %d-th haplotype at %s:%d. (This warning is printed only once.)\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
+                        warned_haplotype = 1;
+                    }
+                    return;
+                }
+            }
+            else
+            {
+                ialt = (int8_t)ptr[args->haplotype-1];
+                if ( bcf_gt_is_missing(ialt) || ialt==bcf_int8_vector_end )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else 
+                    ialt = bcf_gt_allele(ialt);
+            }
          }
          else if ( args->output_iupac ) 
          {
              ialt = ptr[0];
-            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
-            ialt = bcf_gt_allele(ialt);
+            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end )
+            {
+                if ( !args->missing_allele ) return;
+                ialt = -1;
+            }
+            else
+                ialt = bcf_gt_allele(ialt);
  
              int jalt;
              if ( fmt->n>1 )
              {
                  jalt = ptr[1];
-                if ( bcf_gt_is_missing(jalt) || jalt==bcf_int32_vector_end ) jalt = ialt;
-                else jalt = bcf_gt_allele(jalt);
+                if ( bcf_gt_is_missing(jalt) )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else if ( jalt==bcf_int32_vector_end ) jalt = ialt;
+                else
+                    jalt = bcf_gt_allele(jalt);
              }
              else jalt = ialt;
-            if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
-            if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+
+            if ( ialt>=0 )
              {
-                char ial = rec->d.allele[ialt][0];
-                char jal = rec->d.allele[jalt][0];
-                if ( !ialt ) ialt = jalt;   // only ialt is used, make sure 0/1 is not ignored
-                rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+                if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+                {
+                    char ial = rec->d.allele[ialt][0];
+                    char jal = rec->d.allele[jalt][0];
+                    if ( !ialt ) ialt = jalt;   // only ialt is used, make sure 0/1 is not ignored
+                    rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+                }
              }
          }
          else
@@ -431,8 +471,13 @@ static void apply_variant(args_t *args, bcf1_t *rec)
              int is_hom = 1;
              for (i=0; i<fmt->n; i++)
              {
-                if ( bcf_gt_is_missing(ptr[i]) ) return;  // ignore missing or half-missing genotypes
-                if ( ptr[i]==bcf_int32_vector_end ) break;
+                if ( bcf_gt_is_missing(ptr[i]) )
+                {
+                    if ( !args->missing_allele ) return;  // ignore missing or half-missing genotypes
+                    ialt = -1;
+                    break;
+                }
+                if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
                  ialt = bcf_gt_allele(ptr[i]);
                  if ( i>0 && ialt!=bcf_gt_allele(ptr[i-1]) ) { is_hom = 0; break; }
              }
@@ -441,7 +486,7 @@ static void apply_variant(args_t *args, bcf1_t *rec)
                  int prev_len = 0, jalt;
                  for (i=0; i<fmt->n; i++)
                  {
-                    if ( ptr[i]==bcf_int32_vector_end ) break;
+                    if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
                      jalt = bcf_gt_allele(ptr[i]);
                      if ( rec->n_allele <= jalt ) error("Broken VCF, too few alts at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
                      if ( args->allele & (PICK_LONG|PICK_SHORT) )
@@ -474,6 +519,25 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          rec->d.allele[1][0] = gt2iupac(ial,jal);
      }
  
+    if ( rec->n_allele==1 && ialt!=-1 ) return; // non-missing reference
+    if ( ialt==-1 )
+    {
+        char alleles[4];
+        alleles[0] = rec->d.allele[0][0];
+        alleles[1] = ',';
+        alleles[2] = args->missing_allele;
+        alleles[3] = 0;
+        bcf_update_alleles_str(args->hdr, rec, alleles);
+        ialt = 1;
+    }
+
+    // Overlapping variant? Can be still OK iff this is an insertion
+    if ( rec->pos <= args->fa_frz_pos && (rec->pos!=args->fa_frz_pos || rec->d.allele[0][0]!=rec->d.allele[ialt][0]) )
+    {
+        fprintf(stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+        return;
+    }
+
      int len_diff = 0, alen = 0;
      int idx = rec->pos - args->fa_ori_pos + args->fa_mod_off;
      if ( idx<0 )
@@ -492,7 +556,7 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          }
      }
      if ( idx>=args->fa_buf.l ) 
-        error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%d, off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,args->fa_buf.l,args->fa_mod_off);
+        error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%"PRIu64", off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,(uint64_t)args->fa_buf.l,args->fa_mod_off);
  
      // sanity check the reference base
      if ( rec->d.allele[ialt][0]=='<' )
@@ -506,21 +570,37 @@ static void apply_variant(args_t *args, bcf1_t *rec)
      }
      else if ( strncasecmp(rec->d.allele[0],args->fa_buf.s+idx,rec->rlen) )
      {
-        // fprintf(stderr,"%d .. [%s], idx=%d ori=%d off=%d\n",args->fa_ori_pos,args->fa_buf.s,idx,args->fa_ori_pos,args->fa_mod_off);
-        char tmp = 0;
-        if ( args->fa_buf.l - idx > rec->rlen ) 
-        { 
-            tmp = args->fa_buf.s[idx+rec->rlen];
-            args->fa_buf.s[idx+rec->rlen] = 0;
+        // This is hacky, handle a special case: if insert follows a deletion (AAC>A, C>CAA),
+        // the reference base in fa_buf is lost and the check fails. We do not keep a buffer
+        // with the original sequence as it should not be necessary, we should encounter max
+        // one base overlap
+
+        int fail = 1;
+        if ( args->prev_base_pos==rec->pos && toupper(rec->d.allele[0][0])==toupper(args->prev_base) )
+        {
+            if ( rec->rlen==1 ) fail = 0;
+            else if ( !strncasecmp(rec->d.allele[0]+1,args->fa_buf.s+idx+1,rec->rlen-1) ) fail = 0;
+        }
+
+        if ( fail )
+        {
+            char tmp = 0;
+            if ( args->fa_buf.l - idx > rec->rlen ) 
+            { 
+                tmp = args->fa_buf.s[idx+rec->rlen];
+                args->fa_buf.s[idx+rec->rlen] = 0;
+            }
+            error(
+                    "The fasta sequence does not match the REF allele at %s:%d:\n"
+                    "   .vcf: [%s]\n" 
+                    "   .vcf: [%s] <- (ALT)\n" 
+                    "   .fa:  [%s]%c%s\n",
+                    bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx, 
+                    tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
+                 );
          }
-        error(
-            "The fasta sequence does not match the REF allele at %s:%d:\n"
-            "   .vcf: [%s]\n" 
-            "   .vcf: [%s] <- (ALT)\n" 
-            "   .fa:  [%s]%c%s\n",
-            bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx, 
-            tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
-            );
+        alen = strlen(rec->d.allele[ialt]);
+        len_diff = alen - rec->rlen;
      }
      else
      {
@@ -539,7 +619,11 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          for (i=0; i<alen; i++)
              args->fa_buf.s[idx+i] = rec->d.allele[ialt][i];
          if ( len_diff )
+        {
+            args->prev_base = rec->d.allele[0][rec->rlen - 1];
+            args->prev_base_pos = rec->pos + rec->rlen - 1;
              memmove(args->fa_buf.s+idx+alen,args->fa_buf.s+idx+rec->rlen,args->fa_buf.l-idx-rec->rlen);
+        }
      }
      else
      {
@@ -589,10 +673,10 @@ static void mask_region(args_t *args, char *seq, int len)
  
  static void consensus(args_t *args)
  {
-    htsFile *fasta = hts_open(args->ref_fname, "rb");
+    BGZF *fasta = bgzf_open(args->ref_fname, "r");
      if ( !fasta ) error("Error reading %s\n", args->ref_fname);
      kstring_t str = {0,0,0};
-    while ( hts_getline(fasta, KS_SEP_LINE, &str) > 0 )
+    while ( bgzf_getline(fasta, '\n', &str) > 0 )
      {
          if ( str.s[0]=='>' )
          {
@@ -669,7 +753,7 @@ static void consensus(args_t *args)
          destroy_chain(args);
      }
      flush_fa_buffer(args, 0);
-    hts_close(fasta);
+    bgzf_close(fasta);
      free(str.s);
  }
  
@@ -681,7 +765,7 @@ static void usage(args_t *args)
      fprintf(stderr, "       --sample (and, optionally, --haplotype) option will apply genotype\n");
      fprintf(stderr, "       (or haplotype) calls from FORMAT/GT. The program ignores allelic depth\n");
      fprintf(stderr, "       information, such as INFO/AD or FORMAT/AD.\n");
-    fprintf(stderr, "Usage:   bcftools consensus [OPTIONS] <file.vcf>\n");
+    fprintf(stderr, "Usage:   bcftools consensus [OPTIONS] <file.vcf.gz>\n");
      fprintf(stderr, "Options:\n");
      fprintf(stderr, "    -c, --chain <file>         write a chain file for liftover\n");
      fprintf(stderr, "    -e, --exclude <expr>       exclude sites for which the expression is true (see man page for details)\n");
@@ -697,6 +781,7 @@ static void usage(args_t *args)
      fprintf(stderr, "    -i, --include <expr>       select sites for which the expression is true (see man page for details)\n");
      fprintf(stderr, "    -I, --iupac-codes          output variants in the form of IUPAC ambiguity codes\n");
      fprintf(stderr, "    -m, --mask <file>          replace regions with N\n");
+    fprintf(stderr, "    -M, --missing <char>       output <char> instead of skipping the missing genotypes\n");
      fprintf(stderr, "    -o, --output <file>        write output to a file [standard output]\n");
      fprintf(stderr, "    -s, --sample <name>        apply variants of the given sample\n");
      fprintf(stderr, "Examples:\n");
@@ -722,11 +807,12 @@ int main_consensus(int argc, char *argv[])
          {"output",1,0,'o'},
          {"fasta-ref",1,0,'f'},
          {"mask",1,0,'m'},
+        {"missing",1,0,'M'},
          {"chain",1,0,'c'},
          {0,0,0,0}
      };
      int c;
-    while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:",loptions,NULL)) >= 0) 
+    while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:M:",loptions,NULL)) >= 0) 
      {
          switch (c) 
          {
@@ -737,6 +823,10 @@ int main_consensus(int argc, char *argv[])
              case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
              case 'f': args->ref_fname = optarg; break;
              case 'm': args->mask_fname = optarg; break;
+            case 'M': 
+                args->missing_allele = optarg[0]; 
+                if ( optarg[1]!=0 ) error("Expected single character with -M, got \"%s\"\n", optarg);
+                break;
              case 'c': args->chain_fname = optarg; break;
              case 'H': 
                  if ( !strcasecmp(optarg,"R") ) args->allele |= PICK_REF;
diff --git a/bcftools/consensus.c.pysam.c b/bcftools/consensus.c.pysam.c

index 901560181adda490c01a7454d281c922a1131b34..625345e9946f01609aa9660dd82995293ed2296f 100644 (file)
--- a/bcftools/consensus.c.pysam.c
+++ b/bcftools/consensus.c.pysam.c
@@ -38,6 +38,7 @@
  #include <htslib/kstring.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kseq.h>
+#include <htslib/bgzf.h>
  #include "regidx.h"
  #include "bcftools.h"
  #include "rbuf.h"
@@ -75,6 +76,8 @@ typedef struct
      int fa_length;      // region's length in the original sequence (in case end_pos not provided in the FASTA header)
      int fa_case;        // output upper case or lower case?
      int fa_src_pos;     // last genomic coordinate read from the input fasta (0-based)
+    char prev_base;     // this is only to validate the REF allele in the VCF - the modified fa_buf cannot be used for inserts following deletions, see 600#issuecomment-383186778
+    int prev_base_pos;  // the position of prev_base
  
      rbuf_t vcf_rbuf;
      bcf1_t **vcf_buf;
@@ -98,7 +101,7 @@ typedef struct
      FILE *fp_chain;
      char **argv;
      int argc, output_iupac, haplotype, allele, isample;
-    char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname;
+    char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname, missing_allele;
  }
  args_t;
  
@@ -239,7 +242,7 @@ static void init_data(args_t *args)
          if ( ! args->fp_out ) error("Failed to create %s: %s\n", args->output_fname, strerror(errno));
      }
      else args->fp_out = bcftools_stdout;
-    if ( args->isample<0 ) fprintf(bcftools_stderr,"Note: the --sample option not given, applying all records\n");
+    if ( args->isample<0 ) fprintf(bcftools_stderr,"Note: the --sample option not given, applying all records regardless of the genotype\n");
      if ( args->filter_str )
          args->filter = filter_init(args->hdr, args->filter_str);
  }
@@ -266,7 +269,7 @@ static void init_region(args_t *args, char *line)
      char *ss, *se = line;
      while ( *se && !isspace(*se) && *se!=':' ) se++;
      int from = 0, to = 0;
-    char tmp, *tmp_ptr = NULL;
+    char tmp = 0, *tmp_ptr = NULL;
      if ( *se )
      {
          tmp = *se; *se = 0; tmp_ptr = se;
@@ -282,10 +285,12 @@ static void init_region(args_t *args, char *line)
              else to--;
          }
      }
+    free(args->chr);
      args->chr = strdup(line);
      args->rid = bcf_hdr_name2id(args->hdr,line);
      if ( args->rid<0 ) fprintf(bcftools_stderr,"Warning: Sequence \"%s\" not in %s\n", line,args->fname);
-    args->fa_buf.l = 0;
+    args->prev_base_pos = -1;
+    args->fa_buf.l  = 0;
      args->fa_length = 0;
      args->fa_end_pos = to;
      args->fa_ori_pos = from;
@@ -372,13 +377,10 @@ static void flush_fa_buffer(args_t *args, int len)
  }
  static void apply_variant(args_t *args, bcf1_t *rec)
  {
-    if ( rec->n_allele==1 ) return;
+    static int warned_haplotype = 0;
+
+    if ( rec->n_allele==1 && !args->missing_allele ) return;
  
-    if ( rec->pos <= args->fa_frz_pos )
-    {
-        fprintf(bcftools_stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
-        return;
-    }
      if ( args->mask )
      {
          char *chr = (char*)bcf_hdr_id2name(args->hdr,args->rid);
@@ -397,35 +399,73 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          if ( fmt->type!=BCF_BT_INT8 )
              error("Todo: GT field represented with BCF_BT_INT8, too many alleles at %s:%d?\n",bcf_seqname(args->hdr,rec),rec->pos+1);
          uint8_t *ptr = fmt->p + fmt->size*args->isample;
-
          if ( args->haplotype )
          {
-            if ( args->haplotype > fmt->n ) error("Can't apply %d-th haplotype at %s:%d\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
-            ialt = ptr[args->haplotype-1];
-            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
-            ialt = bcf_gt_allele(ialt);
+            if ( args->haplotype > fmt->n )
+            {
+                if ( bcf_gt_is_missing(ptr[fmt->n-1]) || bcf_gt_is_missing(ptr[0]) )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else 
+                {
+                    if ( !warned_haplotype )
+                    {
+                        fprintf(bcftools_stderr, "Can't apply %d-th haplotype at %s:%d. (This warning is printed only once.)\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
+                        warned_haplotype = 1;
+                    }
+                    return;
+                }
+            }
+            else
+            {
+                ialt = (int8_t)ptr[args->haplotype-1];
+                if ( bcf_gt_is_missing(ialt) || ialt==bcf_int8_vector_end )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else 
+                    ialt = bcf_gt_allele(ialt);
+            }
          }
          else if ( args->output_iupac ) 
          {
              ialt = ptr[0];
-            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
-            ialt = bcf_gt_allele(ialt);
+            if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end )
+            {
+                if ( !args->missing_allele ) return;
+                ialt = -1;
+            }
+            else
+                ialt = bcf_gt_allele(ialt);
  
              int jalt;
              if ( fmt->n>1 )
              {
                  jalt = ptr[1];
-                if ( bcf_gt_is_missing(jalt) || jalt==bcf_int32_vector_end ) jalt = ialt;
-                else jalt = bcf_gt_allele(jalt);
+                if ( bcf_gt_is_missing(jalt) )
+                {
+                    if ( !args->missing_allele ) return;
+                    ialt = -1;
+                }
+                else if ( jalt==bcf_int32_vector_end ) jalt = ialt;
+                else
+                    jalt = bcf_gt_allele(jalt);
              }
              else jalt = ialt;
-            if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
-            if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+
+            if ( ialt>=0 )
              {
-                char ial = rec->d.allele[ialt][0];
-                char jal = rec->d.allele[jalt][0];
-                if ( !ialt ) ialt = jalt;   // only ialt is used, make sure 0/1 is not ignored
-                rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+                if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+                {
+                    char ial = rec->d.allele[ialt][0];
+                    char jal = rec->d.allele[jalt][0];
+                    if ( !ialt ) ialt = jalt;   // only ialt is used, make sure 0/1 is not ignored
+                    rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+                }
              }
          }
          else
@@ -433,8 +473,13 @@ static void apply_variant(args_t *args, bcf1_t *rec)
              int is_hom = 1;
              for (i=0; i<fmt->n; i++)
              {
-                if ( bcf_gt_is_missing(ptr[i]) ) return;  // ignore missing or half-missing genotypes
-                if ( ptr[i]==bcf_int32_vector_end ) break;
+                if ( bcf_gt_is_missing(ptr[i]) )
+                {
+                    if ( !args->missing_allele ) return;  // ignore missing or half-missing genotypes
+                    ialt = -1;
+                    break;
+                }
+                if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
                  ialt = bcf_gt_allele(ptr[i]);
                  if ( i>0 && ialt!=bcf_gt_allele(ptr[i-1]) ) { is_hom = 0; break; }
              }
@@ -443,7 +488,7 @@ static void apply_variant(args_t *args, bcf1_t *rec)
                  int prev_len = 0, jalt;
                  for (i=0; i<fmt->n; i++)
                  {
-                    if ( ptr[i]==bcf_int32_vector_end ) break;
+                    if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
                      jalt = bcf_gt_allele(ptr[i]);
                      if ( rec->n_allele <= jalt ) error("Broken VCF, too few alts at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
                      if ( args->allele & (PICK_LONG|PICK_SHORT) )
@@ -476,6 +521,25 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          rec->d.allele[1][0] = gt2iupac(ial,jal);
      }
  
+    if ( rec->n_allele==1 && ialt!=-1 ) return; // non-missing reference
+    if ( ialt==-1 )
+    {
+        char alleles[4];
+        alleles[0] = rec->d.allele[0][0];
+        alleles[1] = ',';
+        alleles[2] = args->missing_allele;
+        alleles[3] = 0;
+        bcf_update_alleles_str(args->hdr, rec, alleles);
+        ialt = 1;
+    }
+
+    // Overlapping variant? Can be still OK iff this is an insertion
+    if ( rec->pos <= args->fa_frz_pos && (rec->pos!=args->fa_frz_pos || rec->d.allele[0][0]!=rec->d.allele[ialt][0]) )
+    {
+        fprintf(bcftools_stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+        return;
+    }
+
      int len_diff = 0, alen = 0;
      int idx = rec->pos - args->fa_ori_pos + args->fa_mod_off;
      if ( idx<0 )
@@ -494,7 +558,7 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          }
      }
      if ( idx>=args->fa_buf.l ) 
-        error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%d, off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,args->fa_buf.l,args->fa_mod_off);
+        error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%"PRIu64", off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,(uint64_t)args->fa_buf.l,args->fa_mod_off);
  
      // sanity check the reference base
      if ( rec->d.allele[ialt][0]=='<' )
@@ -508,21 +572,37 @@ static void apply_variant(args_t *args, bcf1_t *rec)
      }
      else if ( strncasecmp(rec->d.allele[0],args->fa_buf.s+idx,rec->rlen) )
      {
-        // fprintf(bcftools_stderr,"%d .. [%s], idx=%d ori=%d off=%d\n",args->fa_ori_pos,args->fa_buf.s,idx,args->fa_ori_pos,args->fa_mod_off);
-        char tmp = 0;
-        if ( args->fa_buf.l - idx > rec->rlen ) 
-        { 
-            tmp = args->fa_buf.s[idx+rec->rlen];
-            args->fa_buf.s[idx+rec->rlen] = 0;
+        // This is hacky, handle a special case: if insert follows a deletion (AAC>A, C>CAA),
+        // the reference base in fa_buf is lost and the check fails. We do not keep a buffer
+        // with the original sequence as it should not be necessary, we should encounter max
+        // one base overlap
+
+        int fail = 1;
+        if ( args->prev_base_pos==rec->pos && toupper(rec->d.allele[0][0])==toupper(args->prev_base) )
+        {
+            if ( rec->rlen==1 ) fail = 0;
+            else if ( !strncasecmp(rec->d.allele[0]+1,args->fa_buf.s+idx+1,rec->rlen-1) ) fail = 0;
+        }
+
+        if ( fail )
+        {
+            char tmp = 0;
+            if ( args->fa_buf.l - idx > rec->rlen ) 
+            { 
+                tmp = args->fa_buf.s[idx+rec->rlen];
+                args->fa_buf.s[idx+rec->rlen] = 0;
+            }
+            error(
+                    "The fasta sequence does not match the REF allele at %s:%d:\n"
+                    "   .vcf: [%s]\n" 
+                    "   .vcf: [%s] <- (ALT)\n" 
+                    "   .fa:  [%s]%c%s\n",
+                    bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx, 
+                    tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
+                 );
          }
-        error(
-            "The fasta sequence does not match the REF allele at %s:%d:\n"
-            "   .vcf: [%s]\n" 
-            "   .vcf: [%s] <- (ALT)\n" 
-            "   .fa:  [%s]%c%s\n",
-            bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx, 
-            tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
-            );
+        alen = strlen(rec->d.allele[ialt]);
+        len_diff = alen - rec->rlen;
      }
      else
      {
@@ -541,7 +621,11 @@ static void apply_variant(args_t *args, bcf1_t *rec)
          for (i=0; i<alen; i++)
              args->fa_buf.s[idx+i] = rec->d.allele[ialt][i];
          if ( len_diff )
+        {
+            args->prev_base = rec->d.allele[0][rec->rlen - 1];
+            args->prev_base_pos = rec->pos + rec->rlen - 1;
              memmove(args->fa_buf.s+idx+alen,args->fa_buf.s+idx+rec->rlen,args->fa_buf.l-idx-rec->rlen);
+        }
      }
      else
      {
@@ -591,10 +675,10 @@ static void mask_region(args_t *args, char *seq, int len)
  
  static void consensus(args_t *args)
  {
-    htsFile *fasta = hts_open(args->ref_fname, "rb");
+    BGZF *fasta = bgzf_open(args->ref_fname, "r");
      if ( !fasta ) error("Error reading %s\n", args->ref_fname);
      kstring_t str = {0,0,0};
-    while ( hts_getline(fasta, KS_SEP_LINE, &str) > 0 )
+    while ( bgzf_getline(fasta, '\n', &str) > 0 )
      {
          if ( str.s[0]=='>' )
          {
@@ -671,7 +755,7 @@ static void consensus(args_t *args)
          destroy_chain(args);
      }
      flush_fa_buffer(args, 0);
-    hts_close(fasta);
+    bgzf_close(fasta);
      free(str.s);
  }
  
@@ -683,7 +767,7 @@ static void usage(args_t *args)
      fprintf(bcftools_stderr, "       --sample (and, optionally, --haplotype) option will apply genotype\n");
      fprintf(bcftools_stderr, "       (or haplotype) calls from FORMAT/GT. The program ignores allelic depth\n");
      fprintf(bcftools_stderr, "       information, such as INFO/AD or FORMAT/AD.\n");
-    fprintf(bcftools_stderr, "Usage:   bcftools consensus [OPTIONS] <file.vcf>\n");
+    fprintf(bcftools_stderr, "Usage:   bcftools consensus [OPTIONS] <file.vcf.gz>\n");
      fprintf(bcftools_stderr, "Options:\n");
      fprintf(bcftools_stderr, "    -c, --chain <file>         write a chain file for liftover\n");
      fprintf(bcftools_stderr, "    -e, --exclude <expr>       exclude sites for which the expression is true (see man page for details)\n");
@@ -699,6 +783,7 @@ static void usage(args_t *args)
      fprintf(bcftools_stderr, "    -i, --include <expr>       select sites for which the expression is true (see man page for details)\n");
      fprintf(bcftools_stderr, "    -I, --iupac-codes          output variants in the form of IUPAC ambiguity codes\n");
      fprintf(bcftools_stderr, "    -m, --mask <file>          replace regions with N\n");
+    fprintf(bcftools_stderr, "    -M, --missing <char>       output <char> instead of skipping the missing genotypes\n");
      fprintf(bcftools_stderr, "    -o, --output <file>        write output to a file [standard output]\n");
      fprintf(bcftools_stderr, "    -s, --sample <name>        apply variants of the given sample\n");
      fprintf(bcftools_stderr, "Examples:\n");
@@ -724,11 +809,12 @@ int main_consensus(int argc, char *argv[])
          {"output",1,0,'o'},
          {"fasta-ref",1,0,'f'},
          {"mask",1,0,'m'},
+        {"missing",1,0,'M'},
          {"chain",1,0,'c'},
          {0,0,0,0}
      };
      int c;
-    while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:",loptions,NULL)) >= 0) 
+    while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:M:",loptions,NULL)) >= 0) 
      {
          switch (c) 
          {
@@ -739,6 +825,10 @@ int main_consensus(int argc, char *argv[])
              case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
              case 'f': args->ref_fname = optarg; break;
              case 'm': args->mask_fname = optarg; break;
+            case 'M': 
+                args->missing_allele = optarg[0]; 
+                if ( optarg[1]!=0 ) error("Expected single character with -M, got \"%s\"\n", optarg);
+                break;
              case 'c': args->chain_fname = optarg; break;
              case 'H': 
                  if ( !strcasecmp(optarg,"R") ) args->allele |= PICK_REF;
diff --git a/bcftools/convert.c b/bcftools/convert.c

index d7ed0abc10bfc63685903d4f20e5d8b6af4753ca..811a364c236fdacc57dc71aad39a228204cb84f6 100644 (file)
--- a/bcftools/convert.c
+++ b/bcftools/convert.c
@@ -30,6 +30,7 @@ THE SOFTWARE.  */
  #include <errno.h>
  #include <sys/stat.h>
  #include <sys/types.h>
+#include <inttypes.h>
  #include <math.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
@@ -756,16 +757,37 @@ static void process_gt_to_hap(convert_t *convert, bcf1_t *line, fmt_t *fmt, int
      if ( line->n_allele > 100 )
          error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
      if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
-        error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+        error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
  
      if ( fmt_gt->type!=BCF_BT_INT8 )    // todo: use BRANCH_INT if the VCF is valid
          error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
+    if ( fmt_gt->n!=1 && fmt_gt->n!=2 )
+        error("Uh, ploidy of %d not supported, see %s:%d\n", fmt_gt->n, bcf_seqname(convert->header, line), line->pos+1);
  
      int8_t *ptr = ((int8_t*) fmt_gt->p) - fmt_gt->n;
      for (i=0; i<convert->nsamples; i++)
      {
          ptr += fmt_gt->n;
-        if ( ptr[0]==2 )
+        if ( fmt_gt->n==1 ) // haploid genotypes
+        {
+            if ( ptr[0]==2 ) /* 0 */
+            {
+                str->s[str->l++] = '0'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+            else if ( ptr[0]==bcf_int8_missing )   /* . */
+            {
+                str->s[str->l++] = '?'; str->s[str->l++] = ' '; str->s[str->l++] = '?'; str->s[str->l++] = ' ';
+            }
+            else if ( ptr[0]==4 ) /* 1 */
+            {
+                str->s[str->l++] = '1'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+            else
+            {
+                kputw(bcf_gt_allele(ptr[0]),str); str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+        }
+        else if ( ptr[0]==2 )
          {
              if ( ptr[1]==3 ) /* 0|0 */
              {
@@ -889,7 +911,7 @@ static void process_gt_to_hap2(convert_t *convert, bcf1_t *line, fmt_t *fmt, int
      if ( line->n_allele > 100 )
          error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
      if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
-        error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+        error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
  
      if ( fmt_gt->type!=BCF_BT_INT8 )    // todo: use BRANCH_INT if the VCF is valid
          error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
diff --git a/bcftools/convert.c.pysam.c b/bcftools/convert.c.pysam.c

index 66373bcda447b2b7b45dd8201eac331ddfbf651e..5496abea9ee59ec5a27c179ff3365bfb3eb820ef 100644 (file)
--- a/bcftools/convert.c.pysam.c
+++ b/bcftools/convert.c.pysam.c
@@ -32,6 +32,7 @@ THE SOFTWARE.  */
  #include <errno.h>
  #include <sys/stat.h>
  #include <sys/types.h>
+#include <inttypes.h>
  #include <math.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
@@ -758,16 +759,37 @@ static void process_gt_to_hap(convert_t *convert, bcf1_t *line, fmt_t *fmt, int
      if ( line->n_allele > 100 )
          error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
      if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
-        error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+        error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
  
      if ( fmt_gt->type!=BCF_BT_INT8 )    // todo: use BRANCH_INT if the VCF is valid
          error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
+    if ( fmt_gt->n!=1 && fmt_gt->n!=2 )
+        error("Uh, ploidy of %d not supported, see %s:%d\n", fmt_gt->n, bcf_seqname(convert->header, line), line->pos+1);
  
      int8_t *ptr = ((int8_t*) fmt_gt->p) - fmt_gt->n;
      for (i=0; i<convert->nsamples; i++)
      {
          ptr += fmt_gt->n;
-        if ( ptr[0]==2 )
+        if ( fmt_gt->n==1 ) // haploid genotypes
+        {
+            if ( ptr[0]==2 ) /* 0 */
+            {
+                str->s[str->l++] = '0'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+            else if ( ptr[0]==bcf_int8_missing )   /* . */
+            {
+                str->s[str->l++] = '?'; str->s[str->l++] = ' '; str->s[str->l++] = '?'; str->s[str->l++] = ' ';
+            }
+            else if ( ptr[0]==4 ) /* 1 */
+            {
+                str->s[str->l++] = '1'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+            else
+            {
+                kputw(bcf_gt_allele(ptr[0]),str); str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+            }
+        }
+        else if ( ptr[0]==2 )
          {
              if ( ptr[1]==3 ) /* 0|0 */
              {
@@ -891,7 +913,7 @@ static void process_gt_to_hap2(convert_t *convert, bcf1_t *line, fmt_t *fmt, int
      if ( line->n_allele > 100 )
          error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
      if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
-        error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+        error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
  
      if ( fmt_gt->type!=BCF_BT_INT8 )    // todo: use BRANCH_INT if the VCF is valid
          error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
diff --git a/bcftools/csq.c b/bcftools/csq.c

index b36ebc32d98ba13c953f1d0b4fd813b00931d645..dc24ae6ebe08db00d8a59c58751f4f45da21f37d 100644 (file)
--- a/bcftools/csq.c
+++ b/bcftools/csq.c
@@ -71,7 +71,7 @@
              A .. gene line with a supported biotype
                      A.ID=~/^gene:/
  
-            B .. transcript line referencing A
+            B .. transcript line referencing A with supported biotype
                      B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
  
              C .. corresponding CDS, exon, and UTR lines:
@@ -595,6 +595,7 @@ typedef struct _args_t
      csq_t *csq_buf;             // pool of csq not managed by hap_node_t, i.e. non-CDS csqs
      int ncsq_buf, mcsq_buf;
      id_tbl_t tscript_ids;       // mapping between transcript id (eg. Zm00001d027245_T001) and a numeric idx
+    int force;                  // force run under various conditions. Currently only to skip out-of-phase transcripts
  
      faidx_t *fai;
      kstring_t str, str2;
@@ -1111,15 +1112,26 @@ void tscript_init_cds(args_t *args)
              tr->cds[0]->len -= tr->cds[0]->phase;
              tr->cds[0]->phase = 0;
  
-            // sanity check phase
+            // sanity check phase; the phase number in gff tells us how many bases to skip in this
+            // feature to reach the first base of the next codon
+            int tscript_ok = 1;
              for (i=0; i<tr->ncds; i++)
              {
                  int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
                  if ( phase!=len%3)
-                    error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
-                assert( phase == len%3 );
+                {
+                    if ( args->force )
+                    {
+                        if ( args->quiet < 2 )
+                            fprintf(stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                        tscript_ok = 0;
+                        break;
+                    }
+                    error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                }
                  len += tr->cds[i]->len; 
              }
+            if ( !tscript_ok ) continue;    // skip this transcript
          }
          else
          {
@@ -1140,13 +1152,24 @@ void tscript_init_cds(args_t *args)
              tr->cds[i]->phase = 0;
  
              // sanity check phase
+            int tscript_ok = 1;
              for (i=tr->ncds-1; i>=0; i--)
              {
                  int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
                  if ( phase!=len%3)
-                    error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                {
+                    if ( args->force )
+                    {
+                        if ( args->quiet < 2 )
+                            fprintf(stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                        tscript_ok = 0;
+                        break;
+                    }
+                    error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                }
                  len += tr->cds[i]->len;
              }
+            if ( !tscript_ok ) continue;    // skip this transcript
          }
  
          // set len. At the same check that CDS within a transcript do not overlap
@@ -1876,7 +1899,7 @@ fprintf(stderr,"del: %s>%s .. ex=%d,%d  beg,end=%d,%d  tbeg,tend=%d,%d  check_ut
          splice->kalt.l = 0; kputsn(splice->vcf.alt + splice->tbeg, splice->vcf.alen, &splice->kalt); 
          if ( (splice->ref_beg+1 < ex_beg && splice->ref_end >= ex_beg) || (splice->ref_beg+1 < ex_end && splice->ref_end >= ex_end) ) // ouch, ugly ENST00000409523/long-overlapping-del.vcf
          {
-            splice->csq |= (splice->ref_end - splice->ref_beg + 1)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
+            splice->csq |= (splice->ref_end - splice->ref_beg)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
              return SPLICE_OVERLAP;
          }
      }
@@ -2074,7 +2097,6 @@ fprintf(stderr,"cds splice_csq: %d [%s][%s] .. beg,end=%d %d, ret=%d, csq=%d\n\n
          child->var  = str.s;
          child->type = HAP_SSS;
          child->csq  = splice.csq;
-        child->prev = parent->type==HAP_SSS ? parent->prev : parent;
          child->rec  = rec;
          return 0;
      }
@@ -2092,7 +2114,7 @@ fprintf(stderr,"cds splice_csq: %d [%s][%s] .. beg,end=%d %d, ret=%d, csq=%d\n\n
          assert( dbeg <= splice.kalt.l );
      }
  
-    if ( parent->type==HAP_SSS ) parent = parent->prev;
+    assert( parent->type!=HAP_SSS );
      if ( parent->type==HAP_CDS )    
      {
          i = parent->icds;
@@ -2402,12 +2424,12 @@ fprintf(stderr,"csq_push: %d .. %d\n",rec->pos+1,csq->type.type);
  #endif
      khint_t k = kh_get(pos2vbuf, args->pos2vbuf, (int)csq->pos);
      vbuf_t *vbuf = (k == kh_end(args->pos2vbuf)) ? NULL : kh_val(args->pos2vbuf, k);
-    if ( !vbuf ) error("This should not happen. %s:%d  %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+    if ( !vbuf ) error("This should not happen. %s:%d  %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
  
      int i;
      for (i=0; i<vbuf->n; i++)
          if ( vbuf->vrec[i]->line==rec ) break;
-    if ( i==vbuf->n ) error("This should not happen.. %s:%d  %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+    if ( i==vbuf->n ) error("This should not happen.. %s:%d  %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
      vrec_t *vrec = vbuf->vrec[i];
  
      // if the variant overlaps donor/acceptor and also splice region, report only donor/acceptor
@@ -2769,26 +2791,20 @@ void hap_finalize(args_t *args, hap_t *hap)
          hap->upstream_stop = 0;
  
          int i = 1, dlen = 0, ibeg, indel = 0;
-        while ( i<istack && hap->stack[i].node->type == HAP_SSS ) i++;
          hap->sbeg = hap->stack[i].node->sbeg;
-
+        assert( hap->stack[istack].node->type != HAP_SSS );
          if ( tr->strand==STRAND_FWD )
          {
              i = 0, ibeg = -1;
              while ( ++i <= istack )
              {
-                if ( hap->stack[i].node->type == HAP_SSS )
-                {
-                    // start/stop/splice site overlap: don't know how to build the haplotypes correctly, skipping
-                    hap_add_csq(args,hap,node,0,i,i,0,0);
-                    continue;
-                }
+                assert( hap->stack[i].node->type != HAP_SSS );
+
                  dlen += hap->stack[i].node->dlen;
                  if ( hap->stack[i].node->dlen ) indel = 1;
  
-                // This condition extends compound variants. Note that s/s/s sites are forced out to always break
-                // a compound block. See ENST00000271583/splice-acceptor.vcf for motivation.
-                if ( i<istack && hap->stack[i+1].node->type != HAP_SSS )
+                // This condition extends compound variants.
+                if ( i<istack )
                  {
                      if ( dlen%3 )   // frameshift
                      {
@@ -2839,14 +2855,10 @@ void hap_finalize(args_t *args, hap_t *hap)
              i = istack + 1, ibeg = -1;
              while ( --i > 0 )
              {
-                if ( hap->stack[i].node->type == HAP_SSS )
-                {
-                    hap_add_csq(args,hap,node,0,i,i,0,0);
-                    continue;
-                }
+                assert ( hap->stack[i].node->type != HAP_SSS );
                  dlen += hap->stack[i].node->dlen;
                  if ( hap->stack[i].node->dlen ) indel = 1;
-                if ( i>1 && hap->stack[i-1].node->type != HAP_SSS )
+                if ( i>1 )
                  {
                      if ( dlen%3 )
                      {
@@ -3352,7 +3364,8 @@ int test_cds(args_t *args, bcf1_t *rec)
              if ( rec->d.allele[1][0]=='<' || rec->d.allele[1][0]=='*' ) { continue; }
              hap_node_t *parent = tr->hap[0] ? tr->hap[0] : tr->root;
              hap_node_t *child  = (hap_node_t*)calloc(1,sizeof(hap_node_t));
-            if ( (hap_ret=hap_init(args, parent, child, cds, rec, 1))!=0 )
+            hap_ret = hap_init(args, parent, child, cds, rec, 1);
+            if ( hap_ret!=0 )
              {
                  // overlapping or intron variant, cannot apply
                  if ( hap_ret==1 )
@@ -3363,7 +3376,22 @@ int test_cds(args_t *args, bcf1_t *rec)
                          fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d\t%s>%s\n", chr,rec->pos+1,rec->d.allele[0],rec->d.allele[1]);
                  }
                  else ret = 1;   // prevent reporting as intron in test_tscript
-                free(child);
+                hap_destroy(child);
+                continue;
+            }
+            if ( child->type==HAP_SSS )
+            {
+                csq_t csq; 
+                memset(&csq, 0, sizeof(csq_t));
+                csq.pos          = rec->pos;
+                csq.type.biotype = tr->type;
+                csq.type.strand  = tr->strand;
+                csq.type.trid    = tr->id;
+                csq.type.gene    = tr->gene->name;
+                csq.type.type = child->csq;
+                csq_stage(args, &csq, rec);
+                hap_destroy(child);
+                ret = 1;
                  continue;
              }
              parent->nend--;
@@ -3434,7 +3462,8 @@ int test_cds(args_t *args, bcf1_t *rec)
                  }
  
                  hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
-                if ( (hap_ret=hap_init(args, parent, child, cds, rec, ial))!=0 )
+                hap_ret = hap_init(args, parent, child, cds, rec, ial);
+                if ( hap_ret!=0 )
                  {
                      // overlapping or intron variant, cannot apply
                      if ( hap_ret==1 )
@@ -3446,10 +3475,23 @@ int test_cds(args_t *args, bcf1_t *rec)
                              fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d, sample %s\t%s>%s\n",
                                      chr,rec->pos+1,args->hdr->samples[args->smpl->idx[ismpl]],rec->d.allele[0],rec->d.allele[ial]);
                      }
-                    free(child);
+                    hap_destroy(child);
+                    continue;
+                }
+                if ( child->type==HAP_SSS )
+                {
+                    csq_t csq; 
+                    memset(&csq, 0, sizeof(csq_t));
+                    csq.pos          = rec->pos;
+                    csq.type.biotype = tr->type;
+                    csq.type.strand  = tr->strand;
+                    csq.type.trid    = tr->id;
+                    csq.type.gene    = tr->gene->name;
+                    csq.type.type = child->csq;
+                    csq_stage(args, &csq, rec);
+                    hap_destroy(child);
                      continue;
                  }
-
                  if ( parent->cur_rec!=rec )
                  {
                      hts_expand(int,rec->n_allele,parent->mcur_child,parent->cur_child);
@@ -3708,6 +3750,7 @@ static const char *usage(void)
          "                                     s: skip unphased hets\n"
          "Options:\n"
          "   -e, --exclude <expr>            exclude sites for which the expression is true\n"
+        "       --force                     run even if some sanity checks fail\n"
          "   -i, --include <expr>            select sites for which the expression is true\n"
          "   -o, --output <file>             write output to a file [standard output]\n"
          "   -O, --output-type <b|u|z|v|t>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF\n"
@@ -3739,6 +3782,7 @@ int main_csq(int argc, char *argv[])
  
      static struct option loptions[] =
      {
+        {"force",0,0,1},
          {"help",0,0,'h'},
          {"ncsq",1,0,'n'},
          {"custom-tag",1,0,'c'},
@@ -3765,6 +3809,7 @@ int main_csq(int argc, char *argv[])
      {
          switch (c) 
          {
+            case  1 : args->force = 1; break;
              case 'l': args->local_csq = 1; break;
              case 'c': args->bcsq_tag = optarg; break;
              case 'q': args->quiet++; break;
diff --git a/bcftools/csq.c.pysam.c b/bcftools/csq.c.pysam.c

index e4afa8ea3e66ebc898b2f72210cd27d76fe6f775..cfe34a63fb66256e83ab3a3844c37c5ddbef5ab1 100644 (file)
--- a/bcftools/csq.c.pysam.c
+++ b/bcftools/csq.c.pysam.c
@@ -73,7 +73,7 @@
              A .. gene line with a supported biotype
                      A.ID=~/^gene:/
  
-            B .. transcript line referencing A
+            B .. transcript line referencing A with supported biotype
                      B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
  
              C .. corresponding CDS, exon, and UTR lines:
@@ -597,6 +597,7 @@ typedef struct _args_t
      csq_t *csq_buf;             // pool of csq not managed by hap_node_t, i.e. non-CDS csqs
      int ncsq_buf, mcsq_buf;
      id_tbl_t tscript_ids;       // mapping between transcript id (eg. Zm00001d027245_T001) and a numeric idx
+    int force;                  // force run under various conditions. Currently only to skip out-of-phase transcripts
  
      faidx_t *fai;
      kstring_t str, str2;
@@ -1113,15 +1114,26 @@ void tscript_init_cds(args_t *args)
              tr->cds[0]->len -= tr->cds[0]->phase;
              tr->cds[0]->phase = 0;
  
-            // sanity check phase
+            // sanity check phase; the phase number in gff tells us how many bases to skip in this
+            // feature to reach the first base of the next codon
+            int tscript_ok = 1;
              for (i=0; i<tr->ncds; i++)
              {
                  int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
                  if ( phase!=len%3)
-                    error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
-                assert( phase == len%3 );
+                {
+                    if ( args->force )
+                    {
+                        if ( args->quiet < 2 )
+                            fprintf(bcftools_stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                        tscript_ok = 0;
+                        break;
+                    }
+                    error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                }
                  len += tr->cds[i]->len; 
              }
+            if ( !tscript_ok ) continue;    // skip this transcript
          }
          else
          {
@@ -1142,13 +1154,24 @@ void tscript_init_cds(args_t *args)
              tr->cds[i]->phase = 0;
  
              // sanity check phase
+            int tscript_ok = 1;
              for (i=tr->ncds-1; i>=0; i--)
              {
                  int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
                  if ( phase!=len%3)
-                    error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                {
+                    if ( args->force )
+                    {
+                        if ( args->quiet < 2 )
+                            fprintf(bcftools_stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                        tscript_ok = 0;
+                        break;
+                    }
+                    error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+                }
                  len += tr->cds[i]->len;
              }
+            if ( !tscript_ok ) continue;    // skip this transcript
          }
  
          // set len. At the same check that CDS within a transcript do not overlap
@@ -1878,7 +1901,7 @@ fprintf(bcftools_stderr,"del: %s>%s .. ex=%d,%d  beg,end=%d,%d  tbeg,tend=%d,%d
          splice->kalt.l = 0; kputsn(splice->vcf.alt + splice->tbeg, splice->vcf.alen, &splice->kalt); 
          if ( (splice->ref_beg+1 < ex_beg && splice->ref_end >= ex_beg) || (splice->ref_beg+1 < ex_end && splice->ref_end >= ex_end) ) // ouch, ugly ENST00000409523/long-overlapping-del.vcf
          {
-            splice->csq |= (splice->ref_end - splice->ref_beg + 1)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
+            splice->csq |= (splice->ref_end - splice->ref_beg)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
              return SPLICE_OVERLAP;
          }
      }
@@ -2076,7 +2099,6 @@ fprintf(bcftools_stderr,"cds splice_csq: %d [%s][%s] .. beg,end=%d %d, ret=%d, c
          child->var  = str.s;
          child->type = HAP_SSS;
          child->csq  = splice.csq;
-        child->prev = parent->type==HAP_SSS ? parent->prev : parent;
          child->rec  = rec;
          return 0;
      }
@@ -2094,7 +2116,7 @@ fprintf(bcftools_stderr,"cds splice_csq: %d [%s][%s] .. beg,end=%d %d, ret=%d, c
          assert( dbeg <= splice.kalt.l );
      }
  
-    if ( parent->type==HAP_SSS ) parent = parent->prev;
+    assert( parent->type!=HAP_SSS );
      if ( parent->type==HAP_CDS )    
      {
          i = parent->icds;
@@ -2404,12 +2426,12 @@ fprintf(bcftools_stderr,"csq_push: %d .. %d\n",rec->pos+1,csq->type.type);
  #endif
      khint_t k = kh_get(pos2vbuf, args->pos2vbuf, (int)csq->pos);
      vbuf_t *vbuf = (k == kh_end(args->pos2vbuf)) ? NULL : kh_val(args->pos2vbuf, k);
-    if ( !vbuf ) error("This should not happen. %s:%d  %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+    if ( !vbuf ) error("This should not happen. %s:%d  %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
  
      int i;
      for (i=0; i<vbuf->n; i++)
          if ( vbuf->vrec[i]->line==rec ) break;
-    if ( i==vbuf->n ) error("This should not happen.. %s:%d  %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+    if ( i==vbuf->n ) error("This should not happen.. %s:%d  %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
      vrec_t *vrec = vbuf->vrec[i];
  
      // if the variant overlaps donor/acceptor and also splice region, report only donor/acceptor
@@ -2771,26 +2793,20 @@ void hap_finalize(args_t *args, hap_t *hap)
          hap->upstream_stop = 0;
  
          int i = 1, dlen = 0, ibeg, indel = 0;
-        while ( i<istack && hap->stack[i].node->type == HAP_SSS ) i++;
          hap->sbeg = hap->stack[i].node->sbeg;
-
+        assert( hap->stack[istack].node->type != HAP_SSS );
          if ( tr->strand==STRAND_FWD )
          {
              i = 0, ibeg = -1;
              while ( ++i <= istack )
              {
-                if ( hap->stack[i].node->type == HAP_SSS )
-                {
-                    // start/stop/splice site overlap: don't know how to build the haplotypes correctly, skipping
-                    hap_add_csq(args,hap,node,0,i,i,0,0);
-                    continue;
-                }
+                assert( hap->stack[i].node->type != HAP_SSS );
+
                  dlen += hap->stack[i].node->dlen;
                  if ( hap->stack[i].node->dlen ) indel = 1;
  
-                // This condition extends compound variants. Note that s/s/s sites are forced out to always break
-                // a compound block. See ENST00000271583/splice-acceptor.vcf for motivation.
-                if ( i<istack && hap->stack[i+1].node->type != HAP_SSS )
+                // This condition extends compound variants.
+                if ( i<istack )
                  {
                      if ( dlen%3 )   // frameshift
                      {
@@ -2841,14 +2857,10 @@ void hap_finalize(args_t *args, hap_t *hap)
              i = istack + 1, ibeg = -1;
              while ( --i > 0 )
              {
-                if ( hap->stack[i].node->type == HAP_SSS )
-                {
-                    hap_add_csq(args,hap,node,0,i,i,0,0);
-                    continue;
-                }
+                assert ( hap->stack[i].node->type != HAP_SSS );
                  dlen += hap->stack[i].node->dlen;
                  if ( hap->stack[i].node->dlen ) indel = 1;
-                if ( i>1 && hap->stack[i-1].node->type != HAP_SSS )
+                if ( i>1 )
                  {
                      if ( dlen%3 )
                      {
@@ -3354,7 +3366,8 @@ int test_cds(args_t *args, bcf1_t *rec)
              if ( rec->d.allele[1][0]=='<' || rec->d.allele[1][0]=='*' ) { continue; }
              hap_node_t *parent = tr->hap[0] ? tr->hap[0] : tr->root;
              hap_node_t *child  = (hap_node_t*)calloc(1,sizeof(hap_node_t));
-            if ( (hap_ret=hap_init(args, parent, child, cds, rec, 1))!=0 )
+            hap_ret = hap_init(args, parent, child, cds, rec, 1);
+            if ( hap_ret!=0 )
              {
                  // overlapping or intron variant, cannot apply
                  if ( hap_ret==1 )
@@ -3365,7 +3378,22 @@ int test_cds(args_t *args, bcf1_t *rec)
                          fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d\t%s>%s\n", chr,rec->pos+1,rec->d.allele[0],rec->d.allele[1]);
                  }
                  else ret = 1;   // prevent reporting as intron in test_tscript
-                free(child);
+                hap_destroy(child);
+                continue;
+            }
+            if ( child->type==HAP_SSS )
+            {
+                csq_t csq; 
+                memset(&csq, 0, sizeof(csq_t));
+                csq.pos          = rec->pos;
+                csq.type.biotype = tr->type;
+                csq.type.strand  = tr->strand;
+                csq.type.trid    = tr->id;
+                csq.type.gene    = tr->gene->name;
+                csq.type.type = child->csq;
+                csq_stage(args, &csq, rec);
+                hap_destroy(child);
+                ret = 1;
                  continue;
              }
              parent->nend--;
@@ -3436,7 +3464,8 @@ int test_cds(args_t *args, bcf1_t *rec)
                  }
  
                  hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
-                if ( (hap_ret=hap_init(args, parent, child, cds, rec, ial))!=0 )
+                hap_ret = hap_init(args, parent, child, cds, rec, ial);
+                if ( hap_ret!=0 )
                  {
                      // overlapping or intron variant, cannot apply
                      if ( hap_ret==1 )
@@ -3448,10 +3477,23 @@ int test_cds(args_t *args, bcf1_t *rec)
                              fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d, sample %s\t%s>%s\n",
                                      chr,rec->pos+1,args->hdr->samples[args->smpl->idx[ismpl]],rec->d.allele[0],rec->d.allele[ial]);
                      }
-                    free(child);
+                    hap_destroy(child);
+                    continue;
+                }
+                if ( child->type==HAP_SSS )
+                {
+                    csq_t csq; 
+                    memset(&csq, 0, sizeof(csq_t));
+                    csq.pos          = rec->pos;
+                    csq.type.biotype = tr->type;
+                    csq.type.strand  = tr->strand;
+                    csq.type.trid    = tr->id;
+                    csq.type.gene    = tr->gene->name;
+                    csq.type.type = child->csq;
+                    csq_stage(args, &csq, rec);
+                    hap_destroy(child);
                      continue;
                  }
-
                  if ( parent->cur_rec!=rec )
                  {
                      hts_expand(int,rec->n_allele,parent->mcur_child,parent->cur_child);
@@ -3710,6 +3752,7 @@ static const char *usage(void)
          "                                     s: skip unphased hets\n"
          "Options:\n"
          "   -e, --exclude <expr>            exclude sites for which the expression is true\n"
+        "       --force                     run even if some sanity checks fail\n"
          "   -i, --include <expr>            select sites for which the expression is true\n"
          "   -o, --output <file>             write output to a file [standard output]\n"
          "   -O, --output-type <b|u|z|v|t>   b: compressed BCF, u: uncompressed BCF, z: compressed VCF\n"
@@ -3741,6 +3784,7 @@ int main_csq(int argc, char *argv[])
  
      static struct option loptions[] =
      {
+        {"force",0,0,1},
          {"help",0,0,'h'},
          {"ncsq",1,0,'n'},
          {"custom-tag",1,0,'c'},
@@ -3767,6 +3811,7 @@ int main_csq(int argc, char *argv[])
      {
          switch (c) 
          {
+            case  1 : args->force = 1; break;
              case 'l': args->local_csq = 1; break;
              case 'c': args->bcsq_tag = optarg; break;
              case 'q': args->quiet++; break;
diff --git a/bcftools/filter.c b/bcftools/filter.c

index 53e8dc7747678de89381e6a6364f49b5b85bac07..02868c986965e1c92670d6f9acc05fd2dcf7a4ba 100644 (file)
--- a/bcftools/filter.c
+++ b/bcftools/filter.c
@@ -27,13 +27,27 @@ THE SOFTWARE.  */
  #include <strings.h>
  #include <errno.h>
  #include <math.h>
-#include <wordexp.h>
+#include <sys/types.h>
+#include <pwd.h>
  #include <regex.h>
  #include <htslib/khash_str2int.h>
-#include "filter.h"
-#include "bcftools.h"
  #include <htslib/hts_defs.h>
  #include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include "config.h"
+#include "filter.h"
+#include "bcftools.h"
+
+#if ENABLE_PERL_FILTERS
+#  define filter_t perl_filter_t
+#  include <EXTERN.h>
+#  include <perl.h>
+#  undef filter_t
+#  define my_perl perl
+
+static int filter_ninit = 0;
+#endif
+
  
  #ifndef __FUNCTION__
  #  define __FUNCTION__ __func__
@@ -63,9 +77,11 @@ typedef struct _token_t
  {
      // read-only values, same for all VCF lines
      int tok_type;       // one of the TOK_* keys below
+    int nargs;          // with TOK_PERLSUB the first argument is the name of the subroutine
      char *key;          // set only for string constants, otherwise NULL
      char *tag;          // for debugging and printout only, VCF tag name
      double threshold;   // filtering threshold
+    int is_constant;    // the threshold is set
      int hdr_id, type;   // BCF header lookup ID and one of BCF_HT_* types
      int idx;            // 0-based index to VCF vectors,
                          //  -2: list (e.g. [0,1,2] or [1..3] or [1..] or any field[*], which is equivalent to [0..])
@@ -100,6 +116,9 @@ struct _filter_t
      float   *tmpf;
      kstring_t tmps;
      int max_unpack, mtmpi, mtmpf, nsamples;
+#if ENABLE_PERL_FILTERS
+    PerlInterpreter *perl;
+#endif
  };
  
  
@@ -130,12 +149,15 @@ struct _filter_t
  #define TOK_LEN     24
  #define TOK_FUNC    25
  #define TOK_CNT     26
+#define TOK_PERLSUB 27
+#define TOK_BINOM   28
  
-//                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
-//                        ( ) [ < = > ] ! | &  +  -  *  /  M  m  a  A  O  ~  ^  S  .  l  
-static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8};
-#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^f"
+//                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
+//                        ( ) [ < = > ] ! | &  +  -  *  /  M  m  a  A  O  ~  ^  S  .  l  f  c  p
+static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8, 8};
+#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^S.lfcp"
  
+// Return negative values if it is a function with variable number of arguments
  static int filters_next_token(char **str, int *len)
  {
      char *tmp = *str;
@@ -162,6 +184,7 @@ static int filters_next_token(char **str, int *len)
      if ( !strncasecmp(tmp,"ABS(",4) ) { (*str) += 3; return TOK_ABS; }
      if ( !strncasecmp(tmp,"COUNT(",4) ) { (*str) += 5; return TOK_CNT; }
      if ( !strncasecmp(tmp,"STRLEN(",7) ) { (*str) += 6; return TOK_LEN; }
+    if ( !strncasecmp(tmp,"BINOM(",6) ) { (*str) += 5; return -TOK_BINOM; }
      if ( !strncasecmp(tmp,"%MAX(",5) ) { (*str) += 4; return TOK_MAX; } // for backward compatibility
      if ( !strncasecmp(tmp,"%MIN(",5) ) { (*str) += 4; return TOK_MIN; } // for backward compatibility
      if ( !strncasecmp(tmp,"%AVG(",5) ) { (*str) += 4; return TOK_AVG; } // for backward compatibility
@@ -169,6 +192,9 @@ static int filters_next_token(char **str, int *len)
      if ( !strncasecmp(tmp,"INFO/",5) ) tmp += 5;
      if ( !strncasecmp(tmp,"FORMAT/",7) ) tmp += 7;
      if ( !strncasecmp(tmp,"FMT/",4) ) tmp += 4;
+    if ( !strncasecmp(tmp,"PERL.",5) ) { (*str) += 5; return -TOK_PERLSUB; }
+    if ( !strncasecmp(tmp,"N_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
+    if ( !strncasecmp(tmp,"F_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
  
      if ( tmp[0]=='@' )  // file name
      {
@@ -180,22 +206,25 @@ static int filters_next_token(char **str, int *len)
      int square_brackets = 0;
      while ( tmp[0] )
      {
-        if ( tmp[0]=='"' ) break;
-        if ( tmp[0]=='\'' ) break;
-        if ( isspace(tmp[0]) ) break;
-        if ( tmp[0]=='<' ) break;
-        if ( tmp[0]=='>' ) break;
-        if ( tmp[0]=='=' ) break;
-        if ( tmp[0]=='!' ) break;
-        if ( tmp[0]=='&' ) break;
-        if ( tmp[0]=='|' ) break;
-        if ( tmp[0]=='(' ) break;
-        if ( tmp[0]==')' ) break;
-        if ( tmp[0]=='+' ) break;
-        if ( tmp[0]=='*' && !square_brackets ) break;
-        if ( tmp[0]=='-' && !square_brackets ) break;
-        if ( tmp[0]=='/' ) break;
-        if ( tmp[0]=='~' ) break;
+        if ( !square_brackets )
+        {
+            if ( tmp[0]=='"' ) break;
+            if ( tmp[0]=='\'' ) break;
+            if ( isspace(tmp[0]) ) break;
+            if ( tmp[0]=='<' ) break;
+            if ( tmp[0]=='>' ) break;
+            if ( tmp[0]=='=' ) break;
+            if ( tmp[0]=='!' ) break;
+            if ( tmp[0]=='&' ) break;
+            if ( tmp[0]=='|' ) break;
+            if ( tmp[0]=='(' ) break;
+            if ( tmp[0]==')' ) break;
+            if ( tmp[0]=='+' ) break;
+            if ( tmp[0]=='*' ) break;
+            if ( tmp[0]=='-' ) break;
+            if ( tmp[0]=='/' ) break;
+            if ( tmp[0]=='~' ) break;
+        }
          if ( tmp[0]==']' ) { if (square_brackets) tmp++; break; }
          if ( tmp[0]=='[' ) square_brackets++; 
          tmp++;
@@ -250,6 +279,52 @@ static int filters_next_token(char **str, int *len)
      return TOK_VAL;
  }
  
+
+/* 
+    Simple path expansion, expands ~/, ~user, $var. The result must be freed by the caller.
+    
+    Based on jkb's staden code with some adjustements.
+    https://sourceforge.net/p/staden/code/HEAD/tree/staden/trunk/src/Misc/getfile.c#l123
+*/
+char *expand_path(char *path)
+{
+#ifdef _WIN32
+    return strdup(path);    // windows expansion: todo
+#endif
+
+    kstring_t str = {0,0,0};
+
+    if ( path[0] == '~' )
+    {
+        if ( !path[1] || path[1] == '/' )
+        {
+            // ~ or ~/path
+            kputs(getenv("HOME"), &str);
+            if ( path[1] ) kputs(path+1, &str);
+        }
+        else
+        {
+            // user name: ~pd3/path
+            char *end = path;
+            while ( *end && *end!='/' ) end++;
+            kputsn(path+1, end-path-1, &str);
+            struct passwd *pwentry = getpwnam(str.s);
+            str.l = 0;
+
+            if ( !pwentry ) kputsn(path, end-path, &str);
+            else kputs(pwentry->pw_dir, &str);
+            kputs(end, &str);
+        }
+        return str.s;
+    }
+    if ( path[0] == '$' )
+    {
+        char *var = getenv(path+1);
+        if ( var ) path = var;
+    }
+    return strdup(path);
+}
+
  static void filters_set_qual(filter_t *flt, bcf1_t *line, token_t *tok)
  {
      float *ptr = &line->qual;
@@ -856,7 +931,6 @@ static void filters_set_alt_string(filter_t *flt, bcf1_t *line, token_t *tok)
              kputs(line->d.allele[tok->idx + 1], &tok->str_value);
          else
              kputc('.', &tok->str_value);
-        tok->idx = 0;
      }
      else if ( tok->idx==-2 )
      {
@@ -917,6 +991,36 @@ static void filters_set_nmissing(filter_t *flt, bcf1_t *line, token_t *tok)
      tok->nvalues = 1;
      tok->values[0] = tok->tag[0]=='N' ? nmissing : (double)nmissing / line->n_sample;
  }
+static int func_npass(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+    if ( nstack==0 ) error("Error parsing the expresion\n");
+    token_t *tok = stack[nstack - 1];
+    if ( !tok->nsamples ) error("The function %s works with FORMAT fields\n", rtok->tag);
+
+    rtok->nsamples = tok->nsamples;
+    memcpy(rtok->pass_samples, tok->pass_samples, rtok->nsamples*sizeof(*rtok->pass_samples));
+
+    assert(tok->usmpl);
+    if ( !rtok->usmpl )
+    {
+        rtok->usmpl = (uint8_t*) malloc(tok->nsamples*sizeof(*rtok->usmpl));
+        memcpy(rtok->usmpl, tok->usmpl, tok->nsamples*sizeof(*rtok->usmpl));
+    }
+
+    int i, npass = 0;
+    for (i=0; i<rtok->nsamples; i++)
+    {
+        if ( !rtok->usmpl[i] ) continue;
+        if ( rtok->pass_samples[i] ) npass++;
+    }
+
+    assert( rtok->values );
+    rtok->nvalues = 1;
+    rtok->values[0] = rtok->tag[0]=='N' ? npass : (line->n_sample ? 1.0*npass/line->n_sample : 0);
+    rtok->nsamples = 0;
+
+    return 1;
+}
  static void filters_set_nalt(filter_t *flt, bcf1_t *line, token_t *tok)
  {
      tok->nvalues = 1;
@@ -1118,17 +1222,171 @@ static int func_strlen(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **sta
      }
      else
      {
-        rtok->values[0] = strlen(tok->str_value.s);
+        if ( !tok->str_value.s[1] && tok->str_value.s[0]=='.' )
+            rtok->values[0] = 0;
+        else
+            rtok->values[0] = strlen(tok->str_value.s);
          rtok->nvalues = 1;
      }
      return 1;
  }
+static inline double calc_binom(int na, int nb)
+{
+    if ( na==0 && nb==0 ) return -1;
+    if ( na==nb ) return 1;
+
+    // kfunc.h implements kf_betai, which is the regularized beta function  P(X<=k/N;p) = I_{1-p}(N-k,k+1)
+
+    double pval = na < nb ? kf_betai(nb, na + 1, 0.5) : kf_betai(na, nb + 1, 0.5);
+    pval *= 2;
+    assert( pval-1 < 1e-10 );
+    if ( pval>1 ) pval = 1;     // this can happen, machine precision error, eg. kf_betai(1,0,0.5)
+
+    return pval;
+}
+static int func_binom(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+    int i, istack = nstack - rtok->nargs;
+
+    if ( rtok->nargs!=2 && rtok->nargs!=1 ) error("Error: binom() takes one or two arguments\n");
+    assert( istack>=0 );
+
+    // The expected mean is 0.5. Should we support also prob!=0.5?
+    //
+    //  double prob = 0.5;
+    //  if ( nstack==3 )
+    //  {
+    //      // three parameters, the first one must be a scalar:  binom(0.25,AD[0],AD[1])
+    //      if ( !stack[istack]->is_constant )
+    //          error("The first argument of binom() must be a numeric constant if three parameters are given\n");
+    //      prob = stack[istack]->threshold;
+    //      istack++;
+    //  }
+    //  else if ( nstack==2 && stack[istack]->is_constant )
+    //  {
+    //      // two parameters, the first can be a scalar:  binom(0.25,AD) or binom(AD[0],AD[1])
+    //      prob = stack[istack]->threshold;
+    //      istack++;
+    //  }
+
+    token_t *tok = stack[istack];
+    if ( tok->nsamples )
+    {
+        // working with a FORMAT tag
+        rtok->nval1    = 1;
+        rtok->nvalues  = tok->nsamples;
+        rtok->nsamples = tok->nsamples;
+        hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+        assert(tok->usmpl);
+        if ( !rtok->usmpl ) rtok->usmpl = (uint8_t*) malloc(tok->nsamples);
+        memcpy(rtok->usmpl, tok->usmpl, tok->nsamples);
+
+        if ( istack+1==nstack )
+        {
+            // determine the index from the GT field: binom(AD)
+            int ngt = bcf_get_genotypes(flt->hdr, line, &flt->tmpi, &flt->mtmpi);
+            int max_ploidy = ngt/line->n_sample;
+            if ( ngt <= 0 || max_ploidy < 2 ) // GT not present or not diploid, cannot set
+            {
+                for (i=0; i<rtok->nsamples; i++)
+                    if ( rtok->usmpl[i] ) bcf_double_set_missing(rtok->values[i]);
+                return rtok->nargs;
+            }
+            for (i=0; i<rtok->nsamples; i++)
+            {
+                if ( !rtok->usmpl[i] ) continue;
+                int32_t *ptr = flt->tmpi + i*max_ploidy;
+                if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                int idx1 = bcf_gt_allele(ptr[0]);
+                int idx2 = bcf_gt_allele(ptr[1]);
+                if ( idx1>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+                if ( idx2>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+                double *vals = tok->values + tok->nval1*i;
+                if ( bcf_double_is_missing(vals[idx1]) || bcf_double_is_missing(vals[idx2]) )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                rtok->values[i] = calc_binom(vals[idx1],vals[idx2]);
+                if ( rtok->values[i] < 0 )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+            }
+        }
+        else
+        {
+            // the fields given explicitly: binom(AD[:0],AD[:1])
+            token_t *tok2 = stack[istack+1];
+            if ( tok->nval1!=1 || tok2->nval1!=1 )
+                error("Expected one value per binom() argument, found %d and %d at %s:%d\n",tok->nval1,tok2->nval1, bcf_seqname(flt->hdr,line),line->pos+1);
+            for (i=0; i<rtok->nsamples; i++)
+            {
+                if ( !rtok->usmpl[i] ) continue;
+                double *ptr1 = tok->values + tok->nval1*i;
+                double *ptr2 = tok2->values + tok2->nval1*i;
+                if ( bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                rtok->values[i] = calc_binom(ptr1[0],ptr2[0]);
+                if ( rtok->values[i] < 0 )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+            }
+        }
+    }
+    else
+    {
+        // working with an INFO tag
+        rtok->nvalues  = 1;
+        hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+
+        double *ptr1 = NULL, *ptr2 = NULL;
+        if ( istack+1==nstack )
+        {
+            // only one tag, expecting two values: binom(INFO/AD)
+            if ( tok->nvalues==2 )
+            {
+                ptr1 = &tok->values[0];
+                ptr2 = &tok->values[1];
+            }
+        }
+        else
+        {
+            // two tags, expecting one value in each: binom(INFO/AD[0],INFO/AD[1])
+            token_t *tok2 = stack[istack+1];
+            if ( tok->nvalues==1 && tok2->nvalues==1 )
+            {
+                ptr1 = &tok->values[0];
+                ptr2 = &tok2->values[0];
+            }
+        }
+        if ( !ptr1 || !ptr2 || bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+            bcf_double_set_missing(rtok->values[0]);
+        else
+        {
+            rtok->values[0] = calc_binom(ptr1[0],ptr2[0]);
+            if ( rtok->values[0] < 0 )
+                bcf_double_set_missing(rtok->values[0]);
+        }
+    }
+    return rtok->nargs;
+}
  inline static void tok_init_values(token_t *atok, token_t *btok, token_t *rtok)
  {
      token_t *tok = atok->nvalues > btok->nvalues ? atok : btok;
      rtok->nvalues = tok->nvalues;
      rtok->nval1   = tok->nval1;
-    hts_expand(double*, rtok->nvalues, rtok->mvalues, rtok->values);
+    hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
  }
  inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
  {
@@ -1190,6 +1448,8 @@ inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
  
  static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
  {
+    if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\"\n", filter->str);
+
      token_t *atok = stack[nstack-2];
      token_t *btok = stack[nstack-1];
      tok_init_samples(atok, btok, rtok);
@@ -1250,6 +1510,8 @@ static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_
  }
  static int vector_logic_and(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
  {
+    if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\". (nstack=%d)\n", filter->str,nstack);
+
      token_t *atok = stack[nstack-2];
      token_t *btok = stack[nstack-1];
      tok_init_samples(atok, btok, rtok);
@@ -1612,9 +1874,46 @@ static void parse_tag_idx(bcf_hdr_t *hdr, int is_fmt, char *tag, char *tag_idx,
      int *idxs2 = NULL, nidxs2 = 0, idx2 = 0;
  
      int set_samples = 0;
-    char *colon = index(tag_idx, ':');
-    if ( colon )
+    char *colon = rindex(tag_idx, ':');
+    if ( tag_idx[0]=='@' )     // file list with sample names
+    {
+        if ( !is_fmt ) error("Could not parse \"%s\". (Not a FORMAT tag yet a sample list provided.)\n", ori);
+        char *fname = expand_path(tag_idx+1);
+        int nsmpl;
+        char **list = hts_readlist(fname, 1, &nsmpl);
+        if ( !list && colon )
+        {
+            if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
+            tok->idxs  = idxs2;
+            tok->nidxs = nidxs2;
+            tok->idx   = idx2;
+            colon = rindex(fname, ':');
+            *colon = 0;
+            list = hts_readlist(fname, 1, &nsmpl);
+        }
+        if ( !list ) error("Could not read: %s\n", fname);
+        free(fname);
+        tok->nsamples = bcf_hdr_nsamples(hdr);
+        tok->usmpl = (uint8_t*) calloc(tok->nsamples,1); 
+        for (i=0; i<nsmpl; i++)
+        {
+            int ismpl = bcf_hdr_id2int(hdr,BCF_DT_SAMPLE,list[i]);
+            if ( ismpl<0 ) error("No such sample in the VCF: \"%s\"\n", list[i]);
+            free(list[i]);
+            tok->usmpl[ismpl] = 1;
+        }
+        free(list);
+        if ( !colon )
+        {
+            tok->idxs = (int*) malloc(sizeof(int));
+            tok->idxs[0] = -1;
+            tok->nidxs   = 1;
+            tok->idx     = -2;
+        }
+    }
+    else if ( colon )
      {
+        if ( !is_fmt ) error("Could not parse the index \"%s\". (Not a FORMAT tag yet sample index implied.)\n", ori);
          *colon = 0;
          if ( parse_idxs(tag_idx, &idxs1, &nidxs1, &idx1) != 0 ) error("Could not parse the index: %s\n", ori);
          if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
@@ -1682,6 +1981,18 @@ static void parse_tag_idx(bcf_hdr_t *hdr, int is_fmt, char *tag, char *tag_idx,
          for (i=0; i<tok->nidxs; i++) if ( tok->idxs[i] ) tok->nuidxs++;
      }
  }
+static int max_ac_an_unpack(bcf_hdr_t *hdr)
+{
+    int hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AC");
+    if ( hdr_id<0 ) return BCF_UN_FMT;
+    if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+    hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AN");
+    if ( hdr_id<0 ) return BCF_UN_FMT;
+    if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+    return BCF_UN_INFO;
+}
  static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
  {
      tok->tok_type  = TOK_VAL;
@@ -1711,13 +2022,11 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
          tok->tag = (char*) calloc(len+1,sizeof(char));
          memcpy(tok->tag,str,len);
          tok->tag[len] = 0;
-        wordexp_t wexp;
-        wordexp(tok->tag+1, &wexp, 0);
-        if ( !wexp.we_wordc ) error("No such file: %s\n", tok->tag+1);
+        char *fname = expand_path(tok->tag+1);
          int i, n;
-        char **list = hts_readlist(wexp.we_wordv[0], 1, &n);
-        if ( !list ) error("Could not read: %s\n", wexp.we_wordv[0]);
-        wordfree(&wexp);
+        char **list = hts_readlist(fname, 1, &n);
+        if ( !list ) error("Could not read: %s\n", fname);
+        free(fname);
          tok->hash = khash_str2int_init();
          for (i=0; i<n; i++)
          {
@@ -1829,8 +2138,9 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      {
          if ( tok->hdr_id >=0 )
          {
-            if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ) is_fmt = 0;
-            else if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ) is_fmt = 1;
+            int is_info = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ? 1 : 0;
+            is_fmt = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ? 1 : 0;
+            if ( is_info && is_fmt ) error("Both INFO/%s and FORMAT/%s exist, which one do you want?\n", tmp.s,tmp.s);
          }
          if ( is_fmt==-1 ) is_fmt = 0;
      }
@@ -1916,6 +2226,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AN") )
      {
+        filter->max_unpack |= BCF_UN_FMT;
          tok->setter = &filters_set_an;
          tok->tag = strdup("AN");
          free(tmp.s);
@@ -1923,6 +2234,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AC") )
      {
+        filter->max_unpack |= BCF_UN_FMT;
          tok->setter = &filters_set_ac;
          tok->tag = strdup("AC");
          free(tmp.s);
@@ -1930,6 +2242,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"MAC") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_mac;
          tok->tag = strdup("MAC");
          free(tmp.s);
@@ -1937,6 +2250,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AF") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_af;
          tok->tag = strdup("AF");
          free(tmp.s);
@@ -1944,6 +2258,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"MAF") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_maf;
          tok->tag = strdup("MAF");
          free(tmp.s);
@@ -1962,6 +2277,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
          tok->threshold = strtof(tmp.s, &end);   // float?
          if ( errno!=0 || end!=tmp.s+len ) error("[%s:%d %s] Error: the tag \"%s\" is not defined in the VCF header\n", __FILE__,__LINE__,__FUNCTION__,tmp.s);
      }
+    tok->is_constant = 1;
  
      if ( tmp.s ) free(tmp.s);
      return 0;
@@ -1994,6 +2310,126 @@ static void str_to_lower(char *str)
  {
      while ( *str ) { *str = tolower(*str); str++; }
  }
+static int perl_exec(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+#if ENABLE_PERL_FILTERS
+
+    PerlInterpreter *perl = flt->perl;
+    if ( !perl ) error("Error: perl expression without a perl script name\n");
+
+    dSP;
+    ENTER;
+    SAVETMPS;
+
+    PUSHMARK(SP);
+    int i,j, istack = nstack - rtok->nargs;
+    for (i=istack+1; i<nstack; i++)
+    {
+        token_t *tok = stack[i];
+        if ( tok->is_str )
+            XPUSHs(sv_2mortal(newSVpvn(tok->str_value.s,tok->str_value.l)));
+        else if ( tok->nvalues==1 )
+            XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+        else if ( tok->nvalues>1 )
+        {
+            AV *av = newAV();
+            for (j=0; j<tok->nvalues; j++) av_push(av, newSVnv(tok->values[j]));
+            SV *rv = newRV_inc((SV*)av);
+            XPUSHs(rv);
+        }
+        else
+        {
+            bcf_double_set_missing(tok->values[0]);
+            XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+        }
+    }
+    PUTBACK;
+
+    // A possible future todo: provide a means to select samples and indexes,
+    // expressions like this don't work yet
+    //          perl.filter(FMT/AD)[1:0]
+
+    int nret = call_pv(stack[istack]->str_value.s, G_ARRAY);
+
+    SPAGAIN;
+
+    rtok->nvalues = nret;
+    hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+    for (i=nret; i>0; i--)
+    {
+        rtok->values[i-1] = (double) POPn;
+        if ( isnan(rtok->values[i-1]) ) bcf_double_set_missing(rtok->values[i-1]);
+    }
+
+    PUTBACK;
+    FREETMPS;
+    LEAVE;
+
+#else
+    error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+    return rtok->nargs;
+}
+static void perl_init(filter_t *filter, char **str)
+{
+    char *beg = *str;
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) return;
+    if ( strncasecmp("perl:", beg, 5) ) return;
+#if ENABLE_PERL_FILTERS
+    beg += 5;
+    char *end = beg;
+    while ( *end && *end!=';' ) end++;  // for now not escaping semicolons
+    *str = end+1;
+
+    if ( ++filter_ninit == 1 )
+    {
+        // must be executed only once, even for multiple filters; first time here
+        int argc = 0;
+        char **argv = NULL;
+        char **env  = NULL;
+        PERL_SYS_INIT3(&argc, &argv, &env);
+    }
+    
+    filter->perl = perl_alloc();
+    PerlInterpreter *perl = filter->perl;
+
+    if ( !perl ) error("perl_alloc failed\n");
+    perl_construct(perl);
+
+    // name of the perl script to run
+    char *rmme = (char*) calloc(end - beg + 1,1);
+    memcpy(rmme, beg, end - beg);
+    char *argv[] = { "", "" };
+    argv[1] = expand_path(rmme);
+    free(rmme);
+
+    PL_origalen = 1;    // don't allow $0 change
+    int ret = perl_parse(filter->perl, NULL, 2, argv, NULL);
+    PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
+    if ( ret ) error("Failed to parse: %s\n", argv[1]);
+    free(argv[1]);
+
+    perl_run(perl);
+#else
+    error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+}
+static void perl_destroy(filter_t *filter)
+{
+#if ENABLE_PERL_FILTERS
+    if ( !filter->perl ) return;
+
+    PerlInterpreter *perl = filter->perl;
+    perl_destruct(perl);
+    perl_free(perl);
+    if ( --filter_ninit <= 0  )
+    {
+        // similarly to PERL_SYS_INIT3, can must be executed only once? todo: test
+        PERL_SYS_TERM();
+    }
+#endif
+}
  
  
  // Parse filter expression and convert to reverse polish notation. Dijkstra's shunting-yard algorithm
@@ -2004,10 +2440,13 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      filter->hdr = hdr;
      filter->max_unpack |= BCF_UN_STR;
  
-    int nops = 0, mops = 0, *ops = NULL;    // operators stack
-    int nout = 0, mout = 0;                 // filter tokens, RPN
+    int nops = 0, mops = 0;    // operators stack
+    int nout = 0, mout = 0;    // filter tokens, RPN
      token_t *out = NULL;
+    token_t *ops = NULL;
      char *tmp = filter->str;
+    perl_init(filter, &tmp);
+
      int last_op = -1;
      while ( *tmp )
      {
@@ -2016,24 +2455,26 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
          if ( ret==-1 ) error("Missing quotes in: %s\n", str);
  
          // fprintf(stderr,"token=[%c] .. [%s] %d\n", TOKEN_STRING[ret], tmp, len);
-        // int i; for (i=0; i<nops; i++) fprintf(stderr," .%c.", TOKEN_STRING[ops[i]]); fprintf(stderr,"\n");
+        // int i; for (i=0; i<nops; i++) fprintf(stderr," .%c", TOKEN_STRING[ops[i]]); fprintf(stderr,"\n");
  
          if ( ret==TOK_LFT )         // left bracket
          {
              nops++;
-            hts_expand(int, nops, mops, ops);
-            ops[nops-1] = ret;
+            hts_expand0(token_t, nops, mops, ops);
+            ops[nops-1].tok_type = ret;
          }
          else if ( ret==TOK_RGT )    // right bracket
          {
-            while ( nops>0 && ops[nops-1]!=TOK_LFT )
+            while ( nops>0 && ops[nops-1].tok_type!=TOK_LFT )
              {
                  nout++;
                  hts_expand0(token_t, nout, mout, out);
-                out[nout-1].tok_type = ops[nops-1];
+                out[nout-1] = ops[nops-1];
+                memset(&ops[nops-1],0,sizeof(token_t));
                  nops--;
              }
              if ( nops<=0 ) error("Could not parse: %s\n", str);
+            memset(&ops[nops-1],0,sizeof(token_t));
              nops--;
          }
          else if ( ret!=TOK_VAL )    // one of the operators
@@ -2050,19 +2491,90 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
                  tok->threshold = -1.0;
                  ret = TOK_MULT;
              }
+            else if ( ret == -TOK_FUNC )
+            {
+                // this is different from TOK_PERLSUB,TOK_BINOM in that the expression inside the
+                // brackets gets evaluated as normal expression
+                nops++;
+                hts_expand0(token_t, nops, mops, ops);
+                token_t *tok = &ops[nops-1];
+                tok->tok_type  = -ret;
+                tok->hdr_id    = -1;
+                tok->pass_site = -1;
+                tok->threshold = -1.0;
+                if ( !strncasecmp(tmp-len,"N_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("N_PASS"); }
+                else if ( !strncasecmp(tmp-len,"F_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("F_PASS"); }
+                else error("The function \"%s\" is not supported\n", tmp-len);
+                continue;
+            }
+            else if ( ret < 0 )     // variable number of arguments: TOK_PERLSUB,TOK_BINOM
+            {
+                ret = -ret;
+
+                tmp += len;
+                char *beg = tmp;
+                kstring_t rmme = {0,0,0};
+                int i, margs, nargs = 0;
+
+                if ( ret == TOK_PERLSUB )
+                {
+                    while ( *beg && ((isalnum(*beg) && !ispunct(*beg)) || *beg=='_') ) beg++;
+                    if ( *beg!='(' ) error("Could not parse the expression: %s\n", str);
+
+                    // the subroutine name
+                    kputc('"', &rmme);
+                    kputsn(tmp, beg-tmp, &rmme);
+                    kputc('"', &rmme);
+                    nout++;
+                    hts_expand0(token_t, nout, mout, out);
+                    filters_init1(filter, rmme.s, rmme.l, &out[nout-1]);
+                    nargs++;
+                }
+                char *end = beg;
+                while ( *end && *end!=')' ) end++;
+                if ( !*end ) error("Could not parse the expression: %s\n", str);
+
+                // subroutine arguments
+                rmme.l = 0;
+                kputsn(beg+1, end-beg-1, &rmme);
+                char **rmme_list = hts_readlist(rmme.s, 0, &margs);
+                for (i=0; i<margs; i++)
+                {
+                    nargs++;
+                    nout++;
+                    hts_expand0(token_t, nout, mout, out);
+                    filters_init1(filter, rmme_list[i], strlen(rmme_list[i]), &out[nout-1]);
+                    free(rmme_list[i]);
+                }
+                free(rmme_list);
+                free(rmme.s);
+
+                nout++;
+                hts_expand0(token_t, nout, mout, out);
+                token_t *tok = &out[nout-1];
+                tok->tok_type  = ret;
+                tok->nargs     = nargs;
+                tok->hdr_id    = -1;
+                tok->pass_site = -1;
+                tok->threshold = -1.0;
+
+                tmp = end + 1;
+                continue;
+            }
              else
              {
-                while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1]] )
+                while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1].tok_type] )
                  {
                      nout++;
                      hts_expand0(token_t, nout, mout, out);
-                    out[nout-1].tok_type = ops[nops-1];
+                    out[nout-1] = ops[nops-1];
+                    memset(&ops[nops-1],0,sizeof(token_t));
                      nops--;
                  }
              }
              nops++;
-            hts_expand(int, nops, mops, ops);
-            ops[nops-1] = ret;
+            hts_expand0(token_t, nops, mops, ops);
+            ops[nops-1].tok_type = ret;
          }
          else if ( !len )
          {
@@ -2073,17 +2585,21 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
          {
              nout++;
              hts_expand0(token_t, nout, mout, out);
-            filters_init1(filter, tmp, len, &out[nout-1]);
+            if ( tmp[len-1]==',' )
+                filters_init1(filter, tmp, len-1, &out[nout-1]);
+            else
+                filters_init1(filter, tmp, len, &out[nout-1]);
              tmp += len;
          }
          last_op = ret;
      }
      while ( nops>0 )
      {
-        if ( ops[nops-1]==TOK_LFT || ops[nops-1]==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
+        if ( ops[nops-1].tok_type==TOK_LFT || ops[nops-1].tok_type==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
          nout++;
          hts_expand0(token_t, nout, mout, out);
-        out[nout-1].tok_type = ops[nops-1];
+        out[nout-1] = ops[nops-1];
+        memset(&ops[nops-1],0,sizeof(token_t));
          nops--;
      }
  
@@ -2096,6 +2612,9 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      int i;
      for (i=0; i<nout; i++)
      {
+        if ( i+1<nout && (out[i].tok_type==TOK_LT || out[i].tok_type==TOK_BT) && out[i+1].tok_type==TOK_EQ )
+            error("Error parsing the expression: \"%s\"\n", filter->str);
+
          if ( out[i].tok_type==TOK_OR || out[i].tok_type==TOK_OR_VEC )
              out[i].func = vector_logic_or;
          if ( out[i].tok_type==TOK_AND || out[i].tok_type==TOK_AND_VEC )
@@ -2213,13 +2732,15 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      filter->nsamples = filter->max_unpack&BCF_UN_FMT ? bcf_hdr_nsamples(filter->hdr) : 0;
      for (i=0; i<nout; i++)
      {
-        if ( out[i].tok_type==TOK_MAX )      { out[i].func = func_max; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; }
+        if ( out[i].tok_type==TOK_MAX )      { out[i].func = func_max; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_BINOM ) { out[i].func = func_binom; out[i].tok_type = TOK_FUNC; }
+        else if ( out[i].tok_type==TOK_PERLSUB ) { out[i].func = perl_exec; out[i].tok_type = TOK_FUNC; }
          hts_expand0(double,1,out[i].mvalues,out[i].values);
          if ( filter->nsamples )
          {
@@ -2240,6 +2761,7 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
  
  void filter_destroy(filter_t *filter)
  {
+    perl_destroy(filter);
      int i;
      for (i=0; i<filter->nfilters; i++)
      {
@@ -2274,7 +2796,6 @@ int filter_test(filter_t *filter, bcf1_t *line, const uint8_t **samples)
      for (i=0; i<filter->nfilters; i++)
      {
          filter->filters[i].pass_site = 0;
-
          if ( filter->filters[i].tok_type == TOK_VAL )
          {
              if ( filter->filters[i].setter )    // variable, query the VCF line
diff --git a/bcftools/filter.c.pysam.c b/bcftools/filter.c.pysam.c

index 83f49e0f69932bece89c7632eca48cf9f2fbe41d..82c2b85977abb43e59f20882222dcfa2c35b47b1 100644 (file)
--- a/bcftools/filter.c.pysam.c
+++ b/bcftools/filter.c.pysam.c
@@ -29,13 +29,27 @@ THE SOFTWARE.  */
  #include <strings.h>
  #include <errno.h>
  #include <math.h>
-#include <wordexp.h>
+#include <sys/types.h>
+#include <pwd.h>
  #include <regex.h>
  #include <htslib/khash_str2int.h>
-#include "filter.h"
-#include "bcftools.h"
  #include <htslib/hts_defs.h>
  #include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include "config.h"
+#include "filter.h"
+#include "bcftools.h"
+
+#if ENABLE_PERL_FILTERS
+#  define filter_t perl_filter_t
+#  include <EXTERN.h>
+#  include <perl.h>
+#  undef filter_t
+#  define my_perl perl
+
+static int filter_ninit = 0;
+#endif
+
  
  #ifndef __FUNCTION__
  #  define __FUNCTION__ __func__
@@ -65,9 +79,11 @@ typedef struct _token_t
  {
      // read-only values, same for all VCF lines
      int tok_type;       // one of the TOK_* keys below
+    int nargs;          // with TOK_PERLSUB the first argument is the name of the subroutine
      char *key;          // set only for string constants, otherwise NULL
      char *tag;          // for debugging and printout only, VCF tag name
      double threshold;   // filtering threshold
+    int is_constant;    // the threshold is set
      int hdr_id, type;   // BCF header lookup ID and one of BCF_HT_* types
      int idx;            // 0-based index to VCF vectors,
                          //  -2: list (e.g. [0,1,2] or [1..3] or [1..] or any field[*], which is equivalent to [0..])
@@ -102,6 +118,9 @@ struct _filter_t
      float   *tmpf;
      kstring_t tmps;
      int max_unpack, mtmpi, mtmpf, nsamples;
+#if ENABLE_PERL_FILTERS
+    PerlInterpreter *perl;
+#endif
  };
  
  
@@ -132,12 +151,15 @@ struct _filter_t
  #define TOK_LEN     24
  #define TOK_FUNC    25
  #define TOK_CNT     26
+#define TOK_PERLSUB 27
+#define TOK_BINOM   28
  
-//                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
-//                        ( ) [ < = > ] ! | &  +  -  *  /  M  m  a  A  O  ~  ^  S  .  l  
-static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8};
-#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^f"
+//                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
+//                        ( ) [ < = > ] ! | &  +  -  *  /  M  m  a  A  O  ~  ^  S  .  l  f  c  p
+static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8, 8};
+#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^S.lfcp"
  
+// Return negative values if it is a function with variable number of arguments
  static int filters_next_token(char **str, int *len)
  {
      char *tmp = *str;
@@ -164,6 +186,7 @@ static int filters_next_token(char **str, int *len)
      if ( !strncasecmp(tmp,"ABS(",4) ) { (*str) += 3; return TOK_ABS; }
      if ( !strncasecmp(tmp,"COUNT(",4) ) { (*str) += 5; return TOK_CNT; }
      if ( !strncasecmp(tmp,"STRLEN(",7) ) { (*str) += 6; return TOK_LEN; }
+    if ( !strncasecmp(tmp,"BINOM(",6) ) { (*str) += 5; return -TOK_BINOM; }
      if ( !strncasecmp(tmp,"%MAX(",5) ) { (*str) += 4; return TOK_MAX; } // for backward compatibility
      if ( !strncasecmp(tmp,"%MIN(",5) ) { (*str) += 4; return TOK_MIN; } // for backward compatibility
      if ( !strncasecmp(tmp,"%AVG(",5) ) { (*str) += 4; return TOK_AVG; } // for backward compatibility
@@ -171,6 +194,9 @@ static int filters_next_token(char **str, int *len)
      if ( !strncasecmp(tmp,"INFO/",5) ) tmp += 5;
      if ( !strncasecmp(tmp,"FORMAT/",7) ) tmp += 7;
      if ( !strncasecmp(tmp,"FMT/",4) ) tmp += 4;
+    if ( !strncasecmp(tmp,"PERL.",5) ) { (*str) += 5; return -TOK_PERLSUB; }
+    if ( !strncasecmp(tmp,"N_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
+    if ( !strncasecmp(tmp,"F_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
  
      if ( tmp[0]=='@' )  // file name
      {
@@ -182,22 +208,25 @@ static int filters_next_token(char **str, int *len)
      int square_brackets = 0;
      while ( tmp[0] )
      {
-        if ( tmp[0]=='"' ) break;
-        if ( tmp[0]=='\'' ) break;
-        if ( isspace(tmp[0]) ) break;
-        if ( tmp[0]=='<' ) break;
-        if ( tmp[0]=='>' ) break;
-        if ( tmp[0]=='=' ) break;
-        if ( tmp[0]=='!' ) break;
-        if ( tmp[0]=='&' ) break;
-        if ( tmp[0]=='|' ) break;
-        if ( tmp[0]=='(' ) break;
-        if ( tmp[0]==')' ) break;
-        if ( tmp[0]=='+' ) break;
-        if ( tmp[0]=='*' && !square_brackets ) break;
-        if ( tmp[0]=='-' && !square_brackets ) break;
-        if ( tmp[0]=='/' ) break;
-        if ( tmp[0]=='~' ) break;
+        if ( !square_brackets )
+        {
+            if ( tmp[0]=='"' ) break;
+            if ( tmp[0]=='\'' ) break;
+            if ( isspace(tmp[0]) ) break;
+            if ( tmp[0]=='<' ) break;
+            if ( tmp[0]=='>' ) break;
+            if ( tmp[0]=='=' ) break;
+            if ( tmp[0]=='!' ) break;
+            if ( tmp[0]=='&' ) break;
+            if ( tmp[0]=='|' ) break;
+            if ( tmp[0]=='(' ) break;
+            if ( tmp[0]==')' ) break;
+            if ( tmp[0]=='+' ) break;
+            if ( tmp[0]=='*' ) break;
+            if ( tmp[0]=='-' ) break;
+            if ( tmp[0]=='/' ) break;
+            if ( tmp[0]=='~' ) break;
+        }
          if ( tmp[0]==']' ) { if (square_brackets) tmp++; break; }
          if ( tmp[0]=='[' ) square_brackets++; 
          tmp++;
@@ -252,6 +281,52 @@ static int filters_next_token(char **str, int *len)
      return TOK_VAL;
  }
  
+
+/* 
+    Simple path expansion, expands ~/, ~user, $var. The result must be freed by the caller.
+    
+    Based on jkb's staden code with some adjustements.
+    https://sourceforge.net/p/staden/code/HEAD/tree/staden/trunk/src/Misc/getfile.c#l123
+*/
+char *expand_path(char *path)
+{
+#ifdef _WIN32
+    return strdup(path);    // windows expansion: todo
+#endif
+
+    kstring_t str = {0,0,0};
+
+    if ( path[0] == '~' )
+    {
+        if ( !path[1] || path[1] == '/' )
+        {
+            // ~ or ~/path
+            kputs(getenv("HOME"), &str);
+            if ( path[1] ) kputs(path+1, &str);
+        }
+        else
+        {
+            // user name: ~pd3/path
+            char *end = path;
+            while ( *end && *end!='/' ) end++;
+            kputsn(path+1, end-path-1, &str);
+            struct passwd *pwentry = getpwnam(str.s);
+            str.l = 0;
+
+            if ( !pwentry ) kputsn(path, end-path, &str);
+            else kputs(pwentry->pw_dir, &str);
+            kputs(end, &str);
+        }
+        return str.s;
+    }
+    if ( path[0] == '$' )
+    {
+        char *var = getenv(path+1);
+        if ( var ) path = var;
+    }
+    return strdup(path);
+}
+
  static void filters_set_qual(filter_t *flt, bcf1_t *line, token_t *tok)
  {
      float *ptr = &line->qual;
@@ -858,7 +933,6 @@ static void filters_set_alt_string(filter_t *flt, bcf1_t *line, token_t *tok)
              kputs(line->d.allele[tok->idx + 1], &tok->str_value);
          else
              kputc('.', &tok->str_value);
-        tok->idx = 0;
      }
      else if ( tok->idx==-2 )
      {
@@ -919,6 +993,36 @@ static void filters_set_nmissing(filter_t *flt, bcf1_t *line, token_t *tok)
      tok->nvalues = 1;
      tok->values[0] = tok->tag[0]=='N' ? nmissing : (double)nmissing / line->n_sample;
  }
+static int func_npass(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+    if ( nstack==0 ) error("Error parsing the expresion\n");
+    token_t *tok = stack[nstack - 1];
+    if ( !tok->nsamples ) error("The function %s works with FORMAT fields\n", rtok->tag);
+
+    rtok->nsamples = tok->nsamples;
+    memcpy(rtok->pass_samples, tok->pass_samples, rtok->nsamples*sizeof(*rtok->pass_samples));
+
+    assert(tok->usmpl);
+    if ( !rtok->usmpl )
+    {
+        rtok->usmpl = (uint8_t*) malloc(tok->nsamples*sizeof(*rtok->usmpl));
+        memcpy(rtok->usmpl, tok->usmpl, tok->nsamples*sizeof(*rtok->usmpl));
+    }
+
+    int i, npass = 0;
+    for (i=0; i<rtok->nsamples; i++)
+    {
+        if ( !rtok->usmpl[i] ) continue;
+        if ( rtok->pass_samples[i] ) npass++;
+    }
+
+    assert( rtok->values );
+    rtok->nvalues = 1;
+    rtok->values[0] = rtok->tag[0]=='N' ? npass : (line->n_sample ? 1.0*npass/line->n_sample : 0);
+    rtok->nsamples = 0;
+
+    return 1;
+}
  static void filters_set_nalt(filter_t *flt, bcf1_t *line, token_t *tok)
  {
      tok->nvalues = 1;
@@ -1120,17 +1224,171 @@ static int func_strlen(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **sta
      }
      else
      {
-        rtok->values[0] = strlen(tok->str_value.s);
+        if ( !tok->str_value.s[1] && tok->str_value.s[0]=='.' )
+            rtok->values[0] = 0;
+        else
+            rtok->values[0] = strlen(tok->str_value.s);
          rtok->nvalues = 1;
      }
      return 1;
  }
+static inline double calc_binom(int na, int nb)
+{
+    if ( na==0 && nb==0 ) return -1;
+    if ( na==nb ) return 1;
+
+    // kfunc.h implements kf_betai, which is the regularized beta function  P(X<=k/N;p) = I_{1-p}(N-k,k+1)
+
+    double pval = na < nb ? kf_betai(nb, na + 1, 0.5) : kf_betai(na, nb + 1, 0.5);
+    pval *= 2;
+    assert( pval-1 < 1e-10 );
+    if ( pval>1 ) pval = 1;     // this can happen, machine precision error, eg. kf_betai(1,0,0.5)
+
+    return pval;
+}
+static int func_binom(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+    int i, istack = nstack - rtok->nargs;
+
+    if ( rtok->nargs!=2 && rtok->nargs!=1 ) error("Error: binom() takes one or two arguments\n");
+    assert( istack>=0 );
+
+    // The expected mean is 0.5. Should we support also prob!=0.5?
+    //
+    //  double prob = 0.5;
+    //  if ( nstack==3 )
+    //  {
+    //      // three parameters, the first one must be a scalar:  binom(0.25,AD[0],AD[1])
+    //      if ( !stack[istack]->is_constant )
+    //          error("The first argument of binom() must be a numeric constant if three parameters are given\n");
+    //      prob = stack[istack]->threshold;
+    //      istack++;
+    //  }
+    //  else if ( nstack==2 && stack[istack]->is_constant )
+    //  {
+    //      // two parameters, the first can be a scalar:  binom(0.25,AD) or binom(AD[0],AD[1])
+    //      prob = stack[istack]->threshold;
+    //      istack++;
+    //  }
+
+    token_t *tok = stack[istack];
+    if ( tok->nsamples )
+    {
+        // working with a FORMAT tag
+        rtok->nval1    = 1;
+        rtok->nvalues  = tok->nsamples;
+        rtok->nsamples = tok->nsamples;
+        hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+        assert(tok->usmpl);
+        if ( !rtok->usmpl ) rtok->usmpl = (uint8_t*) malloc(tok->nsamples);
+        memcpy(rtok->usmpl, tok->usmpl, tok->nsamples);
+
+        if ( istack+1==nstack )
+        {
+            // determine the index from the GT field: binom(AD)
+            int ngt = bcf_get_genotypes(flt->hdr, line, &flt->tmpi, &flt->mtmpi);
+            int max_ploidy = ngt/line->n_sample;
+            if ( ngt <= 0 || max_ploidy < 2 ) // GT not present or not diploid, cannot set
+            {
+                for (i=0; i<rtok->nsamples; i++)
+                    if ( rtok->usmpl[i] ) bcf_double_set_missing(rtok->values[i]);
+                return rtok->nargs;
+            }
+            for (i=0; i<rtok->nsamples; i++)
+            {
+                if ( !rtok->usmpl[i] ) continue;
+                int32_t *ptr = flt->tmpi + i*max_ploidy;
+                if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                int idx1 = bcf_gt_allele(ptr[0]);
+                int idx2 = bcf_gt_allele(ptr[1]);
+                if ( idx1>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+                if ( idx2>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+                double *vals = tok->values + tok->nval1*i;
+                if ( bcf_double_is_missing(vals[idx1]) || bcf_double_is_missing(vals[idx2]) )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                rtok->values[i] = calc_binom(vals[idx1],vals[idx2]);
+                if ( rtok->values[i] < 0 )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+            }
+        }
+        else
+        {
+            // the fields given explicitly: binom(AD[:0],AD[:1])
+            token_t *tok2 = stack[istack+1];
+            if ( tok->nval1!=1 || tok2->nval1!=1 )
+                error("Expected one value per binom() argument, found %d and %d at %s:%d\n",tok->nval1,tok2->nval1, bcf_seqname(flt->hdr,line),line->pos+1);
+            for (i=0; i<rtok->nsamples; i++)
+            {
+                if ( !rtok->usmpl[i] ) continue;
+                double *ptr1 = tok->values + tok->nval1*i;
+                double *ptr2 = tok2->values + tok2->nval1*i;
+                if ( bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+                rtok->values[i] = calc_binom(ptr1[0],ptr2[0]);
+                if ( rtok->values[i] < 0 )
+                {
+                    bcf_double_set_missing(rtok->values[i]);
+                    continue;
+                }
+            }
+        }
+    }
+    else
+    {
+        // working with an INFO tag
+        rtok->nvalues  = 1;
+        hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+
+        double *ptr1 = NULL, *ptr2 = NULL;
+        if ( istack+1==nstack )
+        {
+            // only one tag, expecting two values: binom(INFO/AD)
+            if ( tok->nvalues==2 )
+            {
+                ptr1 = &tok->values[0];
+                ptr2 = &tok->values[1];
+            }
+        }
+        else
+        {
+            // two tags, expecting one value in each: binom(INFO/AD[0],INFO/AD[1])
+            token_t *tok2 = stack[istack+1];
+            if ( tok->nvalues==1 && tok2->nvalues==1 )
+            {
+                ptr1 = &tok->values[0];
+                ptr2 = &tok2->values[0];
+            }
+        }
+        if ( !ptr1 || !ptr2 || bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+            bcf_double_set_missing(rtok->values[0]);
+        else
+        {
+            rtok->values[0] = calc_binom(ptr1[0],ptr2[0]);
+            if ( rtok->values[0] < 0 )
+                bcf_double_set_missing(rtok->values[0]);
+        }
+    }
+    return rtok->nargs;
+}
  inline static void tok_init_values(token_t *atok, token_t *btok, token_t *rtok)
  {
      token_t *tok = atok->nvalues > btok->nvalues ? atok : btok;
      rtok->nvalues = tok->nvalues;
      rtok->nval1   = tok->nval1;
-    hts_expand(double*, rtok->nvalues, rtok->mvalues, rtok->values);
+    hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
  }
  inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
  {
@@ -1192,6 +1450,8 @@ inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
  
  static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
  {
+    if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\"\n", filter->str);
+
      token_t *atok = stack[nstack-2];
      token_t *btok = stack[nstack-1];
      tok_init_samples(atok, btok, rtok);
@@ -1252,6 +1512,8 @@ static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_
  }
  static int vector_logic_and(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
  {
+    if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\". (nstack=%d)\n", filter->str,nstack);
+
      token_t *atok = stack[nstack-2];
      token_t *btok = stack[nstack-1];
      tok_init_samples(atok, btok, rtok);
@@ -1614,9 +1876,46 @@ static void parse_tag_idx(bcf_hdr_t *hdr, int is_fmt, char *tag, char *tag_idx,
      int *idxs2 = NULL, nidxs2 = 0, idx2 = 0;
  
      int set_samples = 0;
-    char *colon = index(tag_idx, ':');
-    if ( colon )
+    char *colon = rindex(tag_idx, ':');
+    if ( tag_idx[0]=='@' )     // file list with sample names
+    {
+        if ( !is_fmt ) error("Could not parse \"%s\". (Not a FORMAT tag yet a sample list provided.)\n", ori);
+        char *fname = expand_path(tag_idx+1);
+        int nsmpl;
+        char **list = hts_readlist(fname, 1, &nsmpl);
+        if ( !list && colon )
+        {
+            if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
+            tok->idxs  = idxs2;
+            tok->nidxs = nidxs2;
+            tok->idx   = idx2;
+            colon = rindex(fname, ':');
+            *colon = 0;
+            list = hts_readlist(fname, 1, &nsmpl);
+        }
+        if ( !list ) error("Could not read: %s\n", fname);
+        free(fname);
+        tok->nsamples = bcf_hdr_nsamples(hdr);
+        tok->usmpl = (uint8_t*) calloc(tok->nsamples,1); 
+        for (i=0; i<nsmpl; i++)
+        {
+            int ismpl = bcf_hdr_id2int(hdr,BCF_DT_SAMPLE,list[i]);
+            if ( ismpl<0 ) error("No such sample in the VCF: \"%s\"\n", list[i]);
+            free(list[i]);
+            tok->usmpl[ismpl] = 1;
+        }
+        free(list);
+        if ( !colon )
+        {
+            tok->idxs = (int*) malloc(sizeof(int));
+            tok->idxs[0] = -1;
+            tok->nidxs   = 1;
+            tok->idx     = -2;
+        }
+    }
+    else if ( colon )
      {
+        if ( !is_fmt ) error("Could not parse the index \"%s\". (Not a FORMAT tag yet sample index implied.)\n", ori);
          *colon = 0;
          if ( parse_idxs(tag_idx, &idxs1, &nidxs1, &idx1) != 0 ) error("Could not parse the index: %s\n", ori);
          if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
@@ -1684,6 +1983,18 @@ static void parse_tag_idx(bcf_hdr_t *hdr, int is_fmt, char *tag, char *tag_idx,
          for (i=0; i<tok->nidxs; i++) if ( tok->idxs[i] ) tok->nuidxs++;
      }
  }
+static int max_ac_an_unpack(bcf_hdr_t *hdr)
+{
+    int hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AC");
+    if ( hdr_id<0 ) return BCF_UN_FMT;
+    if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+    hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AN");
+    if ( hdr_id<0 ) return BCF_UN_FMT;
+    if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+    return BCF_UN_INFO;
+}
  static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
  {
      tok->tok_type  = TOK_VAL;
@@ -1713,13 +2024,11 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
          tok->tag = (char*) calloc(len+1,sizeof(char));
          memcpy(tok->tag,str,len);
          tok->tag[len] = 0;
-        wordexp_t wexp;
-        wordexp(tok->tag+1, &wexp, 0);
-        if ( !wexp.we_wordc ) error("No such file: %s\n", tok->tag+1);
+        char *fname = expand_path(tok->tag+1);
          int i, n;
-        char **list = hts_readlist(wexp.we_wordv[0], 1, &n);
-        if ( !list ) error("Could not read: %s\n", wexp.we_wordv[0]);
-        wordfree(&wexp);
+        char **list = hts_readlist(fname, 1, &n);
+        if ( !list ) error("Could not read: %s\n", fname);
+        free(fname);
          tok->hash = khash_str2int_init();
          for (i=0; i<n; i++)
          {
@@ -1831,8 +2140,9 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      {
          if ( tok->hdr_id >=0 )
          {
-            if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ) is_fmt = 0;
-            else if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ) is_fmt = 1;
+            int is_info = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ? 1 : 0;
+            is_fmt = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ? 1 : 0;
+            if ( is_info && is_fmt ) error("Both INFO/%s and FORMAT/%s exist, which one do you want?\n", tmp.s,tmp.s);
          }
          if ( is_fmt==-1 ) is_fmt = 0;
      }
@@ -1918,6 +2228,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AN") )
      {
+        filter->max_unpack |= BCF_UN_FMT;
          tok->setter = &filters_set_an;
          tok->tag = strdup("AN");
          free(tmp.s);
@@ -1925,6 +2236,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AC") )
      {
+        filter->max_unpack |= BCF_UN_FMT;
          tok->setter = &filters_set_ac;
          tok->tag = strdup("AC");
          free(tmp.s);
@@ -1932,6 +2244,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"MAC") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_mac;
          tok->tag = strdup("MAC");
          free(tmp.s);
@@ -1939,6 +2252,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"AF") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_af;
          tok->tag = strdup("AF");
          free(tmp.s);
@@ -1946,6 +2260,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
      }
      else if ( !strcasecmp(tmp.s,"MAF") )
      {
+        filter->max_unpack |= max_ac_an_unpack(filter->hdr);
          tok->setter = &filters_set_maf;
          tok->tag = strdup("MAF");
          free(tmp.s);
@@ -1964,6 +2279,7 @@ static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
          tok->threshold = strtof(tmp.s, &end);   // float?
          if ( errno!=0 || end!=tmp.s+len ) error("[%s:%d %s] Error: the tag \"%s\" is not defined in the VCF header\n", __FILE__,__LINE__,__FUNCTION__,tmp.s);
      }
+    tok->is_constant = 1;
  
      if ( tmp.s ) free(tmp.s);
      return 0;
@@ -1996,6 +2312,126 @@ static void str_to_lower(char *str)
  {
      while ( *str ) { *str = tolower(*str); str++; }
  }
+static int perl_exec(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+#if ENABLE_PERL_FILTERS
+
+    PerlInterpreter *perl = flt->perl;
+    if ( !perl ) error("Error: perl expression without a perl script name\n");
+
+    dSP;
+    ENTER;
+    SAVETMPS;
+
+    PUSHMARK(SP);
+    int i,j, istack = nstack - rtok->nargs;
+    for (i=istack+1; i<nstack; i++)
+    {
+        token_t *tok = stack[i];
+        if ( tok->is_str )
+            XPUSHs(sv_2mortal(newSVpvn(tok->str_value.s,tok->str_value.l)));
+        else if ( tok->nvalues==1 )
+            XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+        else if ( tok->nvalues>1 )
+        {
+            AV *av = newAV();
+            for (j=0; j<tok->nvalues; j++) av_push(av, newSVnv(tok->values[j]));
+            SV *rv = newRV_inc((SV*)av);
+            XPUSHs(rv);
+        }
+        else
+        {
+            bcf_double_set_missing(tok->values[0]);
+            XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+        }
+    }
+    PUTBACK;
+
+    // A possible future todo: provide a means to select samples and indexes,
+    // expressions like this don't work yet
+    //          perl.filter(FMT/AD)[1:0]
+
+    int nret = call_pv(stack[istack]->str_value.s, G_ARRAY);
+
+    SPAGAIN;
+
+    rtok->nvalues = nret;
+    hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+    for (i=nret; i>0; i--)
+    {
+        rtok->values[i-1] = (double) POPn;
+        if ( isnan(rtok->values[i-1]) ) bcf_double_set_missing(rtok->values[i-1]);
+    }
+
+    PUTBACK;
+    FREETMPS;
+    LEAVE;
+
+#else
+    error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+    return rtok->nargs;
+}
+static void perl_init(filter_t *filter, char **str)
+{
+    char *beg = *str;
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) return;
+    if ( strncasecmp("perl:", beg, 5) ) return;
+#if ENABLE_PERL_FILTERS
+    beg += 5;
+    char *end = beg;
+    while ( *end && *end!=';' ) end++;  // for now not escaping semicolons
+    *str = end+1;
+
+    if ( ++filter_ninit == 1 )
+    {
+        // must be executed only once, even for multiple filters; first time here
+        int argc = 0;
+        char **argv = NULL;
+        char **env  = NULL;
+        PERL_SYS_INIT3(&argc, &argv, &env);
+    }
+    
+    filter->perl = perl_alloc();
+    PerlInterpreter *perl = filter->perl;
+
+    if ( !perl ) error("perl_alloc failed\n");
+    perl_construct(perl);
+
+    // name of the perl script to run
+    char *rmme = (char*) calloc(end - beg + 1,1);
+    memcpy(rmme, beg, end - beg);
+    char *argv[] = { "", "" };
+    argv[1] = expand_path(rmme);
+    free(rmme);
+
+    PL_origalen = 1;    // don't allow $0 change
+    int ret = perl_parse(filter->perl, NULL, 2, argv, NULL);
+    PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
+    if ( ret ) error("Failed to parse: %s\n", argv[1]);
+    free(argv[1]);
+
+    perl_run(perl);
+#else
+    error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+}
+static void perl_destroy(filter_t *filter)
+{
+#if ENABLE_PERL_FILTERS
+    if ( !filter->perl ) return;
+
+    PerlInterpreter *perl = filter->perl;
+    perl_destruct(perl);
+    perl_free(perl);
+    if ( --filter_ninit <= 0  )
+    {
+        // similarly to PERL_SYS_INIT3, can must be executed only once? todo: test
+        PERL_SYS_TERM();
+    }
+#endif
+}
  
  
  // Parse filter expression and convert to reverse polish notation. Dijkstra's shunting-yard algorithm
@@ -2006,10 +2442,13 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      filter->hdr = hdr;
      filter->max_unpack |= BCF_UN_STR;
  
-    int nops = 0, mops = 0, *ops = NULL;    // operators stack
-    int nout = 0, mout = 0;                 // filter tokens, RPN
+    int nops = 0, mops = 0;    // operators stack
+    int nout = 0, mout = 0;    // filter tokens, RPN
      token_t *out = NULL;
+    token_t *ops = NULL;
      char *tmp = filter->str;
+    perl_init(filter, &tmp);
+
      int last_op = -1;
      while ( *tmp )
      {
@@ -2018,24 +2457,26 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
          if ( ret==-1 ) error("Missing quotes in: %s\n", str);
  
          // fprintf(bcftools_stderr,"token=[%c] .. [%s] %d\n", TOKEN_STRING[ret], tmp, len);
-        // int i; for (i=0; i<nops; i++) fprintf(bcftools_stderr," .%c.", TOKEN_STRING[ops[i]]); fprintf(bcftools_stderr,"\n");
+        // int i; for (i=0; i<nops; i++) fprintf(bcftools_stderr," .%c", TOKEN_STRING[ops[i]]); fprintf(bcftools_stderr,"\n");
  
          if ( ret==TOK_LFT )         // left bracket
          {
              nops++;
-            hts_expand(int, nops, mops, ops);
-            ops[nops-1] = ret;
+            hts_expand0(token_t, nops, mops, ops);
+            ops[nops-1].tok_type = ret;
          }
          else if ( ret==TOK_RGT )    // right bracket
          {
-            while ( nops>0 && ops[nops-1]!=TOK_LFT )
+            while ( nops>0 && ops[nops-1].tok_type!=TOK_LFT )
              {
                  nout++;
                  hts_expand0(token_t, nout, mout, out);
-                out[nout-1].tok_type = ops[nops-1];
+                out[nout-1] = ops[nops-1];
+                memset(&ops[nops-1],0,sizeof(token_t));
                  nops--;
              }
              if ( nops<=0 ) error("Could not parse: %s\n", str);
+            memset(&ops[nops-1],0,sizeof(token_t));
              nops--;
          }
          else if ( ret!=TOK_VAL )    // one of the operators
@@ -2052,19 +2493,90 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
                  tok->threshold = -1.0;
                  ret = TOK_MULT;
              }
+            else if ( ret == -TOK_FUNC )
+            {
+                // this is different from TOK_PERLSUB,TOK_BINOM in that the expression inside the
+                // brackets gets evaluated as normal expression
+                nops++;
+                hts_expand0(token_t, nops, mops, ops);
+                token_t *tok = &ops[nops-1];
+                tok->tok_type  = -ret;
+                tok->hdr_id    = -1;
+                tok->pass_site = -1;
+                tok->threshold = -1.0;
+                if ( !strncasecmp(tmp-len,"N_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("N_PASS"); }
+                else if ( !strncasecmp(tmp-len,"F_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("F_PASS"); }
+                else error("The function \"%s\" is not supported\n", tmp-len);
+                continue;
+            }
+            else if ( ret < 0 )     // variable number of arguments: TOK_PERLSUB,TOK_BINOM
+            {
+                ret = -ret;
+
+                tmp += len;
+                char *beg = tmp;
+                kstring_t rmme = {0,0,0};
+                int i, margs, nargs = 0;
+
+                if ( ret == TOK_PERLSUB )
+                {
+                    while ( *beg && ((isalnum(*beg) && !ispunct(*beg)) || *beg=='_') ) beg++;
+                    if ( *beg!='(' ) error("Could not parse the expression: %s\n", str);
+
+                    // the subroutine name
+                    kputc('"', &rmme);
+                    kputsn(tmp, beg-tmp, &rmme);
+                    kputc('"', &rmme);
+                    nout++;
+                    hts_expand0(token_t, nout, mout, out);
+                    filters_init1(filter, rmme.s, rmme.l, &out[nout-1]);
+                    nargs++;
+                }
+                char *end = beg;
+                while ( *end && *end!=')' ) end++;
+                if ( !*end ) error("Could not parse the expression: %s\n", str);
+
+                // subroutine arguments
+                rmme.l = 0;
+                kputsn(beg+1, end-beg-1, &rmme);
+                char **rmme_list = hts_readlist(rmme.s, 0, &margs);
+                for (i=0; i<margs; i++)
+                {
+                    nargs++;
+                    nout++;
+                    hts_expand0(token_t, nout, mout, out);
+                    filters_init1(filter, rmme_list[i], strlen(rmme_list[i]), &out[nout-1]);
+                    free(rmme_list[i]);
+                }
+                free(rmme_list);
+                free(rmme.s);
+
+                nout++;
+                hts_expand0(token_t, nout, mout, out);
+                token_t *tok = &out[nout-1];
+                tok->tok_type  = ret;
+                tok->nargs     = nargs;
+                tok->hdr_id    = -1;
+                tok->pass_site = -1;
+                tok->threshold = -1.0;
+
+                tmp = end + 1;
+                continue;
+            }
              else
              {
-                while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1]] )
+                while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1].tok_type] )
                  {
                      nout++;
                      hts_expand0(token_t, nout, mout, out);
-                    out[nout-1].tok_type = ops[nops-1];
+                    out[nout-1] = ops[nops-1];
+                    memset(&ops[nops-1],0,sizeof(token_t));
                      nops--;
                  }
              }
              nops++;
-            hts_expand(int, nops, mops, ops);
-            ops[nops-1] = ret;
+            hts_expand0(token_t, nops, mops, ops);
+            ops[nops-1].tok_type = ret;
          }
          else if ( !len )
          {
@@ -2075,17 +2587,21 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
          {
              nout++;
              hts_expand0(token_t, nout, mout, out);
-            filters_init1(filter, tmp, len, &out[nout-1]);
+            if ( tmp[len-1]==',' )
+                filters_init1(filter, tmp, len-1, &out[nout-1]);
+            else
+                filters_init1(filter, tmp, len, &out[nout-1]);
              tmp += len;
          }
          last_op = ret;
      }
      while ( nops>0 )
      {
-        if ( ops[nops-1]==TOK_LFT || ops[nops-1]==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
+        if ( ops[nops-1].tok_type==TOK_LFT || ops[nops-1].tok_type==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
          nout++;
          hts_expand0(token_t, nout, mout, out);
-        out[nout-1].tok_type = ops[nops-1];
+        out[nout-1] = ops[nops-1];
+        memset(&ops[nops-1],0,sizeof(token_t));
          nops--;
      }
  
@@ -2098,6 +2614,9 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      int i;
      for (i=0; i<nout; i++)
      {
+        if ( i+1<nout && (out[i].tok_type==TOK_LT || out[i].tok_type==TOK_BT) && out[i+1].tok_type==TOK_EQ )
+            error("Error parsing the expression: \"%s\"\n", filter->str);
+
          if ( out[i].tok_type==TOK_OR || out[i].tok_type==TOK_OR_VEC )
              out[i].func = vector_logic_or;
          if ( out[i].tok_type==TOK_AND || out[i].tok_type==TOK_AND_VEC )
@@ -2215,13 +2734,15 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
      filter->nsamples = filter->max_unpack&BCF_UN_FMT ? bcf_hdr_nsamples(filter->hdr) : 0;
      for (i=0; i<nout; i++)
      {
-        if ( out[i].tok_type==TOK_MAX )      { out[i].func = func_max; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; }
-        else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; }
+        if ( out[i].tok_type==TOK_MAX )      { out[i].func = func_max; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+        else if ( out[i].tok_type==TOK_BINOM ) { out[i].func = func_binom; out[i].tok_type = TOK_FUNC; }
+        else if ( out[i].tok_type==TOK_PERLSUB ) { out[i].func = perl_exec; out[i].tok_type = TOK_FUNC; }
          hts_expand0(double,1,out[i].mvalues,out[i].values);
          if ( filter->nsamples )
          {
@@ -2242,6 +2763,7 @@ filter_t *filter_init(bcf_hdr_t *hdr, const char *str)
  
  void filter_destroy(filter_t *filter)
  {
+    perl_destroy(filter);
      int i;
      for (i=0; i<filter->nfilters; i++)
      {
@@ -2276,7 +2798,6 @@ int filter_test(filter_t *filter, bcf1_t *line, const uint8_t **samples)
      for (i=0; i<filter->nfilters; i++)
      {
          filter->filters[i].pass_site = 0;
-
          if ( filter->filters[i].tok_type == TOK_VAL )
          {
              if ( filter->filters[i].setter )    // variable, query the VCF line
diff --git a/bcftools/htslib-1.9/LICENSE b/bcftools/htslib-1.9/LICENSE

new file mode 100644 (file)

index 0000000..86f782b
--- /dev/null
+++ b/bcftools/htslib-1.9/LICENSE
@@ -0,0 +1,69 @@
+[Files in this distribution outwith the cram/ subdirectory are distributed
+according to the terms of the following MIT/Expat license.]
+
+The MIT/Expat License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[Files within the cram/ subdirectory in this distribution are distributed
+according to the terms of the following Modified 3-Clause BSD license.]
+
+The Modified-BSD License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+   this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+3. Neither the names Genome Research Ltd and Wellcome Trust Sanger Institute
+   nor the names of its contributors may be used to endorse or promote products
+   derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY GENOME RESEARCH LTD AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL GENOME RESEARCH LTD OR ITS CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
diff --git a/bcftools/htslib-1.9/README b/bcftools/htslib-1.9/README

new file mode 100644 (file)

index 0000000..4225bec
--- /dev/null
+++ b/bcftools/htslib-1.9/README
@@ -0,0 +1,5 @@
+HTSlib is an implementation of a unified C library for accessing common file
+formats, such as SAM, CRAM, VCF, and BCF, used for high-throughput sequencing
+data.  It is the core library used by samtools and bcftools.
+
+See INSTALL for building and installation instructions.
diff --git a/bcftools/main.c b/bcftools/main.c

index 03fa6a7c1b4ab243ed4658b339b25f3e76ebb15b..17f0b4b7f92595b15ebc6591529a9ab3d299a260 100644 (file)
--- a/bcftools/main.c
+++ b/bcftools/main.c
@@ -1,6 +1,6 @@
  /*  main.c -- main bcftools command front-end.
  
-    Copyright (C) 2012-2016 Genome Research Ltd.
+    Copyright (C) 2012-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -240,7 +240,7 @@ int main(int argc, char *argv[])
      if (argc < 2) { usage(stderr); return 1; }
  
      if (strcmp(argv[1], "version") == 0 || strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-v") == 0) {
-        printf("bcftools %s\nUsing htslib %s\nCopyright (C) 2016 Genome Research Ltd.\n", bcftools_version(), hts_version());
+        printf("bcftools %s\nUsing htslib %s\nCopyright (C) 2018 Genome Research Ltd.\n", bcftools_version(), hts_version());
  #if USE_GPL
          printf("License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\n");
  #else
diff --git a/bcftools/main.c.pysam.c b/bcftools/main.c.pysam.c

index d6b51a04c71c142e102da45785a4257029d6940e..5081316593ab8bfdb0097ea33069e50a42357e97 100644 (file)
--- a/bcftools/main.c.pysam.c
+++ b/bcftools/main.c.pysam.c
@@ -2,7 +2,7 @@
  
  /*  main.c -- main bcftools command front-end.
  
-    Copyright (C) 2012-2016 Genome Research Ltd.
+    Copyright (C) 2012-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -242,7 +242,7 @@ int bcftools_main(int argc, char *argv[])
      if (argc < 2) { usage(bcftools_stderr); return 1; }
  
      if (strcmp(argv[1], "version") == 0 || strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-v") == 0) {
-        fprintf(bcftools_stdout, "bcftools %s\nUsing htslib %s\nCopyright (C) 2016 Genome Research Ltd.\n", bcftools_version(), hts_version());
+        fprintf(bcftools_stdout, "bcftools %s\nUsing htslib %s\nCopyright (C) 2018 Genome Research Ltd.\n", bcftools_version(), hts_version());
  #if USE_GPL
          fprintf(bcftools_stdout, "License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\n");
  #else
diff --git a/bcftools/mpileup.c b/bcftools/mpileup.c

index 9b6c6ebb64d13ba112cb26653cffec5a80189d81..29cbf65565d0ca2ff26dc8cf96685f576f3f02bb 100644 (file)
--- a/bcftools/mpileup.c
+++ b/bcftools/mpileup.c
@@ -1012,7 +1012,7 @@ int bam_mpileup(int argc, char *argv[])
                  case 'u': mplp.output_type = FT_BCF; break;
                  case 'z': mplp.output_type = FT_VCF_GZ; break;
                  case 'v': mplp.output_type = FT_VCF; break;
-                default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n", optarg); 
+                default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n"); 
              }
              break;
          case 'C': mplp.capQ_thres = atoi(optarg); break;
diff --git a/bcftools/mpileup.c.pysam.c b/bcftools/mpileup.c.pysam.c

index 47c684c19f8efef6013fcd7e6c7a536dd9b46e6b..20defaa1211e9c4ee96beba71b9969d32f025e0a 100644 (file)
--- a/bcftools/mpileup.c.pysam.c
+++ b/bcftools/mpileup.c.pysam.c
@@ -1014,7 +1014,7 @@ int bam_mpileup(int argc, char *argv[])
                  case 'u': mplp.output_type = FT_BCF; break;
                  case 'z': mplp.output_type = FT_VCF_GZ; break;
                  case 'v': mplp.output_type = FT_VCF; break;
-                default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n", optarg); 
+                default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n"); 
              }
              break;
          case 'C': mplp.capQ_thres = atoi(optarg); break;
diff --git a/bcftools/plugins/GTisec.c b/bcftools/plugins/GTisec.c

new file mode 100644 (file)

index 0000000..eaa0b15
--- /dev/null
+++ b/bcftools/plugins/GTisec.c
@@ -0,0 +1,470 @@
+/*  plugins/GTisec.c -- collect genotype intersection counts of all possible
+                   subsets of the present samples and output in banker's
+                   sequence order (in this sequence, the number of contained
+                   samples increases monotonically, a property that is e.g.
+                   useful for programatically creating plotting files for the
+                   R package VennDiagram or the plotting tool circos from the
+                   counts, as in the command line tools bankers2VennDiagram and
+                   bankers2circos at htpps://github.com/dlaehnemann/bankers2)
+
+    Copyright (C) 2016 Computational Biology of Infection Research,
+                       Helmholtz Centre for Infection Research, Braunschweig,
+                       Germany
+
+    Author: David Laehnemann <david.laehnemann@helmholtz-hzi.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/khash.h>
+KHASH_MAP_INIT_INT(gts2smps, uint32_t)
+
+#include "bcftools.h"
+
+/*!
+ * Flag definitions for args.flag
+ */
+#define MISSING        (1<<0)
+#define VERBOSE        (1<<1)
+#define SMPORDER       (1<<2)
+
+typedef struct _args_t
+{
+    bcf_srs_t *file;    /*! multi-sample VCF file */
+    bcf_hdr_t *hdr;     /*! VCF file header */
+    FILE *out;          /*! output file pointer */
+    int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+    int nsmpp2; /*! 2^(nsmp) (is needed multiple times) */
+    int *gt_arr; /*! temporary array, to store GTs of current line/record */
+    int ngt_arr; /*! hold the number of current GT array entries */
+    uint32_t *bankers; /*! array to store banker's sequence for all possible sample subsets for
+                                programmatic indexing into smp_is for output printing, e.g. for three
+                                samples A, B and C this would be the following order:
+                                [   C,   B,   A,  CB,  CA,  BA, CBA ]
+                                [ 100, 010, 001, 110, 101, 011, 111 ]
+                                */
+    uint64_t *quick; /*! array to store n choose k lookup table of choose() function */
+    uint8_t flag; /*! several flags, for positions see above*/
+    uint64_t *missing_gts; /*! array to count missing genotypes of each sample */
+    uint64_t *smp_is; /*! array to track all possible intersections between
+                 samples, with each bit in the index integer belonging to one
+                 sample. E.g. for three samples A, B and C, count would be in
+                 the following order:
+                 [   A,   B,  AB,   C,  AC,  BC, ABC ]
+                 [ 001, 010, 011, 100, 101, 110, 111 ]
+                 */
+}
+args_t;
+
+static args_t args;
+uint32_t compute_bankers(unsigned long a);
+
+const char *about(void)
+{
+    return "Count genotype intersections across all possible sample subsets in a vcf file.\n";
+}
+
+
+const char *usage(void)
+{
+    return
+        "\n"
+        "About:   Count genotype intersections across all possible sample subsets in a vcf file.\n"
+        "Usage:   bcftools +GTisec <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+        "\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -m, --missing                   if set, include count of missing genotypes per sample in output\n"
+        "   -v, --verbose                   if set, annotate count rows with corresponding sample subset lists\n"
+        "   -H, --human-readable            if set, create human readable output; i.e. sort output by sample and\n"
+        "                                   print each subset's intersection count once for each sample contained\n"
+        "                                   in the subset; implies verbose output (-v)\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +GTisec in.vcf -- -v # for verbose output\n"
+        "   bcftools +GTisec in.vcf -- -H # for human readable output\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.flag = 0;
+
+    static struct option loptions[] =
+    {
+        {"help",            no_argument,      0,'h'},
+        {"missing",         no_argument,      0,'m'},
+        {"verbose",         no_argument,      0,'v'},
+        {"human-readable",  no_argument,      0,'H'},
+        {0,0,0,0}
+    };
+
+    int c;
+    while ((c = getopt_long(argc, argv, "?mvHh",loptions,NULL)) >= 0)
+    {
+        switch (c)
+        {
+            case 'm': args.flag |= MISSING; break;
+            case 'v': args.flag |= VERBOSE; break;
+            case 'H': args.flag |= ( SMPORDER | VERBOSE ); break;
+            case 'h': usage(); break;
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc )  usage();  // too many files given
+
+
+    args.hdr = in;
+
+    if ( !bcf_hdr_nsamples(args.hdr) )
+    {
+        error("No samples in input file.\n");
+    }
+
+    args.nsmp = bcf_hdr_nsamples(args.hdr);
+    if ( args.nsmp > 32 ) error("Too many samples. A maximum of 32 is supported.\n");
+    args.nsmpp2 = pow( 2, args.nsmp);
+    args.bankers = (uint32_t*) calloc( args.nsmpp2, sizeof(uint32_t) );
+    args.quick = (uint64_t*) calloc((args.nsmp * (args.nsmp + 1)) / 4, sizeof(unsigned long));
+    if ( args.flag & MISSING ) args.missing_gts = (uint64_t*) calloc( args.nsmp, sizeof(uint64_t));
+    args.smp_is = (uint64_t*) calloc( args.nsmpp2, sizeof(uint64_t));
+    if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+    args.gt_arr = NULL;
+    args.ngt_arr = 0;
+
+    args.out = stdout;
+
+    /*! Header printing */
+    FILE *fp = args.out;
+    fprintf(fp, "# This file was produced by bcftools +GTisec (%s+htslib-%s)\n", bcftools_version(), hts_version());
+    fprintf(fp, "# The command line was:\tbcftools +GTisec %s ", argv[0]);
+    int i;
+    for (i=1; i < argc; i++)
+    {
+        fprintf(fp, " %s", argv[i]);
+    }
+    fprintf(fp,"\n");
+    fprintf(fp,"# This file can be used as input to the subset plotting tools at:\n"
+               "#   https://github.com/dlaehnemann/bankers2\n");
+    fprintf(fp,"# Genotype intersections across samples:\n");
+    fprintf(fp,"@SMPS");
+    for (i = args.nsmp-1; i >= 0; i--)
+    {
+        fprintf(fp," %s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, i));
+    }
+    fprintf(fp,"\n");
+    if ( args.flag & MISSING )
+    {
+        if ( args.flag & SMPORDER )
+        {
+            fprintf(fp, "# The first line of each sample contains its count of missing genotypes, with a '-' appended\n"
+                        "#   to the sample name.\n");
+        }
+        else
+        {
+            fprintf(fp, "# The first %i lines contain the counts for missing values of each sample in the order provided\n"
+                        "#   in the SMPS-line above. Intersection counts only start afterwards.\n", args.nsmp);
+        }
+    }
+    if ( args.flag & SMPORDER )
+    {
+        fprintf(fp, "# Human readable output (-H) was requested. Subset intersection counts are therefore sorted by\n"
+                    "#   sample and repeated for each contained sample. For each sample, counts are in banker's \n"
+                    "#   sequence order regarding all other samples.\n");
+    }
+    else
+    {
+        fprintf(fp, "# Subset intersection counts are in global banker's sequence order.\n");
+        if ( args.nsmp > 2 )
+        {
+            fprintf(fp, "#   After exclusive sample counts in order of the SMPS-line, banker's sequence continues with:\n"
+                        "#   %s,%s   %s,%s   ...\n",
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-2 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-3 )
+                            );
+        }
+    }
+    if (args.flag & VERBOSE )
+    {
+        fprintf(fp,"# [1] Number of shared non-ref genotypes \t[2] Samples sharing non-ref genotype (GT)\n");
+    }
+    else
+    {
+        fprintf(fp,"# [1] Number of shared non-ref genotypes\n");
+    }
+
+    /* Compute banker's sequence for following printing by sample and
+     * with increasing subset size.
+     */
+    uint32_t j;
+    for ( j = 0; j < args.nsmpp2; j++ )
+    {
+        args.bankers[j] = compute_bankers(j);
+    }
+
+    return 1;
+}
+
+
+/* ADAPTED CODE FROM CORIN LAWSON (START)
+ * https://github.com/au-phiware/bankers/blob/master/c/bankers.c
+ * who implemented ideas of Eric Burnett:
+ * http://www.thelowlyprogrammer.com/2010/04/indexing-and-enumerating-subsets-of.html
+ */
+
+/*
+ * Compute the binomial coefficient of `n choose k'.
+ * Use the fact that binom(n, k) = binom(n, n - k).
+ * Use a lookup table (triangle, actually) for speed.
+ * Otherwise it's dumb (heart) recursion.
+ * Added relative to Corin Lawson:
+ * * Passing in of sample number through pointer to args struct
+ * * Make quick lookup table external to keep it persistent with clean allocation
+ *   and freeing
+ */
+uint64_t choose(unsigned int n, unsigned int k) {
+    if (n == 0)
+        return 0;
+    if (n == k || k == 0)
+        return 1;
+    if (k > n / 2)
+        k = n - k;
+
+    unsigned int i = (n * (n - 1)) / 4 + k - 1;
+    if (args.quick[i] == 0)
+        args.quick[i] = choose(n - 1, k - 1) + choose(n - 1, k);
+
+    return args.quick[i];
+}
+
+/*
+ * Returns the Banker's number at the specified position, a.
+ * Derived from the recursive bit flip method.
+ * Added relative to Corin Lawson:
+ * * Uses same lookup table solution as choose function, just
+ *   maintained externally to persist across separate function calls.
+ * * Uses bitwise symmetry of banker's sequence to use bitwise inversion
+ *   instead of recursive bit flip for second half of sequence.
+ */
+uint32_t compute_bankers(unsigned long a)
+{
+    if (a == 0)
+        return 0;
+
+    if ( args.bankers[a] == 0 )
+    {
+        if ( a >= (args.nsmpp2 / 2) )
+            return args.bankers[a] = ( compute_bankers(args.nsmpp2 - (a+1)) ^ (args.nsmpp2 - 1) ); // use bitwise symmetry of bankers sequence
+        unsigned int c = 0;
+        uint32_t n = args.nsmp;
+        uint64_t e = a, binom;
+        binom = choose(n, c);
+        do {
+            e -= binom;
+        } while ((binom = choose(n, ++c)) <= e);
+
+        do {
+            if (e == 0 || (binom = choose(n - 1, c - 1)) > e)
+                c--, args.bankers[a] |= 1;
+            else
+                e -= binom;
+        } while (--n && c && ((args.bankers[a] <<= 1) || 1));
+        args.bankers[a] <<= n;
+    }
+
+    return args.bankers[a];
+}
+
+// ADAPTED CODE FROM CORIN LAWSON END
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+    uint64_t i;
+    bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+    int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+    if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+    {
+        error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+    }
+
+    gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+    int ret;
+
+    // stick all genotypes in a hash as keys and store up to 32 samples in a corresponding flag as its value
+    khiter_t bucket;
+    khash_t(gts2smps) *gts = kh_init(gts2smps); // create hash
+    for ( i = 0; i < args.nsmp; i++ )
+    {
+        int *gt_ptr = args.gt_arr + gte_smp * i;
+
+        if (bcf_gt_is_missing(gt_ptr[0]) || ( gte_smp == 2 && bcf_gt_is_missing(gt_ptr[1]) ) )
+        {
+            if ( args.flag & MISSING ) args.missing_gts[i]++; // count missing genotypes, if requested
+            continue; // don't do anything else for missing genotypes, their "sharing" gives no info...
+        }
+
+        int a = bcf_gt_allele(gt_ptr[0]);
+        int b;
+        if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+        {
+            b = bcf_gt_allele(gt_ptr[1]);
+        }
+        else if (gte_smp == 1 ) // use missing value for second entry in hash key generation below, if only one is available
+        {
+            b = bcf_gt_allele(bcf_int32_vector_end);
+        }
+        else
+        {
+            error("gtisec does not support ploidy higher than 2.\n");
+        }
+
+        int idx = bcf_alleles2gt(a,b); // generate genotype specific hash key
+
+        bucket = kh_get(gts2smps, gts, idx); // get the genotype's hash bucket
+
+        if ( bucket == kh_end(gts) ) { // means that key does not exist
+            bucket = kh_put(gts2smps, gts, idx, &ret); // create bucket with genotype index as key and return its iterator
+            kh_val(gts, bucket) = 0; // initialize the bucket with all sample bits unset
+        }
+        kh_value(gts, bucket) |= (1<<i); // set the sample's bit to 1 in this genotype's bucket
+    }
+
+    // iterate over genotypes and for each genotype increment the appropriate smp_is entry
+    for ( bucket = kh_begin(gts); bucket != kh_end(gts); ++bucket ) // iterate over all genotypes at this position
+    {
+        if ( kh_exist(gts, bucket) ) // for existing genotype buckets
+        {
+            uint32_t s = kh_val(gts, bucket); // get the 32 bit flag
+            args.smp_is[s]++; // add to the corresponding subset
+        }
+    }
+    kh_destroy(gts2smps, gts); // destroy hash
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    int32_t i;
+    int s;
+
+    FILE *fp = args.out;
+
+    /* Printing to File */
+    if ( args.flag & SMPORDER )
+    {
+        /* Iterate over samples, printing out all subsets including
+         * the current sample, with the current sample first. This
+         * includes multiple printouts of the same sample but makes
+         * output more readable and is also needed for circos files
+         * printing.
+         */
+        for ( s = args.nsmp-1; s >= 0; s--)
+        {
+            if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+            {
+                fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+            }
+            for ( i = 1; i < args.nsmpp2; i++ )
+            {
+                if ( (args.bankers[i]>>s) & 1 )
+                {
+                    fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+                    int j;
+                    /* Print sample list */
+                    fprintf(fp, "%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s)); // print current sample first
+                    for ( j = args.nsmp-1; j >= 0; j-- )
+                    {
+                        if ( (args.bankers[i] ^ (1<<s)) & (1<<j) ) // exclude current sample from printing again
+                        {
+                            fprintf(fp, ",%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, j ) ); // print out sample list, starting with our current major sample
+                        }
+                    }
+                    fprintf(fp, "\n" );
+                }
+            }
+        }
+    }
+    else if ( args.flag & VERBOSE )
+    {
+        if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+        {
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+                fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+            }
+        }
+        for ( i = 1; i < args.nsmpp2; i++ )
+        {
+            fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+            int j = 0;
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+               if ( (args.bankers[i]>>s) & 1 )
+               {
+                   fprintf(fp, "%s%s", j ? "," : "", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s) ); // samples in specified order
+                   j = 1;
+               }
+            }
+            fprintf(fp, "\n" );
+        }
+    }
+    else
+    {
+        if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+        {
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+                fprintf(fp, "%"PRIu64"\n", args.missing_gts[s]);
+            }
+        }
+        for ( i = 1; i < args.nsmpp2; i++ )
+        {
+            fprintf(fp, "%"PRIu64"\n", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+        }
+    }
+    fclose(fp);
+
+    /* freeing up args */
+    free(args.gt_arr);
+    free(args.bankers);
+    free(args.quick);
+    if (args.flag & MISSING) free(args.missing_gts);
+    free(args.smp_is);
+}
diff --git a/bcftools/plugins/GTisec.c.pysam.c b/bcftools/plugins/GTisec.c.pysam.c

new file mode 100644 (file)

index 0000000..5f898ba
--- /dev/null
+++ b/bcftools/plugins/GTisec.c.pysam.c
@@ -0,0 +1,472 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/GTisec.c -- collect genotype intersection counts of all possible
+                   subsets of the present samples and output in banker's
+                   sequence order (in this sequence, the number of contained
+                   samples increases monotonically, a property that is e.g.
+                   useful for programatically creating plotting files for the
+                   R package VennDiagram or the plotting tool circos from the
+                   counts, as in the command line tools bankers2VennDiagram and
+                   bankers2circos at htpps://github.com/dlaehnemann/bankers2)
+
+    Copyright (C) 2016 Computational Biology of Infection Research,
+                       Helmholtz Centre for Infection Research, Braunschweig,
+                       Germany
+
+    Author: David Laehnemann <david.laehnemann@helmholtz-hzi.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/khash.h>
+KHASH_MAP_INIT_INT(gts2smps, uint32_t)
+
+#include "bcftools.h"
+
+/*!
+ * Flag definitions for args.flag
+ */
+#define MISSING        (1<<0)
+#define VERBOSE        (1<<1)
+#define SMPORDER       (1<<2)
+
+typedef struct _args_t
+{
+    bcf_srs_t *file;    /*! multi-sample VCF file */
+    bcf_hdr_t *hdr;     /*! VCF file header */
+    FILE *out;          /*! output file pointer */
+    int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+    int nsmpp2; /*! 2^(nsmp) (is needed multiple times) */
+    int *gt_arr; /*! temporary array, to store GTs of current line/record */
+    int ngt_arr; /*! hold the number of current GT array entries */
+    uint32_t *bankers; /*! array to store banker's sequence for all possible sample subsets for
+                                programmatic indexing into smp_is for output printing, e.g. for three
+                                samples A, B and C this would be the following order:
+                                [   C,   B,   A,  CB,  CA,  BA, CBA ]
+                                [ 100, 010, 001, 110, 101, 011, 111 ]
+                                */
+    uint64_t *quick; /*! array to store n choose k lookup table of choose() function */
+    uint8_t flag; /*! several flags, for positions see above*/
+    uint64_t *missing_gts; /*! array to count missing genotypes of each sample */
+    uint64_t *smp_is; /*! array to track all possible intersections between
+                 samples, with each bit in the index integer belonging to one
+                 sample. E.g. for three samples A, B and C, count would be in
+                 the following order:
+                 [   A,   B,  AB,   C,  AC,  BC, ABC ]
+                 [ 001, 010, 011, 100, 101, 110, 111 ]
+                 */
+}
+args_t;
+
+static args_t args;
+uint32_t compute_bankers(unsigned long a);
+
+const char *about(void)
+{
+    return "Count genotype intersections across all possible sample subsets in a vcf file.\n";
+}
+
+
+const char *usage(void)
+{
+    return
+        "\n"
+        "About:   Count genotype intersections across all possible sample subsets in a vcf file.\n"
+        "Usage:   bcftools +GTisec <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+        "\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -m, --missing                   if set, include count of missing genotypes per sample in output\n"
+        "   -v, --verbose                   if set, annotate count rows with corresponding sample subset lists\n"
+        "   -H, --human-readable            if set, create human readable output; i.e. sort output by sample and\n"
+        "                                   print each subset's intersection count once for each sample contained\n"
+        "                                   in the subset; implies verbose output (-v)\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +GTisec in.vcf -- -v # for verbose output\n"
+        "   bcftools +GTisec in.vcf -- -H # for human readable output\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.flag = 0;
+
+    static struct option loptions[] =
+    {
+        {"help",            no_argument,      0,'h'},
+        {"missing",         no_argument,      0,'m'},
+        {"verbose",         no_argument,      0,'v'},
+        {"human-readable",  no_argument,      0,'H'},
+        {0,0,0,0}
+    };
+
+    int c;
+    while ((c = getopt_long(argc, argv, "?mvHh",loptions,NULL)) >= 0)
+    {
+        switch (c)
+        {
+            case 'm': args.flag |= MISSING; break;
+            case 'v': args.flag |= VERBOSE; break;
+            case 'H': args.flag |= ( SMPORDER | VERBOSE ); break;
+            case 'h': usage(); break;
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc )  usage();  // too many files given
+
+
+    args.hdr = in;
+
+    if ( !bcf_hdr_nsamples(args.hdr) )
+    {
+        error("No samples in input file.\n");
+    }
+
+    args.nsmp = bcf_hdr_nsamples(args.hdr);
+    if ( args.nsmp > 32 ) error("Too many samples. A maximum of 32 is supported.\n");
+    args.nsmpp2 = pow( 2, args.nsmp);
+    args.bankers = (uint32_t*) calloc( args.nsmpp2, sizeof(uint32_t) );
+    args.quick = (uint64_t*) calloc((args.nsmp * (args.nsmp + 1)) / 4, sizeof(unsigned long));
+    if ( args.flag & MISSING ) args.missing_gts = (uint64_t*) calloc( args.nsmp, sizeof(uint64_t));
+    args.smp_is = (uint64_t*) calloc( args.nsmpp2, sizeof(uint64_t));
+    if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+    args.gt_arr = NULL;
+    args.ngt_arr = 0;
+
+    args.out = bcftools_stdout;
+
+    /*! Header printing */
+    FILE *fp = args.out;
+    fprintf(fp, "# This file was produced by bcftools +GTisec (%s+htslib-%s)\n", bcftools_version(), hts_version());
+    fprintf(fp, "# The command line was:\tbcftools +GTisec %s ", argv[0]);
+    int i;
+    for (i=1; i < argc; i++)
+    {
+        fprintf(fp, " %s", argv[i]);
+    }
+    fprintf(fp,"\n");
+    fprintf(fp,"# This file can be used as input to the subset plotting tools at:\n"
+               "#   https://github.com/dlaehnemann/bankers2\n");
+    fprintf(fp,"# Genotype intersections across samples:\n");
+    fprintf(fp,"@SMPS");
+    for (i = args.nsmp-1; i >= 0; i--)
+    {
+        fprintf(fp," %s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, i));
+    }
+    fprintf(fp,"\n");
+    if ( args.flag & MISSING )
+    {
+        if ( args.flag & SMPORDER )
+        {
+            fprintf(fp, "# The first line of each sample contains its count of missing genotypes, with a '-' appended\n"
+                        "#   to the sample name.\n");
+        }
+        else
+        {
+            fprintf(fp, "# The first %i lines contain the counts for missing values of each sample in the order provided\n"
+                        "#   in the SMPS-line above. Intersection counts only start afterwards.\n", args.nsmp);
+        }
+    }
+    if ( args.flag & SMPORDER )
+    {
+        fprintf(fp, "# Human readable output (-H) was requested. Subset intersection counts are therefore sorted by\n"
+                    "#   sample and repeated for each contained sample. For each sample, counts are in banker's \n"
+                    "#   sequence order regarding all other samples.\n");
+    }
+    else
+    {
+        fprintf(fp, "# Subset intersection counts are in global banker's sequence order.\n");
+        if ( args.nsmp > 2 )
+        {
+            fprintf(fp, "#   After exclusive sample counts in order of the SMPS-line, banker's sequence continues with:\n"
+                        "#   %s,%s   %s,%s   ...\n",
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-2 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+                            bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-3 )
+                            );
+        }
+    }
+    if (args.flag & VERBOSE )
+    {
+        fprintf(fp,"# [1] Number of shared non-ref genotypes \t[2] Samples sharing non-ref genotype (GT)\n");
+    }
+    else
+    {
+        fprintf(fp,"# [1] Number of shared non-ref genotypes\n");
+    }
+
+    /* Compute banker's sequence for following printing by sample and
+     * with increasing subset size.
+     */
+    uint32_t j;
+    for ( j = 0; j < args.nsmpp2; j++ )
+    {
+        args.bankers[j] = compute_bankers(j);
+    }
+
+    return 1;
+}
+
+
+/* ADAPTED CODE FROM CORIN LAWSON (START)
+ * https://github.com/au-phiware/bankers/blob/master/c/bankers.c
+ * who implemented ideas of Eric Burnett:
+ * http://www.thelowlyprogrammer.com/2010/04/indexing-and-enumerating-subsets-of.html
+ */
+
+/*
+ * Compute the binomial coefficient of `n choose k'.
+ * Use the fact that binom(n, k) = binom(n, n - k).
+ * Use a lookup table (triangle, actually) for speed.
+ * Otherwise it's dumb (heart) recursion.
+ * Added relative to Corin Lawson:
+ * * Passing in of sample number through pointer to args struct
+ * * Make quick lookup table external to keep it persistent with clean allocation
+ *   and freeing
+ */
+uint64_t choose(unsigned int n, unsigned int k) {
+    if (n == 0)
+        return 0;
+    if (n == k || k == 0)
+        return 1;
+    if (k > n / 2)
+        k = n - k;
+
+    unsigned int i = (n * (n - 1)) / 4 + k - 1;
+    if (args.quick[i] == 0)
+        args.quick[i] = choose(n - 1, k - 1) + choose(n - 1, k);
+
+    return args.quick[i];
+}
+
+/*
+ * Returns the Banker's number at the specified position, a.
+ * Derived from the recursive bit flip method.
+ * Added relative to Corin Lawson:
+ * * Uses same lookup table solution as choose function, just
+ *   maintained externally to persist across separate function calls.
+ * * Uses bitwise symmetry of banker's sequence to use bitwise inversion
+ *   instead of recursive bit flip for second half of sequence.
+ */
+uint32_t compute_bankers(unsigned long a)
+{
+    if (a == 0)
+        return 0;
+
+    if ( args.bankers[a] == 0 )
+    {
+        if ( a >= (args.nsmpp2 / 2) )
+            return args.bankers[a] = ( compute_bankers(args.nsmpp2 - (a+1)) ^ (args.nsmpp2 - 1) ); // use bitwise symmetry of bankers sequence
+        unsigned int c = 0;
+        uint32_t n = args.nsmp;
+        uint64_t e = a, binom;
+        binom = choose(n, c);
+        do {
+            e -= binom;
+        } while ((binom = choose(n, ++c)) <= e);
+
+        do {
+            if (e == 0 || (binom = choose(n - 1, c - 1)) > e)
+                c--, args.bankers[a] |= 1;
+            else
+                e -= binom;
+        } while (--n && c && ((args.bankers[a] <<= 1) || 1));
+        args.bankers[a] <<= n;
+    }
+
+    return args.bankers[a];
+}
+
+// ADAPTED CODE FROM CORIN LAWSON END
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+    uint64_t i;
+    bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+    int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+    if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+    {
+        error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+    }
+
+    gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+    int ret;
+
+    // stick all genotypes in a hash as keys and store up to 32 samples in a corresponding flag as its value
+    khiter_t bucket;
+    khash_t(gts2smps) *gts = kh_init(gts2smps); // create hash
+    for ( i = 0; i < args.nsmp; i++ )
+    {
+        int *gt_ptr = args.gt_arr + gte_smp * i;
+
+        if (bcf_gt_is_missing(gt_ptr[0]) || ( gte_smp == 2 && bcf_gt_is_missing(gt_ptr[1]) ) )
+        {
+            if ( args.flag & MISSING ) args.missing_gts[i]++; // count missing genotypes, if requested
+            continue; // don't do anything else for missing genotypes, their "sharing" gives no info...
+        }
+
+        int a = bcf_gt_allele(gt_ptr[0]);
+        int b;
+        if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+        {
+            b = bcf_gt_allele(gt_ptr[1]);
+        }
+        else if (gte_smp == 1 ) // use missing value for second entry in hash key generation below, if only one is available
+        {
+            b = bcf_gt_allele(bcf_int32_vector_end);
+        }
+        else
+        {
+            error("gtisec does not support ploidy higher than 2.\n");
+        }
+
+        int idx = bcf_alleles2gt(a,b); // generate genotype specific hash key
+
+        bucket = kh_get(gts2smps, gts, idx); // get the genotype's hash bucket
+
+        if ( bucket == kh_end(gts) ) { // means that key does not exist
+            bucket = kh_put(gts2smps, gts, idx, &ret); // create bucket with genotype index as key and return its iterator
+            kh_val(gts, bucket) = 0; // initialize the bucket with all sample bits unset
+        }
+        kh_value(gts, bucket) |= (1<<i); // set the sample's bit to 1 in this genotype's bucket
+    }
+
+    // iterate over genotypes and for each genotype increment the appropriate smp_is entry
+    for ( bucket = kh_begin(gts); bucket != kh_end(gts); ++bucket ) // iterate over all genotypes at this position
+    {
+        if ( kh_exist(gts, bucket) ) // for existing genotype buckets
+        {
+            uint32_t s = kh_val(gts, bucket); // get the 32 bit flag
+            args.smp_is[s]++; // add to the corresponding subset
+        }
+    }
+    kh_destroy(gts2smps, gts); // destroy hash
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    int32_t i;
+    int s;
+
+    FILE *fp = args.out;
+
+    /* Printing to File */
+    if ( args.flag & SMPORDER )
+    {
+        /* Iterate over samples, printing out all subsets including
+         * the current sample, with the current sample first. This
+         * includes multiple printouts of the same sample but makes
+         * output more readable and is also needed for circos files
+         * printing.
+         */
+        for ( s = args.nsmp-1; s >= 0; s--)
+        {
+            if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+            {
+                fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+            }
+            for ( i = 1; i < args.nsmpp2; i++ )
+            {
+                if ( (args.bankers[i]>>s) & 1 )
+                {
+                    fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+                    int j;
+                    /* Print sample list */
+                    fprintf(fp, "%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s)); // print current sample first
+                    for ( j = args.nsmp-1; j >= 0; j-- )
+                    {
+                        if ( (args.bankers[i] ^ (1<<s)) & (1<<j) ) // exclude current sample from printing again
+                        {
+                            fprintf(fp, ",%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, j ) ); // print out sample list, starting with our current major sample
+                        }
+                    }
+                    fprintf(fp, "\n" );
+                }
+            }
+        }
+    }
+    else if ( args.flag & VERBOSE )
+    {
+        if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+        {
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+                fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+            }
+        }
+        for ( i = 1; i < args.nsmpp2; i++ )
+        {
+            fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+            int j = 0;
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+               if ( (args.bankers[i]>>s) & 1 )
+               {
+                   fprintf(fp, "%s%s", j ? "," : "", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s) ); // samples in specified order
+                   j = 1;
+               }
+            }
+            fprintf(fp, "\n" );
+        }
+    }
+    else
+    {
+        if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+        {
+            for ( s = args.nsmp-1; s >= 0; s--)
+            {
+                fprintf(fp, "%"PRIu64"\n", args.missing_gts[s]);
+            }
+        }
+        for ( i = 1; i < args.nsmpp2; i++ )
+        {
+            fprintf(fp, "%"PRIu64"\n", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+        }
+    }
+    fclose(fp);
+
+    /* freeing up args */
+    free(args.gt_arr);
+    free(args.bankers);
+    free(args.quick);
+    if (args.flag & MISSING) free(args.missing_gts);
+    free(args.smp_is);
+}
diff --git a/bcftools/plugins/GTsubset.c b/bcftools/plugins/GTsubset.c

new file mode 100644 (file)

index 0000000..69e7dbe
--- /dev/null
+++ b/bcftools/plugins/GTsubset.c
@@ -0,0 +1,266 @@
+/*  plugins/GTsubset.c -- output only positions where the selected samples exclusively
+                  share a genotype, i.e. all selected samples must have the same
+                  genotype (including both alleles) and none of the unselected
+                  samples can have the same genotype
+
+    Copyright (C) 2016 Computational Biology of Infection Research,
+                       Helmholtz Centre for Infection Research, Braunschweig,
+                       Germany
+
+    Author: David Laehnemann <david.laehnemann@hhu.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+
+#include "bcftools.h"
+
+typedef struct _args_t
+{
+    bcf_hdr_t *hdr;     /*! VCF file header */
+    int *gt_arr;        /*! temporary array, to store GTs of current line/record */
+    int ngt_arr;        /*! hold the number of current GT array entries */
+    int nsmp;           /*! number of samples, can be determined from header but is needed in multiple contexts */
+    int n_sel_smps;     /*! number of selected samples who should exclusively share genotypes */
+    int *selected_smps; /*! pointer to start of array containing 1 at indices corresponding to selected samples in header dict and 0 at others*/
+}
+args_t;
+
+static args_t args;
+
+const char *about(void)
+{
+    return "Output only sites where the requested samples all exclusively share a genotype (GT).\n";
+}
+
+
+const char *usage(void)
+{
+    return
+        "\n"
+        "About:   Output only sites where the requested samples all exclusively share a genotype (GT), i.e.\n"
+        "         all selected samples must have the same GT, while non of the others can have it.\n"
+        "Usage:   bcftools +GTsubset <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+        "\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "  -s,--sample-list     comma-separated list of samples; only those sites where all of these\n"
+        "                       samples exclusively share their genotype are given as output\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +GTsubset in.vcf -- -s SMP1,SMP2 \n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+
+    int i;
+
+    static struct option loptions[] =
+    {
+        {"help",            no_argument,       0,'h'},
+        {"sample-list",     required_argument, 0,'s'},
+        {0,0,0,0}
+    };
+
+    char **smps_strs = NULL;
+
+    int c;
+    while ((c = getopt_long(argc, argv, "?s:h",loptions,NULL)) >= 0)
+    {
+        switch (c)
+        {
+            case 's': smps_strs = hts_readlist(optarg,0,&(args.n_sel_smps));
+                      if ( args.n_sel_smps == 0 )
+                      {
+                          fprintf(stderr, "Sample specification not valid.\n");
+                          error("%s", usage());
+                      }
+                      break;
+            case 'h': usage(); break;
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc )  usage();  // too many files given
+
+    args.hdr = bcf_hdr_dup(in);
+
+    // Samples parsing from header and input option
+    if ( !bcf_hdr_nsamples(args.hdr) )
+    {
+        error("No samples in input file.\n");
+    }
+    args.nsmp = bcf_hdr_nsamples(args.hdr);
+    args.selected_smps = (int*) calloc(args.nsmp,sizeof(int));
+    for ( i = 0; i < args.n_sel_smps; i++ )
+    {
+        int ind = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, smps_strs[i]);
+        if ( ind == -1 )
+        {
+            error("Sample '%s' not in input vcf file.\n", smps_strs[i]);
+        } else {
+            args.selected_smps[ind] = 1;
+        }
+        free(smps_strs[i]);
+    }
+    free(smps_strs);
+
+    /*
+    fprintf(stderr, "Selected samples array:[");
+    for (i=0;i<args.nsmp;i++)
+    {
+        fprintf(stderr, " %i", args.selected_smps[i]);
+    }
+    fprintf(stderr, " ]\n");
+    */
+
+    if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+    args.gt_arr = NULL;
+
+    return 0;
+}
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+    uint64_t i;
+    bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+    int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+    args.ngt_arr = 0;        /*! hold the number of current GT array entries */
+    if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+    {
+        error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+    }
+
+    gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+
+    // initialize with missing genotype
+    int a1 = 0;
+    int a2 = 0;
+
+    // initialize with first selected sample genotype that is not missing
+    int gt = -1;
+    while ( (a1 == 0) || (a2 == 0) )
+    {
+       gt++;
+       if (gt == args.nsmp) break;
+       if (args.selected_smps[gt] == 0) continue;
+       a1 = (args.gt_arr + gte_smp * gt)[0];
+       if ( gte_smp == 2 ) a2 = (args.gt_arr + gte_smp * gt)[1];
+       else if ( gte_smp == 1 ) a2 = bcf_int32_vector_end;
+       else error("GTsubset does not support ploidy higher than 2.\n");
+    }
+//    fprintf(stderr, "a1: %i  a2: %i\n", a1, a2);
+
+    // check all genotypes if they match (for included samples) or disagree (for samples not included)
+    gt = 0;
+    for ( i = 0; i < args.nsmp; i++ )
+    {
+        int *gt_ptr = args.gt_arr + gte_smp * i;
+
+        int b1 = gt_ptr[0];
+        int b2;
+        if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+        {
+            b2 = gt_ptr[1];
+        }
+        else if (gte_smp == 1 ) // use vector end value for second entry, if only one is available
+        {
+            b2 = bcf_int32_vector_end;
+        }
+        else
+        {
+            error("GTsubset does not support ploidy higher than 2.\n");
+        }
+
+ //      fprintf(stderr, "b1: %i  b2: %i\n", b1, b2);
+        /* missing genotypes are counted as always passing, as they neither
+         * mismatch the initial selected genotype for a selected sample, nor
+         * do they match the initial selected genotype for an excluded sample's
+         * genotype */
+        if ( (b1 == 0) || (b2 == 0) )
+        {
+            gt++;
+//            fprintf(stderr, "missing => pass\n");
+            continue;
+        }
+        else if ( args.selected_smps[i] == 1 )
+        {
+            if ( (b1 == a1) && (b2 == a2) )
+            {
+                gt++;
+//                fprintf(stderr, "match => pass\n");
+                continue;
+            }
+            else
+            {
+//                fprintf(stderr, "no match => fail\n");
+                break;
+            }
+        }
+        else if ( args.selected_smps[i] == 0 )
+        {
+            if ( (b1 != a1 ) || (b2 != a2) )
+            {
+                gt++;
+ //               fprintf(stderr, "no match => pass\n");
+                continue;
+            }
+            else
+            {
+//                fprintf(stderr, "match => fail\n");
+                break;
+            }
+        }
+    }
+    if ( gt == args.nsmp )
+    {
+        return rec;
+    }
+    else
+    {
+        return NULL;
+    }
+}
+
+void destroy(void)
+{
+    /* freeing up args */
+    bcf_hdr_destroy(args.hdr);
+    free(args.gt_arr);
+    free(args.selected_smps);
+}
diff --git a/bcftools/plugins/GTsubset.c.pysam.c b/bcftools/plugins/GTsubset.c.pysam.c

new file mode 100644 (file)

index 0000000..d4f8658
--- /dev/null
+++ b/bcftools/plugins/GTsubset.c.pysam.c
@@ -0,0 +1,268 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/GTsubset.c -- output only positions where the selected samples exclusively
+                  share a genotype, i.e. all selected samples must have the same
+                  genotype (including both alleles) and none of the unselected
+                  samples can have the same genotype
+
+    Copyright (C) 2016 Computational Biology of Infection Research,
+                       Helmholtz Centre for Infection Research, Braunschweig,
+                       Germany
+
+    Author: David Laehnemann <david.laehnemann@hhu.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+
+#include "bcftools.h"
+
+typedef struct _args_t
+{
+    bcf_hdr_t *hdr;     /*! VCF file header */
+    int *gt_arr;        /*! temporary array, to store GTs of current line/record */
+    int ngt_arr;        /*! hold the number of current GT array entries */
+    int nsmp;           /*! number of samples, can be determined from header but is needed in multiple contexts */
+    int n_sel_smps;     /*! number of selected samples who should exclusively share genotypes */
+    int *selected_smps; /*! pointer to start of array containing 1 at indices corresponding to selected samples in header dict and 0 at others*/
+}
+args_t;
+
+static args_t args;
+
+const char *about(void)
+{
+    return "Output only sites where the requested samples all exclusively share a genotype (GT).\n";
+}
+
+
+const char *usage(void)
+{
+    return
+        "\n"
+        "About:   Output only sites where the requested samples all exclusively share a genotype (GT), i.e.\n"
+        "         all selected samples must have the same GT, while non of the others can have it.\n"
+        "Usage:   bcftools +GTsubset <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+        "\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "  -s,--sample-list     comma-separated list of samples; only those sites where all of these\n"
+        "                       samples exclusively share their genotype are given as output\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +GTsubset in.vcf -- -s SMP1,SMP2 \n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+
+    int i;
+
+    static struct option loptions[] =
+    {
+        {"help",            no_argument,       0,'h'},
+        {"sample-list",     required_argument, 0,'s'},
+        {0,0,0,0}
+    };
+
+    char **smps_strs = NULL;
+
+    int c;
+    while ((c = getopt_long(argc, argv, "?s:h",loptions,NULL)) >= 0)
+    {
+        switch (c)
+        {
+            case 's': smps_strs = hts_readlist(optarg,0,&(args.n_sel_smps));
+                      if ( args.n_sel_smps == 0 )
+                      {
+                          fprintf(bcftools_stderr, "Sample specification not valid.\n");
+                          error("%s", usage());
+                      }
+                      break;
+            case 'h': usage(); break;
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc )  usage();  // too many files given
+
+    args.hdr = bcf_hdr_dup(in);
+
+    // Samples parsing from header and input option
+    if ( !bcf_hdr_nsamples(args.hdr) )
+    {
+        error("No samples in input file.\n");
+    }
+    args.nsmp = bcf_hdr_nsamples(args.hdr);
+    args.selected_smps = (int*) calloc(args.nsmp,sizeof(int));
+    for ( i = 0; i < args.n_sel_smps; i++ )
+    {
+        int ind = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, smps_strs[i]);
+        if ( ind == -1 )
+        {
+            error("Sample '%s' not in input vcf file.\n", smps_strs[i]);
+        } else {
+            args.selected_smps[ind] = 1;
+        }
+        free(smps_strs[i]);
+    }
+    free(smps_strs);
+
+    /*
+    fprintf(bcftools_stderr, "Selected samples array:[");
+    for (i=0;i<args.nsmp;i++)
+    {
+        fprintf(bcftools_stderr, " %i", args.selected_smps[i]);
+    }
+    fprintf(bcftools_stderr, " ]\n");
+    */
+
+    if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+    args.gt_arr = NULL;
+
+    return 0;
+}
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+    uint64_t i;
+    bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+    int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+    args.ngt_arr = 0;        /*! hold the number of current GT array entries */
+    if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+    {
+        error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+    }
+
+    gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+
+    // initialize with missing genotype
+    int a1 = 0;
+    int a2 = 0;
+
+    // initialize with first selected sample genotype that is not missing
+    int gt = -1;
+    while ( (a1 == 0) || (a2 == 0) )
+    {
+       gt++;
+       if (gt == args.nsmp) break;
+       if (args.selected_smps[gt] == 0) continue;
+       a1 = (args.gt_arr + gte_smp * gt)[0];
+       if ( gte_smp == 2 ) a2 = (args.gt_arr + gte_smp * gt)[1];
+       else if ( gte_smp == 1 ) a2 = bcf_int32_vector_end;
+       else error("GTsubset does not support ploidy higher than 2.\n");
+    }
+//    fprintf(bcftools_stderr, "a1: %i  a2: %i\n", a1, a2);
+
+    // check all genotypes if they match (for included samples) or disagree (for samples not included)
+    gt = 0;
+    for ( i = 0; i < args.nsmp; i++ )
+    {
+        int *gt_ptr = args.gt_arr + gte_smp * i;
+
+        int b1 = gt_ptr[0];
+        int b2;
+        if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+        {
+            b2 = gt_ptr[1];
+        }
+        else if (gte_smp == 1 ) // use vector end value for second entry, if only one is available
+        {
+            b2 = bcf_int32_vector_end;
+        }
+        else
+        {
+            error("GTsubset does not support ploidy higher than 2.\n");
+        }
+
+ //      fprintf(bcftools_stderr, "b1: %i  b2: %i\n", b1, b2);
+        /* missing genotypes are counted as always passing, as they neither
+         * mismatch the initial selected genotype for a selected sample, nor
+         * do they match the initial selected genotype for an excluded sample's
+         * genotype */
+        if ( (b1 == 0) || (b2 == 0) )
+        {
+            gt++;
+//            fprintf(bcftools_stderr, "missing => pass\n");
+            continue;
+        }
+        else if ( args.selected_smps[i] == 1 )
+        {
+            if ( (b1 == a1) && (b2 == a2) )
+            {
+                gt++;
+//                fprintf(bcftools_stderr, "match => pass\n");
+                continue;
+            }
+            else
+            {
+//                fprintf(bcftools_stderr, "no match => fail\n");
+                break;
+            }
+        }
+        else if ( args.selected_smps[i] == 0 )
+        {
+            if ( (b1 != a1 ) || (b2 != a2) )
+            {
+                gt++;
+ //               fprintf(bcftools_stderr, "no match => pass\n");
+                continue;
+            }
+            else
+            {
+//                fprintf(bcftools_stderr, "match => fail\n");
+                break;
+            }
+        }
+    }
+    if ( gt == args.nsmp )
+    {
+        return rec;
+    }
+    else
+    {
+        return NULL;
+    }
+}
+
+void destroy(void)
+{
+    /* freeing up args */
+    bcf_hdr_destroy(args.hdr);
+    free(args.gt_arr);
+    free(args.selected_smps);
+}
diff --git a/bcftools/plugins/ad-bias.c b/bcftools/plugins/ad-bias.c

new file mode 100644 (file)

index 0000000..ee6a07e
--- /dev/null
+++ b/bcftools/plugins/ad-bias.c
@@ -0,0 +1,226 @@
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include "bcftools.h"
+#include "convert.h"
+
+typedef struct
+{
+    int smpl,ctrl;      // VCF sample index
+    const char *smpl_name, *ctrl_name;
+}
+pair_t;
+
+typedef struct
+{
+    bcf_hdr_t *hdr;
+    pair_t *pair;
+    int npair, mpair, min_dp, min_alt_dp;
+    int32_t *ad_arr;
+    int mad_arr;
+    double th;
+    convert_t *convert;
+    kstring_t str;
+    uint64_t nsite,ncmp;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n"
+        "Usage: bcftools +ad-bias [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -a, --min-alt-dp <int>      Minimum required alternate allele depth [1]\n"
+        "   -d, --min-dp <int>          Minimum required depth [0]\n"
+        "   -f, --format <string>       Optional tags to append to output (`bcftools query` style of format)\n"
+        "   -s, --samples <file>        List of sample pairs, one tab-delimited pair per line\n"
+        "   -t, --threshold <float>     Output only hits with p-value smaller than <float> [1e-3]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +ad-bias file.bcf -- -t 1e-3 -s samples.txt\n"
+        "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // HPSI0513i-veqz_6    HPSI0513pf-veqz
+        int ncols = ksplit_core(str.s,'\t',&moff,&off);
+        if ( ncols<2 ) error("Could not parse the sample file: %s\n", str.s);
+
+        int smpl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[0]]);
+        if ( smpl<0 ) continue;
+        int ctrl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( ctrl<0 ) continue;
+
+        args->npair++;
+        hts_expand0(pair_t,args->npair,args->mpair,args->pair);
+        pair_t *pair = &args->pair[args->npair-1];
+        pair->ctrl = ctrl;
+        pair->smpl = smpl;
+        pair->smpl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->smpl);
+        pair->ctrl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->ctrl);
+    } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.hdr = in;
+    args.th  = 1e-3;
+    args.min_alt_dp = 1;
+    char *fname = NULL, *format = NULL;
+    static struct option loptions[] =
+    {
+        {"min-dp",required_argument,NULL,'d'},
+        {"min-alt-dp",required_argument,NULL,'a'},
+        {"format",required_argument,NULL,'f'},
+        {"samples",required_argument,NULL,'s'},
+        {"threshold",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "?hs:t:f:d:a:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'a':
+                args.min_alt_dp = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -a %s\n", optarg);
+                break;
+            case 'd':
+                args.min_dp = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+                break;
+            case 't':
+                args.th = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -t %s\n", optarg);
+                break;
+            case 's': fname = optarg; break;
+            case 'f': format = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !fname ) error("Expected the -s option\n");
+    parse_samples(&args, fname);
+    if ( format ) args.convert = convert_init(args.hdr, NULL, 0, format);
+    printf("# This file was produced by: bcftools +ad-bias(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    printf("# The command line was:\tbcftools +ad-bias %s", argv[0]);
+    for (c=1; c<argc; c++) printf(" %s",argv[c]);
+    printf("\n#\n");
+    printf("# FT, Fisher Test\t[2]Sample\t[3]Control\t[4]Chrom\t[5]Pos\t[6]smpl.nREF\t[7]smpl.nALT\t[8]ctrl.nREF\t[9]ctrl.nALT\t[10]P-value");
+    if ( format ) printf("\t[11-]User data: %s", format);
+    printf("\n");
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int nad = bcf_get_format_int32(args.hdr, rec, "AD", &args.ad_arr, &args.mad_arr);
+    if ( nad<0 ) return NULL;
+    nad /= bcf_hdr_nsamples(args.hdr);
+    
+    if ( args.convert ) convert_line(args.convert, rec, &args.str);
+    args.nsite++;
+
+    int i;
+    for (i=0; i<args.npair; i++)
+    {
+        pair_t *pair = &args.pair[i];
+        int32_t *aptr = args.ad_arr + nad*pair->smpl;
+        int32_t *bptr = args.ad_arr + nad*pair->ctrl;
+
+        if ( aptr[0]==bcf_int32_missing ) continue;
+        if ( bptr[0]==bcf_int32_missing ) continue;
+        if ( aptr[0]+aptr[1] < args.min_dp ) continue;
+        if ( bptr[0]+bptr[1] < args.min_dp ) continue;
+        if ( aptr[1] < args.min_alt_dp && bptr[1] < args.min_alt_dp ) continue;
+
+        args.ncmp++;
+
+        int n11 = aptr[0], n12 = aptr[1];
+        int n21 = bptr[0], n22 = bptr[1];
+        double left, right, fisher;
+        kt_fisher_exact(n11,n12,n21,n22, &left,&right,&fisher);
+        if ( fisher >= args.th ) continue;
+
+        printf("FT\t%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%e",
+            pair->smpl_name,pair->ctrl_name,
+            bcf_hdr_id2name(args.hdr,rec->rid), rec->pos+1,
+            n11,n12,n21,n22, fisher
+            );
+        if ( args.convert ) printf("\t%s", args.str.s);
+        printf("\n");
+    }
+    return NULL;
+}
+
+void destroy(void)
+{
+    printf("# SN, Summary Numbers\t[2]Number of Pairs\t[3]Number of Sites\t[4]Number of comparisons\t[5]P-value output threshold\n");
+    printf("SN\t%d\t%"PRId64"\t%"PRId64"\t%e\n",args.npair,args.nsite,args.ncmp,args.th);
+    if (args.convert) convert_destroy(args.convert);
+    free(args.str.s);
+    free(args.pair);
+    free(args.ad_arr);
+}
diff --git a/bcftools/plugins/ad-bias.c.pysam.c b/bcftools/plugins/ad-bias.c.pysam.c

new file mode 100644 (file)

index 0000000..894dd4c
--- /dev/null
+++ b/bcftools/plugins/ad-bias.c.pysam.c
@@ -0,0 +1,228 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include "bcftools.h"
+#include "convert.h"
+
+typedef struct
+{
+    int smpl,ctrl;      // VCF sample index
+    const char *smpl_name, *ctrl_name;
+}
+pair_t;
+
+typedef struct
+{
+    bcf_hdr_t *hdr;
+    pair_t *pair;
+    int npair, mpair, min_dp, min_alt_dp;
+    int32_t *ad_arr;
+    int mad_arr;
+    double th;
+    convert_t *convert;
+    kstring_t str;
+    uint64_t nsite,ncmp;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n"
+        "Usage: bcftools +ad-bias [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -a, --min-alt-dp <int>      Minimum required alternate allele depth [1]\n"
+        "   -d, --min-dp <int>          Minimum required depth [0]\n"
+        "   -f, --format <string>       Optional tags to append to output (`bcftools query` style of format)\n"
+        "   -s, --samples <file>        List of sample pairs, one tab-delimited pair per line\n"
+        "   -t, --threshold <float>     Output only hits with p-value smaller than <float> [1e-3]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +ad-bias file.bcf -- -t 1e-3 -s samples.txt\n"
+        "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // HPSI0513i-veqz_6    HPSI0513pf-veqz
+        int ncols = ksplit_core(str.s,'\t',&moff,&off);
+        if ( ncols<2 ) error("Could not parse the sample file: %s\n", str.s);
+
+        int smpl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[0]]);
+        if ( smpl<0 ) continue;
+        int ctrl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( ctrl<0 ) continue;
+
+        args->npair++;
+        hts_expand0(pair_t,args->npair,args->mpair,args->pair);
+        pair_t *pair = &args->pair[args->npair-1];
+        pair->ctrl = ctrl;
+        pair->smpl = smpl;
+        pair->smpl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->smpl);
+        pair->ctrl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->ctrl);
+    } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.hdr = in;
+    args.th  = 1e-3;
+    args.min_alt_dp = 1;
+    char *fname = NULL, *format = NULL;
+    static struct option loptions[] =
+    {
+        {"min-dp",required_argument,NULL,'d'},
+        {"min-alt-dp",required_argument,NULL,'a'},
+        {"format",required_argument,NULL,'f'},
+        {"samples",required_argument,NULL,'s'},
+        {"threshold",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "?hs:t:f:d:a:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'a':
+                args.min_alt_dp = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -a %s\n", optarg);
+                break;
+            case 'd':
+                args.min_dp = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+                break;
+            case 't':
+                args.th = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -t %s\n", optarg);
+                break;
+            case 's': fname = optarg; break;
+            case 'f': format = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !fname ) error("Expected the -s option\n");
+    parse_samples(&args, fname);
+    if ( format ) args.convert = convert_init(args.hdr, NULL, 0, format);
+    fprintf(bcftools_stdout, "# This file was produced by: bcftools +ad-bias(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    fprintf(bcftools_stdout, "# The command line was:\tbcftools +ad-bias %s", argv[0]);
+    for (c=1; c<argc; c++) fprintf(bcftools_stdout, " %s",argv[c]);
+    fprintf(bcftools_stdout, "\n#\n");
+    fprintf(bcftools_stdout, "# FT, Fisher Test\t[2]Sample\t[3]Control\t[4]Chrom\t[5]Pos\t[6]smpl.nREF\t[7]smpl.nALT\t[8]ctrl.nREF\t[9]ctrl.nALT\t[10]P-value");
+    if ( format ) fprintf(bcftools_stdout, "\t[11-]User data: %s", format);
+    fprintf(bcftools_stdout, "\n");
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int nad = bcf_get_format_int32(args.hdr, rec, "AD", &args.ad_arr, &args.mad_arr);
+    if ( nad<0 ) return NULL;
+    nad /= bcf_hdr_nsamples(args.hdr);
+    
+    if ( args.convert ) convert_line(args.convert, rec, &args.str);
+    args.nsite++;
+
+    int i;
+    for (i=0; i<args.npair; i++)
+    {
+        pair_t *pair = &args.pair[i];
+        int32_t *aptr = args.ad_arr + nad*pair->smpl;
+        int32_t *bptr = args.ad_arr + nad*pair->ctrl;
+
+        if ( aptr[0]==bcf_int32_missing ) continue;
+        if ( bptr[0]==bcf_int32_missing ) continue;
+        if ( aptr[0]+aptr[1] < args.min_dp ) continue;
+        if ( bptr[0]+bptr[1] < args.min_dp ) continue;
+        if ( aptr[1] < args.min_alt_dp && bptr[1] < args.min_alt_dp ) continue;
+
+        args.ncmp++;
+
+        int n11 = aptr[0], n12 = aptr[1];
+        int n21 = bptr[0], n22 = bptr[1];
+        double left, right, fisher;
+        kt_fisher_exact(n11,n12,n21,n22, &left,&right,&fisher);
+        if ( fisher >= args.th ) continue;
+
+        fprintf(bcftools_stdout, "FT\t%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%e",
+            pair->smpl_name,pair->ctrl_name,
+            bcf_hdr_id2name(args.hdr,rec->rid), rec->pos+1,
+            n11,n12,n21,n22, fisher
+            );
+        if ( args.convert ) fprintf(bcftools_stdout, "\t%s", args.str.s);
+        fprintf(bcftools_stdout, "\n");
+    }
+    return NULL;
+}
+
+void destroy(void)
+{
+    fprintf(bcftools_stdout, "# SN, Summary Numbers\t[2]Number of Pairs\t[3]Number of Sites\t[4]Number of comparisons\t[5]P-value output threshold\n");
+    fprintf(bcftools_stdout, "SN\t%d\t%"PRId64"\t%"PRId64"\t%e\n",args.npair,args.nsite,args.ncmp,args.th);
+    if (args.convert) convert_destroy(args.convert);
+    free(args.str.s);
+    free(args.pair);
+    free(args.ad_arr);
+}
diff --git a/bcftools/plugins/af-dist.c b/bcftools/plugins/af-dist.c

new file mode 100644 (file)

index 0000000..819adc9
--- /dev/null
+++ b/bcftools/plugins/af-dist.c
@@ -0,0 +1,220 @@
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <inttypes.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+#include "bin.h"
+
+typedef struct
+{
+    char *af_tag;
+    bcf_hdr_t *hdr;
+    int32_t *gt, ngt, naf;
+    float *af, list_min, list_max;
+    bin_t *dev_bins, *prob_bins;
+    uint64_t *dev_dist, *prob_dist;
+}
+args_t;
+
+args_t *args;
+
+const char *about(void)
+{
+    return "AF and GT probability distribution stats.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Collect AF deviation stats and GT probability distribution\n"
+        "       given AF and assuming HWE\n"
+        "Usage: bcftools +af-dist [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --dev-bins <list>       AF deviation bins\n"
+        "   -l, --list <min,max>        list genotypes from the given bin (for debugging)\n"
+        "   -p, --prob-bins <list>      probability distribution bins\n"
+        "   -t, --af-tag <tag>          VCF INFO tag to use [AF]\n"
+        "\n"
+        "Default binning:\n"
+        "   -d: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+        "   -p: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+        "Example:\n"
+        "   bcftools +af-tag file.bcf -- -t EUR_AF -p bins.txt\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    char *dev_bins  = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+    char *prob_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+    args->hdr = in;
+    args->af_tag = "AF";
+    args->list_min = -1;
+    static struct option loptions[] =
+    {
+        {"list",required_argument,NULL,'l'},
+        {"dev-bins",required_argument,NULL,'d'},
+        {"prob-bins",required_argument,NULL,'p'},
+        {"af-tag",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:d:p:l:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'l': 
+            {
+                char *a,*b;
+                args->list_min = strtod(optarg,&a);
+                if ( a==optarg || *a!=',' ) error("Could not parse: --list %s\n", optarg);
+                args->list_max = strtod(a+1,&b);
+                if ( a+1==b || *b ) error("Could not parse: --list %s\n", optarg);
+                break;
+            }
+            case 'd': dev_bins = optarg; break;
+            case 'p': prob_bins = optarg; break;
+            case 't': args->af_tag = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+
+    args->dev_bins = bin_init(dev_bins,0,1);
+    int nbins = bin_get_size(args->dev_bins);
+    args->dev_dist = (uint64_t*)calloc(nbins,sizeof(*args->dev_dist));
+
+    args->prob_bins = bin_init(prob_bins,0,1);
+    nbins = bin_get_size(args->prob_bins);
+    args->prob_dist = (uint64_t*)calloc(nbins,sizeof(*args->prob_dist));
+
+    printf("# This file was produced by: bcftools +af-dist(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    printf("# The command line was:\tbcftools +af-dist %s", argv[0]);
+    for (c=1; c<argc; c++) printf(" %s",argv[c]);
+    printf("\n#\n");
+
+    if ( args->list_min!=-1 )
+        printf("# GT, genotypes with P(AF) in [%f,%f]; [2]Chromosome\t[3]Position[4]Sample\t[5]Genotype\t[6]AF-based probability\n",args->list_min,args->list_max);
+
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int naf = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af,&args->naf);
+    if ( naf<=0 ) return NULL;
+    float af = args->af[0];
+
+    float pRA = 2*af*(1-af);
+    float pAA = af*af;
+    int iRA = bin_get_idx(args->prob_bins,pRA);
+    int iAA = bin_get_idx(args->prob_bins,pAA);
+
+    int list_RA = args->list_min==-1 || pRA < args->list_min || pRA > args->list_max ? 0 : 1;
+    int list_AA = args->list_min==-1 || pAA < args->list_min || pAA > args->list_max ? 0 : 1;
+    const char *chr = bcf_seqname(args->hdr,rec);
+
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt, &args->ngt);
+    int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+    int nals = 0, nalt = 0;
+    ngt /= nsmpl;
+    for (i=0; i<nsmpl; i++)
+    {
+        int32_t *ptr = args->gt + i*ngt;
+        int dosage = 0;
+        for (j=0; j<ngt; j++)
+        {
+            if ( bcf_gt_is_missing(ptr[j]) ) break;
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_allele(ptr[j])==1 ) dosage++;
+        }
+        if ( j!=ngt ) continue;
+
+        nals += j;
+        nalt += dosage;
+
+        if ( dosage==1 )
+        {
+            args->prob_dist[iRA]++;
+            if ( list_RA ) printf("GT\t%s\t%d\t%s\t1\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pRA);
+        }
+        else if ( dosage==2 )
+        {
+            args->prob_dist[iAA]++;
+            if ( list_AA ) printf("GT\t%s\t%d\t%s\t2\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pAA);
+        }
+    }
+
+    if ( nals && (nalt || af) )
+    {
+        float af_dev = fabs(af - (float)nalt/nals);
+        int iAF = bin_get_idx(args->dev_bins,af_dev);
+        args->dev_dist[iAF]++;
+    }
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    printf("# PROB_DIST, genotype probability distribution, assumes HWE\n");
+    int i, n;
+    n = bin_get_size(args->prob_bins);
+    for (i=0; i<n-1; i++)
+    {
+        float min = bin_get_value(args->prob_bins,i);
+        float max = bin_get_value(args->prob_bins,i+1);
+        printf("PROB_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->prob_dist[i]);
+    }
+    printf("# DEV_DIST, distribution of AF deviation, based on %s and INFO/AN, AC calculated on the fly\n", args->af_tag);
+    n = bin_get_size(args->dev_bins);
+    for (i=0; i<n-1; i++)
+    {
+        float min = bin_get_value(args->dev_bins,i);
+        float max = bin_get_value(args->dev_bins,i+1);
+        printf("DEV_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->dev_dist[i]);
+    }
+    bin_destroy(args->dev_bins);
+    bin_destroy(args->prob_bins);
+    free(args->dev_dist);
+    free(args->prob_dist);
+    free(args->gt);
+    free(args->af);
+    free(args);
+}
+
+
diff --git a/bcftools/plugins/af-dist.c.pysam.c b/bcftools/plugins/af-dist.c.pysam.c

new file mode 100644 (file)

index 0000000..5dc67b3
--- /dev/null
+++ b/bcftools/plugins/af-dist.c.pysam.c
@@ -0,0 +1,222 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <inttypes.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+#include "bin.h"
+
+typedef struct
+{
+    char *af_tag;
+    bcf_hdr_t *hdr;
+    int32_t *gt, ngt, naf;
+    float *af, list_min, list_max;
+    bin_t *dev_bins, *prob_bins;
+    uint64_t *dev_dist, *prob_dist;
+}
+args_t;
+
+args_t *args;
+
+const char *about(void)
+{
+    return "AF and GT probability distribution stats.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Collect AF deviation stats and GT probability distribution\n"
+        "       given AF and assuming HWE\n"
+        "Usage: bcftools +af-dist [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --dev-bins <list>       AF deviation bins\n"
+        "   -l, --list <min,max>        list genotypes from the given bin (for debugging)\n"
+        "   -p, --prob-bins <list>      probability distribution bins\n"
+        "   -t, --af-tag <tag>          VCF INFO tag to use [AF]\n"
+        "\n"
+        "Default binning:\n"
+        "   -d: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+        "   -p: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+        "Example:\n"
+        "   bcftools +af-tag file.bcf -- -t EUR_AF -p bins.txt\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    char *dev_bins  = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+    char *prob_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+    args->hdr = in;
+    args->af_tag = "AF";
+    args->list_min = -1;
+    static struct option loptions[] =
+    {
+        {"list",required_argument,NULL,'l'},
+        {"dev-bins",required_argument,NULL,'d'},
+        {"prob-bins",required_argument,NULL,'p'},
+        {"af-tag",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:d:p:l:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'l': 
+            {
+                char *a,*b;
+                args->list_min = strtod(optarg,&a);
+                if ( a==optarg || *a!=',' ) error("Could not parse: --list %s\n", optarg);
+                args->list_max = strtod(a+1,&b);
+                if ( a+1==b || *b ) error("Could not parse: --list %s\n", optarg);
+                break;
+            }
+            case 'd': dev_bins = optarg; break;
+            case 'p': prob_bins = optarg; break;
+            case 't': args->af_tag = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+
+    args->dev_bins = bin_init(dev_bins,0,1);
+    int nbins = bin_get_size(args->dev_bins);
+    args->dev_dist = (uint64_t*)calloc(nbins,sizeof(*args->dev_dist));
+
+    args->prob_bins = bin_init(prob_bins,0,1);
+    nbins = bin_get_size(args->prob_bins);
+    args->prob_dist = (uint64_t*)calloc(nbins,sizeof(*args->prob_dist));
+
+    fprintf(bcftools_stdout, "# This file was produced by: bcftools +af-dist(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    fprintf(bcftools_stdout, "# The command line was:\tbcftools +af-dist %s", argv[0]);
+    for (c=1; c<argc; c++) fprintf(bcftools_stdout, " %s",argv[c]);
+    fprintf(bcftools_stdout, "\n#\n");
+
+    if ( args->list_min!=-1 )
+        fprintf(bcftools_stdout, "# GT, genotypes with P(AF) in [%f,%f]; [2]Chromosome\t[3]Position[4]Sample\t[5]Genotype\t[6]AF-based probability\n",args->list_min,args->list_max);
+
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int naf = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af,&args->naf);
+    if ( naf<=0 ) return NULL;
+    float af = args->af[0];
+
+    float pRA = 2*af*(1-af);
+    float pAA = af*af;
+    int iRA = bin_get_idx(args->prob_bins,pRA);
+    int iAA = bin_get_idx(args->prob_bins,pAA);
+
+    int list_RA = args->list_min==-1 || pRA < args->list_min || pRA > args->list_max ? 0 : 1;
+    int list_AA = args->list_min==-1 || pAA < args->list_min || pAA > args->list_max ? 0 : 1;
+    const char *chr = bcf_seqname(args->hdr,rec);
+
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt, &args->ngt);
+    int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+    int nals = 0, nalt = 0;
+    ngt /= nsmpl;
+    for (i=0; i<nsmpl; i++)
+    {
+        int32_t *ptr = args->gt + i*ngt;
+        int dosage = 0;
+        for (j=0; j<ngt; j++)
+        {
+            if ( bcf_gt_is_missing(ptr[j]) ) break;
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_allele(ptr[j])==1 ) dosage++;
+        }
+        if ( j!=ngt ) continue;
+
+        nals += j;
+        nalt += dosage;
+
+        if ( dosage==1 )
+        {
+            args->prob_dist[iRA]++;
+            if ( list_RA ) fprintf(bcftools_stdout, "GT\t%s\t%d\t%s\t1\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pRA);
+        }
+        else if ( dosage==2 )
+        {
+            args->prob_dist[iAA]++;
+            if ( list_AA ) fprintf(bcftools_stdout, "GT\t%s\t%d\t%s\t2\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pAA);
+        }
+    }
+
+    if ( nals && (nalt || af) )
+    {
+        float af_dev = fabs(af - (float)nalt/nals);
+        int iAF = bin_get_idx(args->dev_bins,af_dev);
+        args->dev_dist[iAF]++;
+    }
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    fprintf(bcftools_stdout, "# PROB_DIST, genotype probability distribution, assumes HWE\n");
+    int i, n;
+    n = bin_get_size(args->prob_bins);
+    for (i=0; i<n-1; i++)
+    {
+        float min = bin_get_value(args->prob_bins,i);
+        float max = bin_get_value(args->prob_bins,i+1);
+        fprintf(bcftools_stdout, "PROB_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->prob_dist[i]);
+    }
+    fprintf(bcftools_stdout, "# DEV_DIST, distribution of AF deviation, based on %s and INFO/AN, AC calculated on the fly\n", args->af_tag);
+    n = bin_get_size(args->dev_bins);
+    for (i=0; i<n-1; i++)
+    {
+        float min = bin_get_value(args->dev_bins,i);
+        float max = bin_get_value(args->dev_bins,i+1);
+        fprintf(bcftools_stdout, "DEV_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->dev_dist[i]);
+    }
+    bin_destroy(args->dev_bins);
+    bin_destroy(args->prob_bins);
+    free(args->dev_dist);
+    free(args->prob_dist);
+    free(args->gt);
+    free(args->af);
+    free(args);
+}
+
+
diff --git a/bcftools/plugins/check-ploidy.c b/bcftools/plugins/check-ploidy.c

new file mode 100644 (file)

index 0000000..51c4a3b
--- /dev/null
+++ b/bcftools/plugins/check-ploidy.c
@@ -0,0 +1,165 @@
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    char *sample;
+    int beg,end,ploidy;
+}
+dat_t;
+
+typedef struct
+{
+    int argc;
+    char **argv;
+    int rid, gt_id, ndat;
+    dat_t *dat;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+    return "Check if ploidy of samples is consistent for all sites\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Check if ploidy of samples is consistent for all sites.\n"
+        "Usage: bcftools +check-ploidy [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +check-ploidy file.bcf\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->hdr = in;
+    args->ndat = bcf_hdr_nsamples(args->hdr);
+    args->dat  = (dat_t*) calloc(args->ndat,sizeof(dat_t));
+    int i;
+    for (i=0; i<args->ndat; i++) args->dat[i].sample = args->hdr->samples[i];
+    args->rid = -1;
+    args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+    printf("# [1]Sample\t[2]Chromosome\t[3]Region Start\t[4]Region End\t[5]Ploidy\n");
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i;
+
+    bcf_unpack(rec, BCF_UN_FMT);
+    bcf_fmt_t *fmt_gt = NULL;
+    for (i=0; i<rec->n_fmt; i++)
+        if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+    if ( !fmt_gt ) return NULL;    // no GT tag
+
+    if ( args->ndat != rec->n_sample ) 
+        error("Incorrect number of samples at %s:%d .. found %d, expected %d\n",bcf_seqname(args->hdr,rec),rec->pos+1,rec->n_sample,args->ndat);
+
+    if ( args->rid!=rec->rid && args->rid!=-1 )
+    {
+        for (i=0; i<args->ndat; i++)
+        {
+            dat_t *dat = &args->dat[i];
+            if ( dat->ploidy!=0 ) printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); 
+            dat->ploidy = 0;
+        }
+    }
+    args->rid = rec->rid;
+
+    #define BRANCH_INT(type_t,vector_end) \
+    { \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+            int nal, missing = 0; \
+            for (nal=0; nal<fmt_gt->n; nal++) \
+            { \
+                if ( p[nal]==vector_end ) break; /* smaller ploidy */ \
+                if ( bcf_gt_is_missing(p[nal]) ) { missing=1; break; } /* missing allele */ \
+            } \
+            if ( !nal || missing ) continue; /* missing genotype */ \
+            dat_t *dat = &args->dat[i]; \
+            if ( dat->ploidy==nal ) \
+            { \
+                dat->end = rec->pos; \
+                continue; \
+            } \
+            if ( dat->ploidy!=0 ) \
+                printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); \
+            dat->ploidy = nal; \
+            dat->beg = rec->pos; \
+            dat->end = rec->pos; \
+        } \
+    }
+    switch (fmt_gt->type) {
+        case BCF_BT_INT8:  BRANCH_INT(int8_t,  bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+        default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->hdr,rec),rec->pos+1); break;
+    }
+    #undef BRANCH_INT
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    int i;
+    for (i=0; i<args->ndat; i++)
+    {
+        dat_t *dat = &args->dat[i];
+        if ( dat->ploidy!=0 ) printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_hdr_id2name(args->hdr,args->rid),dat->beg+1,dat->end+1,dat->ploidy); 
+        dat->ploidy = 0;
+    }
+    free(args->dat);
+    free(args);
+}
+
diff --git a/bcftools/plugins/check-ploidy.c.pysam.c b/bcftools/plugins/check-ploidy.c.pysam.c

new file mode 100644 (file)

index 0000000..a40b2f1
--- /dev/null
+++ b/bcftools/plugins/check-ploidy.c.pysam.c
@@ -0,0 +1,167 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    char *sample;
+    int beg,end,ploidy;
+}
+dat_t;
+
+typedef struct
+{
+    int argc;
+    char **argv;
+    int rid, gt_id, ndat;
+    dat_t *dat;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+    return "Check if ploidy of samples is consistent for all sites\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Check if ploidy of samples is consistent for all sites.\n"
+        "Usage: bcftools +check-ploidy [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +check-ploidy file.bcf\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->hdr = in;
+    args->ndat = bcf_hdr_nsamples(args->hdr);
+    args->dat  = (dat_t*) calloc(args->ndat,sizeof(dat_t));
+    int i;
+    for (i=0; i<args->ndat; i++) args->dat[i].sample = args->hdr->samples[i];
+    args->rid = -1;
+    args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+    fprintf(bcftools_stdout, "# [1]Sample\t[2]Chromosome\t[3]Region Start\t[4]Region End\t[5]Ploidy\n");
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i;
+
+    bcf_unpack(rec, BCF_UN_FMT);
+    bcf_fmt_t *fmt_gt = NULL;
+    for (i=0; i<rec->n_fmt; i++)
+        if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+    if ( !fmt_gt ) return NULL;    // no GT tag
+
+    if ( args->ndat != rec->n_sample ) 
+        error("Incorrect number of samples at %s:%d .. found %d, expected %d\n",bcf_seqname(args->hdr,rec),rec->pos+1,rec->n_sample,args->ndat);
+
+    if ( args->rid!=rec->rid && args->rid!=-1 )
+    {
+        for (i=0; i<args->ndat; i++)
+        {
+            dat_t *dat = &args->dat[i];
+            if ( dat->ploidy!=0 ) fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); 
+            dat->ploidy = 0;
+        }
+    }
+    args->rid = rec->rid;
+
+    #define BRANCH_INT(type_t,vector_end) \
+    { \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+            int nal, missing = 0; \
+            for (nal=0; nal<fmt_gt->n; nal++) \
+            { \
+                if ( p[nal]==vector_end ) break; /* smaller ploidy */ \
+                if ( bcf_gt_is_missing(p[nal]) ) { missing=1; break; } /* missing allele */ \
+            } \
+            if ( !nal || missing ) continue; /* missing genotype */ \
+            dat_t *dat = &args->dat[i]; \
+            if ( dat->ploidy==nal ) \
+            { \
+                dat->end = rec->pos; \
+                continue; \
+            } \
+            if ( dat->ploidy!=0 ) \
+                fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); \
+            dat->ploidy = nal; \
+            dat->beg = rec->pos; \
+            dat->end = rec->pos; \
+        } \
+    }
+    switch (fmt_gt->type) {
+        case BCF_BT_INT8:  BRANCH_INT(int8_t,  bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+        default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->hdr,rec),rec->pos+1); break;
+    }
+    #undef BRANCH_INT
+
+    return NULL;
+}
+
+void destroy(void)
+{
+    int i;
+    for (i=0; i<args->ndat; i++)
+    {
+        dat_t *dat = &args->dat[i];
+        if ( dat->ploidy!=0 ) fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_hdr_id2name(args->hdr,args->rid),dat->beg+1,dat->end+1,dat->ploidy); 
+        dat->ploidy = 0;
+    }
+    free(args->dat);
+    free(args);
+}
+
diff --git a/bcftools/plugins/check-sparsity.c b/bcftools/plugins/check-sparsity.c

new file mode 100644 (file)

index 0000000..2c09f3d
--- /dev/null
+++ b/bcftools/plugins/check-sparsity.c
@@ -0,0 +1,273 @@
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    int argc;
+    char **argv, *fname, *region, **regs;
+    int region_is_file, nregs, regs_free;
+    int *smpl, nsmpl, *nsites, min_sites, gt_id;
+    kstring_t tmps;
+    bcf1_t *rec;
+    tbx_t *tbx;
+    hts_idx_t *idx;
+    hts_itr_t *itr;
+    htsFile *fp;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Print samples without genotypes in a region or chromosome\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Print samples without genotypess in a region (-r/-R) or chromosome (the default)\n"
+        "\n"
+        "Usage: bcftools +check-sparsity <file.vcf.gz> [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -n, --n-markers <int>           minimum number of required markers [1]\n"
+        "   -r, --regions <chr:beg-end>     restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->fp = hts_open(args->fname,"r");
+    if ( !args->fp ) error("Could not read %s\n", args->fname);
+    args->hdr = bcf_hdr_read(args->fp);
+    if ( !args->hdr ) error("Could not read the header: %s\n", args->fname);
+
+    args->rec = bcf_init1();
+    args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+    int i;
+    args->nsmpl  = bcf_hdr_nsamples(args->hdr);
+    args->nsites = (int*) calloc(args->nsmpl, sizeof(int));
+    args->smpl   = (int*) malloc(sizeof(int)*args->nsmpl);
+    for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+
+    if ( strcmp("-",args->fname) )  // not reading from stdin
+    {
+        if ( hts_get_format(args->fp)->format==vcf )
+        {
+            args->tbx = tbx_index_load(args->fname);
+            if ( !args->tbx && args->region ) error("Could not load the VCF index, please drop the -r/-R option\n");
+        }
+        else if ( hts_get_format(args->fp)->format==bcf )
+        {
+            args->idx = bcf_index_load(args->fname);
+            if ( !args->idx && args->region ) error("Could not load the BCF index, please drop the -r/-R option\n");
+        }
+    }
+    else if ( args->region ) error("Cannot use index with this file, please drop the -r/-R option\n");
+
+    if ( args->tbx || args->idx )
+    {
+        if ( args->region )
+        {
+            args->regs = hts_readlist(args->region, args->region_is_file, &args->nregs);
+            if ( !args->regs ) error("Could not parse regions: %s\n", args->region);
+            args->regs_free = 1;
+        }
+        else
+            args->regs = (char**) (args->tbx ? tbx_seqnames(args->tbx, &args->nregs) : bcf_index_seqnames(args->idx, args->hdr, &args->nregs));
+    }
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    if ( args->regs_free )
+        for (i=0; i<args->nregs; i++) free(args->regs[i]);
+    free(args->regs);
+    bcf_hdr_destroy(args->hdr);
+    bcf_destroy(args->rec);
+    free(args->tmps.s);
+    free(args->smpl);
+    free(args->nsites);
+    if ( args->itr ) hts_itr_destroy(args->itr);
+    if ( args->tbx ) tbx_destroy(args->tbx);
+    if ( args->idx ) hts_idx_destroy(args->idx);
+    hts_close(args->fp);
+}
+
+static void report(args_t *args, const char *reg)
+{
+    int i;
+    for (i=0; i<args->nsmpl; i++)
+        printf("%s\t%s\n", reg, args->hdr->samples[args->smpl[i]]);
+    args->nsmpl = bcf_hdr_nsamples(args->hdr);
+    for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+    memset(args->nsites, 0, sizeof(int)*args->nsmpl);
+}
+static void test_region(args_t *args, char *reg)
+{
+    if ( args->tbx )
+    {
+        args->itr = tbx_itr_querys(args->tbx,reg);
+        if ( !args->itr ) return;
+    }
+    else if ( args->idx )
+    {
+        args->itr = bcf_itr_querys(args->idx,args->hdr,reg);
+        if ( !args->itr ) return;
+    }
+
+    int ret,i, rid = -1, nread = 0;
+    while (1)
+    {
+        if ( args->tbx )
+        {
+            if ( (ret=tbx_itr_next(args->fp, args->tbx, args->itr, &args->tmps)) < 0 ) break;  // no more lines
+            ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+            if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+        }
+        else if ( args->idx )
+        {
+            ret = bcf_itr_next(args->fp, args->itr, args->rec);
+            if ( ret < -1 ) error("Could not parse a line from %s\n", reg);
+            if ( ret < 0 ) break; // no more lines or an error
+        }
+        else
+        {
+            if ( args->fp->format.format==vcf )
+            {
+                if ( (ret=hts_getline(args->fp, KS_SEP_LINE, &args->tmps)) < 0 ) break;   // no more lines
+                ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+                if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+            }
+            else if ( args->fp->format.format==bcf )
+            {
+                ret = bcf_read1(args->fp, args->hdr, args->rec);
+                if ( ret < -1 ) error("Could not parse %s\n", args->fname);
+                if ( ret < 0 ) break; // no more lines or an error
+            }
+            if ( rid!=-1 && rid!=args->rec->rid )
+            {
+                report(args, bcf_hdr_id2name(args->hdr,rid));
+                nread = 0;
+            }
+            rid = args->rec->rid;
+        }
+
+        bcf_unpack(args->rec, BCF_UN_FMT);
+        bcf_fmt_t *fmt_gt = NULL;
+        for (i=0; i<args->rec->n_fmt; i++)
+            if ( args->rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &args->rec->d.fmt[i]; break; }
+        if ( !fmt_gt ) continue;        // no GT tag
+        if ( fmt_gt->n==0 ) continue;   // empty?!
+        if ( fmt_gt->type!=BCF_BT_INT8 ) error("TODO: the GT fmt_type is not int8!\n");
+
+        // update the array of missing samples
+        for (i=0; i<args->nsmpl; i++)
+        {
+            int8_t *ptr = (int8_t*) (fmt_gt->p + args->smpl[i]*fmt_gt->size);
+            int ial = 0;
+            for (ial=0; ial<fmt_gt->n; ial++)
+                if ( ptr[ial]==bcf_gt_missing || ptr[ial]==bcf_int8_vector_end ) break;
+            if ( ial==0 ) continue;     // missing
+            if ( ++args->nsites[i] < args->min_sites ) continue;
+            if ( i+1<args->nsmpl )
+            {
+                memmove(args->smpl+i, args->smpl+i+1, sizeof(int)*(args->nsmpl-i-1));
+                memmove(args->nsites+i, args->nsites+i+1, sizeof(int)*(args->nsmpl-i-1));
+            }
+            args->nsmpl--;
+            i--;
+        }
+        nread = 1;
+        if ( !args->nsmpl ) break;
+    }
+    if ( nread ) report(args, rid==-1 ? reg : bcf_hdr_id2name(args->hdr,rid));
+
+    tbx_itr_destroy(args->itr);
+    args->itr = NULL;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->min_sites = 1;
+    static struct option loptions[] =
+    {
+        {"n-markers",required_argument,NULL,'n'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {NULL,0,NULL,0}
+    };
+    int c,i;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:n:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'n': 
+                args->min_sites = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -n %s\n", optarg);
+                break;
+            case 'R': args->region_is_file = 1; 
+            case 'r': args->region = optarg; break; 
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+
+    if ( optind>=argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else error("%s",usage_text());
+    }
+    else args->fname = argv[optind];
+    init_data(args);
+
+    for (i=0; i<args->nregs; i++) test_region(args, args->regs[i]);
+    if ( !args->nregs ) test_region(args, NULL);
+
+    destroy_data(args);
+    free(args);
+    return 0;
+}
+
diff --git a/bcftools/plugins/check-sparsity.c.pysam.c b/bcftools/plugins/check-sparsity.c.pysam.c

new file mode 100644 (file)

index 0000000..8964d18
--- /dev/null
+++ b/bcftools/plugins/check-sparsity.c.pysam.c
@@ -0,0 +1,275 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    int argc;
+    char **argv, *fname, *region, **regs;
+    int region_is_file, nregs, regs_free;
+    int *smpl, nsmpl, *nsites, min_sites, gt_id;
+    kstring_t tmps;
+    bcf1_t *rec;
+    tbx_t *tbx;
+    hts_idx_t *idx;
+    hts_itr_t *itr;
+    htsFile *fp;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Print samples without genotypes in a region or chromosome\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Print samples without genotypess in a region (-r/-R) or chromosome (the default)\n"
+        "\n"
+        "Usage: bcftools +check-sparsity <file.vcf.gz> [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -n, --n-markers <int>           minimum number of required markers [1]\n"
+        "   -r, --regions <chr:beg-end>     restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->fp = hts_open(args->fname,"r");
+    if ( !args->fp ) error("Could not read %s\n", args->fname);
+    args->hdr = bcf_hdr_read(args->fp);
+    if ( !args->hdr ) error("Could not read the header: %s\n", args->fname);
+
+    args->rec = bcf_init1();
+    args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+    int i;
+    args->nsmpl  = bcf_hdr_nsamples(args->hdr);
+    args->nsites = (int*) calloc(args->nsmpl, sizeof(int));
+    args->smpl   = (int*) malloc(sizeof(int)*args->nsmpl);
+    for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+
+    if ( strcmp("-",args->fname) )  // not reading from stdin
+    {
+        if ( hts_get_format(args->fp)->format==vcf )
+        {
+            args->tbx = tbx_index_load(args->fname);
+            if ( !args->tbx && args->region ) error("Could not load the VCF index, please drop the -r/-R option\n");
+        }
+        else if ( hts_get_format(args->fp)->format==bcf )
+        {
+            args->idx = bcf_index_load(args->fname);
+            if ( !args->idx && args->region ) error("Could not load the BCF index, please drop the -r/-R option\n");
+        }
+    }
+    else if ( args->region ) error("Cannot use index with this file, please drop the -r/-R option\n");
+
+    if ( args->tbx || args->idx )
+    {
+        if ( args->region )
+        {
+            args->regs = hts_readlist(args->region, args->region_is_file, &args->nregs);
+            if ( !args->regs ) error("Could not parse regions: %s\n", args->region);
+            args->regs_free = 1;
+        }
+        else
+            args->regs = (char**) (args->tbx ? tbx_seqnames(args->tbx, &args->nregs) : bcf_index_seqnames(args->idx, args->hdr, &args->nregs));
+    }
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    if ( args->regs_free )
+        for (i=0; i<args->nregs; i++) free(args->regs[i]);
+    free(args->regs);
+    bcf_hdr_destroy(args->hdr);
+    bcf_destroy(args->rec);
+    free(args->tmps.s);
+    free(args->smpl);
+    free(args->nsites);
+    if ( args->itr ) hts_itr_destroy(args->itr);
+    if ( args->tbx ) tbx_destroy(args->tbx);
+    if ( args->idx ) hts_idx_destroy(args->idx);
+    hts_close(args->fp);
+}
+
+static void report(args_t *args, const char *reg)
+{
+    int i;
+    for (i=0; i<args->nsmpl; i++)
+        fprintf(bcftools_stdout, "%s\t%s\n", reg, args->hdr->samples[args->smpl[i]]);
+    args->nsmpl = bcf_hdr_nsamples(args->hdr);
+    for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+    memset(args->nsites, 0, sizeof(int)*args->nsmpl);
+}
+static void test_region(args_t *args, char *reg)
+{
+    if ( args->tbx )
+    {
+        args->itr = tbx_itr_querys(args->tbx,reg);
+        if ( !args->itr ) return;
+    }
+    else if ( args->idx )
+    {
+        args->itr = bcf_itr_querys(args->idx,args->hdr,reg);
+        if ( !args->itr ) return;
+    }
+
+    int ret,i, rid = -1, nread = 0;
+    while (1)
+    {
+        if ( args->tbx )
+        {
+            if ( (ret=tbx_itr_next(args->fp, args->tbx, args->itr, &args->tmps)) < 0 ) break;  // no more lines
+            ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+            if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+        }
+        else if ( args->idx )
+        {
+            ret = bcf_itr_next(args->fp, args->itr, args->rec);
+            if ( ret < -1 ) error("Could not parse a line from %s\n", reg);
+            if ( ret < 0 ) break; // no more lines or an error
+        }
+        else
+        {
+            if ( args->fp->format.format==vcf )
+            {
+                if ( (ret=hts_getline(args->fp, KS_SEP_LINE, &args->tmps)) < 0 ) break;   // no more lines
+                ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+                if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+            }
+            else if ( args->fp->format.format==bcf )
+            {
+                ret = bcf_read1(args->fp, args->hdr, args->rec);
+                if ( ret < -1 ) error("Could not parse %s\n", args->fname);
+                if ( ret < 0 ) break; // no more lines or an error
+            }
+            if ( rid!=-1 && rid!=args->rec->rid )
+            {
+                report(args, bcf_hdr_id2name(args->hdr,rid));
+                nread = 0;
+            }
+            rid = args->rec->rid;
+        }
+
+        bcf_unpack(args->rec, BCF_UN_FMT);
+        bcf_fmt_t *fmt_gt = NULL;
+        for (i=0; i<args->rec->n_fmt; i++)
+            if ( args->rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &args->rec->d.fmt[i]; break; }
+        if ( !fmt_gt ) continue;        // no GT tag
+        if ( fmt_gt->n==0 ) continue;   // empty?!
+        if ( fmt_gt->type!=BCF_BT_INT8 ) error("TODO: the GT fmt_type is not int8!\n");
+
+        // update the array of missing samples
+        for (i=0; i<args->nsmpl; i++)
+        {
+            int8_t *ptr = (int8_t*) (fmt_gt->p + args->smpl[i]*fmt_gt->size);
+            int ial = 0;
+            for (ial=0; ial<fmt_gt->n; ial++)
+                if ( ptr[ial]==bcf_gt_missing || ptr[ial]==bcf_int8_vector_end ) break;
+            if ( ial==0 ) continue;     // missing
+            if ( ++args->nsites[i] < args->min_sites ) continue;
+            if ( i+1<args->nsmpl )
+            {
+                memmove(args->smpl+i, args->smpl+i+1, sizeof(int)*(args->nsmpl-i-1));
+                memmove(args->nsites+i, args->nsites+i+1, sizeof(int)*(args->nsmpl-i-1));
+            }
+            args->nsmpl--;
+            i--;
+        }
+        nread = 1;
+        if ( !args->nsmpl ) break;
+    }
+    if ( nread ) report(args, rid==-1 ? reg : bcf_hdr_id2name(args->hdr,rid));
+
+    tbx_itr_destroy(args->itr);
+    args->itr = NULL;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->min_sites = 1;
+    static struct option loptions[] =
+    {
+        {"n-markers",required_argument,NULL,'n'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {NULL,0,NULL,0}
+    };
+    int c,i;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:n:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'n': 
+                args->min_sites = strtol(optarg,&tmp,10);
+                if ( *tmp ) error("Could not parse: -n %s\n", optarg);
+                break;
+            case 'R': args->region_is_file = 1; 
+            case 'r': args->region = optarg; break; 
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+
+    if ( optind>=argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else error("%s",usage_text());
+    }
+    else args->fname = argv[optind];
+    init_data(args);
+
+    for (i=0; i<args->nregs; i++) test_region(args, args->regs[i]);
+    if ( !args->nregs ) test_region(args, NULL);
+
+    destroy_data(args);
+    free(args);
+    return 0;
+}
+
diff --git a/bcftools/plugins/color-chrs.c b/bcftools/plugins/color-chrs.c

new file mode 100644 (file)

index 0000000..dc80c84
--- /dev/null
+++ b/bcftools/plugins/color-chrs.c
@@ -0,0 +1,561 @@
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+/*
+    Trio haplotypes: mother (A,B), father (C,D), child (E,F)
+    Modeling the following states:
+        01|23|02 
+        01|23|03
+        01|23|12
+        01|23|13
+        01|23|20
+        01|23|30
+        01|23|21
+        01|23|31
+    with the likelihoods of two haplotypes A,B segments sharing an allele:
+        P(01|A==B)  .. e (P of error)
+        P(00|A==B)  .. 1-e
+    and
+        P(ab,cd,ef|E=A,F=C) = P(ea|E=A)*P(fc|F=C)
+
+
+    Unrelated samples: (A,B) and (C,D)
+    Modeling the states:
+        xxxx .. A!=C,A!=D,B!=C,B!=D
+        0x0x .. A=C,B!=D
+        0xx0 .. A=D,B!=C
+        x00x .. B=C,A!=D
+        x0x0 .. B=D,A!=C
+        0101 .. A=C,B=D
+        0110 .. A=D,B=C
+    with the likelihoods
+        P(01|A!=B)  .. f*(1-f)
+        P(00|A!=B)  .. (1-f)*(1-f)
+        P(11|A!=B)  .. f*f
+
+    Assuming 2x30 crossovers, P=2e-8.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "HMM.h"
+
+#define C_TRIO 1
+#define C_UNRL 2
+
+// states for unrelated samples
+#define UNRL_xxxx  0
+#define UNRL_0x0x  1
+#define UNRL_0xx0  2
+#define UNRL_x00x  3
+#define UNRL_x0x0  4
+#define UNRL_0101  5
+#define UNRL_0110  6
+
+// trio states
+#define TRIO_AC 0
+#define TRIO_AD 1
+#define TRIO_BC 2
+#define TRIO_BD 3
+#define TRIO_CA 4
+#define TRIO_DA 5
+#define TRIO_CB 6
+#define TRIO_DB 7
+
+typedef struct _args_t
+{
+    bcf_hdr_t *hdr;
+    hmm_t *hmm;
+    double *eprob, *tprob, pij, pgt_err;
+    uint32_t *sites;
+    int32_t *gt_arr;
+    int nsites, msites, ngt_arr, prev_rid;
+    int mode, nstates, nhet_father, nhet_mother;
+    int imother,ifather,ichild, isample,jsample;
+    void (*set_observed_prob) (bcf1_t *rec);
+    char *prefix;
+    FILE *fp;
+}
+args_t;
+
+static args_t args;
+
+#define SW_MOTHER 1
+#define SW_FATHER 2
+static int hap_switch[8][8];
+
+static void set_observed_prob_trio(bcf1_t *rec);
+static void set_observed_prob_unrelated(bcf1_t *rec);
+static void init_hmm_trio(args_t *args);
+static void init_hmm_unrelated(args_t *args);
+
+
+const char *about(void)
+{
+    return "Color shared chromosomal segments, requires phased GTs.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Color shared chromosomal segments, requires phased GTs. The output\n"
+        "       can be visualized using the color-chrs.pl script.\n"
+        "Usage: bcftools +color-chrs [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --prefix <path>     output files prefix\n"
+        "   -t, --trio <m,f,c>      names of mother, father and the child\n"
+        "   -u, --unrelated <a,b>   names of two unrelated samples\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +color-chrs in.vcf --\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    char *trio_samples = NULL, *unrelated_samples = NULL;
+    memset(&args,0,sizeof(args_t));
+    args.prev_rid = -1;
+    args.hdr = in;
+    args.pij = 2e-8;
+    args.pgt_err = 1e-9;
+
+    static struct option loptions[] =
+    {
+        {"prefix",1,0,'p'},
+        {"trio",1,0,'t'},
+        {"unrelated",1,0,'u'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:u:p:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': args.prefix = optarg; break;
+            case 't': trio_samples = optarg; break;
+            case 'u': unrelated_samples = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc ) error("%s",usage());
+    if ( trio_samples && unrelated_samples ) error("Expected only one of the -t/-u options\n");
+    if ( !trio_samples && !unrelated_samples ) error("Expected one of the -t/-u options\n");
+    if ( !args.prefix ) error("Expected the -p option\n");
+
+    int ret = bcf_hdr_set_samples(args.hdr, trio_samples ? trio_samples : unrelated_samples, 0);
+    if ( ret<0 ) error("Could not parse samples: %s\n", trio_samples ? trio_samples : unrelated_samples);
+    else if ( ret>0 ) error("%d-th sample not found: %s\n", ret,trio_samples ? trio_samples : unrelated_samples);
+
+    if ( trio_samples )
+    {
+        int i,n = 0;
+        char **list = hts_readlist(trio_samples, 0, &n);
+        if ( n!=3 ) error("Expected three sample names with -t\n");
+        args.imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        args.ichild  = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+        args.set_observed_prob = set_observed_prob_trio;
+        args.mode = C_TRIO;
+        init_hmm_trio(&args);
+    }
+    else
+    {
+        int i,n = 0;
+        char **list = hts_readlist(unrelated_samples, 0, &n);
+        if ( n!=2 ) error("Expected two sample names with -u\n");
+        args.isample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.jsample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+        args.set_observed_prob = set_observed_prob_unrelated;
+        args.mode = C_UNRL;
+        init_hmm_unrelated(&args);
+    }
+    return 1;
+}
+
+static void init_hmm_trio(args_t *args)
+{
+    int i,j;
+    args->nstates = 8;
+    args->tprob   = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+    for (i=0; i<args->nstates; i++)
+        for (j=0; j<args->nstates; j++) hap_switch[i][j] = 0;
+
+    hap_switch[TRIO_AD][TRIO_AC] = SW_FATHER;
+    hap_switch[TRIO_BC][TRIO_AC] = SW_MOTHER;
+    hap_switch[TRIO_BD][TRIO_AC] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_AC][TRIO_AD] = SW_FATHER;
+    hap_switch[TRIO_BC][TRIO_AD] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_BD][TRIO_AD] = SW_MOTHER;
+    hap_switch[TRIO_AC][TRIO_BC] = SW_MOTHER;
+    hap_switch[TRIO_AD][TRIO_BC] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_BD][TRIO_BC] = SW_FATHER;
+    hap_switch[TRIO_AC][TRIO_BD] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_AD][TRIO_BD] = SW_MOTHER;
+    hap_switch[TRIO_BC][TRIO_BD] = SW_FATHER;
+
+    hap_switch[TRIO_DA][TRIO_CA] = SW_FATHER;
+    hap_switch[TRIO_CB][TRIO_CA] = SW_MOTHER;
+    hap_switch[TRIO_DB][TRIO_CA] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_CA][TRIO_DA] = SW_FATHER;
+    hap_switch[TRIO_CB][TRIO_DA] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DB][TRIO_DA] = SW_MOTHER;
+    hap_switch[TRIO_CA][TRIO_CB] = SW_MOTHER;
+    hap_switch[TRIO_DA][TRIO_CB] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DB][TRIO_CB] = SW_FATHER;
+    hap_switch[TRIO_CA][TRIO_DB] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DA][TRIO_DB] = SW_MOTHER;
+    hap_switch[TRIO_CB][TRIO_DB] = SW_FATHER;
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+        {
+            if ( !hap_switch[i][j] ) MAT(args->tprob,args->nstates,i,j) = 0;
+            else
+            {
+                MAT(args->tprob,args->nstates,i,j) = 1;
+                if ( hap_switch[i][j] & SW_MOTHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+                if ( hap_switch[i][j] & SW_FATHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+            }
+        }
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        double sum = 0;
+        for (j=0; j<args->nstates; j++)
+        {
+            if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+        }
+        MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+    }
+
+    #if 0
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(stderr,"\t%d",hap_switch[j][i]);
+        fprintf(stderr,"\n");
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+        fprintf(stderr,"\n");
+    }
+    #endif
+
+    args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static void init_hmm_unrelated(args_t *args)
+{
+    int i,j;
+    args->nstates = 7;
+    args->tprob   = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            MAT(args->tprob,args->nstates,i,j) = args->pij;
+    }
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_xxxx) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_xxxx) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_x0x0,UNRL_0x0x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0x0x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_x00x,UNRL_0xx0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_0xx0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_x00x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_x0x0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0101) = args->pij*args->pij;
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=i+1; j<args->nstates; j++)
+            MAT(args->tprob,args->nstates,i,j) = MAT(args->tprob,args->nstates,j,i);
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        double sum = 0;
+        for (j=0; j<args->nstates; j++)
+            if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+        MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+    }
+
+    #if 0
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+        fprintf(stderr,"\n");
+    }
+    #endif
+
+    args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static inline double prob_shared(float af, int a, int b)
+{
+    return a==b ? 1-args.pgt_err : args.pgt_err;
+}
+static inline double prob_not_shared(float af, int a, int b)
+{
+    if ( a!=b ) return af*(1-af);
+    else if ( a==0 ) return (1-af)*(1-af);
+    else return af*af;
+}
+static void set_observed_prob_unrelated(bcf1_t *rec)
+{
+    float af = 0.5;  // alternate allele frequency
+
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return;
+    if ( ngt!=4 ) return;   // chrX
+
+    int32_t a,b,c,d;
+    a = args.gt_arr[2*args.isample];
+    b = args.gt_arr[2*args.isample+1];
+    c = args.gt_arr[2*args.jsample];
+    d = args.gt_arr[2*args.jsample+1];
+    if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+    if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+    if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+    if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+    a = bcf_gt_allele(a);
+    b = bcf_gt_allele(b);
+    c = bcf_gt_allele(c);
+    d = bcf_gt_allele(d);
+
+    int m = args.msites;
+    args.nsites++;
+    hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+    if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+    args.sites[args.nsites-1] = rec->pos;
+    double *prob = args.eprob + args.nstates*(args.nsites-1);
+    prob[UNRL_xxxx] = prob_not_shared(af,a,c) * prob_not_shared(af,a,d) * prob_not_shared(af,b,c) * prob_not_shared(af,b,d);
+    prob[UNRL_0x0x] = prob_shared(af,a,c) * prob_not_shared(af,b,d);
+    prob[UNRL_0xx0] = prob_shared(af,a,d) * prob_not_shared(af,b,c);
+    prob[UNRL_x00x] = prob_shared(af,b,c) * prob_not_shared(af,a,d);
+    prob[UNRL_x0x0] = prob_shared(af,b,d) * prob_not_shared(af,a,c);
+    prob[UNRL_0101] = prob_shared(af,a,c) * prob_shared(af,b,d);
+    prob[UNRL_0110] = prob_shared(af,a,d) * prob_shared(af,b,c);
+
+#if 0
+    static int x = 0;
+    if ( !x++)
+    {
+        printf("p(0==0) .. %f\n", prob_shared(af,0,0));
+        printf("p(0!=0) .. %f\n", prob_not_shared(af,0,0));
+        printf("p(0==1) .. %f\n", prob_shared(af,0,1));
+        printf("p(0!=1) .. %f\n", prob_not_shared(af,0,1));
+    }
+    printf("%d|%d %d|%d  x:%f 11:%f 12:%f 21:%f 22:%f 11,22:%f 12,21:%f  %d\n", a,b,c,d,
+            prob[UNRL_xxxx], prob[UNRL_0x0x], prob[UNRL_0xx0], prob[UNRL_x00x], prob[UNRL_x0x0], prob[UNRL_0101], prob[UNRL_0110], rec->pos+1);
+#endif
+}
+static void set_observed_prob_trio(bcf1_t *rec)
+{
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return;
+    if ( ngt!=6 ) return;   // chrX
+
+    int32_t a,b,c,d,e,f;
+    a = args.gt_arr[2*args.imother];
+    b = args.gt_arr[2*args.imother+1];
+    c = args.gt_arr[2*args.ifather];
+    d = args.gt_arr[2*args.ifather+1];
+    e = args.gt_arr[2*args.ichild];
+    f = args.gt_arr[2*args.ichild+1];
+    if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+    if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+    if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) return;
+    if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+    if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+    if ( !bcf_gt_is_phased(e) && !bcf_gt_is_phased(f) ) return;
+    a = bcf_gt_allele(a);
+    b = bcf_gt_allele(b);
+    c = bcf_gt_allele(c);
+    d = bcf_gt_allele(d);
+    e = bcf_gt_allele(e);
+    f = bcf_gt_allele(f);
+
+    int mother = (1<<a) | (1<<b);
+    int father = (1<<c) | (1<<d);
+    int child  = (1<<e) | (1<<f);
+    if ( !(mother&child) || !(father&child) )  return;      // Mendelian-inconsistent site, skip
+
+    if ( a!=b ) args.nhet_mother++;
+    if ( c!=d ) args.nhet_father++;
+
+    int m = args.msites;
+    args.nsites++;
+    hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+    if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+    args.sites[args.nsites-1] = rec->pos;
+    double *prob = args.eprob + args.nstates*(args.nsites-1);
+    prob[TRIO_AC] = prob_shared(0,e,a) * prob_shared(0,f,c);
+    prob[TRIO_AD] = prob_shared(0,e,a) * prob_shared(0,f,d);
+    prob[TRIO_BC] = prob_shared(0,e,b) * prob_shared(0,f,c);
+    prob[TRIO_BD] = prob_shared(0,e,b) * prob_shared(0,f,d);
+    prob[TRIO_CA] = prob_shared(0,e,c) * prob_shared(0,f,a);
+    prob[TRIO_DA] = prob_shared(0,e,d) * prob_shared(0,f,a);
+    prob[TRIO_CB] = prob_shared(0,e,c) * prob_shared(0,f,b);
+    prob[TRIO_DB] = prob_shared(0,e,d) * prob_shared(0,f,b);
+}
+
+void flush_viterbi(args_t *args)
+{
+    const char *s1, *s2, *s3 = NULL;
+    if ( args->mode==C_UNRL )
+    {
+        s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->isample);
+        s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->jsample);
+    }
+    else if ( args->mode==C_TRIO )
+    {
+        s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->imother);
+        s3 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ifather);
+        s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ichild);
+    }
+    else abort();
+
+    if ( !args->fp )
+    {
+        kstring_t str = {0,0,0};
+        kputs(args->prefix, &str);
+        kputs(".dat", &str);
+        args->fp = fopen(str.s,"w");
+        if ( !args->fp ) error("%s: %s\n", str.s,strerror(errno));
+        free(str.s);
+        fprintf(args->fp,"# SG, shared segment\t[2]Chromosome\t[3]Start\t[4]End\t[5]%s:1\t[6]%s:2\n",s2,s2);
+        fprintf(args->fp,"# SW, number of switches\t[3]Sample\t[4]Chromosome\t[5]nHets\t[5]nSwitches\t[6]switch rate\n");
+    }
+
+    hmm_run_viterbi(args->hmm,args->nsites,args->eprob,args->sites);
+    uint8_t *vpath = hmm_get_viterbi_path(args->hmm);
+    int i, iprev = -1, prev_state = -1, nstates = hmm_get_nstates(args->hmm);
+    int nswitch_mother = 0, nswitch_father = 0;
+    for (i=0; i<args->nsites; i++)
+    {
+        int state = vpath[i*nstates];
+        if ( state!=prev_state || i+1==args->nsites )
+        {
+            uint32_t start = iprev>=0 ? args->sites[iprev]+1 : 1, end = i>0 ? args->sites[i-1] : 1;
+            const char *chr = bcf_hdr_id2name(args->hdr,args->prev_rid);
+            if ( args->mode==C_UNRL )
+            {
+                switch (prev_state)
+                {
+                    case UNRL_0x0x:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t-\n", chr,start,end,s1); break;
+                    case UNRL_0xx0:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:1\n", chr,start,end,s1); break;
+                    case UNRL_x00x:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t-\n", chr,start,end,s1); break;
+                    case UNRL_x0x0:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:2\n", chr,start,end,s1); break;
+                    case UNRL_0101:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s1); break;
+                    case UNRL_0110:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s1); break;
+                }
+            }
+            else if ( args->mode==C_TRIO )
+            {
+                switch (prev_state)
+                {
+                    case TRIO_AC:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s1,s3); break;
+                    case TRIO_AD:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s3); break;
+                    case TRIO_BC:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s3); break;
+                    case TRIO_BD:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s1,s3); break;
+                    case TRIO_CA:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s3,s1); break;
+                    case TRIO_DA:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s3,s1); break;
+                    case TRIO_CB:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s3,s1); break;
+                    case TRIO_DB:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s3,s1); break;
+                }
+                if ( hap_switch[state][prev_state] & SW_MOTHER ) nswitch_mother++;
+                if ( hap_switch[state][prev_state] & SW_FATHER ) nswitch_father++;
+            }
+            iprev = i-1;
+        }
+        prev_state = state;
+    }
+    float mrate = args->nhet_mother>1 ? (float)nswitch_mother/(args->nhet_mother-1) : 0;
+    float frate = args->nhet_father>1 ? (float)nswitch_father/(args->nhet_father-1) : 0;
+    fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s1,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_mother,nswitch_mother,mrate);
+    fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s3,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_father,nswitch_father,frate);
+    args->nsites = 0;
+    args->nhet_father = args->nhet_mother = 0;
+}
+    
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( args.prev_rid==-1 ) args.prev_rid = rec->rid;
+    if ( args.prev_rid!=rec->rid ) flush_viterbi(&args);
+    args.prev_rid = rec->rid;
+    args.set_observed_prob(rec);
+    return NULL;
+}
+
+void destroy(void)
+{
+    flush_viterbi(&args);
+    fclose(args.fp);
+
+    free(args.gt_arr);
+    free(args.tprob);
+    free(args.sites);
+    free(args.eprob);
+    hmm_destroy(args.hmm);
+}
+
+
+
diff --git a/bcftools/plugins/color-chrs.c.pysam.c b/bcftools/plugins/color-chrs.c.pysam.c

new file mode 100644 (file)

index 0000000..0b62b58
--- /dev/null
+++ b/bcftools/plugins/color-chrs.c.pysam.c
@@ -0,0 +1,563 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+/*
+    Trio haplotypes: mother (A,B), father (C,D), child (E,F)
+    Modeling the following states:
+        01|23|02 
+        01|23|03
+        01|23|12
+        01|23|13
+        01|23|20
+        01|23|30
+        01|23|21
+        01|23|31
+    with the likelihoods of two haplotypes A,B segments sharing an allele:
+        P(01|A==B)  .. e (P of error)
+        P(00|A==B)  .. 1-e
+    and
+        P(ab,cd,ef|E=A,F=C) = P(ea|E=A)*P(fc|F=C)
+
+
+    Unrelated samples: (A,B) and (C,D)
+    Modeling the states:
+        xxxx .. A!=C,A!=D,B!=C,B!=D
+        0x0x .. A=C,B!=D
+        0xx0 .. A=D,B!=C
+        x00x .. B=C,A!=D
+        x0x0 .. B=D,A!=C
+        0101 .. A=C,B=D
+        0110 .. A=D,B=C
+    with the likelihoods
+        P(01|A!=B)  .. f*(1-f)
+        P(00|A!=B)  .. (1-f)*(1-f)
+        P(11|A!=B)  .. f*f
+
+    Assuming 2x30 crossovers, P=2e-8.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "HMM.h"
+
+#define C_TRIO 1
+#define C_UNRL 2
+
+// states for unrelated samples
+#define UNRL_xxxx  0
+#define UNRL_0x0x  1
+#define UNRL_0xx0  2
+#define UNRL_x00x  3
+#define UNRL_x0x0  4
+#define UNRL_0101  5
+#define UNRL_0110  6
+
+// trio states
+#define TRIO_AC 0
+#define TRIO_AD 1
+#define TRIO_BC 2
+#define TRIO_BD 3
+#define TRIO_CA 4
+#define TRIO_DA 5
+#define TRIO_CB 6
+#define TRIO_DB 7
+
+typedef struct _args_t
+{
+    bcf_hdr_t *hdr;
+    hmm_t *hmm;
+    double *eprob, *tprob, pij, pgt_err;
+    uint32_t *sites;
+    int32_t *gt_arr;
+    int nsites, msites, ngt_arr, prev_rid;
+    int mode, nstates, nhet_father, nhet_mother;
+    int imother,ifather,ichild, isample,jsample;
+    void (*set_observed_prob) (bcf1_t *rec);
+    char *prefix;
+    FILE *fp;
+}
+args_t;
+
+static args_t args;
+
+#define SW_MOTHER 1
+#define SW_FATHER 2
+static int hap_switch[8][8];
+
+static void set_observed_prob_trio(bcf1_t *rec);
+static void set_observed_prob_unrelated(bcf1_t *rec);
+static void init_hmm_trio(args_t *args);
+static void init_hmm_unrelated(args_t *args);
+
+
+const char *about(void)
+{
+    return "Color shared chromosomal segments, requires phased GTs.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Color shared chromosomal segments, requires phased GTs. The output\n"
+        "       can be visualized using the color-chrs.pl script.\n"
+        "Usage: bcftools +color-chrs [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --prefix <path>     output files prefix\n"
+        "   -t, --trio <m,f,c>      names of mother, father and the child\n"
+        "   -u, --unrelated <a,b>   names of two unrelated samples\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +color-chrs in.vcf --\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    char *trio_samples = NULL, *unrelated_samples = NULL;
+    memset(&args,0,sizeof(args_t));
+    args.prev_rid = -1;
+    args.hdr = in;
+    args.pij = 2e-8;
+    args.pgt_err = 1e-9;
+
+    static struct option loptions[] =
+    {
+        {"prefix",1,0,'p'},
+        {"trio",1,0,'t'},
+        {"unrelated",1,0,'u'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:u:p:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': args.prefix = optarg; break;
+            case 't': trio_samples = optarg; break;
+            case 'u': unrelated_samples = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( optind != argc ) error("%s",usage());
+    if ( trio_samples && unrelated_samples ) error("Expected only one of the -t/-u options\n");
+    if ( !trio_samples && !unrelated_samples ) error("Expected one of the -t/-u options\n");
+    if ( !args.prefix ) error("Expected the -p option\n");
+
+    int ret = bcf_hdr_set_samples(args.hdr, trio_samples ? trio_samples : unrelated_samples, 0);
+    if ( ret<0 ) error("Could not parse samples: %s\n", trio_samples ? trio_samples : unrelated_samples);
+    else if ( ret>0 ) error("%d-th sample not found: %s\n", ret,trio_samples ? trio_samples : unrelated_samples);
+
+    if ( trio_samples )
+    {
+        int i,n = 0;
+        char **list = hts_readlist(trio_samples, 0, &n);
+        if ( n!=3 ) error("Expected three sample names with -t\n");
+        args.imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        args.ichild  = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+        args.set_observed_prob = set_observed_prob_trio;
+        args.mode = C_TRIO;
+        init_hmm_trio(&args);
+    }
+    else
+    {
+        int i,n = 0;
+        char **list = hts_readlist(unrelated_samples, 0, &n);
+        if ( n!=2 ) error("Expected two sample names with -u\n");
+        args.isample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.jsample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+        args.set_observed_prob = set_observed_prob_unrelated;
+        args.mode = C_UNRL;
+        init_hmm_unrelated(&args);
+    }
+    return 1;
+}
+
+static void init_hmm_trio(args_t *args)
+{
+    int i,j;
+    args->nstates = 8;
+    args->tprob   = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+    for (i=0; i<args->nstates; i++)
+        for (j=0; j<args->nstates; j++) hap_switch[i][j] = 0;
+
+    hap_switch[TRIO_AD][TRIO_AC] = SW_FATHER;
+    hap_switch[TRIO_BC][TRIO_AC] = SW_MOTHER;
+    hap_switch[TRIO_BD][TRIO_AC] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_AC][TRIO_AD] = SW_FATHER;
+    hap_switch[TRIO_BC][TRIO_AD] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_BD][TRIO_AD] = SW_MOTHER;
+    hap_switch[TRIO_AC][TRIO_BC] = SW_MOTHER;
+    hap_switch[TRIO_AD][TRIO_BC] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_BD][TRIO_BC] = SW_FATHER;
+    hap_switch[TRIO_AC][TRIO_BD] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_AD][TRIO_BD] = SW_MOTHER;
+    hap_switch[TRIO_BC][TRIO_BD] = SW_FATHER;
+
+    hap_switch[TRIO_DA][TRIO_CA] = SW_FATHER;
+    hap_switch[TRIO_CB][TRIO_CA] = SW_MOTHER;
+    hap_switch[TRIO_DB][TRIO_CA] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_CA][TRIO_DA] = SW_FATHER;
+    hap_switch[TRIO_CB][TRIO_DA] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DB][TRIO_DA] = SW_MOTHER;
+    hap_switch[TRIO_CA][TRIO_CB] = SW_MOTHER;
+    hap_switch[TRIO_DA][TRIO_CB] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DB][TRIO_CB] = SW_FATHER;
+    hap_switch[TRIO_CA][TRIO_DB] = SW_MOTHER | SW_FATHER;
+    hap_switch[TRIO_DA][TRIO_DB] = SW_MOTHER;
+    hap_switch[TRIO_CB][TRIO_DB] = SW_FATHER;
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+        {
+            if ( !hap_switch[i][j] ) MAT(args->tprob,args->nstates,i,j) = 0;
+            else
+            {
+                MAT(args->tprob,args->nstates,i,j) = 1;
+                if ( hap_switch[i][j] & SW_MOTHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+                if ( hap_switch[i][j] & SW_FATHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+            }
+        }
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        double sum = 0;
+        for (j=0; j<args->nstates; j++)
+        {
+            if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+        }
+        MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+    }
+
+    #if 0
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(bcftools_stderr,"\t%d",hap_switch[j][i]);
+        fprintf(bcftools_stderr,"\n");
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(bcftools_stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+        fprintf(bcftools_stderr,"\n");
+    }
+    #endif
+
+    args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static void init_hmm_unrelated(args_t *args)
+{
+    int i,j;
+    args->nstates = 7;
+    args->tprob   = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            MAT(args->tprob,args->nstates,i,j) = args->pij;
+    }
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_xxxx) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_xxxx) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_x0x0,UNRL_0x0x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0x0x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_x00x,UNRL_0xx0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_0xx0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0101,UNRL_x00x) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_x0x0) = args->pij*args->pij;
+    MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0101) = args->pij*args->pij;
+
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=i+1; j<args->nstates; j++)
+            MAT(args->tprob,args->nstates,i,j) = MAT(args->tprob,args->nstates,j,i);
+    }
+    for (i=0; i<args->nstates; i++)
+    {
+        double sum = 0;
+        for (j=0; j<args->nstates; j++)
+            if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+        MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+    }
+
+    #if 0
+    for (i=0; i<args->nstates; i++)
+    {
+        for (j=0; j<args->nstates; j++)
+            fprintf(bcftools_stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+        fprintf(bcftools_stderr,"\n");
+    }
+    #endif
+
+    args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static inline double prob_shared(float af, int a, int b)
+{
+    return a==b ? 1-args.pgt_err : args.pgt_err;
+}
+static inline double prob_not_shared(float af, int a, int b)
+{
+    if ( a!=b ) return af*(1-af);
+    else if ( a==0 ) return (1-af)*(1-af);
+    else return af*af;
+}
+static void set_observed_prob_unrelated(bcf1_t *rec)
+{
+    float af = 0.5;  // alternate allele frequency
+
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return;
+    if ( ngt!=4 ) return;   // chrX
+
+    int32_t a,b,c,d;
+    a = args.gt_arr[2*args.isample];
+    b = args.gt_arr[2*args.isample+1];
+    c = args.gt_arr[2*args.jsample];
+    d = args.gt_arr[2*args.jsample+1];
+    if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+    if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+    if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+    if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+    a = bcf_gt_allele(a);
+    b = bcf_gt_allele(b);
+    c = bcf_gt_allele(c);
+    d = bcf_gt_allele(d);
+
+    int m = args.msites;
+    args.nsites++;
+    hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+    if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+    args.sites[args.nsites-1] = rec->pos;
+    double *prob = args.eprob + args.nstates*(args.nsites-1);
+    prob[UNRL_xxxx] = prob_not_shared(af,a,c) * prob_not_shared(af,a,d) * prob_not_shared(af,b,c) * prob_not_shared(af,b,d);
+    prob[UNRL_0x0x] = prob_shared(af,a,c) * prob_not_shared(af,b,d);
+    prob[UNRL_0xx0] = prob_shared(af,a,d) * prob_not_shared(af,b,c);
+    prob[UNRL_x00x] = prob_shared(af,b,c) * prob_not_shared(af,a,d);
+    prob[UNRL_x0x0] = prob_shared(af,b,d) * prob_not_shared(af,a,c);
+    prob[UNRL_0101] = prob_shared(af,a,c) * prob_shared(af,b,d);
+    prob[UNRL_0110] = prob_shared(af,a,d) * prob_shared(af,b,c);
+
+#if 0
+    static int x = 0;
+    if ( !x++)
+    {
+        fprintf(bcftools_stdout, "p(0==0) .. %f\n", prob_shared(af,0,0));
+        fprintf(bcftools_stdout, "p(0!=0) .. %f\n", prob_not_shared(af,0,0));
+        fprintf(bcftools_stdout, "p(0==1) .. %f\n", prob_shared(af,0,1));
+        fprintf(bcftools_stdout, "p(0!=1) .. %f\n", prob_not_shared(af,0,1));
+    }
+    fprintf(bcftools_stdout, "%d|%d %d|%d  x:%f 11:%f 12:%f 21:%f 22:%f 11,22:%f 12,21:%f  %d\n", a,b,c,d,
+            prob[UNRL_xxxx], prob[UNRL_0x0x], prob[UNRL_0xx0], prob[UNRL_x00x], prob[UNRL_x0x0], prob[UNRL_0101], prob[UNRL_0110], rec->pos+1);
+#endif
+}
+static void set_observed_prob_trio(bcf1_t *rec)
+{
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return;
+    if ( ngt!=6 ) return;   // chrX
+
+    int32_t a,b,c,d,e,f;
+    a = args.gt_arr[2*args.imother];
+    b = args.gt_arr[2*args.imother+1];
+    c = args.gt_arr[2*args.ifather];
+    d = args.gt_arr[2*args.ifather+1];
+    e = args.gt_arr[2*args.ichild];
+    f = args.gt_arr[2*args.ichild+1];
+    if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+    if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+    if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) return;
+    if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+    if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+    if ( !bcf_gt_is_phased(e) && !bcf_gt_is_phased(f) ) return;
+    a = bcf_gt_allele(a);
+    b = bcf_gt_allele(b);
+    c = bcf_gt_allele(c);
+    d = bcf_gt_allele(d);
+    e = bcf_gt_allele(e);
+    f = bcf_gt_allele(f);
+
+    int mother = (1<<a) | (1<<b);
+    int father = (1<<c) | (1<<d);
+    int child  = (1<<e) | (1<<f);
+    if ( !(mother&child) || !(father&child) )  return;      // Mendelian-inconsistent site, skip
+
+    if ( a!=b ) args.nhet_mother++;
+    if ( c!=d ) args.nhet_father++;
+
+    int m = args.msites;
+    args.nsites++;
+    hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+    if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+    args.sites[args.nsites-1] = rec->pos;
+    double *prob = args.eprob + args.nstates*(args.nsites-1);
+    prob[TRIO_AC] = prob_shared(0,e,a) * prob_shared(0,f,c);
+    prob[TRIO_AD] = prob_shared(0,e,a) * prob_shared(0,f,d);
+    prob[TRIO_BC] = prob_shared(0,e,b) * prob_shared(0,f,c);
+    prob[TRIO_BD] = prob_shared(0,e,b) * prob_shared(0,f,d);
+    prob[TRIO_CA] = prob_shared(0,e,c) * prob_shared(0,f,a);
+    prob[TRIO_DA] = prob_shared(0,e,d) * prob_shared(0,f,a);
+    prob[TRIO_CB] = prob_shared(0,e,c) * prob_shared(0,f,b);
+    prob[TRIO_DB] = prob_shared(0,e,d) * prob_shared(0,f,b);
+}
+
+void flush_viterbi(args_t *args)
+{
+    const char *s1, *s2, *s3 = NULL;
+    if ( args->mode==C_UNRL )
+    {
+        s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->isample);
+        s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->jsample);
+    }
+    else if ( args->mode==C_TRIO )
+    {
+        s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->imother);
+        s3 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ifather);
+        s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ichild);
+    }
+    else abort();
+
+    if ( !args->fp )
+    {
+        kstring_t str = {0,0,0};
+        kputs(args->prefix, &str);
+        kputs(".dat", &str);
+        args->fp = fopen(str.s,"w");
+        if ( !args->fp ) error("%s: %s\n", str.s,strerror(errno));
+        free(str.s);
+        fprintf(args->fp,"# SG, shared segment\t[2]Chromosome\t[3]Start\t[4]End\t[5]%s:1\t[6]%s:2\n",s2,s2);
+        fprintf(args->fp,"# SW, number of switches\t[3]Sample\t[4]Chromosome\t[5]nHets\t[5]nSwitches\t[6]switch rate\n");
+    }
+
+    hmm_run_viterbi(args->hmm,args->nsites,args->eprob,args->sites);
+    uint8_t *vpath = hmm_get_viterbi_path(args->hmm);
+    int i, iprev = -1, prev_state = -1, nstates = hmm_get_nstates(args->hmm);
+    int nswitch_mother = 0, nswitch_father = 0;
+    for (i=0; i<args->nsites; i++)
+    {
+        int state = vpath[i*nstates];
+        if ( state!=prev_state || i+1==args->nsites )
+        {
+            uint32_t start = iprev>=0 ? args->sites[iprev]+1 : 1, end = i>0 ? args->sites[i-1] : 1;
+            const char *chr = bcf_hdr_id2name(args->hdr,args->prev_rid);
+            if ( args->mode==C_UNRL )
+            {
+                switch (prev_state)
+                {
+                    case UNRL_0x0x:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t-\n", chr,start,end,s1); break;
+                    case UNRL_0xx0:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:1\n", chr,start,end,s1); break;
+                    case UNRL_x00x:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t-\n", chr,start,end,s1); break;
+                    case UNRL_x0x0:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:2\n", chr,start,end,s1); break;
+                    case UNRL_0101:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s1); break;
+                    case UNRL_0110:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s1); break;
+                }
+            }
+            else if ( args->mode==C_TRIO )
+            {
+                switch (prev_state)
+                {
+                    case TRIO_AC:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s1,s3); break;
+                    case TRIO_AD:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s3); break;
+                    case TRIO_BC:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s3); break;
+                    case TRIO_BD:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s1,s3); break;
+                    case TRIO_CA:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s3,s1); break;
+                    case TRIO_DA:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s3,s1); break;
+                    case TRIO_CB:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s3,s1); break;
+                    case TRIO_DB:
+                        fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s3,s1); break;
+                }
+                if ( hap_switch[state][prev_state] & SW_MOTHER ) nswitch_mother++;
+                if ( hap_switch[state][prev_state] & SW_FATHER ) nswitch_father++;
+            }
+            iprev = i-1;
+        }
+        prev_state = state;
+    }
+    float mrate = args->nhet_mother>1 ? (float)nswitch_mother/(args->nhet_mother-1) : 0;
+    float frate = args->nhet_father>1 ? (float)nswitch_father/(args->nhet_father-1) : 0;
+    fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s1,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_mother,nswitch_mother,mrate);
+    fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s3,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_father,nswitch_father,frate);
+    args->nsites = 0;
+    args->nhet_father = args->nhet_mother = 0;
+}
+    
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( args.prev_rid==-1 ) args.prev_rid = rec->rid;
+    if ( args.prev_rid!=rec->rid ) flush_viterbi(&args);
+    args.prev_rid = rec->rid;
+    args.set_observed_prob(rec);
+    return NULL;
+}
+
+void destroy(void)
+{
+    flush_viterbi(&args);
+    fclose(args.fp);
+
+    free(args.gt_arr);
+    free(args.tprob);
+    free(args.sites);
+    free(args.eprob);
+    hmm_destroy(args.hmm);
+}
+
+
+
diff --git a/bcftools/plugins/contrast.c b/bcftools/plugins/contrast.c

new file mode 100644 (file)

index 0000000..f5bafb1
--- /dev/null
+++ b/bcftools/plugins/contrast.c
@@ -0,0 +1,364 @@
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <errno.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file, output_type;
+    char **argv, *output_fname, *fname, *regions, *targets, *filter_str;
+    char *bg_samples_str, *novel_samples_str;
+    int *bg_smpl, *novel_smpl, nbg_smpl, nnovel_smpl;
+    filter_t *filter;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr, *hdr_out;
+    htsFile *out_fh;
+    int32_t *gts;
+    int mgts;
+    uint32_t *bg_gts;
+    int nbg_gts, mbg_gts, ntotal, nskipped, ntested, nnovel_al, nnovel_gt;
+    kstring_t novel_als_smpl, novel_gts_smpl;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Find novel alleles and genotypes in two groups of samples.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Finds novel alleles and genotypes in two groups of samples. Adds\n"
+        "       an annotation which lists samples with a novel allele (INFO/NOVELAL)\n"
+        "       or a novel genotype (INFO/NOVELGT)\n"
+        "Usage: bcftools +contrast [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -0, --bg-samples <list>     list of background samples\n"
+        "   -1, --novel-samples <list>  list of samples where novel allele or genotype are expected\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [stdout]\n"
+        "   -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   # Test if any of the samples a,b is different from the samples c,d,e\n"
+        "   bcftools +contrast -0 c,d,e -1 a,b file.bcf\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+    args->hdr_out = bcf_hdr_dup(args->hdr);
+    bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELAL,Number=.,Type=String,Description=\"List of samples with novel alleles\">");
+    bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELGT,Number=.,Type=String,Description=\"List of samples with novel genotypes. Note that only samples w/o a novel allele are listed.\">");
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
+    int i;
+    char **smpl = hts_readlist(args->bg_samples_str, 0, &args->nbg_smpl);
+    args->bg_smpl = (int*) malloc(sizeof(int)*args->nbg_smpl);
+    for (i=0; i<args->nbg_smpl; i++)
+    {
+        args->bg_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+        if ( args->bg_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+        free(smpl[i]);
+    }
+    free(smpl);
+
+    smpl = hts_readlist(args->novel_samples_str, 0, &args->nnovel_smpl);
+    args->novel_smpl = (int*) malloc(sizeof(int)*args->nnovel_smpl);
+    for (i=0; i<args->nnovel_smpl; i++)
+    {
+        args->novel_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+        if ( args->novel_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+        free(smpl[i]);
+    }
+    free(smpl);
+
+    args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    bcf_hdr_write(args->out_fh, args->hdr_out);
+}
+static void destroy_data(args_t *args)
+{
+    bcf_hdr_destroy(args->hdr_out);
+    hts_close(args->out_fh);
+    free(args->novel_als_smpl.s);
+    free(args->novel_gts_smpl.s);
+    free(args->gts);
+    free(args->bg_gts);
+    free(args->bg_smpl);
+    free(args->novel_smpl);
+    if ( args->filter ) filter_destroy(args->filter);
+    bcf_sr_destroy(args->sr);
+    free(args);
+}
+static inline int binary_search(uint32_t val, uint32_t *dat, int ndat)
+{
+    int i = -1, imin = 0, imax = ndat - 1;
+    while ( imin<=imax )
+    {
+        i = (imin+imax)/2;
+        if ( dat[i] < val ) imin = i + 1;
+        else if ( dat[i] > val ) imax = i - 1;
+        else return 1;
+    }
+    return 0;
+}
+static inline void binary_insert(uint32_t val, uint32_t **dat, int *ndat, int *mdat)
+{
+    int i = -1, imin = 0, imax = *ndat - 1;
+    while ( imin<=imax )
+    {
+        i = (imin+imax)/2;
+        if ( (*dat)[i] < val ) imin = i + 1;
+        else if ( (*dat)[i] > val ) imax = i - 1;
+        else return;
+    }
+    while ( i>=0 && (*dat)[i]>val ) i--;
+
+    (*ndat)++;
+    hts_expand(uint32_t, (*ndat), (*mdat), (*dat));
+
+    if ( *ndat > 1 )
+        memmove(*dat + i + 1, *dat + i, sizeof(uint32_t)*(*ndat - i - 1));
+
+    (*dat)[i+1] = val;
+}
+static int process_record(args_t *args, bcf1_t *rec)
+{
+    args->ntotal++;
+
+    static int warned = 0;
+    int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->mgts);
+    ngts /= rec->n_sample;
+    if ( ngts>2 ) error("todo: ploidy=%d\n", ngts);
+
+    args->nbg_gts = 0;
+    uint32_t bg_als = 0;
+    int i,j;
+    for (i=0; i<args->nbg_smpl; i++)
+    {
+        uint32_t gt  = 0;
+        int32_t *ptr = args->gts + args->bg_smpl[i]*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_is_missing(ptr[j]) ) continue; 
+            int ial = bcf_gt_allele(ptr[j]);
+            if ( ial > 31 )
+            {
+                if ( !warned )
+                {
+                    fprintf(stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                    warned = 1;
+                }
+                args->nskipped++;
+                return -1;
+            }
+            bg_als |= 1<<ial;
+            gt |= 1<<ial;
+        }
+        binary_insert(gt, &args->bg_gts, &args->nbg_gts, &args->mbg_gts);
+    }
+    if ( !bg_als )
+    {
+        // all are missing
+        args->nskipped++;
+        return -1;
+    }
+
+    args->novel_als_smpl.l = 0;
+    args->novel_gts_smpl.l = 0;
+
+    int has_gt = 0;
+    for (i=0; i<args->nnovel_smpl; i++)
+    {
+        int novel_al = 0;
+        uint32_t gt  = 0;
+        int32_t *ptr = args->gts + args->novel_smpl[i]*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_is_missing(ptr[j]) ) continue; 
+            int ial = bcf_gt_allele(ptr[j]);
+            if ( ial > 31 )
+            {
+                if ( !warned )
+                {
+                    fprintf(stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                    warned = 1;
+                }
+                args->nskipped++;
+                return -1;
+            }
+            if ( !(bg_als & (1<<ial)) ) novel_al = 1; 
+            gt |= 1<<ial;
+        }
+        if ( !gt ) continue;
+        has_gt = 1;
+
+        char *smpl = args->hdr->samples[ args->novel_smpl[i] ];
+        if ( novel_al )
+        {
+            if ( args->novel_als_smpl.l ) kputc(',', &args->novel_als_smpl);
+            kputs(smpl, &args->novel_als_smpl);
+        }
+        else if ( !binary_search(gt, args->bg_gts, args->nbg_gts) )
+        {
+            if ( args->novel_gts_smpl.l ) kputc(',', &args->novel_gts_smpl);
+            kputs(smpl, &args->novel_gts_smpl);
+        }
+    }
+    if ( !has_gt )
+    {
+        // all are missing
+        args->nskipped++;
+        return -1;
+    }
+    if ( args->novel_als_smpl.l ) 
+    {
+        bcf_update_info_string(args->hdr_out, rec, "NOVELAL", args->novel_als_smpl.s);
+        args->nnovel_al++;
+    }
+    if ( args->novel_gts_smpl.l ) 
+    {
+        bcf_update_info_string(args->hdr_out, rec, "NOVELGT", args->novel_gts_smpl.s);
+        args->nnovel_gt++;
+    }
+    args->ntested++;
+    return 0;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"bg-samples",required_argument,0,'0'},
+        {"novel-samples",required_argument,0,'1'},
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "O:o:i:e:r:R:t:T:0:1:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case '0': args->bg_samples_str = optarg; break;
+            case '1': args->novel_samples_str = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      };
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, rec, NULL);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !pass ) continue;
+        }
+        process_record(args, rec);
+        bcf_write(args->out_fh, args->hdr_out, rec);
+    }
+
+    fprintf(stderr,"Total/processed/skipped/novel_allele/novel_gt:\t%d\t%d\t%d\t%d\t%d\n", args->ntotal, args->ntested, args->nskipped, args->nnovel_al, args->nnovel_gt);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/contrast.c.pysam.c b/bcftools/plugins/contrast.c.pysam.c

new file mode 100644 (file)

index 0000000..6f7e3ea
--- /dev/null
+++ b/bcftools/plugins/contrast.c.pysam.c
@@ -0,0 +1,366 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <errno.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file, output_type;
+    char **argv, *output_fname, *fname, *regions, *targets, *filter_str;
+    char *bg_samples_str, *novel_samples_str;
+    int *bg_smpl, *novel_smpl, nbg_smpl, nnovel_smpl;
+    filter_t *filter;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr, *hdr_out;
+    htsFile *out_fh;
+    int32_t *gts;
+    int mgts;
+    uint32_t *bg_gts;
+    int nbg_gts, mbg_gts, ntotal, nskipped, ntested, nnovel_al, nnovel_gt;
+    kstring_t novel_als_smpl, novel_gts_smpl;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Find novel alleles and genotypes in two groups of samples.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Finds novel alleles and genotypes in two groups of samples. Adds\n"
+        "       an annotation which lists samples with a novel allele (INFO/NOVELAL)\n"
+        "       or a novel genotype (INFO/NOVELGT)\n"
+        "Usage: bcftools +contrast [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -0, --bg-samples <list>     list of background samples\n"
+        "   -1, --novel-samples <list>  list of samples where novel allele or genotype are expected\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [bcftools_stdout]\n"
+        "   -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   # Test if any of the samples a,b is different from the samples c,d,e\n"
+        "   bcftools +contrast -0 c,d,e -1 a,b file.bcf\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+    args->hdr_out = bcf_hdr_dup(args->hdr);
+    bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELAL,Number=.,Type=String,Description=\"List of samples with novel alleles\">");
+    bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELGT,Number=.,Type=String,Description=\"List of samples with novel genotypes. Note that only samples w/o a novel allele are listed.\">");
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
+    int i;
+    char **smpl = hts_readlist(args->bg_samples_str, 0, &args->nbg_smpl);
+    args->bg_smpl = (int*) malloc(sizeof(int)*args->nbg_smpl);
+    for (i=0; i<args->nbg_smpl; i++)
+    {
+        args->bg_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+        if ( args->bg_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+        free(smpl[i]);
+    }
+    free(smpl);
+
+    smpl = hts_readlist(args->novel_samples_str, 0, &args->nnovel_smpl);
+    args->novel_smpl = (int*) malloc(sizeof(int)*args->nnovel_smpl);
+    for (i=0; i<args->nnovel_smpl; i++)
+    {
+        args->novel_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+        if ( args->novel_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+        free(smpl[i]);
+    }
+    free(smpl);
+
+    args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    bcf_hdr_write(args->out_fh, args->hdr_out);
+}
+static void destroy_data(args_t *args)
+{
+    bcf_hdr_destroy(args->hdr_out);
+    hts_close(args->out_fh);
+    free(args->novel_als_smpl.s);
+    free(args->novel_gts_smpl.s);
+    free(args->gts);
+    free(args->bg_gts);
+    free(args->bg_smpl);
+    free(args->novel_smpl);
+    if ( args->filter ) filter_destroy(args->filter);
+    bcf_sr_destroy(args->sr);
+    free(args);
+}
+static inline int binary_search(uint32_t val, uint32_t *dat, int ndat)
+{
+    int i = -1, imin = 0, imax = ndat - 1;
+    while ( imin<=imax )
+    {
+        i = (imin+imax)/2;
+        if ( dat[i] < val ) imin = i + 1;
+        else if ( dat[i] > val ) imax = i - 1;
+        else return 1;
+    }
+    return 0;
+}
+static inline void binary_insert(uint32_t val, uint32_t **dat, int *ndat, int *mdat)
+{
+    int i = -1, imin = 0, imax = *ndat - 1;
+    while ( imin<=imax )
+    {
+        i = (imin+imax)/2;
+        if ( (*dat)[i] < val ) imin = i + 1;
+        else if ( (*dat)[i] > val ) imax = i - 1;
+        else return;
+    }
+    while ( i>=0 && (*dat)[i]>val ) i--;
+
+    (*ndat)++;
+    hts_expand(uint32_t, (*ndat), (*mdat), (*dat));
+
+    if ( *ndat > 1 )
+        memmove(*dat + i + 1, *dat + i, sizeof(uint32_t)*(*ndat - i - 1));
+
+    (*dat)[i+1] = val;
+}
+static int process_record(args_t *args, bcf1_t *rec)
+{
+    args->ntotal++;
+
+    static int warned = 0;
+    int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->mgts);
+    ngts /= rec->n_sample;
+    if ( ngts>2 ) error("todo: ploidy=%d\n", ngts);
+
+    args->nbg_gts = 0;
+    uint32_t bg_als = 0;
+    int i,j;
+    for (i=0; i<args->nbg_smpl; i++)
+    {
+        uint32_t gt  = 0;
+        int32_t *ptr = args->gts + args->bg_smpl[i]*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_is_missing(ptr[j]) ) continue; 
+            int ial = bcf_gt_allele(ptr[j]);
+            if ( ial > 31 )
+            {
+                if ( !warned )
+                {
+                    fprintf(bcftools_stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                    warned = 1;
+                }
+                args->nskipped++;
+                return -1;
+            }
+            bg_als |= 1<<ial;
+            gt |= 1<<ial;
+        }
+        binary_insert(gt, &args->bg_gts, &args->nbg_gts, &args->mbg_gts);
+    }
+    if ( !bg_als )
+    {
+        // all are missing
+        args->nskipped++;
+        return -1;
+    }
+
+    args->novel_als_smpl.l = 0;
+    args->novel_gts_smpl.l = 0;
+
+    int has_gt = 0;
+    for (i=0; i<args->nnovel_smpl; i++)
+    {
+        int novel_al = 0;
+        uint32_t gt  = 0;
+        int32_t *ptr = args->gts + args->novel_smpl[i]*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( bcf_gt_is_missing(ptr[j]) ) continue; 
+            int ial = bcf_gt_allele(ptr[j]);
+            if ( ial > 31 )
+            {
+                if ( !warned )
+                {
+                    fprintf(bcftools_stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+                    warned = 1;
+                }
+                args->nskipped++;
+                return -1;
+            }
+            if ( !(bg_als & (1<<ial)) ) novel_al = 1; 
+            gt |= 1<<ial;
+        }
+        if ( !gt ) continue;
+        has_gt = 1;
+
+        char *smpl = args->hdr->samples[ args->novel_smpl[i] ];
+        if ( novel_al )
+        {
+            if ( args->novel_als_smpl.l ) kputc(',', &args->novel_als_smpl);
+            kputs(smpl, &args->novel_als_smpl);
+        }
+        else if ( !binary_search(gt, args->bg_gts, args->nbg_gts) )
+        {
+            if ( args->novel_gts_smpl.l ) kputc(',', &args->novel_gts_smpl);
+            kputs(smpl, &args->novel_gts_smpl);
+        }
+    }
+    if ( !has_gt )
+    {
+        // all are missing
+        args->nskipped++;
+        return -1;
+    }
+    if ( args->novel_als_smpl.l ) 
+    {
+        bcf_update_info_string(args->hdr_out, rec, "NOVELAL", args->novel_als_smpl.s);
+        args->nnovel_al++;
+    }
+    if ( args->novel_gts_smpl.l ) 
+    {
+        bcf_update_info_string(args->hdr_out, rec, "NOVELGT", args->novel_gts_smpl.s);
+        args->nnovel_gt++;
+    }
+    args->ntested++;
+    return 0;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"bg-samples",required_argument,0,'0'},
+        {"novel-samples",required_argument,0,'1'},
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "O:o:i:e:r:R:t:T:0:1:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case '0': args->bg_samples_str = optarg; break;
+            case '1': args->novel_samples_str = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      };
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, rec, NULL);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !pass ) continue;
+        }
+        process_record(args, rec);
+        bcf_write(args->out_fh, args->hdr_out, rec);
+    }
+
+    fprintf(bcftools_stderr,"Total/processed/skipped/novel_allele/novel_gt:\t%d\t%d\t%d\t%d\t%d\n", args->ntotal, args->ntested, args->nskipped, args->nnovel_al, args->nnovel_gt);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/counts.c b/bcftools/plugins/counts.c

new file mode 100644 (file)

index 0000000..2d57ba4
--- /dev/null
+++ b/bcftools/plugins/counts.c
@@ -0,0 +1,82 @@
+/*  plugins/counts.c -- counts SNPs, Indels, and total number of sites.
+
+    Copyright (C) 2013, 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+
+int nsamples, nsnps, nindels, nmnps, nothers, nsites;
+
+/*
+    This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+    return
+        "A minimal plugin which counts number of samples, SNPs,\n"
+        "INDELs, MNPs and total number of sites.\n";
+}
+
+/*
+    Called once at startup, allows to initialize local variables.
+    Return 1 to suppress VCF/BCF header from printing, 0 otherwise.
+*/
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    nsamples = bcf_hdr_nsamples(in);
+    nsnps = nindels = nmnps = nothers = nsites = 0;
+    return 1;
+}
+
+
+/*
+    Called for each VCF record. Return rec to output the line or NULL
+    to suppress output.
+*/
+bcf1_t *process(bcf1_t *rec)
+{
+    int type = bcf_get_variant_types(rec);
+    if ( type & VCF_SNP ) nsnps++;
+    if ( type & VCF_INDEL ) nindels++;
+    if ( type & VCF_MNP ) nmnps++;
+    if ( type & VCF_OTHER ) nothers++;
+    nsites++;
+    return NULL;
+}
+
+
+/*
+    Clean up.
+*/
+void destroy(void)
+{
+    printf("Number of samples: %d\n", nsamples);
+    printf("Number of SNPs:    %d\n", nsnps);
+    printf("Number of INDELs:  %d\n", nindels);
+    printf("Number of MNPs:    %d\n", nmnps);
+    printf("Number of others:  %d\n", nothers);
+    printf("Number of sites:   %d\n", nsites);
+}
+
+
diff --git a/bcftools/plugins/counts.c.pysam.c b/bcftools/plugins/counts.c.pysam.c

new file mode 100644 (file)

index 0000000..062808c
--- /dev/null
+++ b/bcftools/plugins/counts.c.pysam.c
@@ -0,0 +1,84 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/counts.c -- counts SNPs, Indels, and total number of sites.
+
+    Copyright (C) 2013, 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+
+int nsamples, nsnps, nindels, nmnps, nothers, nsites;
+
+/*
+    This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+    return
+        "A minimal plugin which counts number of samples, SNPs,\n"
+        "INDELs, MNPs and total number of sites.\n";
+}
+
+/*
+    Called once at startup, allows to initialize local variables.
+    Return 1 to suppress VCF/BCF header from printing, 0 otherwise.
+*/
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    nsamples = bcf_hdr_nsamples(in);
+    nsnps = nindels = nmnps = nothers = nsites = 0;
+    return 1;
+}
+
+
+/*
+    Called for each VCF record. Return rec to output the line or NULL
+    to suppress output.
+*/
+bcf1_t *process(bcf1_t *rec)
+{
+    int type = bcf_get_variant_types(rec);
+    if ( type & VCF_SNP ) nsnps++;
+    if ( type & VCF_INDEL ) nindels++;
+    if ( type & VCF_MNP ) nmnps++;
+    if ( type & VCF_OTHER ) nothers++;
+    nsites++;
+    return NULL;
+}
+
+
+/*
+    Clean up.
+*/
+void destroy(void)
+{
+    fprintf(bcftools_stdout, "Number of samples: %d\n", nsamples);
+    fprintf(bcftools_stdout, "Number of SNPs:    %d\n", nsnps);
+    fprintf(bcftools_stdout, "Number of INDELs:  %d\n", nindels);
+    fprintf(bcftools_stdout, "Number of MNPs:    %d\n", nmnps);
+    fprintf(bcftools_stdout, "Number of others:  %d\n", nothers);
+    fprintf(bcftools_stdout, "Number of sites:   %d\n", nsites);
+}
+
+
diff --git a/bcftools/plugins/dosage.c b/bcftools/plugins/dosage.c

new file mode 100644 (file)

index 0000000..fdf17d2
--- /dev/null
+++ b/bcftools/plugins/dosage.c
@@ -0,0 +1,337 @@
+/*  plugins/dosage.c -- prints genotype dosage.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+#include "bcftools.h"
+
+
+/*
+    This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+    return "Prints genotype dosage determined from tags requested by the user.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Print genotype dosage\n"
+        "Usage: bcftools +dosage [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -t, --tags <list>   VCF tags to determine the dosage from [PL,GL,GT]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +dosage in.vcf -- -t GT\n"
+        "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL;
+int pl_type = 0, gl_type = 0;
+uint8_t *buf = NULL;
+int nbuf = 0;   // NB: number of elements, not bytes
+char **tags = NULL;
+int ntags = 0;
+float *vals = NULL, *dsg = NULL;
+int mvals, mdsg;
+
+typedef int (*dosage_f) (bcf1_t *);
+dosage_f *handlers = NULL;
+int nhandlers = 0;
+
+
+int calc_dosage_PL(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_format_values(in_hdr,rec,"PL",(void**)&buf,&nbuf,pl_type);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1;     // not diploid
+    hts_expand(float, nret, mvals, vals);
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            float sum = 0; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = exp(-0.1*ptr[j]); \
+                sum += vals[j]; \
+            } \
+            if ( j<nret ) \
+                for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+            else \
+            { \
+                if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+                memset(dsg, 0, sizeof(float)*rec->n_allele); \
+                int k, l = 0; \
+                for (j=0; j<rec->n_allele; j++) \
+                { \
+                    for (k=0; k<=j; k++) \
+                    { \
+                        dsg[j] += vals[l]; \
+                        dsg[k] += vals[l]; \
+                    } \
+                } \
+            } \
+            for (j=1; j<rec->n_allele; j++) \
+                printf("%c%.1f",j==1?'\t':',',dsg[j]); \
+            ptr += nret; \
+        } \
+    }
+    switch (pl_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+    return 0;
+}
+
+int calc_dosage_GL(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_format_values(in_hdr,rec,"GL",(void**)&buf,&nbuf,pl_type);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1;     // not diploid
+    hts_expand(float, nret, mvals, vals);
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            float sum = 0; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = exp(ptr[j]); \
+                sum += vals[j]; \
+            } \
+            if ( j<nret ) \
+                for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+            else \
+            { \
+                for (; j<nret; j++) vals[j] = 0; \
+                if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+                memset(dsg, 0, sizeof(float)*rec->n_allele); \
+                int k, l = 0; \
+                for (j=0; j<rec->n_allele; j++) \
+                { \
+                    for (k=0; k<=j; k++) \
+                    { \
+                        dsg[j] += vals[l]; \
+                        dsg[k] += vals[l]; \
+                    } \
+                } \
+            } \
+            for (j=1; j<rec->n_allele; j++) \
+                printf("%c%.1f",j==1?'\t':',',dsg[j]); \
+            ptr += nret; \
+        } \
+    }
+    switch (pl_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+    return 0;
+}
+
+int calc_dosage_GT(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_genotypes(in_hdr,rec,(void**)&buf,&nbuf);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    int32_t *ptr = (int32_t*) buf;
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    for (i=0; i<rec->n_sample; i++)
+    {
+        memset(dsg, 0, sizeof(float)*rec->n_allele);
+        for (j=0; j<nret; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end || bcf_gt_is_missing(ptr[j]) ) break;
+            int idx = bcf_gt_allele(ptr[j]);
+            if ( idx > rec->n_allele ) error("The allele index is out of range at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+            dsg[idx] += 1;
+        }
+        if ( !j )
+            for (j=0; j<rec->n_allele; j++) dsg[j] = -1;
+        for (j=1; j<rec->n_allele; j++)
+            printf("%c%.1f",j==1?'\t':',',dsg[j]);
+        ptr += nret;
+    }
+    return 0;
+}
+
+
+char **split_list(char *str, int *nitems)
+{
+    int n = 0, done = 0;
+    char *ss = strdup(str), **out = NULL;
+    while ( !done && *ss )
+    {
+        char *se = ss;
+        while ( *se && *se!=',' ) se++;
+        if ( !*se ) done = 1;
+        *se = 0;
+        n++;
+        out = (char**) realloc(out,sizeof(char*)*n);
+        out[n-1] = ss;
+        ss = se+1;
+    }
+    *nitems = n;
+    return out;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int i, id, c;
+    char *tags_str = "PL,GL,GT";
+
+    static struct option loptions[] =
+    {
+        {"tags",1,0,'t'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "t:?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 't': tags_str = optarg; break;
+            case 'h':
+            case '?':
+            default: fprintf(stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    tags = split_list(tags_str, &ntags);
+
+    in_hdr = in;
+    for (i=0; i<ntags; i++)
+    {
+        if ( !strcmp("PL",tags[i]) )
+        {
+            id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"PL");
+            if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+            {
+                pl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+                if ( pl_type!=BCF_HT_INT && pl_type!=BCF_HT_REAL )
+                {
+                    fprintf(stderr,"Expected numeric type of FORMAT/PL\n");
+                    return -1;
+                }
+                handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+                handlers[nhandlers++] = calc_dosage_PL;
+            }
+        }
+        else if ( !strcmp("GL",tags[i]) )
+        {
+            id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"GL");
+            if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+            {
+                gl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+                if ( gl_type!=BCF_HT_INT && gl_type!=BCF_HT_REAL )
+                {
+                    fprintf(stderr,"Expected numeric type of FORMAT/GL\n");
+                    return -1;
+                }
+                handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+                handlers[nhandlers++] = calc_dosage_GL;
+            }
+        }
+        else if ( !strcmp("GT",tags[i]) )
+        {
+            handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+            handlers[nhandlers++] = calc_dosage_GT;
+        }
+        else
+        {
+            fprintf(stderr,"No handler for tag \"%s\"\n", tags[i]);
+            return -1;
+        }
+    }
+    free(tags[0]);
+    free(tags);
+
+    printf("#[1]CHROM\t[2]POS\t[3]REF\t[4]ALT");
+    for (i=0; i<bcf_hdr_nsamples(in_hdr); i++) printf("\t[%d]%s", i+5,in_hdr->samples[i]);
+    printf("\n");
+
+    return 1;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, ret;
+
+    printf("%s\t%d\t%s", bcf_seqname(in_hdr,rec),rec->pos+1,rec->d.allele[0]);
+    if ( rec->n_allele == 1 ) printf("\t.");
+    else for (i=1; i<rec->n_allele; i++) printf("%c%s", i==1?'\t':',', rec->d.allele[i]);
+    if ( rec->n_allele==1 )
+    {
+        for (j=0; j<rec->n_sample; j++) printf("\t0.0");
+    }
+    else
+    {
+        for (i=0; i<nhandlers; i++)
+        {
+            ret = handlers[i](rec);
+            if ( !ret ) break;  // successfully printed
+        }
+        if ( i==nhandlers )
+        {
+            // none of the annotations present
+            for (i=0; i<rec->n_sample; i++) printf("\t-1.0");
+        }
+    }
+    printf("\n");
+
+    return NULL;
+}
+
+
+void destroy(void)
+{
+    free(vals);
+    free(dsg);
+    free(handlers);
+    free(buf);
+}
+
+
diff --git a/bcftools/plugins/dosage.c.pysam.c b/bcftools/plugins/dosage.c.pysam.c

new file mode 100644 (file)

index 0000000..cd6b437
--- /dev/null
+++ b/bcftools/plugins/dosage.c.pysam.c
@@ -0,0 +1,339 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/dosage.c -- prints genotype dosage.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+#include "bcftools.h"
+
+
+/*
+    This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+    return "Prints genotype dosage determined from tags requested by the user.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Print genotype dosage\n"
+        "Usage: bcftools +dosage [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -t, --tags <list>   VCF tags to determine the dosage from [PL,GL,GT]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +dosage in.vcf -- -t GT\n"
+        "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL;
+int pl_type = 0, gl_type = 0;
+uint8_t *buf = NULL;
+int nbuf = 0;   // NB: number of elements, not bytes
+char **tags = NULL;
+int ntags = 0;
+float *vals = NULL, *dsg = NULL;
+int mvals, mdsg;
+
+typedef int (*dosage_f) (bcf1_t *);
+dosage_f *handlers = NULL;
+int nhandlers = 0;
+
+
+int calc_dosage_PL(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_format_values(in_hdr,rec,"PL",(void**)&buf,&nbuf,pl_type);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1;     // not diploid
+    hts_expand(float, nret, mvals, vals);
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            float sum = 0; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = exp(-0.1*ptr[j]); \
+                sum += vals[j]; \
+            } \
+            if ( j<nret ) \
+                for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+            else \
+            { \
+                if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+                memset(dsg, 0, sizeof(float)*rec->n_allele); \
+                int k, l = 0; \
+                for (j=0; j<rec->n_allele; j++) \
+                { \
+                    for (k=0; k<=j; k++) \
+                    { \
+                        dsg[j] += vals[l]; \
+                        dsg[k] += vals[l]; \
+                    } \
+                } \
+            } \
+            for (j=1; j<rec->n_allele; j++) \
+                fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]); \
+            ptr += nret; \
+        } \
+    }
+    switch (pl_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+    return 0;
+}
+
+int calc_dosage_GL(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_format_values(in_hdr,rec,"GL",(void**)&buf,&nbuf,pl_type);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1;     // not diploid
+    hts_expand(float, nret, mvals, vals);
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            float sum = 0; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = exp(ptr[j]); \
+                sum += vals[j]; \
+            } \
+            if ( j<nret ) \
+                for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+            else \
+            { \
+                for (; j<nret; j++) vals[j] = 0; \
+                if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+                memset(dsg, 0, sizeof(float)*rec->n_allele); \
+                int k, l = 0; \
+                for (j=0; j<rec->n_allele; j++) \
+                { \
+                    for (k=0; k<=j; k++) \
+                    { \
+                        dsg[j] += vals[l]; \
+                        dsg[k] += vals[l]; \
+                    } \
+                } \
+            } \
+            for (j=1; j<rec->n_allele; j++) \
+                fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]); \
+            ptr += nret; \
+        } \
+    }
+    switch (pl_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+    return 0;
+}
+
+int calc_dosage_GT(bcf1_t *rec)
+{
+    int i, j, nret = bcf_get_genotypes(in_hdr,rec,(void**)&buf,&nbuf);
+    if ( nret<0 ) return -1;
+
+    nret /= rec->n_sample;
+    int32_t *ptr = (int32_t*) buf;
+    hts_expand(float, rec->n_allele, mdsg, dsg);
+    for (i=0; i<rec->n_sample; i++)
+    {
+        memset(dsg, 0, sizeof(float)*rec->n_allele);
+        for (j=0; j<nret; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end || bcf_gt_is_missing(ptr[j]) ) break;
+            int idx = bcf_gt_allele(ptr[j]);
+            if ( idx > rec->n_allele ) error("The allele index is out of range at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+            dsg[idx] += 1;
+        }
+        if ( !j )
+            for (j=0; j<rec->n_allele; j++) dsg[j] = -1;
+        for (j=1; j<rec->n_allele; j++)
+            fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]);
+        ptr += nret;
+    }
+    return 0;
+}
+
+
+char **split_list(char *str, int *nitems)
+{
+    int n = 0, done = 0;
+    char *ss = strdup(str), **out = NULL;
+    while ( !done && *ss )
+    {
+        char *se = ss;
+        while ( *se && *se!=',' ) se++;
+        if ( !*se ) done = 1;
+        *se = 0;
+        n++;
+        out = (char**) realloc(out,sizeof(char*)*n);
+        out[n-1] = ss;
+        ss = se+1;
+    }
+    *nitems = n;
+    return out;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int i, id, c;
+    char *tags_str = "PL,GL,GT";
+
+    static struct option loptions[] =
+    {
+        {"tags",1,0,'t'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "t:?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 't': tags_str = optarg; break;
+            case 'h':
+            case '?':
+            default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    tags = split_list(tags_str, &ntags);
+
+    in_hdr = in;
+    for (i=0; i<ntags; i++)
+    {
+        if ( !strcmp("PL",tags[i]) )
+        {
+            id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"PL");
+            if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+            {
+                pl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+                if ( pl_type!=BCF_HT_INT && pl_type!=BCF_HT_REAL )
+                {
+                    fprintf(bcftools_stderr,"Expected numeric type of FORMAT/PL\n");
+                    return -1;
+                }
+                handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+                handlers[nhandlers++] = calc_dosage_PL;
+            }
+        }
+        else if ( !strcmp("GL",tags[i]) )
+        {
+            id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"GL");
+            if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+            {
+                gl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+                if ( gl_type!=BCF_HT_INT && gl_type!=BCF_HT_REAL )
+                {
+                    fprintf(bcftools_stderr,"Expected numeric type of FORMAT/GL\n");
+                    return -1;
+                }
+                handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+                handlers[nhandlers++] = calc_dosage_GL;
+            }
+        }
+        else if ( !strcmp("GT",tags[i]) )
+        {
+            handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+            handlers[nhandlers++] = calc_dosage_GT;
+        }
+        else
+        {
+            fprintf(bcftools_stderr,"No handler for tag \"%s\"\n", tags[i]);
+            return -1;
+        }
+    }
+    free(tags[0]);
+    free(tags);
+
+    fprintf(bcftools_stdout, "#[1]CHROM\t[2]POS\t[3]REF\t[4]ALT");
+    for (i=0; i<bcf_hdr_nsamples(in_hdr); i++) fprintf(bcftools_stdout, "\t[%d]%s", i+5,in_hdr->samples[i]);
+    fprintf(bcftools_stdout, "\n");
+
+    return 1;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, ret;
+
+    fprintf(bcftools_stdout, "%s\t%d\t%s", bcf_seqname(in_hdr,rec),rec->pos+1,rec->d.allele[0]);
+    if ( rec->n_allele == 1 ) fprintf(bcftools_stdout, "\t.");
+    else for (i=1; i<rec->n_allele; i++) fprintf(bcftools_stdout, "%c%s", i==1?'\t':',', rec->d.allele[i]);
+    if ( rec->n_allele==1 )
+    {
+        for (j=0; j<rec->n_sample; j++) fprintf(bcftools_stdout, "\t0.0");
+    }
+    else
+    {
+        for (i=0; i<nhandlers; i++)
+        {
+            ret = handlers[i](rec);
+            if ( !ret ) break;  // successfully printed
+        }
+        if ( i==nhandlers )
+        {
+            // none of the annotations present
+            for (i=0; i<rec->n_sample; i++) fprintf(bcftools_stdout, "\t-1.0");
+        }
+    }
+    fprintf(bcftools_stdout, "\n");
+
+    return NULL;
+}
+
+
+void destroy(void)
+{
+    free(vals);
+    free(dsg);
+    free(handlers);
+    free(buf);
+}
+
+
diff --git a/bcftools/plugins/fill-AN-AC.c b/bcftools/plugins/fill-AN-AC.c

new file mode 100644 (file)

index 0000000..7d6c6f4
--- /dev/null
+++ b/bcftools/plugins/fill-AN-AC.c
@@ -0,0 +1,67 @@
+/*  plugins/fill-AN-AC.c -- fills AN and AC INFO fields.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int *arr = NULL, marr = 0;
+
+const char *about(void)
+{
+    return "Fill INFO fields AN and AC.\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    in_hdr  = in;
+    out_hdr = out;
+    bcf_hdr_append(out_hdr, "##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Allele count in genotypes\">");
+    bcf_hdr_append(out_hdr, "##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">");
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    hts_expand(int,rec->n_allele,marr,arr);
+    int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+    if ( ret>0 )
+    {
+        int i, an = 0;
+        for (i=0; i<rec->n_allele; i++) an += arr[i];
+        bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
+        bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
+    }
+    return rec;
+}
+
+void destroy(void)
+{
+    free(arr);
+}
+
+
diff --git a/bcftools/plugins/fill-AN-AC.c.pysam.c b/bcftools/plugins/fill-AN-AC.c.pysam.c

new file mode 100644 (file)

index 0000000..88690d5
--- /dev/null
+++ b/bcftools/plugins/fill-AN-AC.c.pysam.c
@@ -0,0 +1,69 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/fill-AN-AC.c -- fills AN and AC INFO fields.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int *arr = NULL, marr = 0;
+
+const char *about(void)
+{
+    return "Fill INFO fields AN and AC.\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    in_hdr  = in;
+    out_hdr = out;
+    bcf_hdr_append(out_hdr, "##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Allele count in genotypes\">");
+    bcf_hdr_append(out_hdr, "##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">");
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    hts_expand(int,rec->n_allele,marr,arr);
+    int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+    if ( ret>0 )
+    {
+        int i, an = 0;
+        for (i=0; i<rec->n_allele; i++) an += arr[i];
+        bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
+        bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
+    }
+    return rec;
+}
+
+void destroy(void)
+{
+    free(arr);
+}
+
+
diff --git a/bcftools/plugins/fill-from-fasta.c b/bcftools/plugins/fill-from-fasta.c

new file mode 100644 (file)

index 0000000..80e7a8d
--- /dev/null
+++ b/bcftools/plugins/fill-from-fasta.c
@@ -0,0 +1,206 @@
+/*  plugin/fill-from-fasta.c -- fill-from-fasta plugin.
+
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <htslib/vcf.h>
+#include <htslib/faidx.h>
+#include <htslib/kseq.h>
+#include "filter.h"
+#include "bcftools.h"
+
+const char *about(void)
+{
+    return "Fill INFO or REF field based on values in a fasta file\n";
+}
+
+const char *usage(void)
+{
+    return 
+"\n"
+"About:   Fill INFO or REF field based on values in a fasta file.\n"
+"         The fasta file must be indexed with samtools faidx.\n"
+"Usage:   bcftools +fill-from-fasta [General Options] -- [Plugin Options]\n"
+"\n"
+"General options:\n"
+"   run \"bcftools plugin\" for a list of common options\n"
+"\n"
+"Plugin options:\n"
+"   -c, --column <str>          REF or INFO tag, e.g. AA for ancestral allele\n"
+"   -f, --fasta <file>          fasta file\n"
+"   -h, --header-lines <file>   optional file containing header lines to append\n"
+"   -i, --include <expr>        annotate only records passing filter expression\n"
+"   -e, --exclude <expr>        annotate only records failing filter expression\n"
+
+"\n"
+"Examples:\n"
+"   # fill ancestral allele as INFO/AA for SNP records\n"
+"   echo '##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral allele\">' > aa.hdr\n"
+"   bcftools +fill-from-fasta in.vcf -- -c AA -f aa.fasta -h aa.hdr -i 'TYPE=\"snp\"'\n"
+"\n"
+"   # fix the REF allele in VCFs where REF=N or other\n"
+"   bcftools +fill-from-fasta in.vcf -- -c REF -f reference.fasta\n"
+"\n"
+"   # select sites marked as P (PASS) in the 1000G Phase3 mask\n"
+"   echo '##INFO=<ID=P3_MASK,Number=1,Type=String,Description=\"1000G Phase 3 mask\">' > mask.hdr\n"
+"   bcftools +fill-from-fasta in.vcf -Ou -- -c P3_MASK -f 1000G_mask.fasta -h mask.hdr | bcftools view -i 'P3_MASK=\"P\"'\n"
+"\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+faidx_t *faidx;
+int anno = 0;
+char *column = NULL;
+
+#define ANNO_REF 1
+#define ANNO_STRING 2
+#define ANNO_INT 3
+
+filter_t *filter;
+char *filter_str;
+int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    char *ref_fname = NULL, *header_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"column",required_argument,NULL,'c'},
+        {"fasta",required_argument,NULL,'f'},
+        {"header-lines",required_argument,NULL,'h'},
+        {NULL,0,NULL,0}
+    };
+    while ((c = getopt_long(argc, argv, "c:f:?h:i:e:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': filter_str = optarg; filter_logic |= FLT_EXCLUDE; break;
+            case 'i': filter_str = optarg; filter_logic |= FLT_INCLUDE; break;
+            case 'c': column = optarg; break;
+            case 'f': ref_fname = optarg; break;
+            case 'h': header_fname = optarg; break;
+            case '?':
+            default: fprintf(stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    in_hdr  = in;
+    out_hdr = out;
+    if ( filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) { fprintf(stderr,"Only one of -i or -e can be given.\n"); return -1; }
+
+    if ( !column )
+    {
+        fprintf(stderr,"--column option is required.\n");
+        return -1;
+    }
+    if (header_fname)
+    {
+        htsFile *file = hts_open(header_fname, "rb");
+        if ( !file ) { fprintf(stderr,"Error reading %s\n", header_fname); return -1; }
+        kstring_t str = {0,0,0};
+        while ( hts_getline(file, KS_SEP_LINE, &str) > 0 )
+        {
+            if ( bcf_hdr_append(out_hdr, str.s) ) { fprintf(stderr,"Could not parse %s: %s\n", header_fname, str.s); return -1; }
+        }
+        hts_close(file);
+        free(str.s);
+        bcf_hdr_sync(out_hdr);
+    }
+    if (!strcasecmp("REF", column)) anno = ANNO_REF;
+    else {
+        if ( !strncasecmp(column,"INFO/",5) ) column += 5;
+        int hdr_id = bcf_hdr_id2int(out_hdr, BCF_DT_ID, column);
+        if (hdr_id<0) { fprintf(stderr,"No header ID found for %s. Header lines can be added with the --header-lines option\n", column); return -1; }
+        switch ( bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id) )
+        {
+            case BCF_HT_INT:
+                anno=ANNO_INT;
+                break;
+            case BCF_HT_STR:
+                anno=ANNO_STRING;
+                break;
+            default:
+                fprintf(stderr,"The type of %s not recognised (%d)\n", column, bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id));
+                return -1;
+        }
+    }
+    if ( !ref_fname )
+    {
+        fprintf(stderr,"No fasta given.\n");
+        return -1;
+    }
+    faidx = fai_load(ref_fname);
+    if ( filter_str )
+        filter = filter_init(in, filter_str);
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    // filter determines if we will annotate the record
+    // return record unchanged if filter applied
+    if ( filter )
+    {
+        int ret = filter_test(filter, rec, NULL);
+        if ( filter_logic==FLT_INCLUDE ) { if ( !ret ) return rec; }
+        else if ( ret ) return rec;
+    }
+
+    int i;
+    char *ref = rec->d.allele[0];
+    int ref_len = strlen(ref);
+    int fa_len;
+    // could be sped up here by fetching the whole chromosome? could assume
+    // sorted, but revert to this when non-sorted records found?
+    char *fa = faidx_fetch_seq(faidx, bcf_seqname(in_hdr,rec), rec->pos, rec->pos+ref_len-1, &fa_len);
+    if ( !fa ) error("faidx_fetch_seq failed at %s:%d\n", bcf_hdr_id2name(in_hdr,rec->rid), rec->pos+1);
+    for (i=0; i<fa_len; i++)
+        if ( (int)fa[i]>96 ) fa[i] -= 32;
+
+    assert(ref_len == fa_len);
+    if (anno==ANNO_REF)
+        strncpy(rec->d.allele[0], fa, fa_len);
+    else if (anno==ANNO_STRING)
+        bcf_update_info_string(out_hdr, rec, column, fa);
+    else if (anno==ANNO_INT && ref_len==1)
+    {
+        int val = atoi(&fa[0]);
+        bcf_update_info_int32(out_hdr, rec, column, &val, 1);
+    }
+    free(fa);
+    return rec;
+}
+
+void destroy(void)
+{
+    fai_destroy(faidx);
+    if (filter) filter_destroy(filter);
+}
diff --git a/bcftools/plugins/fill-from-fasta.c.pysam.c b/bcftools/plugins/fill-from-fasta.c.pysam.c

new file mode 100644 (file)

index 0000000..c3b6da2
--- /dev/null
+++ b/bcftools/plugins/fill-from-fasta.c.pysam.c
@@ -0,0 +1,208 @@
+#include "bcftools.pysam.h"
+
+/*  plugin/fill-from-fasta.c -- fill-from-fasta plugin.
+
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <htslib/vcf.h>
+#include <htslib/faidx.h>
+#include <htslib/kseq.h>
+#include "filter.h"
+#include "bcftools.h"
+
+const char *about(void)
+{
+    return "Fill INFO or REF field based on values in a fasta file\n";
+}
+
+const char *usage(void)
+{
+    return 
+"\n"
+"About:   Fill INFO or REF field based on values in a fasta file.\n"
+"         The fasta file must be indexed with samtools faidx.\n"
+"Usage:   bcftools +fill-from-fasta [General Options] -- [Plugin Options]\n"
+"\n"
+"General options:\n"
+"   run \"bcftools plugin\" for a list of common options\n"
+"\n"
+"Plugin options:\n"
+"   -c, --column <str>          REF or INFO tag, e.g. AA for ancestral allele\n"
+"   -f, --fasta <file>          fasta file\n"
+"   -h, --header-lines <file>   optional file containing header lines to append\n"
+"   -i, --include <expr>        annotate only records passing filter expression\n"
+"   -e, --exclude <expr>        annotate only records failing filter expression\n"
+
+"\n"
+"Examples:\n"
+"   # fill ancestral allele as INFO/AA for SNP records\n"
+"   echo '##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral allele\">' > aa.hdr\n"
+"   bcftools +fill-from-fasta in.vcf -- -c AA -f aa.fasta -h aa.hdr -i 'TYPE=\"snp\"'\n"
+"\n"
+"   # fix the REF allele in VCFs where REF=N or other\n"
+"   bcftools +fill-from-fasta in.vcf -- -c REF -f reference.fasta\n"
+"\n"
+"   # select sites marked as P (PASS) in the 1000G Phase3 mask\n"
+"   echo '##INFO=<ID=P3_MASK,Number=1,Type=String,Description=\"1000G Phase 3 mask\">' > mask.hdr\n"
+"   bcftools +fill-from-fasta in.vcf -Ou -- -c P3_MASK -f 1000G_mask.fasta -h mask.hdr | bcftools view -i 'P3_MASK=\"P\"'\n"
+"\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+faidx_t *faidx;
+int anno = 0;
+char *column = NULL;
+
+#define ANNO_REF 1
+#define ANNO_STRING 2
+#define ANNO_INT 3
+
+filter_t *filter;
+char *filter_str;
+int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    char *ref_fname = NULL, *header_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"column",required_argument,NULL,'c'},
+        {"fasta",required_argument,NULL,'f'},
+        {"header-lines",required_argument,NULL,'h'},
+        {NULL,0,NULL,0}
+    };
+    while ((c = getopt_long(argc, argv, "c:f:?h:i:e:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': filter_str = optarg; filter_logic |= FLT_EXCLUDE; break;
+            case 'i': filter_str = optarg; filter_logic |= FLT_INCLUDE; break;
+            case 'c': column = optarg; break;
+            case 'f': ref_fname = optarg; break;
+            case 'h': header_fname = optarg; break;
+            case '?':
+            default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    in_hdr  = in;
+    out_hdr = out;
+    if ( filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) { fprintf(bcftools_stderr,"Only one of -i or -e can be given.\n"); return -1; }
+
+    if ( !column )
+    {
+        fprintf(bcftools_stderr,"--column option is required.\n");
+        return -1;
+    }
+    if (header_fname)
+    {
+        htsFile *file = hts_open(header_fname, "rb");
+        if ( !file ) { fprintf(bcftools_stderr,"Error reading %s\n", header_fname); return -1; }
+        kstring_t str = {0,0,0};
+        while ( hts_getline(file, KS_SEP_LINE, &str) > 0 )
+        {
+            if ( bcf_hdr_append(out_hdr, str.s) ) { fprintf(bcftools_stderr,"Could not parse %s: %s\n", header_fname, str.s); return -1; }
+        }
+        hts_close(file);
+        free(str.s);
+        bcf_hdr_sync(out_hdr);
+    }
+    if (!strcasecmp("REF", column)) anno = ANNO_REF;
+    else {
+        if ( !strncasecmp(column,"INFO/",5) ) column += 5;
+        int hdr_id = bcf_hdr_id2int(out_hdr, BCF_DT_ID, column);
+        if (hdr_id<0) { fprintf(bcftools_stderr,"No header ID found for %s. Header lines can be added with the --header-lines option\n", column); return -1; }
+        switch ( bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id) )
+        {
+            case BCF_HT_INT:
+                anno=ANNO_INT;
+                break;
+            case BCF_HT_STR:
+                anno=ANNO_STRING;
+                break;
+            default:
+                fprintf(bcftools_stderr,"The type of %s not recognised (%d)\n", column, bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id));
+                return -1;
+        }
+    }
+    if ( !ref_fname )
+    {
+        fprintf(bcftools_stderr,"No fasta given.\n");
+        return -1;
+    }
+    faidx = fai_load(ref_fname);
+    if ( filter_str )
+        filter = filter_init(in, filter_str);
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    // filter determines if we will annotate the record
+    // return record unchanged if filter applied
+    if ( filter )
+    {
+        int ret = filter_test(filter, rec, NULL);
+        if ( filter_logic==FLT_INCLUDE ) { if ( !ret ) return rec; }
+        else if ( ret ) return rec;
+    }
+
+    int i;
+    char *ref = rec->d.allele[0];
+    int ref_len = strlen(ref);
+    int fa_len;
+    // could be sped up here by fetching the whole chromosome? could assume
+    // sorted, but revert to this when non-sorted records found?
+    char *fa = faidx_fetch_seq(faidx, bcf_seqname(in_hdr,rec), rec->pos, rec->pos+ref_len-1, &fa_len);
+    if ( !fa ) error("faidx_fetch_seq failed at %s:%d\n", bcf_hdr_id2name(in_hdr,rec->rid), rec->pos+1);
+    for (i=0; i<fa_len; i++)
+        if ( (int)fa[i]>96 ) fa[i] -= 32;
+
+    assert(ref_len == fa_len);
+    if (anno==ANNO_REF)
+        strncpy(rec->d.allele[0], fa, fa_len);
+    else if (anno==ANNO_STRING)
+        bcf_update_info_string(out_hdr, rec, column, fa);
+    else if (anno==ANNO_INT && ref_len==1)
+    {
+        int val = atoi(&fa[0]);
+        bcf_update_info_int32(out_hdr, rec, column, &val, 1);
+    }
+    free(fa);
+    return rec;
+}
+
+void destroy(void)
+{
+    fai_destroy(faidx);
+    if (filter) filter_destroy(filter);
+}
diff --git a/bcftools/plugins/fill-tags.c b/bcftools/plugins/fill-tags.c

new file mode 100644 (file)

index 0000000..a99d373
--- /dev/null
+++ b/bcftools/plugins/fill-tags.c
@@ -0,0 +1,663 @@
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/kseq.h>
+#include <htslib/vcf.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+#define SET_AN      (1<<0)
+#define SET_AC      (1<<1)
+#define SET_AC_Hom  (1<<2)
+#define SET_AC_Het  (1<<3)
+#define SET_AC_Hemi (1<<4)
+#define SET_AF      (1<<5)
+#define SET_NS      (1<<6)
+#define SET_MAF     (1<<7)
+#define SET_HWE     (1<<8)
+#define SET_ExcHet  (1<<9)
+
+typedef struct
+{
+    int nhom, nhet, nhemi, nac;
+}
+counts_t;
+
+typedef struct
+{
+    int ns;
+    int ncounts, mcounts;
+    counts_t *counts;
+    char *name, *suffix;
+    int nsmpl, *smpl;
+}
+pop_t;
+
+typedef struct
+{
+    bcf_hdr_t *in_hdr, *out_hdr;
+    int npop, tags, drop_missing, gt_id;
+    pop_t *pop, **smpl2pop;
+    float *farr;
+    int32_t *iarr, niarr, miarr, nfarr, mfarr;
+    double *hwe_probs;
+    int mhwe_probs;
+    kstring_t str;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+    return "Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n"
+        "Usage: bcftools +fill-tags [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --drop-missing          do not count half-missing genotypes \"./1\" as hemizygous\n"
+        "   -l, --list-tags             list available tags with description\n"
+        "   -t, --tags LIST             list of output tags. By default, all tags are filled.\n"
+        "   -S, --samples-file FILE     list of samples (first column) and comma-separated list of populations (second column)\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t AN,AC\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -d\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -S sample-group.txt -t HWE\n"
+        "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    void *pop2i = khash_str2int_init();
+    void *smpli = khash_str2int_init();
+    kstring_t str = {0,0,0};
+
+    int moff = 0, *off = NULL, nsmpl = 0;
+    while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 )
+    {
+        // NA12400 GRP1
+        // NA18507 GRP1,GRP2
+        char *pop_names = str.s + str.l - 1;
+        while ( pop_names >= str.s && isspace(*pop_names) ) pop_names--;
+        if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+        pop_names[1] = 0;   // trailing spaces
+        while ( pop_names >= str.s && !isspace(*pop_names) ) pop_names--;
+        if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+
+        char *smpl = pop_names++;
+        while ( smpl >= str.s && isspace(*smpl) ) smpl--;
+        if ( smpl <= str.s+1 ) error("Could not parse the file: %s\n", str.s);
+        smpl[1] = 0;
+        smpl = str.s;
+
+        int ismpl = bcf_hdr_id2int(args->in_hdr,BCF_DT_SAMPLE,smpl);
+        if ( ismpl<0 ) 
+        {
+            fprintf(stderr,"Warning: The sample not present in the VCF: %s\n",smpl);
+            continue;
+        }
+        if ( khash_str2int_has_key(smpli,smpl) )
+        {
+            fprintf(stderr,"Warning: The sample is listed twice in %s: %s\n",fname,smpl);
+            continue;
+        }
+        khash_str2int_inc(smpli,strdup(smpl));
+
+        int i,npops = ksplit_core(pop_names,',',&moff,&off);
+        for (i=0; i<npops; i++)
+        {
+            char *pop_name = &pop_names[off[i]];
+            if ( !khash_str2int_has_key(pop2i,pop_name) )
+            {
+                pop_name = strdup(pop_name);
+                khash_str2int_set(pop2i,pop_name,args->npop);
+                args->npop++;
+                args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+                memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+                args->pop[args->npop-1].name = pop_name;
+                args->pop[args->npop-1].suffix = (char*)malloc(strlen(pop_name)+2);
+                memcpy(args->pop[args->npop-1].suffix+1,pop_name,strlen(pop_name)+1);
+                args->pop[args->npop-1].suffix[0] = '_';
+            }
+            int ipop = 0;
+            khash_str2int_get(pop2i,pop_name,&ipop);
+            pop_t *pop = &args->pop[ipop];
+            pop->nsmpl++;
+            pop->smpl = (int*) realloc(pop->smpl,pop->nsmpl*sizeof(*pop->smpl));
+            pop->smpl[pop->nsmpl-1] = ismpl;
+        }
+        nsmpl++;
+    }
+
+    if ( nsmpl != bcf_hdr_nsamples(args->in_hdr) )
+        fprintf(stderr,"Warning: %d samples in the list, %d samples in the VCF.\n", nsmpl,bcf_hdr_nsamples(args->in_hdr));
+
+    if ( !args->npop ) error("No populations given?\n");
+
+    khash_str2int_destroy(pop2i);
+    khash_str2int_destroy_free(smpli);
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+void init_pops(args_t *args)
+{
+    int i,j, nsmpl;
+
+    // add the population "ALL", which is a summary population for all samples
+    args->npop++;
+    args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+    memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+    args->pop[args->npop-1].name   = strdup("");
+    args->pop[args->npop-1].suffix = strdup("");
+
+    nsmpl = bcf_hdr_nsamples(args->in_hdr);
+    args->smpl2pop = (pop_t**) calloc(nsmpl*(args->npop+1),sizeof(pop_t*));
+    for (i=0; i<nsmpl; i++)
+        args->smpl2pop[i*(args->npop+1)] = &args->pop[args->npop-1];
+
+    for (i=0; i<args->npop; i++)
+    {
+        for (j=0; j<args->pop[i].nsmpl; j++)
+        {
+            int ismpl = args->pop[i].smpl[j];
+            pop_t **smpl2pop = &args->smpl2pop[ismpl*(args->npop+1)];
+            while (*smpl2pop) smpl2pop++;
+            *smpl2pop = &args->pop[i];
+        }
+    }
+}
+
+int parse_tags(args_t *args, const char *str)
+{
+    int i, flag = 0, n_tags;
+    char **tags = hts_readlist(str, 0, &n_tags);
+    for(i=0; i<n_tags; i++)
+    {
+        if ( !strcasecmp(tags[i],"AN") ) flag |= SET_AN;
+        else if ( !strcasecmp(tags[i],"AC") ) flag |= SET_AC;
+        else if ( !strcasecmp(tags[i],"NS") ) flag |= SET_NS;
+        else if ( !strcasecmp(tags[i],"AC_Hom") ) flag |= SET_AC_Hom;
+        else if ( !strcasecmp(tags[i],"AC_Het") ) flag |= SET_AC_Het;
+        else if ( !strcasecmp(tags[i],"AC_Hemi") ) flag |= SET_AC_Hemi;
+        else if ( !strcasecmp(tags[i],"AF") ) flag |= SET_AF;
+        else if ( !strcasecmp(tags[i],"MAF") ) flag |= SET_MAF;
+        else if ( !strcasecmp(tags[i],"HWE") ) flag |= SET_HWE;
+        else if ( !strcasecmp(tags[i],"ExcHet") ) flag |= SET_ExcHet;
+        else
+        {
+            fprintf(stderr,"Error parsing \"--tags %s\": the tag \"%s\" is not supported\n", str,tags[i]);
+            exit(1);
+        }
+        free(tags[i]);
+    }
+    if (n_tags) free(tags);
+    return flag;
+}
+
+void hdr_append(args_t *args, char *fmt)
+{
+    int i;
+    for (i=0; i<args->npop; i++)
+        bcf_hdr_printf(args->out_hdr, fmt, args->pop[i].suffix,*args->pop[i].name ? " in " : "",args->pop[i].name);
+}
+
+void list_tags(void)
+{
+    error(
+        "INFO/AN       Number:1  Type:Integer  ..  Total number of alleles in called genotypes\n"
+        "INFO/AC       Number:A  Type:Integer  ..  Allele count in genotypes\n"
+        "INFO/NS       Number:1  Type:Integer  ..  Number of samples with data\n"
+        "INFO/AC_Hom   Number:A  Type:Integer  ..  Allele counts in homozygous genotypes\n"
+        "INFO/AC_Het   Number:A  Type:Integer  ..  Allele counts in heterozygous genotypes\n"
+        "INFO/AC_Hemi  Number:A  Type:Integer  ..  Allele counts in hemizygous genotypes\n"
+        "INFO/AF       Number:A  Type:Float    ..  Allele frequency\n"
+        "INFO/MAF      Number:A  Type:Float    ..  Minor Allele frequency\n"
+        "INFO/HWE      Number:A  Type:Float    ..  HWE test (PMID:15789306)\n"
+        "INFO/ExcHet   Number:A  Type:Float    ..  Probability of excess heterozygosity\n"
+        );
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->in_hdr  = in;
+    args->out_hdr = out;
+    char *samples_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"list-tags",0,0,'l'},
+        {"drop-missing",0,0,'d'},
+        {"tags",1,0,'t'},
+        {"samples-file",1,0,'S'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:dS:l",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'l': list_tags(); break;
+            case 'd': args->drop_missing = 1; break;
+            case 't': args->tags |= parse_tags(args,optarg); break;
+            case 'S': samples_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+
+    if ( optind != argc ) error("%s",usage());
+
+    args->gt_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+    if ( !args->tags )
+        for (c=0; c<=9; c++) args->tags |= 1<<c;    // by default all tags will be filled
+
+    if ( samples_fname ) parse_samples(args, samples_fname);
+    init_pops(args);
+
+    if ( args->tags & SET_AN ) hdr_append(args, "##INFO=<ID=AN%s,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes%s%s\">");
+    if ( args->tags & SET_AC ) hdr_append(args, "##INFO=<ID=AC%s,Number=A,Type=Integer,Description=\"Allele count in genotypes%s%s\">");
+    if ( args->tags & SET_NS ) hdr_append(args, "##INFO=<ID=NS%s,Number=1,Type=Integer,Description=\"Number of samples with data%s%s\">");
+    if ( args->tags & SET_AC_Hom ) hdr_append(args, "##INFO=<ID=AC_Hom%s,Number=A,Type=Integer,Description=\"Allele counts in homozygous genotypes%s%s\">");
+    if ( args->tags & SET_AC_Het ) hdr_append(args, "##INFO=<ID=AC_Het%s,Number=A,Type=Integer,Description=\"Allele counts in heterozygous genotypes%s%s\">");
+    if ( args->tags & SET_AC_Hemi ) hdr_append(args, "##INFO=<ID=AC_Hemi%s,Number=A,Type=Integer,Description=\"Allele counts in hemizygous genotypes%s%s\">");
+    if ( args->tags & SET_AF ) hdr_append(args, "##INFO=<ID=AF%s,Number=A,Type=Float,Description=\"Allele frequency%s%s\">");
+    if ( args->tags & SET_MAF ) hdr_append(args, "##INFO=<ID=MAF%s,Number=A,Type=Float,Description=\"Minor Allele frequency%s%s\">");
+    if ( args->tags & SET_HWE ) hdr_append(args, "##INFO=<ID=HWE%s,Number=A,Type=Float,Description=\"HWE test%s%s (PMID:15789306)\">");
+    if ( args->tags & SET_ExcHet ) hdr_append(args, "##INFO=<ID=ExcHet%s,Number=A,Type=Float,Description=\"Probability of excess heterozygosity\">");
+
+    return 0;
+}
+
+/* 
+    Wigginton 2005, PMID: 15789306 
+
+    nref .. number of reference alleles
+    nalt .. number of alt alleles
+    nhet .. number of het genotypes, assuming number of genotypes = (nref+nalt)*2
+
+*/
+
+void calc_hwe(args_t *args, int nref, int nalt, int nhet, float *p_hwe, float *p_exc_het)
+{
+    int ngt   = (nref+nalt) / 2;
+    int nrare = nref < nalt ? nref : nalt;
+
+    // sanity check: there is odd/even number of rare alleles iff there is odd/even number of hets
+    if ( (nrare & 1) ^ (nhet & 1) ) error("nrare/nhet should be both odd or even: nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+    if ( nrare < nhet ) error("Fewer rare alleles than hets? nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+    if ( (nref+nalt) & 1 ) error("Expected diploid genotypes: nref=%d nalt=%d\n",nref,nalt);
+
+    // initialize het probs
+    hts_expand(double,nrare+1,args->mhwe_probs,args->hwe_probs);
+    memset(args->hwe_probs, 0, sizeof(*args->hwe_probs)*(nrare+1));
+    double *probs = args->hwe_probs;
+
+    // start at midpoint
+    int mid = nrare * (nref + nalt - nrare) / (nref + nalt);
+
+    // check to ensure that midpoint and rare alleles have same parity
+    if ( (nrare & 1) ^ (mid & 1) ) mid++;
+
+    int het = mid;
+    int hom_r  = (nrare - mid) / 2;
+    int hom_c  = ngt - het - hom_r;
+    double sum = probs[mid] = 1.0;
+
+    for (het = mid; het > 1; het -= 2)
+    {
+        probs[het - 2] = probs[het] * het * (het - 1.0) / (4.0 * (hom_r + 1.0) * (hom_c + 1.0));
+        sum += probs[het - 2];
+
+        // 2 fewer heterozygotes for next iteration -> add one rare, one common homozygote
+        hom_r++;
+        hom_c++;
+    }
+
+    het = mid;
+    hom_r = (nrare - mid) / 2;
+    hom_c = ngt - het - hom_r;
+    for (het = mid; het <= nrare - 2; het += 2)
+    {
+        probs[het + 2] = probs[het] * 4.0 * hom_r * hom_c / ((het + 2.0) * (het + 1.0));
+        sum += probs[het + 2];
+
+        // add 2 heterozygotes for next iteration -> subtract one rare, one common homozygote
+        hom_r--;
+        hom_c--;
+    }
+
+    for (het=0; het<nrare+1; het++) probs[het] /= sum;
+
+    double prob = probs[nhet];
+    for (het = nhet + 1; het <= nrare; het++) prob += probs[het];
+    *p_exc_het = prob;
+
+    prob = 0;
+    for (het=0; het <= nrare; het++)
+    {
+        if ( probs[het] > probs[nhet]) continue;
+        prob += probs[het];
+    }
+    if ( prob > 1 ) prob = 1;
+    *p_hwe = prob;
+}
+
+static inline void set_counts(pop_t *pop, int is_half, int is_hom, int is_hemi, int als)
+{
+    int ial;
+    for (ial=0; als; ial++)
+    {
+        if ( als&1 )
+        { 
+            if ( is_half ) pop->counts[ial].nac++;
+            else if ( !is_hom ) pop->counts[ial].nhet++;
+            else if ( !is_hemi ) pop->counts[ial].nhom += 2;
+            else pop->counts[ial].nhemi++;
+        }
+        als >>= 1;
+    }
+    pop->ns++;
+}
+static void clean_counts(pop_t *pop, int nals)
+{
+    pop->ns = 0;
+    memset(pop->counts,0,sizeof(counts_t)*nals);
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, nsmpl = bcf_hdr_nsamples(args->in_hdr);;
+
+    bcf_unpack(rec, BCF_UN_FMT);
+    bcf_fmt_t *fmt_gt = NULL;
+    for (i=0; i<rec->n_fmt; i++)
+        if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+    if ( !fmt_gt ) return rec;    // no GT tag
+
+    hts_expand(int32_t,rec->n_allele, args->miarr, args->iarr);
+    hts_expand(float,rec->n_allele*2, args->mfarr, args->farr);
+    for (i=0; i<args->npop; i++)
+        hts_expand(counts_t,rec->n_allele,args->pop[i].mcounts, args->pop[i].counts);
+
+    for (i=0; i<args->npop; i++)
+        clean_counts(&args->pop[i], rec->n_allele);
+
+    assert( rec->n_allele < 8*sizeof(int) );
+
+    #define BRANCH_INT(type_t,vector_end) \
+    { \
+        for (i=0; i<nsmpl; i++) \
+        { \
+            type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+            int ial, als = 0, nals = 0, is_half, is_hom, is_hemi; \
+            for (ial=0; ial<fmt_gt->n; ial++) \
+            { \
+                if ( p[ial]==vector_end ) break; /* smaller ploidy */ \
+                if ( bcf_gt_is_missing(p[ial]) ) continue; /* missing allele */ \
+                int idx = bcf_gt_allele(p[ial]); \
+                nals++; \
+                \
+                if ( idx >= rec->n_allele ) \
+                    error("Incorrect allele (\"%d\") in %s at %s:%d\n",idx,args->in_hdr->samples[i],bcf_seqname(args->in_hdr,rec),rec->pos+1); \
+                als |= (1<<idx);  /* this breaks with too many alleles */ \
+            } \
+            if ( nals==0 ) continue; /* missing genotype */ \
+            is_hom = als && !(als & (als-1)); /* only one bit is set */ \
+            if ( nals!=ial ) \
+            { \
+                if ( args->drop_missing ) is_hemi = 0, is_half = 1; \
+                else is_hemi = 1, is_half = 0; \
+            } \
+            else if ( nals==1 ) is_hemi = 1, is_half = 0; \
+            else is_hemi = 0, is_half = 0; \
+            pop_t **pop = &args->smpl2pop[i*(args->npop+1)]; \
+            while ( *pop ) { set_counts(*pop,is_half,is_hom,is_hemi,als); pop++; }\
+        } \
+    }
+    switch (fmt_gt->type) {
+        case BCF_BT_INT8:  BRANCH_INT(int8_t,  bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+        default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->in_hdr,rec),rec->pos+1); break;
+    }
+    #undef BRANCH_INT
+
+    if ( args->tags & SET_NS )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            args->str.l = 0;
+            ksprintf(&args->str, "NS%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&args->pop[i].ns,1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AN )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            pop_t *pop = &args->pop[i];
+            int32_t an = 0;
+            for (j=0; j<rec->n_allele; j++) 
+                an += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+
+            args->str.l = 0;
+            ksprintf(&args->str, "AN%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&an,1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & (SET_AF | SET_MAF) )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            int32_t an = 0;
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->farr, 0, sizeof(*args->farr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->farr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+                an = pop->counts[0].nhet + pop->counts[0].nhom + pop->counts[0].nhemi + pop->counts[0].nac;
+                for (j=1; j<rec->n_allele; j++) an += args->farr[j-1];
+                if ( !an ) continue;
+                for (j=1; j<rec->n_allele; j++) args->farr[j-1] /= an;
+            }
+            if ( args->tags & SET_AF )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "AF%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+            if ( args->tags & SET_MAF )
+            {
+                if ( !an ) continue;
+                for (j=1; j<rec->n_allele; j++)
+                    if ( args->farr[j-1] > 0.5 ) args->farr[j-1] = 1 - args->farr[j-1];     // todo: this is incorrect for multiallelic sites
+                args->str.l = 0;
+                ksprintf(&args->str, "MAF%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+        }
+    }
+    if ( args->tags & SET_AC )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Het )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhet;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Het%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Hom )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhom;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Hom%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Hemi && rec->n_allele > 1 )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhemi;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Hemi%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & (SET_HWE|SET_ExcHet) )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            float *fhwe = args->farr;
+            float *fexc_het = args->farr + rec->n_allele;
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->farr,  0, sizeof(*args->farr)*(2*rec->n_allele));
+                int nref_tot = pop->counts[0].nhom;
+                for (j=0; j<rec->n_allele; j++) nref_tot += pop->counts[j].nhet;   // NB this neglects multiallelic genotypes
+                for (j=1; j<rec->n_allele; j++) 
+                {
+                    int nref = nref_tot - pop->counts[j].nhet;
+                    int nalt = pop->counts[j].nhet + pop->counts[j].nhom;
+                    int nhet = pop->counts[j].nhet;
+                    if ( nref>0 && nalt>0 )
+                        calc_hwe(args, nref, nalt, nhet, &fhwe[j-1], &fexc_het[j-1]);
+                    else
+                        fhwe[j-1] = fexc_het[j-1] = 1;
+                }
+            }
+            if ( args->tags & SET_HWE )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "HWE%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fhwe,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+            if ( args->tags & SET_ExcHet )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "ExcHet%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fexc_het,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+        }
+    }
+
+    return rec;
+}
+
+void destroy(void)
+{
+    int i; 
+    for (i=0; i<args->npop; i++)
+    {
+        free(args->pop[i].name);
+        free(args->pop[i].suffix);
+        free(args->pop[i].smpl);
+        free(args->pop[i].counts);
+    }
+    free(args->str.s);
+    free(args->pop);
+    free(args->smpl2pop);
+    free(args->iarr);
+    free(args->farr);
+    free(args->hwe_probs);
+    free(args);
+}
+
+
+
diff --git a/bcftools/plugins/fill-tags.c.pysam.c b/bcftools/plugins/fill-tags.c.pysam.c

new file mode 100644 (file)

index 0000000..4c3b06b
--- /dev/null
+++ b/bcftools/plugins/fill-tags.c.pysam.c
@@ -0,0 +1,665 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/kseq.h>
+#include <htslib/vcf.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+#define SET_AN      (1<<0)
+#define SET_AC      (1<<1)
+#define SET_AC_Hom  (1<<2)
+#define SET_AC_Het  (1<<3)
+#define SET_AC_Hemi (1<<4)
+#define SET_AF      (1<<5)
+#define SET_NS      (1<<6)
+#define SET_MAF     (1<<7)
+#define SET_HWE     (1<<8)
+#define SET_ExcHet  (1<<9)
+
+typedef struct
+{
+    int nhom, nhet, nhemi, nac;
+}
+counts_t;
+
+typedef struct
+{
+    int ns;
+    int ncounts, mcounts;
+    counts_t *counts;
+    char *name, *suffix;
+    int nsmpl, *smpl;
+}
+pop_t;
+
+typedef struct
+{
+    bcf_hdr_t *in_hdr, *out_hdr;
+    int npop, tags, drop_missing, gt_id;
+    pop_t *pop, **smpl2pop;
+    float *farr;
+    int32_t *iarr, niarr, miarr, nfarr, mfarr;
+    double *hwe_probs;
+    int mhwe_probs;
+    kstring_t str;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+    return "Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n"
+        "Usage: bcftools +fill-tags [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --drop-missing          do not count half-missing genotypes \"./1\" as hemizygous\n"
+        "   -l, --list-tags             list available tags with description\n"
+        "   -t, --tags LIST             list of output tags. By default, all tags are filled.\n"
+        "   -S, --samples-file FILE     list of samples (first column) and comma-separated list of populations (second column)\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t AN,AC\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -d\n"
+        "   bcftools +fill-tags in.bcf -Ob -o out.bcf -- -S sample-group.txt -t HWE\n"
+        "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    void *pop2i = khash_str2int_init();
+    void *smpli = khash_str2int_init();
+    kstring_t str = {0,0,0};
+
+    int moff = 0, *off = NULL, nsmpl = 0;
+    while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 )
+    {
+        // NA12400 GRP1
+        // NA18507 GRP1,GRP2
+        char *pop_names = str.s + str.l - 1;
+        while ( pop_names >= str.s && isspace(*pop_names) ) pop_names--;
+        if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+        pop_names[1] = 0;   // trailing spaces
+        while ( pop_names >= str.s && !isspace(*pop_names) ) pop_names--;
+        if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+
+        char *smpl = pop_names++;
+        while ( smpl >= str.s && isspace(*smpl) ) smpl--;
+        if ( smpl <= str.s+1 ) error("Could not parse the file: %s\n", str.s);
+        smpl[1] = 0;
+        smpl = str.s;
+
+        int ismpl = bcf_hdr_id2int(args->in_hdr,BCF_DT_SAMPLE,smpl);
+        if ( ismpl<0 ) 
+        {
+            fprintf(bcftools_stderr,"Warning: The sample not present in the VCF: %s\n",smpl);
+            continue;
+        }
+        if ( khash_str2int_has_key(smpli,smpl) )
+        {
+            fprintf(bcftools_stderr,"Warning: The sample is listed twice in %s: %s\n",fname,smpl);
+            continue;
+        }
+        khash_str2int_inc(smpli,strdup(smpl));
+
+        int i,npops = ksplit_core(pop_names,',',&moff,&off);
+        for (i=0; i<npops; i++)
+        {
+            char *pop_name = &pop_names[off[i]];
+            if ( !khash_str2int_has_key(pop2i,pop_name) )
+            {
+                pop_name = strdup(pop_name);
+                khash_str2int_set(pop2i,pop_name,args->npop);
+                args->npop++;
+                args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+                memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+                args->pop[args->npop-1].name = pop_name;
+                args->pop[args->npop-1].suffix = (char*)malloc(strlen(pop_name)+2);
+                memcpy(args->pop[args->npop-1].suffix+1,pop_name,strlen(pop_name)+1);
+                args->pop[args->npop-1].suffix[0] = '_';
+            }
+            int ipop = 0;
+            khash_str2int_get(pop2i,pop_name,&ipop);
+            pop_t *pop = &args->pop[ipop];
+            pop->nsmpl++;
+            pop->smpl = (int*) realloc(pop->smpl,pop->nsmpl*sizeof(*pop->smpl));
+            pop->smpl[pop->nsmpl-1] = ismpl;
+        }
+        nsmpl++;
+    }
+
+    if ( nsmpl != bcf_hdr_nsamples(args->in_hdr) )
+        fprintf(bcftools_stderr,"Warning: %d samples in the list, %d samples in the VCF.\n", nsmpl,bcf_hdr_nsamples(args->in_hdr));
+
+    if ( !args->npop ) error("No populations given?\n");
+
+    khash_str2int_destroy(pop2i);
+    khash_str2int_destroy_free(smpli);
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+void init_pops(args_t *args)
+{
+    int i,j, nsmpl;
+
+    // add the population "ALL", which is a summary population for all samples
+    args->npop++;
+    args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+    memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+    args->pop[args->npop-1].name   = strdup("");
+    args->pop[args->npop-1].suffix = strdup("");
+
+    nsmpl = bcf_hdr_nsamples(args->in_hdr);
+    args->smpl2pop = (pop_t**) calloc(nsmpl*(args->npop+1),sizeof(pop_t*));
+    for (i=0; i<nsmpl; i++)
+        args->smpl2pop[i*(args->npop+1)] = &args->pop[args->npop-1];
+
+    for (i=0; i<args->npop; i++)
+    {
+        for (j=0; j<args->pop[i].nsmpl; j++)
+        {
+            int ismpl = args->pop[i].smpl[j];
+            pop_t **smpl2pop = &args->smpl2pop[ismpl*(args->npop+1)];
+            while (*smpl2pop) smpl2pop++;
+            *smpl2pop = &args->pop[i];
+        }
+    }
+}
+
+int parse_tags(args_t *args, const char *str)
+{
+    int i, flag = 0, n_tags;
+    char **tags = hts_readlist(str, 0, &n_tags);
+    for(i=0; i<n_tags; i++)
+    {
+        if ( !strcasecmp(tags[i],"AN") ) flag |= SET_AN;
+        else if ( !strcasecmp(tags[i],"AC") ) flag |= SET_AC;
+        else if ( !strcasecmp(tags[i],"NS") ) flag |= SET_NS;
+        else if ( !strcasecmp(tags[i],"AC_Hom") ) flag |= SET_AC_Hom;
+        else if ( !strcasecmp(tags[i],"AC_Het") ) flag |= SET_AC_Het;
+        else if ( !strcasecmp(tags[i],"AC_Hemi") ) flag |= SET_AC_Hemi;
+        else if ( !strcasecmp(tags[i],"AF") ) flag |= SET_AF;
+        else if ( !strcasecmp(tags[i],"MAF") ) flag |= SET_MAF;
+        else if ( !strcasecmp(tags[i],"HWE") ) flag |= SET_HWE;
+        else if ( !strcasecmp(tags[i],"ExcHet") ) flag |= SET_ExcHet;
+        else
+        {
+            fprintf(bcftools_stderr,"Error parsing \"--tags %s\": the tag \"%s\" is not supported\n", str,tags[i]);
+            exit(1);
+        }
+        free(tags[i]);
+    }
+    if (n_tags) free(tags);
+    return flag;
+}
+
+void hdr_append(args_t *args, char *fmt)
+{
+    int i;
+    for (i=0; i<args->npop; i++)
+        bcf_hdr_printf(args->out_hdr, fmt, args->pop[i].suffix,*args->pop[i].name ? " in " : "",args->pop[i].name);
+}
+
+void list_tags(void)
+{
+    error(
+        "INFO/AN       Number:1  Type:Integer  ..  Total number of alleles in called genotypes\n"
+        "INFO/AC       Number:A  Type:Integer  ..  Allele count in genotypes\n"
+        "INFO/NS       Number:1  Type:Integer  ..  Number of samples with data\n"
+        "INFO/AC_Hom   Number:A  Type:Integer  ..  Allele counts in homozygous genotypes\n"
+        "INFO/AC_Het   Number:A  Type:Integer  ..  Allele counts in heterozygous genotypes\n"
+        "INFO/AC_Hemi  Number:A  Type:Integer  ..  Allele counts in hemizygous genotypes\n"
+        "INFO/AF       Number:A  Type:Float    ..  Allele frequency\n"
+        "INFO/MAF      Number:A  Type:Float    ..  Minor Allele frequency\n"
+        "INFO/HWE      Number:A  Type:Float    ..  HWE test (PMID:15789306)\n"
+        "INFO/ExcHet   Number:A  Type:Float    ..  Probability of excess heterozygosity\n"
+        );
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->in_hdr  = in;
+    args->out_hdr = out;
+    char *samples_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"list-tags",0,0,'l'},
+        {"drop-missing",0,0,'d'},
+        {"tags",1,0,'t'},
+        {"samples-file",1,0,'S'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:dS:l",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'l': list_tags(); break;
+            case 'd': args->drop_missing = 1; break;
+            case 't': args->tags |= parse_tags(args,optarg); break;
+            case 'S': samples_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+
+    if ( optind != argc ) error("%s",usage());
+
+    args->gt_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,"GT");
+    if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+    if ( !args->tags )
+        for (c=0; c<=9; c++) args->tags |= 1<<c;    // by default all tags will be filled
+
+    if ( samples_fname ) parse_samples(args, samples_fname);
+    init_pops(args);
+
+    if ( args->tags & SET_AN ) hdr_append(args, "##INFO=<ID=AN%s,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes%s%s\">");
+    if ( args->tags & SET_AC ) hdr_append(args, "##INFO=<ID=AC%s,Number=A,Type=Integer,Description=\"Allele count in genotypes%s%s\">");
+    if ( args->tags & SET_NS ) hdr_append(args, "##INFO=<ID=NS%s,Number=1,Type=Integer,Description=\"Number of samples with data%s%s\">");
+    if ( args->tags & SET_AC_Hom ) hdr_append(args, "##INFO=<ID=AC_Hom%s,Number=A,Type=Integer,Description=\"Allele counts in homozygous genotypes%s%s\">");
+    if ( args->tags & SET_AC_Het ) hdr_append(args, "##INFO=<ID=AC_Het%s,Number=A,Type=Integer,Description=\"Allele counts in heterozygous genotypes%s%s\">");
+    if ( args->tags & SET_AC_Hemi ) hdr_append(args, "##INFO=<ID=AC_Hemi%s,Number=A,Type=Integer,Description=\"Allele counts in hemizygous genotypes%s%s\">");
+    if ( args->tags & SET_AF ) hdr_append(args, "##INFO=<ID=AF%s,Number=A,Type=Float,Description=\"Allele frequency%s%s\">");
+    if ( args->tags & SET_MAF ) hdr_append(args, "##INFO=<ID=MAF%s,Number=A,Type=Float,Description=\"Minor Allele frequency%s%s\">");
+    if ( args->tags & SET_HWE ) hdr_append(args, "##INFO=<ID=HWE%s,Number=A,Type=Float,Description=\"HWE test%s%s (PMID:15789306)\">");
+    if ( args->tags & SET_ExcHet ) hdr_append(args, "##INFO=<ID=ExcHet%s,Number=A,Type=Float,Description=\"Probability of excess heterozygosity\">");
+
+    return 0;
+}
+
+/* 
+    Wigginton 2005, PMID: 15789306 
+
+    nref .. number of reference alleles
+    nalt .. number of alt alleles
+    nhet .. number of het genotypes, assuming number of genotypes = (nref+nalt)*2
+
+*/
+
+void calc_hwe(args_t *args, int nref, int nalt, int nhet, float *p_hwe, float *p_exc_het)
+{
+    int ngt   = (nref+nalt) / 2;
+    int nrare = nref < nalt ? nref : nalt;
+
+    // sanity check: there is odd/even number of rare alleles iff there is odd/even number of hets
+    if ( (nrare & 1) ^ (nhet & 1) ) error("nrare/nhet should be both odd or even: nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+    if ( nrare < nhet ) error("Fewer rare alleles than hets? nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+    if ( (nref+nalt) & 1 ) error("Expected diploid genotypes: nref=%d nalt=%d\n",nref,nalt);
+
+    // initialize het probs
+    hts_expand(double,nrare+1,args->mhwe_probs,args->hwe_probs);
+    memset(args->hwe_probs, 0, sizeof(*args->hwe_probs)*(nrare+1));
+    double *probs = args->hwe_probs;
+
+    // start at midpoint
+    int mid = nrare * (nref + nalt - nrare) / (nref + nalt);
+
+    // check to ensure that midpoint and rare alleles have same parity
+    if ( (nrare & 1) ^ (mid & 1) ) mid++;
+
+    int het = mid;
+    int hom_r  = (nrare - mid) / 2;
+    int hom_c  = ngt - het - hom_r;
+    double sum = probs[mid] = 1.0;
+
+    for (het = mid; het > 1; het -= 2)
+    {
+        probs[het - 2] = probs[het] * het * (het - 1.0) / (4.0 * (hom_r + 1.0) * (hom_c + 1.0));
+        sum += probs[het - 2];
+
+        // 2 fewer heterozygotes for next iteration -> add one rare, one common homozygote
+        hom_r++;
+        hom_c++;
+    }
+
+    het = mid;
+    hom_r = (nrare - mid) / 2;
+    hom_c = ngt - het - hom_r;
+    for (het = mid; het <= nrare - 2; het += 2)
+    {
+        probs[het + 2] = probs[het] * 4.0 * hom_r * hom_c / ((het + 2.0) * (het + 1.0));
+        sum += probs[het + 2];
+
+        // add 2 heterozygotes for next iteration -> subtract one rare, one common homozygote
+        hom_r--;
+        hom_c--;
+    }
+
+    for (het=0; het<nrare+1; het++) probs[het] /= sum;
+
+    double prob = probs[nhet];
+    for (het = nhet + 1; het <= nrare; het++) prob += probs[het];
+    *p_exc_het = prob;
+
+    prob = 0;
+    for (het=0; het <= nrare; het++)
+    {
+        if ( probs[het] > probs[nhet]) continue;
+        prob += probs[het];
+    }
+    if ( prob > 1 ) prob = 1;
+    *p_hwe = prob;
+}
+
+static inline void set_counts(pop_t *pop, int is_half, int is_hom, int is_hemi, int als)
+{
+    int ial;
+    for (ial=0; als; ial++)
+    {
+        if ( als&1 )
+        { 
+            if ( is_half ) pop->counts[ial].nac++;
+            else if ( !is_hom ) pop->counts[ial].nhet++;
+            else if ( !is_hemi ) pop->counts[ial].nhom += 2;
+            else pop->counts[ial].nhemi++;
+        }
+        als >>= 1;
+    }
+    pop->ns++;
+}
+static void clean_counts(pop_t *pop, int nals)
+{
+    pop->ns = 0;
+    memset(pop->counts,0,sizeof(counts_t)*nals);
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, nsmpl = bcf_hdr_nsamples(args->in_hdr);;
+
+    bcf_unpack(rec, BCF_UN_FMT);
+    bcf_fmt_t *fmt_gt = NULL;
+    for (i=0; i<rec->n_fmt; i++)
+        if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+    if ( !fmt_gt ) return rec;    // no GT tag
+
+    hts_expand(int32_t,rec->n_allele, args->miarr, args->iarr);
+    hts_expand(float,rec->n_allele*2, args->mfarr, args->farr);
+    for (i=0; i<args->npop; i++)
+        hts_expand(counts_t,rec->n_allele,args->pop[i].mcounts, args->pop[i].counts);
+
+    for (i=0; i<args->npop; i++)
+        clean_counts(&args->pop[i], rec->n_allele);
+
+    assert( rec->n_allele < 8*sizeof(int) );
+
+    #define BRANCH_INT(type_t,vector_end) \
+    { \
+        for (i=0; i<nsmpl; i++) \
+        { \
+            type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+            int ial, als = 0, nals = 0, is_half, is_hom, is_hemi; \
+            for (ial=0; ial<fmt_gt->n; ial++) \
+            { \
+                if ( p[ial]==vector_end ) break; /* smaller ploidy */ \
+                if ( bcf_gt_is_missing(p[ial]) ) continue; /* missing allele */ \
+                int idx = bcf_gt_allele(p[ial]); \
+                nals++; \
+                \
+                if ( idx >= rec->n_allele ) \
+                    error("Incorrect allele (\"%d\") in %s at %s:%d\n",idx,args->in_hdr->samples[i],bcf_seqname(args->in_hdr,rec),rec->pos+1); \
+                als |= (1<<idx);  /* this breaks with too many alleles */ \
+            } \
+            if ( nals==0 ) continue; /* missing genotype */ \
+            is_hom = als && !(als & (als-1)); /* only one bit is set */ \
+            if ( nals!=ial ) \
+            { \
+                if ( args->drop_missing ) is_hemi = 0, is_half = 1; \
+                else is_hemi = 1, is_half = 0; \
+            } \
+            else if ( nals==1 ) is_hemi = 1, is_half = 0; \
+            else is_hemi = 0, is_half = 0; \
+            pop_t **pop = &args->smpl2pop[i*(args->npop+1)]; \
+            while ( *pop ) { set_counts(*pop,is_half,is_hom,is_hemi,als); pop++; }\
+        } \
+    }
+    switch (fmt_gt->type) {
+        case BCF_BT_INT8:  BRANCH_INT(int8_t,  bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+        default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->in_hdr,rec),rec->pos+1); break;
+    }
+    #undef BRANCH_INT
+
+    if ( args->tags & SET_NS )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            args->str.l = 0;
+            ksprintf(&args->str, "NS%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&args->pop[i].ns,1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AN )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            pop_t *pop = &args->pop[i];
+            int32_t an = 0;
+            for (j=0; j<rec->n_allele; j++) 
+                an += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+
+            args->str.l = 0;
+            ksprintf(&args->str, "AN%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&an,1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & (SET_AF | SET_MAF) )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            int32_t an = 0;
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->farr, 0, sizeof(*args->farr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->farr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+                an = pop->counts[0].nhet + pop->counts[0].nhom + pop->counts[0].nhemi + pop->counts[0].nac;
+                for (j=1; j<rec->n_allele; j++) an += args->farr[j-1];
+                if ( !an ) continue;
+                for (j=1; j<rec->n_allele; j++) args->farr[j-1] /= an;
+            }
+            if ( args->tags & SET_AF )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "AF%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+            if ( args->tags & SET_MAF )
+            {
+                if ( !an ) continue;
+                for (j=1; j<rec->n_allele; j++)
+                    if ( args->farr[j-1] > 0.5 ) args->farr[j-1] = 1 - args->farr[j-1];     // todo: this is incorrect for multiallelic sites
+                args->str.l = 0;
+                ksprintf(&args->str, "MAF%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+        }
+    }
+    if ( args->tags & SET_AC )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Het )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhet;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Het%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Hom )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhom;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Hom%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & SET_AC_Hemi && rec->n_allele > 1 )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+                for (j=1; j<rec->n_allele; j++) 
+                    args->iarr[j-1] += pop->counts[j].nhemi;
+            }
+            args->str.l = 0;
+            ksprintf(&args->str, "AC_Hemi%s", args->pop[i].suffix);
+            if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+                error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+        }
+    }
+    if ( args->tags & (SET_HWE|SET_ExcHet) )
+    {
+        for (i=0; i<args->npop; i++)
+        {
+            float *fhwe = args->farr;
+            float *fexc_het = args->farr + rec->n_allele;
+            if ( rec->n_allele > 1 )
+            {
+                pop_t *pop = &args->pop[i];
+                memset(args->farr,  0, sizeof(*args->farr)*(2*rec->n_allele));
+                int nref_tot = pop->counts[0].nhom;
+                for (j=0; j<rec->n_allele; j++) nref_tot += pop->counts[j].nhet;   // NB this neglects multiallelic genotypes
+                for (j=1; j<rec->n_allele; j++) 
+                {
+                    int nref = nref_tot - pop->counts[j].nhet;
+                    int nalt = pop->counts[j].nhet + pop->counts[j].nhom;
+                    int nhet = pop->counts[j].nhet;
+                    if ( nref>0 && nalt>0 )
+                        calc_hwe(args, nref, nalt, nhet, &fhwe[j-1], &fexc_het[j-1]);
+                    else
+                        fhwe[j-1] = fexc_het[j-1] = 1;
+                }
+            }
+            if ( args->tags & SET_HWE )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "HWE%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fhwe,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+            if ( args->tags & SET_ExcHet )
+            {
+                args->str.l = 0;
+                ksprintf(&args->str, "ExcHet%s", args->pop[i].suffix);
+                if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fexc_het,rec->n_allele-1)!=0 )
+                    error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+            }
+        }
+    }
+
+    return rec;
+}
+
+void destroy(void)
+{
+    int i; 
+    for (i=0; i<args->npop; i++)
+    {
+        free(args->pop[i].name);
+        free(args->pop[i].suffix);
+        free(args->pop[i].smpl);
+        free(args->pop[i].counts);
+    }
+    free(args->str.s);
+    free(args->pop);
+    free(args->smpl2pop);
+    free(args->iarr);
+    free(args->farr);
+    free(args->hwe_probs);
+    free(args);
+}
+
+
+
diff --git a/bcftools/plugins/fixploidy.c b/bcftools/plugins/fixploidy.c

new file mode 100644 (file)

index 0000000..039d0f4
--- /dev/null
+++ b/bcftools/plugins/fixploidy.c
@@ -0,0 +1,250 @@
+/* 
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/kseq.h>
+#include "bcftools.h"
+#include "ploidy.h"
+
+static bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+static int *sample2sex = NULL;
+static int n_sample = 0, nsex = 0, *sex2ploidy = NULL;
+static int32_t ngt_arr = 0, *gt_arr = NULL, *gt_arr2 = NULL, ngt_arr2 = 0;
+static ploidy_t *ploidy = NULL;
+static int force_ploidy = -1;
+
+const char *about(void)
+{
+    return "Fix ploidy.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Fix ploidy\n"
+        "Usage: bcftools +fixploidy [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --default-ploidy <int>  default ploidy for regions unlisted in -p [2]\n"
+        "   -f, --force-ploidy <int>    ignore -p, set the same ploidy for all genotypes\n"
+        "   -p, --ploidy <file>         space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY\n"
+        "   -s, --sex <file>            list of samples, \"NAME SEX\"\n"
+        "   -t, --tags <list>           VCF tags to fix [GT]\n"
+        "\n"
+        "Example:\n"
+        "   # Default ploidy, if -p not given. Unlisted regions have ploidy 2\n"
+        "   X 1 60000 M 1\n"
+        "   X 2699521 154931043 M 1\n"
+        "   Y 1 59373566 M 1\n"
+        "   Y 1 59373566 F 0\n"
+        "   MT 1 16569 M 1\n"
+        "   MT 1 16569 F 1\n"
+        "   \n"
+        "   # Example of -s file, sex of unlisted samples is \"F\"\n"
+        "   sampleName1 M\n"
+        "   \n"
+        "   bcftools +fixploidy in.vcf -- -s samples.txt\n"
+        "\n";
+}
+
+void set_samples(char *fname, bcf_hdr_t *hdr, ploidy_t *ploidy, int *sample2sex)
+{
+    kstring_t tmp = {0,0,0};
+
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+    while ( hts_getline(fp, KS_SEP_LINE, &tmp) > 0 )
+    {
+        char *ss = tmp.s;
+        while ( *ss && isspace(*ss) ) ss++;
+        if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+        if ( *ss=='#' ) continue;
+        char *se = ss;
+        while ( *se && !isspace(*se) ) se++;
+        char x = *se; *se = 0;
+
+        int ismpl = bcf_hdr_id2int(hdr, BCF_DT_SAMPLE, ss);
+        if ( ismpl < 0 ) { fprintf(stderr,"Warning: No such sample in the VCF: %s\n",ss); continue; }
+
+        *se = x;
+        ss = se+1;
+        while ( *ss && isspace(*ss) ) ss++;
+        if ( !*ss )  error("Could not parse: %s\n", tmp.s);
+        se = ss;
+        while ( *se && !isspace(*se) ) se++;
+        if ( se==ss ) error("Could not parse: %s\n", tmp.s);
+
+        sample2sex[ismpl] = ploidy_add_sex(ploidy, ss);
+    }
+    if ( hts_close(fp) ) error("Close failed: %s\n", fname);
+    free(tmp.s);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c, default_ploidy = 2;
+    char *tags_str = "GT";
+    char *ploidy_fname = NULL, *sex_fname = NULL;
+
+    static struct option loptions[] =
+    {
+        {"default-ploidy",1,0,'d'},
+        {"force-ploidy",1,0,'f'},
+        {"ploidy",1,0,'p'},
+        {"sex",1,0,'s'},
+        {"tags",1,0,'t'},
+        {0,0,0,0}
+    };
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "?ht:s:p:d:f:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'd': 
+                default_ploidy = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+                break;
+            case 'f': 
+                force_ploidy = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -f %s\n", optarg);
+                break;
+            case 'p': ploidy_fname = optarg; break;
+            case 's': sex_fname = optarg; break;
+            case 't': tags_str = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( strcasecmp("GT",tags_str) ) error("Only -t GT is currently supported, sorry\n");
+
+    n_sample   = bcf_hdr_nsamples(in);
+    sample2sex = (int*) calloc(n_sample,sizeof(int));
+    in_hdr     = in;
+    out_hdr    = out;
+
+    if ( ploidy_fname )
+        ploidy = ploidy_init(ploidy_fname, default_ploidy);
+    else if ( force_ploidy==-1 )
+    {
+        ploidy = ploidy_init_string(
+                "X 1 60000 M 1\n"
+                "X 2699521 154931043 M 1\n"
+                "Y 1 59373566 M 1\n"
+                "Y 1 59373566 F 0\n"
+                "MT 1 16569 M 1\n"
+                "MT 1 16569 F 1\n", 2);
+    }
+    if ( force_ploidy==-1 )
+    {
+        if ( !ploidy ) return -1;
+
+        // add default sex in case it was not included
+        int i, dflt_sex_id = ploidy_add_sex(ploidy, "F");
+        for (i=0; i<n_sample; i++) sample2sex[i] = dflt_sex_id; // by default all are F
+        if ( sex_fname ) set_samples(sex_fname, in, ploidy, sample2sex);
+        nsex = ploidy_nsex(ploidy);
+        sex2ploidy = (int*) malloc(sizeof(int)*nsex);
+    }
+
+    return 0;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, max_ploidy;
+
+    int ngts = bcf_get_genotypes(in_hdr, rec, &gt_arr, &ngt_arr);
+    if ( ngts<0 )
+        return rec;     // GT field not present
+
+    if ( ngts % n_sample )
+        error("Error at %s:%d: wrong number of GT fields\n",bcf_seqname(in_hdr,rec),rec->pos+1);
+
+    if ( force_ploidy==-1 )
+        ploidy_query(ploidy, (char*)bcf_seqname(in_hdr,rec), rec->pos, sex2ploidy,NULL,&max_ploidy);
+    else
+        max_ploidy = force_ploidy;
+
+    ngts /= n_sample;
+    if ( ngts < max_ploidy )
+    {
+        hts_expand(int32_t,max_ploidy*n_sample,ngt_arr2,gt_arr2);
+        for (i=0; i<n_sample; i++)
+        {
+            int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+            int32_t *src = &gt_arr[i*ngts];
+            int32_t *dst = &gt_arr2[i*max_ploidy];
+            j = 0;
+            if ( !ploidy ) { dst[j] = bcf_gt_missing; j++; }
+            else
+                while ( j<ngts && j<ploidy && src[j]!=bcf_int32_vector_end ) { dst[j] = src[j]; j++; }
+            assert( j );
+            while ( j<ploidy ) { dst[j] = dst[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+            while ( j<max_ploidy ) { dst[j] = bcf_int32_vector_end; j++; }
+        }
+        if ( bcf_update_genotypes(out_hdr,rec,gt_arr2,n_sample*max_ploidy) )
+            error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+    }
+    else if ( ngts!=1 || max_ploidy!=1 )
+    {
+        for (i=0; i<n_sample; i++)
+        {
+            int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+            int32_t *gts = &gt_arr[i*ngts];
+            j = 0;
+            if ( !ploidy ) { gts[j] = bcf_gt_missing; j++; }
+            else 
+                while ( j<ngts && j<ploidy && gts[j]!=bcf_int32_vector_end ) j++;
+            assert( j );
+            while ( j<ploidy ) { gts[j] = gts[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+            while ( j<ngts ) { gts[j] = bcf_int32_vector_end; j++; }
+        }
+        if ( bcf_update_genotypes(out_hdr,rec,gt_arr,n_sample*ngts) )
+            error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+    }
+    return rec;
+}
+
+
+void destroy(void)
+{
+    free(gt_arr);
+    free(gt_arr2);
+    free(sample2sex);
+    free(sex2ploidy);
+    if ( ploidy ) ploidy_destroy(ploidy);
+}
+
+
diff --git a/bcftools/plugins/fixploidy.c.pysam.c b/bcftools/plugins/fixploidy.c.pysam.c

new file mode 100644 (file)

index 0000000..e282c19
--- /dev/null
+++ b/bcftools/plugins/fixploidy.c.pysam.c
@@ -0,0 +1,252 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/kseq.h>
+#include "bcftools.h"
+#include "ploidy.h"
+
+static bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+static int *sample2sex = NULL;
+static int n_sample = 0, nsex = 0, *sex2ploidy = NULL;
+static int32_t ngt_arr = 0, *gt_arr = NULL, *gt_arr2 = NULL, ngt_arr2 = 0;
+static ploidy_t *ploidy = NULL;
+static int force_ploidy = -1;
+
+const char *about(void)
+{
+    return "Fix ploidy.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Fix ploidy\n"
+        "Usage: bcftools +fixploidy [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --default-ploidy <int>  default ploidy for regions unlisted in -p [2]\n"
+        "   -f, --force-ploidy <int>    ignore -p, set the same ploidy for all genotypes\n"
+        "   -p, --ploidy <file>         space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY\n"
+        "   -s, --sex <file>            list of samples, \"NAME SEX\"\n"
+        "   -t, --tags <list>           VCF tags to fix [GT]\n"
+        "\n"
+        "Example:\n"
+        "   # Default ploidy, if -p not given. Unlisted regions have ploidy 2\n"
+        "   X 1 60000 M 1\n"
+        "   X 2699521 154931043 M 1\n"
+        "   Y 1 59373566 M 1\n"
+        "   Y 1 59373566 F 0\n"
+        "   MT 1 16569 M 1\n"
+        "   MT 1 16569 F 1\n"
+        "   \n"
+        "   # Example of -s file, sex of unlisted samples is \"F\"\n"
+        "   sampleName1 M\n"
+        "   \n"
+        "   bcftools +fixploidy in.vcf -- -s samples.txt\n"
+        "\n";
+}
+
+void set_samples(char *fname, bcf_hdr_t *hdr, ploidy_t *ploidy, int *sample2sex)
+{
+    kstring_t tmp = {0,0,0};
+
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+    while ( hts_getline(fp, KS_SEP_LINE, &tmp) > 0 )
+    {
+        char *ss = tmp.s;
+        while ( *ss && isspace(*ss) ) ss++;
+        if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+        if ( *ss=='#' ) continue;
+        char *se = ss;
+        while ( *se && !isspace(*se) ) se++;
+        char x = *se; *se = 0;
+
+        int ismpl = bcf_hdr_id2int(hdr, BCF_DT_SAMPLE, ss);
+        if ( ismpl < 0 ) { fprintf(bcftools_stderr,"Warning: No such sample in the VCF: %s\n",ss); continue; }
+
+        *se = x;
+        ss = se+1;
+        while ( *ss && isspace(*ss) ) ss++;
+        if ( !*ss )  error("Could not parse: %s\n", tmp.s);
+        se = ss;
+        while ( *se && !isspace(*se) ) se++;
+        if ( se==ss ) error("Could not parse: %s\n", tmp.s);
+
+        sample2sex[ismpl] = ploidy_add_sex(ploidy, ss);
+    }
+    if ( hts_close(fp) ) error("Close failed: %s\n", fname);
+    free(tmp.s);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c, default_ploidy = 2;
+    char *tags_str = "GT";
+    char *ploidy_fname = NULL, *sex_fname = NULL;
+
+    static struct option loptions[] =
+    {
+        {"default-ploidy",1,0,'d'},
+        {"force-ploidy",1,0,'f'},
+        {"ploidy",1,0,'p'},
+        {"sex",1,0,'s'},
+        {"tags",1,0,'t'},
+        {0,0,0,0}
+    };
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "?ht:s:p:d:f:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'd': 
+                default_ploidy = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+                break;
+            case 'f': 
+                force_ploidy = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -f %s\n", optarg);
+                break;
+            case 'p': ploidy_fname = optarg; break;
+            case 's': sex_fname = optarg; break;
+            case 't': tags_str = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( strcasecmp("GT",tags_str) ) error("Only -t GT is currently supported, sorry\n");
+
+    n_sample   = bcf_hdr_nsamples(in);
+    sample2sex = (int*) calloc(n_sample,sizeof(int));
+    in_hdr     = in;
+    out_hdr    = out;
+
+    if ( ploidy_fname )
+        ploidy = ploidy_init(ploidy_fname, default_ploidy);
+    else if ( force_ploidy==-1 )
+    {
+        ploidy = ploidy_init_string(
+                "X 1 60000 M 1\n"
+                "X 2699521 154931043 M 1\n"
+                "Y 1 59373566 M 1\n"
+                "Y 1 59373566 F 0\n"
+                "MT 1 16569 M 1\n"
+                "MT 1 16569 F 1\n", 2);
+    }
+    if ( force_ploidy==-1 )
+    {
+        if ( !ploidy ) return -1;
+
+        // add default sex in case it was not included
+        int i, dflt_sex_id = ploidy_add_sex(ploidy, "F");
+        for (i=0; i<n_sample; i++) sample2sex[i] = dflt_sex_id; // by default all are F
+        if ( sex_fname ) set_samples(sex_fname, in, ploidy, sample2sex);
+        nsex = ploidy_nsex(ploidy);
+        sex2ploidy = (int*) malloc(sizeof(int)*nsex);
+    }
+
+    return 0;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i,j, max_ploidy;
+
+    int ngts = bcf_get_genotypes(in_hdr, rec, &gt_arr, &ngt_arr);
+    if ( ngts<0 )
+        return rec;     // GT field not present
+
+    if ( ngts % n_sample )
+        error("Error at %s:%d: wrong number of GT fields\n",bcf_seqname(in_hdr,rec),rec->pos+1);
+
+    if ( force_ploidy==-1 )
+        ploidy_query(ploidy, (char*)bcf_seqname(in_hdr,rec), rec->pos, sex2ploidy,NULL,&max_ploidy);
+    else
+        max_ploidy = force_ploidy;
+
+    ngts /= n_sample;
+    if ( ngts < max_ploidy )
+    {
+        hts_expand(int32_t,max_ploidy*n_sample,ngt_arr2,gt_arr2);
+        for (i=0; i<n_sample; i++)
+        {
+            int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+            int32_t *src = &gt_arr[i*ngts];
+            int32_t *dst = &gt_arr2[i*max_ploidy];
+            j = 0;
+            if ( !ploidy ) { dst[j] = bcf_gt_missing; j++; }
+            else
+                while ( j<ngts && j<ploidy && src[j]!=bcf_int32_vector_end ) { dst[j] = src[j]; j++; }
+            assert( j );
+            while ( j<ploidy ) { dst[j] = dst[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+            while ( j<max_ploidy ) { dst[j] = bcf_int32_vector_end; j++; }
+        }
+        if ( bcf_update_genotypes(out_hdr,rec,gt_arr2,n_sample*max_ploidy) )
+            error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+    }
+    else if ( ngts!=1 || max_ploidy!=1 )
+    {
+        for (i=0; i<n_sample; i++)
+        {
+            int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+            int32_t *gts = &gt_arr[i*ngts];
+            j = 0;
+            if ( !ploidy ) { gts[j] = bcf_gt_missing; j++; }
+            else 
+                while ( j<ngts && j<ploidy && gts[j]!=bcf_int32_vector_end ) j++;
+            assert( j );
+            while ( j<ploidy ) { gts[j] = gts[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+            while ( j<ngts ) { gts[j] = bcf_int32_vector_end; j++; }
+        }
+        if ( bcf_update_genotypes(out_hdr,rec,gt_arr,n_sample*ngts) )
+            error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+    }
+    return rec;
+}
+
+
+void destroy(void)
+{
+    free(gt_arr);
+    free(gt_arr2);
+    free(sample2sex);
+    free(sex2ploidy);
+    if ( ploidy ) ploidy_destroy(ploidy);
+}
+
+
diff --git a/bcftools/plugins/fixref.c b/bcftools/plugins/fixref.c

new file mode 100644 (file)

index 0000000..04c19fe
--- /dev/null
+++ b/bcftools/plugins/fixref.c
@@ -0,0 +1,572 @@
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+/*
+    Illumina TOP/BOT strand convention causes a lot of pain. This tool
+    attempts to determine the strand convention and convert it to the
+    forward reference strand.
+
+    On TOP strand, we can encounter
+        unambiguous SNPs:
+            A/G
+            A/C
+        ambiguous (context-dependent) SNPs:
+            C/G
+            A/T
+
+    On BOT strand:
+        unambiguous SNPs:
+            T/G
+            T/C
+        ambiguous (context-dependent) SNPs:
+            T/A
+            G/C
+
+
+    For unambiguous pairs (A/C, A/G, T/C, T/G), the knowledge of reference base
+    at the SNP position is enough to determine the strand:
+
+         TOP      REF   ->  ALLELES   TOP_ON_STRAND
+         -------------------------------------------
+         A/C     A or C      A/C         1
+          "      T or G      T/G        -1
+         A/G     A or G      A/G         1
+          "      T or C      T/C        -1
+
+
+    For ambiguous pairs (A/T, C/G), a sequence walking must be performed
+    (simultaneously upstream and downstream) until the first unambiguous pair
+    is encountered. The 5' base determines the strand:
+
+         TOP    5'REF_BASE   ->  ALLELES   TOP_ON_STRAND
+         ------------------------------------------------
+         A/T    A or T             A/T          1
+          "     C or G             T/A         -1
+         C/G    A or T             C/G          1
+          "     C or G             G/C         -1
+
+ */
+
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <htslib/faidx.h>
+#include <htslib/khash.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+
+#define MODE_STATS    1
+#define MODE_TOP2FWD  2
+#define MODE_FLIP2FWD 3
+#define MODE_USE_ID   4
+
+typedef struct
+{
+    uint32_t pos;
+    uint8_t ref;
+}
+marker_t;
+
+KHASH_MAP_INIT_INT(i2m, marker_t)
+typedef khash_t(i2m) i2m_t;
+
+typedef struct
+{
+    char *dbsnp_fname;
+    int mode, discard;
+    bcf_hdr_t *hdr;
+    faidx_t *fai;
+    int rid, skip_rid;
+    i2m_t *i2m;
+    int32_t *gts, ngts, pos;
+    uint32_t nsite,nok,nflip,nunresolved,nswap,nflip_swap,nonSNP,nonACGT,nonbiallelic;
+    uint32_t count[4][4], npos_err, unsorted;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Fix reference strand orientation, e.g. from Illumina/TOP to fwd.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: This tool helps to determine and fix strand orientation.\n"
+        "       Currently the following modes are recognised:\n"
+        "           flip  .. flips non-ambiguous SNPs and ignores the rest\n"
+        "           id    .. swap REF/ALT and GTs using the ID column to determine the REF allele\n"
+        "           stats .. collect and print stats\n"
+        "           top   .. converts from Illumina TOP strand to fwd\n"
+        "\n"
+        "       WARNING: Do not use the program blindly, make an effort to\n"
+        "       understand what strand convention your data uses! Make sure\n"
+        "       the reason for mismatching REF alleles is not a different\n"
+        "       reference build!!\n"
+        "\n"
+        "Usage: bcftools +fixref [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --discard               Discard sites which could not be resolved\n"
+        "   -f, --fasta-ref <file.fa>   Reference sequence\n"
+        "   -i, --use-id <file.vcf>     Swap REF/ALT using the ID column to determine the REF allele, implies -m id.\n"
+        "                               Download the dbSNP file from\n"
+        "                                   https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf\n"
+        "   -m, --mode <string>         Collect stats (\"stats\") or convert (\"flip\", \"id\", \"top\") [stats]\n"
+        "\n"
+        "Examples:\n"
+        "   # run stats\n"
+        "   bcftools +fixref file.bcf -- -f ref.fa\n"
+        "\n"
+        "   # convert from TOP to fwd\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -f ref.fa -m top\n"
+        "\n"
+        "   # match the REF/ALT alleles based on the ID column, discard unknown sites\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i All_20151104.vcf.gz\n"
+        "\n"
+        "   # assuming the reference build is correct, just flip to fwd, discarding the rest\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -m flip\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.skip_rid = -1;
+    args.hdr = in;
+    args.mode = MODE_STATS;
+    char *ref_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"mode",required_argument,NULL,'m'},
+        {"discard",no_argument,NULL,'d'},
+        {"fasta-ref",required_argument,NULL,'f'},
+        {"use-id",required_argument,NULL,'i'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?hf:m:di:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'm': 
+                if ( !strcasecmp(optarg,"top") ) args.mode = MODE_TOP2FWD; 
+                else if ( !strcasecmp(optarg,"flip") ) args.mode = MODE_FLIP2FWD; 
+                else if ( !strcasecmp(optarg,"id") ) args.mode = MODE_USE_ID; 
+                else if ( !strcasecmp(optarg,"stats") ) args.mode = MODE_STATS; 
+                else error("The source strand convention not recognised: %s\n", optarg);
+                break;
+            case 'i': args.dbsnp_fname = optarg; args.mode = MODE_USE_ID; break;
+            case 'd': args.discard = 1; break;
+            case 'f': ref_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !ref_fname ) error("Expected the -f option\n");
+    args.fai = fai_load(ref_fname);
+    if ( !args.fai ) error("Failed to load the fai index: %s\n", ref_fname);
+
+    if ( args.mode==MODE_STATS ) return 1;
+    return 0;
+}
+
+static bcf1_t *set_ref_alt(args_t *args, bcf1_t *rec, const char ref, const char alt, int swap)
+{
+    rec->d.allele[0][0] = ref;
+    rec->d.allele[1][0] = alt;
+    rec->d.shared_dirty |= BCF1_DIRTY_ALS;
+
+    if ( !swap ) return rec;    // only fix the alleles, leaving GTs unchanged
+
+    int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->ngts);
+    int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+    ngts /= nsmpl;
+    for (i=0; i<nsmpl; i++)
+    {
+        int32_t *ptr = args->gts + i*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_gt_unphased(0) ) ptr[j] = bcf_gt_unphased(1);
+            else if ( ptr[j]==bcf_gt_phased(0) ) ptr[j] = bcf_gt_phased(1);
+            else if ( ptr[j]==bcf_gt_unphased(1) ) ptr[j] = bcf_gt_unphased(0);
+            else if ( ptr[j]==bcf_gt_phased(1) ) ptr[j] = bcf_gt_phased(0);
+        }
+    }
+    bcf_update_genotypes(args->hdr,rec,args->gts,args->ngts);
+    
+    return rec;
+}
+
+static inline int nt2int(char nt)
+{
+    nt = toupper(nt);
+    if ( nt=='A' ) return 0;
+    if ( nt=='C' ) return 1;
+    if ( nt=='G' ) return 2;
+    if ( nt=='T' ) return 3;
+    return -1;
+}
+#define int2nt(x) "ACGT"[x]
+#define revint(x) ("3210"[x]-'0')
+
+static inline uint32_t parse_rsid(char *name)
+{
+    if ( name[0]!='r' || name[1]!='s' ) 
+    {
+        name = strstr(name, "rs");
+        if ( !name ) return 0;
+    }
+    char *tmp;
+    name += 2;
+    uint64_t id = strtol(name, &tmp, 10);
+    if ( tmp==name || *tmp ) return 0;
+    if ( id > UINT32_MAX ) error("FIXME: the ID is too big for uint32_t: %s\n", name-2);
+    return id;
+}
+
+static int fetch_ref(args_t *args, bcf1_t *rec)
+{
+    // Get the reference allele
+    int len;
+    char *ref = faidx_fetch_seq(args->fai, (char*)bcf_seqname(args->hdr,rec), rec->pos, rec->pos, &len);
+    if ( !ref )
+    {
+        if ( faidx_has_seq(args->fai, bcf_seqname(args->hdr,rec))==0 )
+        {
+            fprintf(stderr,"Ignoring sequence \"%s\"\n", bcf_seqname(args->hdr,rec));
+            args->skip_rid = rec->rid;
+            return -2;
+        }
+        error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+    }
+    int ir = nt2int(*ref);
+    free(ref);
+    return ir;
+}
+
+static void dbsnp_init(args_t *args, const char *chr)
+{
+    if ( args->i2m ) kh_destroy(i2m, args->i2m);
+    args->i2m = kh_init(i2m);
+    bcf_srs_t *sr = bcf_sr_init();
+    if ( bcf_sr_set_regions(sr, chr, 0) != 0 ) goto done;
+    if ( !bcf_sr_add_reader(sr,args->dbsnp_fname) ) error("Failed to open %s: %s\n", args->dbsnp_fname,bcf_sr_strerror(sr->errnum));
+    while ( bcf_sr_next_line(sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(sr, 0);
+        if ( rec->d.allele[0][1]!=0 || rec->d.allele[1][1]!=0 ) continue;   // skip non-snps
+
+        int ref = nt2int(rec->d.allele[0][0]);
+        if ( ref<0 ) continue;     // non-[ACGT] base
+
+        uint32_t id = parse_rsid(rec->d.id);
+        if ( !id ) continue;
+
+        int ret, k;
+        k = kh_put(i2m, args->i2m, id, &ret);
+        if ( ret<0 ) error("An error occurred while inserting the key %u\n", id);
+        if ( ret==0 ) continue; // skip ambiguous id
+        kh_val(args->i2m, k).pos = (uint32_t)rec->pos;
+        kh_val(args->i2m, k).ref = ref;
+    }
+done:
+    bcf_sr_destroy(sr);
+}
+
+static bcf1_t *dbsnp_check(args_t *args, bcf1_t *rec, int ir, int ia, int ib)
+{
+    int k, ref,pos;
+    uint32_t id = parse_rsid(rec->d.id);
+    if ( !id ) goto no_info;
+
+    k = kh_get(i2m, args->i2m, id);
+    if ( k==kh_end(args->i2m) ) goto no_info;
+
+    pos = (int)kh_val(args->i2m, k).pos;
+    if ( pos != rec->pos ) 
+    {
+        rec->pos = pos;
+        ir = fetch_ref(args, rec);
+        args->npos_err++;
+    }
+
+    ref = kh_val(args->i2m, k).ref;
+       if ( ref!=ir ) 
+        error("Reference base mismatch at %s:%d .. %c vs %c\n",bcf_seqname(args->hdr,rec),rec->pos+1,int2nt(ref),int2nt(ir));
+
+    if ( ia==ref ) return rec;
+    if ( ib==ref ) { args->nswap++; return set_ref_alt(args,rec,int2nt(ib),int2nt(ia),1); }
+
+no_info:
+    args->nunresolved++;
+    return args->discard ? NULL : rec;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( rec->rid == args.skip_rid ) return NULL;
+
+    bcf1_t *ret = args.mode==MODE_STATS ? NULL : rec;
+    args.nsite++;
+
+    // Skip non-SNPs
+    if ( bcf_get_variant_types(rec)!=VCF_SNP )
+    {
+        args.nonSNP++;
+        return args.discard ? NULL : ret;
+    }
+
+    // Get the reference allele
+    int ir = fetch_ref(&args, rec);
+    if ( ir==-2 ) return NULL;
+    if ( ir==-1 )
+    {
+        args.nonACGT++;
+        return args.discard ? NULL : ret;     // not A,C,G,T
+    }
+
+    if ( rec->n_allele!=2 )
+    {
+        // not a biallelic site
+        args.nonbiallelic++;
+        return args.discard ? NULL : ret;
+    }
+
+    int ia = nt2int(rec->d.allele[0][0]);
+    if ( ia<0 )
+    {
+        // not A,C,G,T
+        args.nonACGT++;
+        return args.discard ? NULL : ret;
+    }
+
+    int ib = nt2int(rec->d.allele[1][0]);
+    if ( ib<0 )
+    {
+        // not A,C,G,T
+        args.nonACGT++;
+        return args.discard ? NULL : ret;
+    }
+
+    if ( ia==ib )
+    {
+        // should not happen in well-formed VCF
+        args.nonSNP++;
+        return args.discard ? NULL : ret;
+    }
+    args.count[ia][ib]++;
+
+    if ( ir==ia ) args.nok++;
+
+    if ( args.mode==MODE_USE_ID )
+    {
+        if ( !args.i2m || args.rid!=rec->rid )
+        {
+            args.pos = 0;
+            args.rid = rec->rid;
+            dbsnp_init(&args,bcf_seqname(args.hdr,rec));
+        }
+        ret = dbsnp_check(&args, rec, ir,ia,ib);
+        if ( !args.unsorted && args.pos > rec->pos )
+        {
+            fprintf(stderr,
+                "Warning: corrected position(s) results in unsorted VCF, for example %s:%d comes after %s:%d\n"
+                "         The standard unix `sort` or `vcf-sort` from vcftools can be used to fix the order.\n",
+                bcf_seqname(args.hdr,rec),rec->pos+1,bcf_seqname(args.hdr,rec),args.pos);
+            args.unsorted = 1;
+        }
+        args.pos = rec->pos;
+        return ret;
+    }
+    else if ( args.mode==MODE_FLIP2FWD )
+    {
+        int pair = 1 << ia | 1 << ib;
+        if ( pair==0x9 || pair==0x6 )   // skip ambiguous pairs: A/T or C/G
+        {
+            args.nunresolved++;
+            return args.discard ? NULL : ret;
+        }
+        if ( ir==ia ) return ret;
+        if ( ir==ib ) { args.nswap++; return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1); }
+        if ( ir==revint(ia) ) { args.nflip++; return set_ref_alt(&args,rec,int2nt(revint(ia)),int2nt(revint(ib)),0); }
+        if ( ir==revint(ib) ) { args.nflip_swap++; return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1); }
+        error("FIXME: this should not happen %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+    }
+    else if ( args.mode==MODE_TOP2FWD )
+    {
+        int pair = 1 << ia | 1 << ib;
+        if ( pair != 0x9 && pair != 0x6 )    // unambiguous pair: A/C or A/G
+        {
+            if ( ir==ia ) return ret;
+
+            int ia_rev = revint(ia);
+            if ( ir==ia_rev )               // vcfref is A, faref is T, flip
+            {
+                args.nflip++;
+                return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(revint(ib)),0);
+            }
+            if ( ir==ib )                   // vcfalt is faref, swap
+            {
+                args.nswap++;
+                return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+            }
+            assert( ib==revint(ir) );
+
+            args.nflip_swap++;
+            return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1);
+        }
+        else    // ambiguous pair, sequence walking must be performed
+        {
+            int len, win = rec->pos > 100 ? 100 : rec->pos, beg = rec->pos - win, end = rec->pos + win;
+            char *ref = faidx_fetch_seq(args.fai, (char*)bcf_seqname(args.hdr,rec), beg,end, &len);
+            if ( !ref ) error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+            if ( end - beg + 1 != len ) error("FIXME: check win=%d,len=%d at %s:%d  (%d %d)\n", win,len, bcf_seqname(args.hdr,rec),rec->pos+1, end,beg);
+
+            int i, mid = rec->pos - beg, strand = 0;
+            for (i=1; i<=win; i++)
+            {
+                int ra = nt2int(ref[mid-i]);
+                int rb = nt2int(ref[mid+i]);
+                if ( ra<0 || rb<0 || ra==rb ) continue;     // skip N's and non-infomative pairs: A/A, C/C, G/G, T/T
+                pair = 1 << ra | 1 << rb;
+                if ( pair==0x9 || pair==0x6 ) continue;     // skip ambiguous pairs: A/T or C/G
+                strand = 1 << ra & 0x9 ? 1 : -1;
+                break;
+            }
+            free(ref);
+            
+            if ( strand==1 )
+            {
+                if ( ir==ia ) return ret;
+                if ( ir==ib )
+                {
+                    args.nswap++;
+                    return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+                }
+            }
+            else if ( strand==-1 )
+            {
+                int ia_rev = revint(ia);
+                int ib_rev = revint(ib);
+                if ( ir==ia_rev )
+                {
+                    args.nflip++;
+                    return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(ib_rev),0);
+                }
+                if ( ir==ib_rev )
+                {
+                    args.nflip_swap++;
+                    return set_ref_alt(&args,rec,int2nt(ib_rev),int2nt(ia_rev),1);
+                }
+            }
+
+            args.nunresolved++;
+            return args.discard ? NULL : ret;
+        }
+    }
+    return ret;
+}
+
+int top_mask[4][4] = 
+{ 
+    {0,1,1,1},
+    {0,0,1,0},
+    {0,0,0,0},
+    {0,0,0,0},
+};
+int bot_mask[4][4] = 
+{ 
+    {0,0,0,0},
+    {0,0,0,0},
+    {0,1,0,0},
+    {1,1,1,0},
+};
+
+void destroy(void)
+{
+    uint32_t i,j,tot = 0;
+    uint32_t top_err = 0, bot_err = 0;
+    for (i=0; i<4; i++)
+    {
+        for (j=0; j<4; j++)
+        {
+            tot += args.count[i][j];
+            if ( !top_mask[i][j] && args.count[i][j] ) top_err++;
+            if ( !bot_mask[i][j] && args.count[i][j] ) bot_err++;
+        }
+    }
+    uint32_t nskip = args.nonACGT+args.nonSNP+args.nonbiallelic;
+    uint32_t ncmp  = args.nsite - nskip;
+
+    fprintf(stderr,"# SC, guessed strand convention\n");
+    fprintf(stderr,"SC\tTOP-compatible\t%d\n",top_err?0:1);
+    fprintf(stderr,"SC\tBOT-compatible\t%d\n",bot_err?0:1);
+
+    fprintf(stderr,"# ST, substitution types\n");
+    for (i=0; i<4; i++)
+    {
+        for (j=0; j<4; j++)
+        {
+            if ( i==j ) continue;
+            fprintf(stderr,"ST\t%c>%c\t%u\t%.1f%%\n", int2nt(i),int2nt(j),args.count[i][j], args.count[i][j]*100./tot);
+        }
+    }
+    fprintf(stderr,"# NS, Number of sites:\n");
+    fprintf(stderr,"NS\ttotal        \t%u\n", args.nsite);
+    fprintf(stderr,"NS\tref match    \t%u\t%.1f%%\n", args.nok,100.*args.nok/ncmp);
+    fprintf(stderr,"NS\tref mismatch \t%u\t%.1f%%\n", ncmp-args.nok,100.*(ncmp-args.nok)/ncmp);
+    if ( args.mode!=MODE_STATS )
+    {
+        fprintf(stderr,"NS\tflipped      \t%u\t%.1f%%\n", args.nflip,100.*args.nflip/(args.nsite-nskip));
+        fprintf(stderr,"NS\tswapped      \t%u\t%.1f%%\n", args.nswap,100.*args.nswap/(args.nsite-nskip));
+        fprintf(stderr,"NS\tflip+swap    \t%u\t%.1f%%\n", args.nflip_swap,100.*args.nflip_swap/(args.nsite-nskip));
+        fprintf(stderr,"NS\tunresolved   \t%u\t%.1f%%\n", args.nunresolved,100.*args.nunresolved/(args.nsite-nskip));
+        fprintf(stderr,"NS\tfixed pos    \t%u\t%.1f%%\n", args.npos_err,100.*args.npos_err/(args.nsite-nskip));
+    }
+    fprintf(stderr,"NS\tskipped      \t%u\n", nskip);
+    fprintf(stderr,"NS\tnon-ACGT     \t%u\n", args.nonACGT);
+    fprintf(stderr,"NS\tnon-SNP      \t%u\n", args.nonSNP);
+    fprintf(stderr,"NS\tnon-biallelic\t%u\n", args.nonbiallelic);
+
+    free(args.gts);
+    if ( args.fai ) fai_destroy(args.fai);
+    if ( args.i2m ) kh_destroy(i2m, args.i2m);
+}
diff --git a/bcftools/plugins/fixref.c.pysam.c b/bcftools/plugins/fixref.c.pysam.c

new file mode 100644 (file)

index 0000000..6d3aa05
--- /dev/null
+++ b/bcftools/plugins/fixref.c.pysam.c
@@ -0,0 +1,574 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+/*
+    Illumina TOP/BOT strand convention causes a lot of pain. This tool
+    attempts to determine the strand convention and convert it to the
+    forward reference strand.
+
+    On TOP strand, we can encounter
+        unambiguous SNPs:
+            A/G
+            A/C
+        ambiguous (context-dependent) SNPs:
+            C/G
+            A/T
+
+    On BOT strand:
+        unambiguous SNPs:
+            T/G
+            T/C
+        ambiguous (context-dependent) SNPs:
+            T/A
+            G/C
+
+
+    For unambiguous pairs (A/C, A/G, T/C, T/G), the knowledge of reference base
+    at the SNP position is enough to determine the strand:
+
+         TOP      REF   ->  ALLELES   TOP_ON_STRAND
+         -------------------------------------------
+         A/C     A or C      A/C         1
+          "      T or G      T/G        -1
+         A/G     A or G      A/G         1
+          "      T or C      T/C        -1
+
+
+    For ambiguous pairs (A/T, C/G), a sequence walking must be performed
+    (simultaneously upstream and downstream) until the first unambiguous pair
+    is encountered. The 5' base determines the strand:
+
+         TOP    5'REF_BASE   ->  ALLELES   TOP_ON_STRAND
+         ------------------------------------------------
+         A/T    A or T             A/T          1
+          "     C or G             T/A         -1
+         C/G    A or T             C/G          1
+          "     C or G             G/C         -1
+
+ */
+
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <htslib/faidx.h>
+#include <htslib/khash.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+
+#define MODE_STATS    1
+#define MODE_TOP2FWD  2
+#define MODE_FLIP2FWD 3
+#define MODE_USE_ID   4
+
+typedef struct
+{
+    uint32_t pos;
+    uint8_t ref;
+}
+marker_t;
+
+KHASH_MAP_INIT_INT(i2m, marker_t)
+typedef khash_t(i2m) i2m_t;
+
+typedef struct
+{
+    char *dbsnp_fname;
+    int mode, discard;
+    bcf_hdr_t *hdr;
+    faidx_t *fai;
+    int rid, skip_rid;
+    i2m_t *i2m;
+    int32_t *gts, ngts, pos;
+    uint32_t nsite,nok,nflip,nunresolved,nswap,nflip_swap,nonSNP,nonACGT,nonbiallelic;
+    uint32_t count[4][4], npos_err, unsorted;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Fix reference strand orientation, e.g. from Illumina/TOP to fwd.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: This tool helps to determine and fix strand orientation.\n"
+        "       Currently the following modes are recognised:\n"
+        "           flip  .. flips non-ambiguous SNPs and ignores the rest\n"
+        "           id    .. swap REF/ALT and GTs using the ID column to determine the REF allele\n"
+        "           stats .. collect and print stats\n"
+        "           top   .. converts from Illumina TOP strand to fwd\n"
+        "\n"
+        "       WARNING: Do not use the program blindly, make an effort to\n"
+        "       understand what strand convention your data uses! Make sure\n"
+        "       the reason for mismatching REF alleles is not a different\n"
+        "       reference build!!\n"
+        "\n"
+        "Usage: bcftools +fixref [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -d, --discard               Discard sites which could not be resolved\n"
+        "   -f, --fasta-ref <file.fa>   Reference sequence\n"
+        "   -i, --use-id <file.vcf>     Swap REF/ALT using the ID column to determine the REF allele, implies -m id.\n"
+        "                               Download the dbSNP file from\n"
+        "                                   https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf\n"
+        "   -m, --mode <string>         Collect stats (\"stats\") or convert (\"flip\", \"id\", \"top\") [stats]\n"
+        "\n"
+        "Examples:\n"
+        "   # run stats\n"
+        "   bcftools +fixref file.bcf -- -f ref.fa\n"
+        "\n"
+        "   # convert from TOP to fwd\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -f ref.fa -m top\n"
+        "\n"
+        "   # match the REF/ALT alleles based on the ID column, discard unknown sites\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i All_20151104.vcf.gz\n"
+        "\n"
+        "   # assuming the reference build is correct, just flip to fwd, discarding the rest\n"
+        "   bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -m flip\n"
+        "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.skip_rid = -1;
+    args.hdr = in;
+    args.mode = MODE_STATS;
+    char *ref_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"mode",required_argument,NULL,'m'},
+        {"discard",no_argument,NULL,'d'},
+        {"fasta-ref",required_argument,NULL,'f'},
+        {"use-id",required_argument,NULL,'i'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?hf:m:di:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'm': 
+                if ( !strcasecmp(optarg,"top") ) args.mode = MODE_TOP2FWD; 
+                else if ( !strcasecmp(optarg,"flip") ) args.mode = MODE_FLIP2FWD; 
+                else if ( !strcasecmp(optarg,"id") ) args.mode = MODE_USE_ID; 
+                else if ( !strcasecmp(optarg,"stats") ) args.mode = MODE_STATS; 
+                else error("The source strand convention not recognised: %s\n", optarg);
+                break;
+            case 'i': args.dbsnp_fname = optarg; args.mode = MODE_USE_ID; break;
+            case 'd': args.discard = 1; break;
+            case 'f': ref_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !ref_fname ) error("Expected the -f option\n");
+    args.fai = fai_load(ref_fname);
+    if ( !args.fai ) error("Failed to load the fai index: %s\n", ref_fname);
+
+    if ( args.mode==MODE_STATS ) return 1;
+    return 0;
+}
+
+static bcf1_t *set_ref_alt(args_t *args, bcf1_t *rec, const char ref, const char alt, int swap)
+{
+    rec->d.allele[0][0] = ref;
+    rec->d.allele[1][0] = alt;
+    rec->d.shared_dirty |= BCF1_DIRTY_ALS;
+
+    if ( !swap ) return rec;    // only fix the alleles, leaving GTs unchanged
+
+    int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->ngts);
+    int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+    ngts /= nsmpl;
+    for (i=0; i<nsmpl; i++)
+    {
+        int32_t *ptr = args->gts + i*ngts;
+        for (j=0; j<ngts; j++)
+        {
+            if ( ptr[j]==bcf_gt_unphased(0) ) ptr[j] = bcf_gt_unphased(1);
+            else if ( ptr[j]==bcf_gt_phased(0) ) ptr[j] = bcf_gt_phased(1);
+            else if ( ptr[j]==bcf_gt_unphased(1) ) ptr[j] = bcf_gt_unphased(0);
+            else if ( ptr[j]==bcf_gt_phased(1) ) ptr[j] = bcf_gt_phased(0);
+        }
+    }
+    bcf_update_genotypes(args->hdr,rec,args->gts,args->ngts);
+    
+    return rec;
+}
+
+static inline int nt2int(char nt)
+{
+    nt = toupper(nt);
+    if ( nt=='A' ) return 0;
+    if ( nt=='C' ) return 1;
+    if ( nt=='G' ) return 2;
+    if ( nt=='T' ) return 3;
+    return -1;
+}
+#define int2nt(x) "ACGT"[x]
+#define revint(x) ("3210"[x]-'0')
+
+static inline uint32_t parse_rsid(char *name)
+{
+    if ( name[0]!='r' || name[1]!='s' ) 
+    {
+        name = strstr(name, "rs");
+        if ( !name ) return 0;
+    }
+    char *tmp;
+    name += 2;
+    uint64_t id = strtol(name, &tmp, 10);
+    if ( tmp==name || *tmp ) return 0;
+    if ( id > UINT32_MAX ) error("FIXME: the ID is too big for uint32_t: %s\n", name-2);
+    return id;
+}
+
+static int fetch_ref(args_t *args, bcf1_t *rec)
+{
+    // Get the reference allele
+    int len;
+    char *ref = faidx_fetch_seq(args->fai, (char*)bcf_seqname(args->hdr,rec), rec->pos, rec->pos, &len);
+    if ( !ref )
+    {
+        if ( faidx_has_seq(args->fai, bcf_seqname(args->hdr,rec))==0 )
+        {
+            fprintf(bcftools_stderr,"Ignoring sequence \"%s\"\n", bcf_seqname(args->hdr,rec));
+            args->skip_rid = rec->rid;
+            return -2;
+        }
+        error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+    }
+    int ir = nt2int(*ref);
+    free(ref);
+    return ir;
+}
+
+static void dbsnp_init(args_t *args, const char *chr)
+{
+    if ( args->i2m ) kh_destroy(i2m, args->i2m);
+    args->i2m = kh_init(i2m);
+    bcf_srs_t *sr = bcf_sr_init();
+    if ( bcf_sr_set_regions(sr, chr, 0) != 0 ) goto done;
+    if ( !bcf_sr_add_reader(sr,args->dbsnp_fname) ) error("Failed to open %s: %s\n", args->dbsnp_fname,bcf_sr_strerror(sr->errnum));
+    while ( bcf_sr_next_line(sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(sr, 0);
+        if ( rec->d.allele[0][1]!=0 || rec->d.allele[1][1]!=0 ) continue;   // skip non-snps
+
+        int ref = nt2int(rec->d.allele[0][0]);
+        if ( ref<0 ) continue;     // non-[ACGT] base
+
+        uint32_t id = parse_rsid(rec->d.id);
+        if ( !id ) continue;
+
+        int ret, k;
+        k = kh_put(i2m, args->i2m, id, &ret);
+        if ( ret<0 ) error("An error occurred while inserting the key %u\n", id);
+        if ( ret==0 ) continue; // skip ambiguous id
+        kh_val(args->i2m, k).pos = (uint32_t)rec->pos;
+        kh_val(args->i2m, k).ref = ref;
+    }
+done:
+    bcf_sr_destroy(sr);
+}
+
+static bcf1_t *dbsnp_check(args_t *args, bcf1_t *rec, int ir, int ia, int ib)
+{
+    int k, ref,pos;
+    uint32_t id = parse_rsid(rec->d.id);
+    if ( !id ) goto no_info;
+
+    k = kh_get(i2m, args->i2m, id);
+    if ( k==kh_end(args->i2m) ) goto no_info;
+
+    pos = (int)kh_val(args->i2m, k).pos;
+    if ( pos != rec->pos ) 
+    {
+        rec->pos = pos;
+        ir = fetch_ref(args, rec);
+        args->npos_err++;
+    }
+
+    ref = kh_val(args->i2m, k).ref;
+       if ( ref!=ir ) 
+        error("Reference base mismatch at %s:%d .. %c vs %c\n",bcf_seqname(args->hdr,rec),rec->pos+1,int2nt(ref),int2nt(ir));
+
+    if ( ia==ref ) return rec;
+    if ( ib==ref ) { args->nswap++; return set_ref_alt(args,rec,int2nt(ib),int2nt(ia),1); }
+
+no_info:
+    args->nunresolved++;
+    return args->discard ? NULL : rec;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( rec->rid == args.skip_rid ) return NULL;
+
+    bcf1_t *ret = args.mode==MODE_STATS ? NULL : rec;
+    args.nsite++;
+
+    // Skip non-SNPs
+    if ( bcf_get_variant_types(rec)!=VCF_SNP )
+    {
+        args.nonSNP++;
+        return args.discard ? NULL : ret;
+    }
+
+    // Get the reference allele
+    int ir = fetch_ref(&args, rec);
+    if ( ir==-2 ) return NULL;
+    if ( ir==-1 )
+    {
+        args.nonACGT++;
+        return args.discard ? NULL : ret;     // not A,C,G,T
+    }
+
+    if ( rec->n_allele!=2 )
+    {
+        // not a biallelic site
+        args.nonbiallelic++;
+        return args.discard ? NULL : ret;
+    }
+
+    int ia = nt2int(rec->d.allele[0][0]);
+    if ( ia<0 )
+    {
+        // not A,C,G,T
+        args.nonACGT++;
+        return args.discard ? NULL : ret;
+    }
+
+    int ib = nt2int(rec->d.allele[1][0]);
+    if ( ib<0 )
+    {
+        // not A,C,G,T
+        args.nonACGT++;
+        return args.discard ? NULL : ret;
+    }
+
+    if ( ia==ib )
+    {
+        // should not happen in well-formed VCF
+        args.nonSNP++;
+        return args.discard ? NULL : ret;
+    }
+    args.count[ia][ib]++;
+
+    if ( ir==ia ) args.nok++;
+
+    if ( args.mode==MODE_USE_ID )
+    {
+        if ( !args.i2m || args.rid!=rec->rid )
+        {
+            args.pos = 0;
+            args.rid = rec->rid;
+            dbsnp_init(&args,bcf_seqname(args.hdr,rec));
+        }
+        ret = dbsnp_check(&args, rec, ir,ia,ib);
+        if ( !args.unsorted && args.pos > rec->pos )
+        {
+            fprintf(bcftools_stderr,
+                "Warning: corrected position(s) results in unsorted VCF, for example %s:%d comes after %s:%d\n"
+                "         The standard unix `sort` or `vcf-sort` from vcftools can be used to fix the order.\n",
+                bcf_seqname(args.hdr,rec),rec->pos+1,bcf_seqname(args.hdr,rec),args.pos);
+            args.unsorted = 1;
+        }
+        args.pos = rec->pos;
+        return ret;
+    }
+    else if ( args.mode==MODE_FLIP2FWD )
+    {
+        int pair = 1 << ia | 1 << ib;
+        if ( pair==0x9 || pair==0x6 )   // skip ambiguous pairs: A/T or C/G
+        {
+            args.nunresolved++;
+            return args.discard ? NULL : ret;
+        }
+        if ( ir==ia ) return ret;
+        if ( ir==ib ) { args.nswap++; return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1); }
+        if ( ir==revint(ia) ) { args.nflip++; return set_ref_alt(&args,rec,int2nt(revint(ia)),int2nt(revint(ib)),0); }
+        if ( ir==revint(ib) ) { args.nflip_swap++; return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1); }
+        error("FIXME: this should not happen %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+    }
+    else if ( args.mode==MODE_TOP2FWD )
+    {
+        int pair = 1 << ia | 1 << ib;
+        if ( pair != 0x9 && pair != 0x6 )    // unambiguous pair: A/C or A/G
+        {
+            if ( ir==ia ) return ret;
+
+            int ia_rev = revint(ia);
+            if ( ir==ia_rev )               // vcfref is A, faref is T, flip
+            {
+                args.nflip++;
+                return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(revint(ib)),0);
+            }
+            if ( ir==ib )                   // vcfalt is faref, swap
+            {
+                args.nswap++;
+                return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+            }
+            assert( ib==revint(ir) );
+
+            args.nflip_swap++;
+            return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1);
+        }
+        else    // ambiguous pair, sequence walking must be performed
+        {
+            int len, win = rec->pos > 100 ? 100 : rec->pos, beg = rec->pos - win, end = rec->pos + win;
+            char *ref = faidx_fetch_seq(args.fai, (char*)bcf_seqname(args.hdr,rec), beg,end, &len);
+            if ( !ref ) error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+            if ( end - beg + 1 != len ) error("FIXME: check win=%d,len=%d at %s:%d  (%d %d)\n", win,len, bcf_seqname(args.hdr,rec),rec->pos+1, end,beg);
+
+            int i, mid = rec->pos - beg, strand = 0;
+            for (i=1; i<=win; i++)
+            {
+                int ra = nt2int(ref[mid-i]);
+                int rb = nt2int(ref[mid+i]);
+                if ( ra<0 || rb<0 || ra==rb ) continue;     // skip N's and non-infomative pairs: A/A, C/C, G/G, T/T
+                pair = 1 << ra | 1 << rb;
+                if ( pair==0x9 || pair==0x6 ) continue;     // skip ambiguous pairs: A/T or C/G
+                strand = 1 << ra & 0x9 ? 1 : -1;
+                break;
+            }
+            free(ref);
+            
+            if ( strand==1 )
+            {
+                if ( ir==ia ) return ret;
+                if ( ir==ib )
+                {
+                    args.nswap++;
+                    return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+                }
+            }
+            else if ( strand==-1 )
+            {
+                int ia_rev = revint(ia);
+                int ib_rev = revint(ib);
+                if ( ir==ia_rev )
+                {
+                    args.nflip++;
+                    return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(ib_rev),0);
+                }
+                if ( ir==ib_rev )
+                {
+                    args.nflip_swap++;
+                    return set_ref_alt(&args,rec,int2nt(ib_rev),int2nt(ia_rev),1);
+                }
+            }
+
+            args.nunresolved++;
+            return args.discard ? NULL : ret;
+        }
+    }
+    return ret;
+}
+
+int top_mask[4][4] = 
+{ 
+    {0,1,1,1},
+    {0,0,1,0},
+    {0,0,0,0},
+    {0,0,0,0},
+};
+int bot_mask[4][4] = 
+{ 
+    {0,0,0,0},
+    {0,0,0,0},
+    {0,1,0,0},
+    {1,1,1,0},
+};
+
+void destroy(void)
+{
+    uint32_t i,j,tot = 0;
+    uint32_t top_err = 0, bot_err = 0;
+    for (i=0; i<4; i++)
+    {
+        for (j=0; j<4; j++)
+        {
+            tot += args.count[i][j];
+            if ( !top_mask[i][j] && args.count[i][j] ) top_err++;
+            if ( !bot_mask[i][j] && args.count[i][j] ) bot_err++;
+        }
+    }
+    uint32_t nskip = args.nonACGT+args.nonSNP+args.nonbiallelic;
+    uint32_t ncmp  = args.nsite - nskip;
+
+    fprintf(bcftools_stderr,"# SC, guessed strand convention\n");
+    fprintf(bcftools_stderr,"SC\tTOP-compatible\t%d\n",top_err?0:1);
+    fprintf(bcftools_stderr,"SC\tBOT-compatible\t%d\n",bot_err?0:1);
+
+    fprintf(bcftools_stderr,"# ST, substitution types\n");
+    for (i=0; i<4; i++)
+    {
+        for (j=0; j<4; j++)
+        {
+            if ( i==j ) continue;
+            fprintf(bcftools_stderr,"ST\t%c>%c\t%u\t%.1f%%\n", int2nt(i),int2nt(j),args.count[i][j], args.count[i][j]*100./tot);
+        }
+    }
+    fprintf(bcftools_stderr,"# NS, Number of sites:\n");
+    fprintf(bcftools_stderr,"NS\ttotal        \t%u\n", args.nsite);
+    fprintf(bcftools_stderr,"NS\tref match    \t%u\t%.1f%%\n", args.nok,100.*args.nok/ncmp);
+    fprintf(bcftools_stderr,"NS\tref mismatch \t%u\t%.1f%%\n", ncmp-args.nok,100.*(ncmp-args.nok)/ncmp);
+    if ( args.mode!=MODE_STATS )
+    {
+        fprintf(bcftools_stderr,"NS\tflipped      \t%u\t%.1f%%\n", args.nflip,100.*args.nflip/(args.nsite-nskip));
+        fprintf(bcftools_stderr,"NS\tswapped      \t%u\t%.1f%%\n", args.nswap,100.*args.nswap/(args.nsite-nskip));
+        fprintf(bcftools_stderr,"NS\tflip+swap    \t%u\t%.1f%%\n", args.nflip_swap,100.*args.nflip_swap/(args.nsite-nskip));
+        fprintf(bcftools_stderr,"NS\tunresolved   \t%u\t%.1f%%\n", args.nunresolved,100.*args.nunresolved/(args.nsite-nskip));
+        fprintf(bcftools_stderr,"NS\tfixed pos    \t%u\t%.1f%%\n", args.npos_err,100.*args.npos_err/(args.nsite-nskip));
+    }
+    fprintf(bcftools_stderr,"NS\tskipped      \t%u\n", nskip);
+    fprintf(bcftools_stderr,"NS\tnon-ACGT     \t%u\n", args.nonACGT);
+    fprintf(bcftools_stderr,"NS\tnon-SNP      \t%u\n", args.nonSNP);
+    fprintf(bcftools_stderr,"NS\tnon-biallelic\t%u\n", args.nonbiallelic);
+
+    free(args.gts);
+    if ( args.fai ) fai_destroy(args.fai);
+    if ( args.i2m ) kh_destroy(i2m, args.i2m);
+}
diff --git a/bcftools/plugins/frameshifts.c b/bcftools/plugins/frameshifts.c

new file mode 100644 (file)

index 0000000..5c0e924
--- /dev/null
+++ b/bcftools/plugins/frameshifts.c
@@ -0,0 +1,157 @@
+/*  plugins/frameshifts.c -- annotates frameshift indels.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+bcf_sr_regions_t *exons;
+int32_t *frm = NULL, nfrm = 0;
+
+const char *about(void)
+{
+    return
+        "Annotate frameshift indels.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Annotate frameshift indels\n"
+        "Usage: bcftools +frameshifts [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -e, --exons <file>      list of exons, see \"--targets-file\" man page entry for details\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +frameshifts in.vcf -- -e exons.bed.gz\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    char *fname = NULL;
+
+    static struct option loptions[] =
+    {
+        {"exons",1,0,'e'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "e:?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': fname = optarg; break;
+            case 'h':
+            case '?':
+            default: fprintf(stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    if ( !fname )
+    {
+        fprintf(stderr,"Missing the -e option.\n");
+        return -1;
+    }
+
+    in_hdr = in;
+    out_hdr = out;
+
+    int ret = bcf_hdr_append(out_hdr,"##INFO=<ID=OOF,Number=A,Type=Integer,Description=\"Frameshift Indels: out-of-frame (1), in-frame (0), not-applicable (-1 or missing)\">");
+    if ( ret!=0 )
+    {
+        fprintf(stderr,"Error updating the header\n");
+        return -1;
+    }
+
+    exons = bcf_sr_regions_init(fname,1,0,1,2);
+    if ( !exons )
+    {
+        fprintf(stderr,"Error occurred while reading (was the file compressed with bgzip?): %s\n", fname);
+        return -1;
+    }
+
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( rec->n_allele<2 ) return rec;    // not a variant
+
+    int type = bcf_get_variant_types(rec);
+    if ( !(type&VCF_INDEL) ) return rec;  // not an indel
+
+    int i, len = 0;
+    for (i=1; i<rec->n_allele; i++)
+        if ( len > rec->d.var[i].n ) len = rec->d.var[i].n;
+
+    int pos_to = len!=0 ? rec->pos : rec->pos - len;    // len is negative
+    if ( bcf_sr_regions_overlap(exons, bcf_seqname(in_hdr,rec),rec->pos,pos_to) ) return rec;  // no overlap
+
+    hts_expand(int32_t,rec->n_allele-1,nfrm,frm);
+    for (i=1; i<rec->n_allele; i++)
+    {
+        if ( rec->d.var[i].type!=VCF_INDEL ) { frm[i-1] = -1; continue; }
+
+        int len = rec->d.var[i].n, tlen = 0;
+        if ( len>0 )
+        {
+            // insertion
+            if ( exons->start <= rec->pos && exons->end > rec->pos ) tlen = abs(len);
+        }
+        else if ( exons->start <= rec->pos + abs(len) )
+        {
+            // deletion
+            tlen = abs(len);
+            if ( rec->pos < exons->start )              // trim the beginning
+                tlen -= exons->start - rec->pos + 1;
+            if ( exons->end < rec->pos + abs(len) )     // trim the end
+                tlen -= rec->pos + abs(len) - exons->end;
+        }
+        if ( tlen )     // there are some deleted/inserted bases in the exon
+        {
+            if ( tlen%3 ) frm[i-1] = 1; // out-of-frame
+            else frm[i-1] = 0;          // in-frame
+        }
+        else frm[i-1] = -1;             // not applicable (is outside)
+    }
+
+    if ( bcf_update_info_int32(out_hdr,rec,"OOF",frm,rec->n_allele-1)<0 ) { fprintf(stderr, "Could not annotate OOF :-/\n"); exit(1); }
+    return rec;
+}
+
+
+void destroy(void)
+{
+    bcf_sr_regions_destroy(exons);
+}
+
+
diff --git a/bcftools/plugins/frameshifts.c.pysam.c b/bcftools/plugins/frameshifts.c.pysam.c

new file mode 100644 (file)

index 0000000..0587937
--- /dev/null
+++ b/bcftools/plugins/frameshifts.c.pysam.c
@@ -0,0 +1,159 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/frameshifts.c -- annotates frameshift indels.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+bcf_sr_regions_t *exons;
+int32_t *frm = NULL, nfrm = 0;
+
+const char *about(void)
+{
+    return
+        "Annotate frameshift indels.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Annotate frameshift indels\n"
+        "Usage: bcftools +frameshifts [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -e, --exons <file>      list of exons, see \"--targets-file\" man page entry for details\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +frameshifts in.vcf -- -e exons.bed.gz\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    char *fname = NULL;
+
+    static struct option loptions[] =
+    {
+        {"exons",1,0,'e'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "e:?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': fname = optarg; break;
+            case 'h':
+            case '?':
+            default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    if ( !fname )
+    {
+        fprintf(bcftools_stderr,"Missing the -e option.\n");
+        return -1;
+    }
+
+    in_hdr = in;
+    out_hdr = out;
+
+    int ret = bcf_hdr_append(out_hdr,"##INFO=<ID=OOF,Number=A,Type=Integer,Description=\"Frameshift Indels: out-of-frame (1), in-frame (0), not-applicable (-1 or missing)\">");
+    if ( ret!=0 )
+    {
+        fprintf(bcftools_stderr,"Error updating the header\n");
+        return -1;
+    }
+
+    exons = bcf_sr_regions_init(fname,1,0,1,2);
+    if ( !exons )
+    {
+        fprintf(bcftools_stderr,"Error occurred while reading (was the file compressed with bgzip?): %s\n", fname);
+        return -1;
+    }
+
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( rec->n_allele<2 ) return rec;    // not a variant
+
+    int type = bcf_get_variant_types(rec);
+    if ( !(type&VCF_INDEL) ) return rec;  // not an indel
+
+    int i, len = 0;
+    for (i=1; i<rec->n_allele; i++)
+        if ( len > rec->d.var[i].n ) len = rec->d.var[i].n;
+
+    int pos_to = len!=0 ? rec->pos : rec->pos - len;    // len is negative
+    if ( bcf_sr_regions_overlap(exons, bcf_seqname(in_hdr,rec),rec->pos,pos_to) ) return rec;  // no overlap
+
+    hts_expand(int32_t,rec->n_allele-1,nfrm,frm);
+    for (i=1; i<rec->n_allele; i++)
+    {
+        if ( rec->d.var[i].type!=VCF_INDEL ) { frm[i-1] = -1; continue; }
+
+        int len = rec->d.var[i].n, tlen = 0;
+        if ( len>0 )
+        {
+            // insertion
+            if ( exons->start <= rec->pos && exons->end > rec->pos ) tlen = abs(len);
+        }
+        else if ( exons->start <= rec->pos + abs(len) )
+        {
+            // deletion
+            tlen = abs(len);
+            if ( rec->pos < exons->start )              // trim the beginning
+                tlen -= exons->start - rec->pos + 1;
+            if ( exons->end < rec->pos + abs(len) )     // trim the end
+                tlen -= rec->pos + abs(len) - exons->end;
+        }
+        if ( tlen )     // there are some deleted/inserted bases in the exon
+        {
+            if ( tlen%3 ) frm[i-1] = 1; // out-of-frame
+            else frm[i-1] = 0;          // in-frame
+        }
+        else frm[i-1] = -1;             // not applicable (is outside)
+    }
+
+    if ( bcf_update_info_int32(out_hdr,rec,"OOF",frm,rec->n_allele-1)<0 ) { fprintf(bcftools_stderr, "Could not annotate OOF :-/\n"); exit(1); }
+    return rec;
+}
+
+
+void destroy(void)
+{
+    bcf_sr_regions_destroy(exons);
+}
+
+
diff --git a/bcftools/plugins/guess-ploidy.c b/bcftools/plugins/guess-ploidy.c

new file mode 100644 (file)

index 0000000..eee060f
--- /dev/null
+++ b/bcftools/plugins/guess-ploidy.c
@@ -0,0 +1,568 @@
+/* 
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define GUESS_GT 1
+#define GUESS_PL 2
+#define GUESS_GL 4
+
+typedef struct
+{
+    uint64_t ncount;
+    double phap, pdip;
+}
+count_t;
+
+typedef struct
+{
+    char *chr;
+    uint32_t start, end;
+    count_t *counts;    // per-sample counts: counts[isample]
+}
+stats_t;
+
+typedef struct
+{
+    int argc;
+    char **argv, *af_tag;
+    double af_dflt;
+    stats_t stats;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;   // include or exclude sites which match the filters? One of FLT_INCLUDE/FLT_EXCLUDE
+    const uint8_t *smpl_pass;
+    int nsample, verbose, tag, include_indels;
+    int *counts, ncounts;       // number of observed GTs with given ploidy, used when -g is not given
+    double *tmpf, *pl2p, gt_err_prob;
+    float *af;
+    int maf;
+    int32_t *arr, narr, nfarr;
+    float *farr;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Determine sample sex by checking genotype likelihoods in haploid regions.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Determine sample sex by checking genotype likelihoods (GL,PL) or genotypes (GT)\n"
+        "       in the non-PAR region of chrX. The HWE is assumed, so given the alternate allele\n" 
+        "       frequency fA and the genotype likelihoods pRR,pRA,pAA, the probabilities are\n"
+        "       calculated as\n"
+        "           P(dip) = pRR*(1-fA)^2 + pAA*fA^2 + 2*pRA*(1-fA)*fA\n"
+        "           P(hap) = pRR*(1-fA) + pAA*fA\n"
+        "       When genotype likelihoods are not available, the -e option is used to account\n"
+        "       for genotyping errors with -t GT. The alternate allele frequency fA is estimated\n"
+        "       directly from the data (the default) or can be provided by an INFO tag.\n"
+        "       The results can be visualized using the accompanied guess-ploidy.py script.\n"
+        "       Note that this plugin is intended to replace the former vcf2sex plugin.\n"
+        "\n"
+        "Usage: bcftools +guess-ploidy <file.vcf.gz> [Plugin Options]\n"
+        "Plugin options:\n"
+        "       --AF-dflt <float>           the default alternate allele frequency [0.5]\n"
+        "       --AF-tag <TAG>              use TAG for allele frequency\n"
+        "   -e, --error-rate <float>        probability of GT being wrong (with -t GT) [1e-3]\n"
+        "       --exclude <expr>            exclude sites for which the expression is true\n"
+        "   -i, --include-indels            do not skip indel sites\n"
+        "       --include <expr>            include only sites for which the expression is true\n"
+        "   -g, --genome <str>              shortcut to select nonPAR region for common genomes b37|hg19|b38|hg38\n"
+        "   -r, --regions <chr:beg-end>     restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "   -t, --tag <tag>                 genotype or genotype likelihoods: GT, PL, GL [PL]\n"
+        "   -v, --verbose                   verbose output (specify twice to increase verbosity)\n"
+        "\n"
+        "Region shortcuts:\n"
+        "   b37  .. -r X:2699521-154931043      # GRCh37 no-chr prefix\n"
+        "   b38  .. -r X:2781480-155701381      # GRCh38 no-chr prefix\n"
+        "   hg19 .. -r chrX:2699521-154931043   # GRCh37 chr prefix\n"
+        "   hg38 .. -r chrX:2781480-155701381   # GRCh38 chr prefix\n"
+        "\n"
+        "Examples:\n"
+        "   bcftools +guess-ploidy -g b37 in.vcf.gz\n"
+        "   bcftools +guess-ploidy in.vcf.gz -t GL -r chrX:2699521-154931043\n"
+        "   bcftools view file.vcf.gz -r chrX:2699521-154931043 | bcftools +guess-ploidy\n"
+        "   bcftools +guess-ploidy in.bcf -v > ploidy.txt && guess-ploidy.py ploidy.txt img\n"
+        "\n";
+}
+
+static inline int smpl_pass(args_t *args, int ismpl)
+{
+    if ( !args->smpl_pass ) return 1;
+    int pass = args->smpl_pass[ismpl];
+    if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+    if ( pass ) return 1;
+    return 0;
+}
+
+void process_region_guess(args_t *args)
+{
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        if ( rec->n_allele==1 ) continue;
+        if ( !args->include_indels && !(bcf_get_variant_types(rec)&VCF_SNP) ) continue;
+
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, rec, &args->smpl_pass);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !args->smpl_pass && !pass ) continue;     // site-level filtering, not per-sample filtering
+        }
+
+        double freq[2] = {0,0}, sum;
+        int ismpl,i;
+        if ( args->tag & GUESS_GT )   // use GTs to guess the ploidy, considering only one ALT
+        {
+            int ngt = bcf_get_genotypes(args->hdr,rec,&args->arr,&args->narr);
+            if ( ngt<=0 ) continue;
+            ngt /= args->nsample;
+            for (ismpl=0; ismpl<args->nsample; ismpl++)
+            {
+                if ( !smpl_pass(args,ismpl) ) continue;
+                int32_t *ptr = args->arr + ismpl*ngt;
+                double *tmp = args->tmpf + ismpl*3;
+
+                if ( ptr[0]==bcf_gt_missing ) 
+                {
+                    tmp[0] = -1;
+                    continue;
+                }
+                if ( ptr[1]==bcf_int32_vector_end )
+                {
+                    if ( bcf_gt_allele(ptr[0])==0 ) // haploid R
+                    {
+                        tmp[0] = 1 - 2*args->gt_err_prob;
+                        tmp[1] = tmp[2] = args->gt_err_prob;
+                    }
+                    else    // haploid A
+                    {
+                        tmp[0] = tmp[1] = args->gt_err_prob;
+                        tmp[2] = 1 - 2*args->gt_err_prob;
+                    }
+                    continue;
+                }
+                if ( bcf_gt_allele(ptr[0])==0 && bcf_gt_allele(ptr[1])==0 ) // RR
+                {
+                    tmp[0] = 1 - 2*args->gt_err_prob;
+                    tmp[1] = tmp[2] = args->gt_err_prob;
+                }
+                else if ( bcf_gt_allele(ptr[0])==bcf_gt_allele(ptr[1]) ) // AA
+                {
+                    tmp[0] = tmp[1] = args->gt_err_prob;
+                    tmp[2] = 1 - 2*args->gt_err_prob;
+                }
+                else  // RA or hetAA, treating as RA
+                {
+                    tmp[1] = 1 - 2*args->gt_err_prob;
+                    tmp[0] = tmp[2] = args->gt_err_prob;
+                }
+                freq[0] += 2*tmp[0]+tmp[1];
+                freq[1] += tmp[1]+2*tmp[2];
+            }
+        }
+        else if ( args->tag & GUESS_PL )    // use PL guess the ploidy, restrict to first ALT allele
+        {
+            int npl = bcf_get_format_int32(args->hdr,rec,"PL",&args->arr,&args->narr);
+            if ( npl<=0 ) continue;
+            npl /= args->nsample;
+            int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+            if ( npl==ndip_gt )             // diploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    int32_t *ptr = args->arr + ismpl*npl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing || ptr[2]==bcf_int32_missing ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[2]==bcf_int32_vector_end )
+                    {
+                        tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+                        tmp[1] = args->pl2p[255];
+                        tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+                    }
+                    else
+                        for (i=0; i<3; i++)
+                            tmp[i] = (ptr[i]<0 || ptr[i]>=256) ? args->pl2p[255] : args->pl2p[ptr[i]];
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    if ( ptr[2]==bcf_int32_vector_end )
+                    {
+                        freq[0] += tmp[0];
+                        freq[1] += tmp[2];
+                    }
+                    else
+                    {
+                        freq[0] += 2*tmp[0]+tmp[1];
+                        freq[1] += tmp[1]+2*tmp[2];
+                    }
+                }
+            }
+            else if ( npl==rec->n_allele )  // all samples haploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    int32_t *ptr = args->arr + ismpl*npl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+                    tmp[1] = args->pl2p[255];
+                    tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    freq[0] += tmp[0];
+                    freq[1] += tmp[2];
+                }
+            }
+            else
+                continue;   // neither diploid nor haploid
+        }
+        else    // use GL
+        {
+            int ngl = bcf_get_format_float(args->hdr,rec,"GL",&args->farr,&args->nfarr);
+            if ( ngl<=0 ) continue;
+            ngl /= args->nsample;
+            int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+            if ( ngl==ndip_gt )             // diploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    float *ptr = args->farr + ismpl*ngl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) || bcf_float_is_missing(ptr[2]) ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( bcf_float_is_vector_end(ptr[2]) )
+                    {
+                        tmp[0] = pow(10.,ptr[0]);
+                        tmp[1] = 1e-26;             // arbitrary small value for a het
+                        tmp[2] = pow(10.,ptr[1]);
+                    }
+                    else
+                        for (i=0; i<3; i++)
+                            tmp[i] = pow(10.,ptr[i]);
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    if ( bcf_float_is_vector_end(ptr[2]) )
+                    {
+                        freq[0] += tmp[0];
+                        freq[1] += tmp[2];
+                    }
+                    else
+                    {
+                        freq[0] += 2*tmp[0]+tmp[1];
+                        freq[1] += tmp[1]+2*tmp[2];
+                    }
+                }
+            }
+            else if ( ngl==rec->n_allele )  // all samples haploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    float *ptr = args->farr + ismpl*ngl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    tmp[0] = pow(10.,ptr[0]);
+                    tmp[1] = 1e-26;
+                    tmp[2] = pow(10.,ptr[1]);
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    freq[0] += tmp[0];
+                    freq[1] += tmp[2];
+                }
+            }
+            else
+                continue;   // neither diploid nor haploid
+        }
+        if ( args->af_tag )
+        {
+            int ret = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af, &args->maf);
+            if ( ret>0 ) { freq[0] = 1 - args->af[0]; freq[1] = args->af[0]; }
+        }
+
+        if ( !freq[0] && !freq[1] ) { freq[0] = 1 - args->af_dflt; freq[1] = args->af_dflt; }
+        sum = freq[0] + freq[1];
+        freq[0] /= sum;
+        freq[1] /= sum;
+        for (ismpl=0; ismpl<args->nsample; ismpl++)
+        {
+            if ( !smpl_pass(args,ismpl) ) continue;
+            count_t *counts = &args->stats.counts[ismpl];
+            double *tmp = args->tmpf + ismpl*3;
+            if ( tmp[0] < 0 ) continue;
+            double phap = freq[0]*tmp[0] + freq[1]*tmp[2];
+            double pdip = freq[0]*freq[0]*tmp[0] + 2*freq[0]*freq[1]*tmp[1] + freq[1]*freq[1]*tmp[2];
+            counts->phap += log(phap);
+            counts->pdip += log(pdip);
+            counts->ncount++;
+            if ( args->verbose>1 )
+                printf("DBG\t%s\t%d\t%s\t%e\t%e\t%e\t%e\t%e\t%e\n", bcf_seqname(args->hdr,rec),rec->pos+1,bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,ismpl),
+                    freq[1],tmp[0],tmp[1],tmp[2],phap,pdip);
+        }
+    }
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->tag    = GUESS_PL;
+    args->argc   = argc; args->argv = argv;
+    args->gt_err_prob = 1e-3;
+    args->af_dflt = 0.5;
+    char *region  = NULL;
+    int region_is_file = 0;
+    static struct option loptions[] =
+    {
+        {"AF-tag",required_argument,NULL,0},
+        {"AF-dflt",required_argument,NULL,1},
+        {"exclude",required_argument,NULL,2},
+        {"include",required_argument,NULL,3},
+        {"verbose",no_argument,NULL,'v'},
+        {"include-indels",no_argument,NULL,'i'},
+        {"error-rate",required_argument,NULL,'e'},
+        {"tag",required_argument,NULL,'t'},
+        {"genome",required_argument,NULL,'g'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"background",required_argument,NULL,'b'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:t:e:ig:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 0: args->af_tag = optarg; break;
+            case 1: 
+                    args->af_dflt = strtod(optarg,&tmp);
+                    if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
+                    break;
+            case 2: args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 3: args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'i': args->include_indels = 1; break;
+            case 'e':
+                args->gt_err_prob = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -e %s\n", optarg);
+                if ( args->gt_err_prob<0 || args->gt_err_prob>1 ) error("Expected value from the interval [0,1]: -e %s\n", optarg);
+                break;
+            case 'g':
+                if ( !strcasecmp(optarg,"b37") ) region = "X:2699521-154931043";
+                else if ( !strcasecmp(optarg,"b38") ) region = "X:2781480-155701381";
+                else if ( !strcasecmp(optarg,"hg19") ) region = "chrX:2699521-154931043";
+                else if ( !strcasecmp(optarg,"hg38") ) region = "chrX:2781480-155701381";
+                else error("The argument not recognised, expected --genome b37, b38, hg19 or hg38: %s\n", optarg);
+                break;
+            case 'R': region_is_file = 1; 
+            case 'r': region = optarg; break; 
+            case 'v': args->verbose++; break; 
+            case 't':
+                if ( !strcasecmp(optarg,"GT") ) args->tag = GUESS_GT;
+                else if ( !strcasecmp(optarg,"PL") ) args->tag = GUESS_PL;
+                else if ( !strcasecmp(optarg,"GL") ) args->tag = GUESS_GL;
+                else error("The argument not recognised, expected --tag GT, PL or GL: %s\n", optarg);
+                break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of --include or --exclude can be given.\n");
+
+    char *fname = NULL;
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else fname = argv[optind];
+
+    args->sr = bcf_sr_init();
+    if ( strcmp("-",fname) )
+    {
+        if ( region )
+        {
+            args->sr->require_index = 1;
+            if ( bcf_sr_set_regions(args->sr, region, region_is_file)<0 )
+                error("Failed to read the regions: %s\n",region);
+        }
+    }
+    else
+    {
+        if ( region )
+        {
+            if ( bcf_sr_set_targets(args->sr, region, region_is_file, 0)<0 )
+                error("Failed to read the targets: %s\n",region);
+        }
+    }
+    if ( !bcf_sr_add_reader(args->sr,fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = args->sr->readers[0].header;
+    args->nsample = bcf_hdr_nsamples(args->hdr);
+    args->stats.counts = (count_t*) calloc(args->nsample,sizeof(count_t));
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
+    if ( args->af_tag && !bcf_hdr_idinfo_exists(args->hdr,BCF_HL_INFO,bcf_hdr_id2int(args->hdr,BCF_DT_ID,args->af_tag)) )
+        error("No such INFO tag: %s\n", args->af_tag);
+
+    if ( args->tag&GUESS_PL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL")<0 )
+    {
+        fprintf(stderr, "Warning: PL tag not found in header, switching to GL\n");
+        args->tag = GUESS_GL;
+    }
+
+    if ( args->tag&GUESS_GL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GL")<0 )
+    {
+        fprintf(stderr, "Warning: GL tag not found in header, switching to GT\n");
+        args->tag = GUESS_GT;
+    }
+
+    if ( args->tag&GUESS_GT && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GT")<0 )
+        error("Error: GT tag not found in header\n");
+
+    int i;
+    if ( args->tag&GUESS_PL )
+    {
+        args->pl2p = (double*) calloc(256,sizeof(double));
+        for (i=0; i<256; i++) args->pl2p[i] = pow(10., -i/10.);
+    }
+    if ( args->tag&GUESS_PL || args->tag&GUESS_GL || args->tag&GUESS_GT )
+        args->tmpf = (double*) malloc(sizeof(*args->tmpf)*3*args->nsample);
+
+    if ( args->verbose )
+    {
+        printf("# This file was produced by: bcftools +guess-ploidy(%s+htslib-%s)\n", bcftools_version(),hts_version());
+        printf("# The command line was:\tbcftools +%s", args->argv[0]);
+        for (i=1; i<args->argc; i++)
+            printf(" %s",args->argv[i]);
+        printf("\n");
+        printf("# [1]SEX\t[2]Sample\t[3]Predicted sex\t[4]log P(Haploid)/nSites\t[5]log P(Diploid)/nSites\t[6]nSites\t[7]Score: F < 0 < M ($4-$5)\n");
+        if ( args->verbose>1 )
+            printf("# [1]DBG\t[2]Chr\t[3]Pos\t[4]Sample\t[5]AF\t[6]pRR\t[7]pRA\t[8]pAA\t[9]P(Haploid)\t[10]P(Diploid)\n");
+    }
+
+    process_region_guess(args);
+
+    for (i=0; i<args->nsample; i++)
+    {
+        double phap = args->stats.counts[i].ncount ? args->stats.counts[i].phap / args->stats.counts[i].ncount : 0.5;
+        double pdip = args->stats.counts[i].ncount ? args->stats.counts[i].pdip / args->stats.counts[i].ncount : 0.5;
+        char predicted_sex = 'U';
+        if (phap>pdip) predicted_sex = 'M';
+        else if (phap<pdip) predicted_sex = 'F';
+        if ( args->verbose )
+        {
+            printf("SEX\t%s\t%c\t%f\t%f\t%"PRId64"\t%f\n", args->hdr->samples[i],predicted_sex,
+                    phap,pdip,args->stats.counts[i].ncount,phap-pdip);
+        }
+        else
+            printf("%s\t%c\n", args->hdr->samples[i],predicted_sex);
+    }
+   
+    if ( args->filter )
+        filter_destroy(args->filter);
+
+    bcf_sr_destroy(args->sr);
+    free(args->pl2p);
+    free(args->tmpf);
+    free(args->counts);
+    free(args->stats.counts);
+    free(args->arr);
+    free(args->farr);
+    free(args->af);
+    free(args);
+    return 0;
+}
diff --git a/bcftools/plugins/guess-ploidy.c.pysam.c b/bcftools/plugins/guess-ploidy.c.pysam.c

new file mode 100644 (file)

index 0000000..a08a11a
--- /dev/null
+++ b/bcftools/plugins/guess-ploidy.c.pysam.c
@@ -0,0 +1,570 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define GUESS_GT 1
+#define GUESS_PL 2
+#define GUESS_GL 4
+
+typedef struct
+{
+    uint64_t ncount;
+    double phap, pdip;
+}
+count_t;
+
+typedef struct
+{
+    char *chr;
+    uint32_t start, end;
+    count_t *counts;    // per-sample counts: counts[isample]
+}
+stats_t;
+
+typedef struct
+{
+    int argc;
+    char **argv, *af_tag;
+    double af_dflt;
+    stats_t stats;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;   // include or exclude sites which match the filters? One of FLT_INCLUDE/FLT_EXCLUDE
+    const uint8_t *smpl_pass;
+    int nsample, verbose, tag, include_indels;
+    int *counts, ncounts;       // number of observed GTs with given ploidy, used when -g is not given
+    double *tmpf, *pl2p, gt_err_prob;
+    float *af;
+    int maf;
+    int32_t *arr, narr, nfarr;
+    float *farr;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Determine sample sex by checking genotype likelihoods in haploid regions.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Determine sample sex by checking genotype likelihoods (GL,PL) or genotypes (GT)\n"
+        "       in the non-PAR region of chrX. The HWE is assumed, so given the alternate allele\n" 
+        "       frequency fA and the genotype likelihoods pRR,pRA,pAA, the probabilities are\n"
+        "       calculated as\n"
+        "           P(dip) = pRR*(1-fA)^2 + pAA*fA^2 + 2*pRA*(1-fA)*fA\n"
+        "           P(hap) = pRR*(1-fA) + pAA*fA\n"
+        "       When genotype likelihoods are not available, the -e option is used to account\n"
+        "       for genotyping errors with -t GT. The alternate allele frequency fA is estimated\n"
+        "       directly from the data (the default) or can be provided by an INFO tag.\n"
+        "       The results can be visualized using the accompanied guess-ploidy.py script.\n"
+        "       Note that this plugin is intended to replace the former vcf2sex plugin.\n"
+        "\n"
+        "Usage: bcftools +guess-ploidy <file.vcf.gz> [Plugin Options]\n"
+        "Plugin options:\n"
+        "       --AF-dflt <float>           the default alternate allele frequency [0.5]\n"
+        "       --AF-tag <TAG>              use TAG for allele frequency\n"
+        "   -e, --error-rate <float>        probability of GT being wrong (with -t GT) [1e-3]\n"
+        "       --exclude <expr>            exclude sites for which the expression is true\n"
+        "   -i, --include-indels            do not skip indel sites\n"
+        "       --include <expr>            include only sites for which the expression is true\n"
+        "   -g, --genome <str>              shortcut to select nonPAR region for common genomes b37|hg19|b38|hg38\n"
+        "   -r, --regions <chr:beg-end>     restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "   -t, --tag <tag>                 genotype or genotype likelihoods: GT, PL, GL [PL]\n"
+        "   -v, --verbose                   verbose output (specify twice to increase verbosity)\n"
+        "\n"
+        "Region shortcuts:\n"
+        "   b37  .. -r X:2699521-154931043      # GRCh37 no-chr prefix\n"
+        "   b38  .. -r X:2781480-155701381      # GRCh38 no-chr prefix\n"
+        "   hg19 .. -r chrX:2699521-154931043   # GRCh37 chr prefix\n"
+        "   hg38 .. -r chrX:2781480-155701381   # GRCh38 chr prefix\n"
+        "\n"
+        "Examples:\n"
+        "   bcftools +guess-ploidy -g b37 in.vcf.gz\n"
+        "   bcftools +guess-ploidy in.vcf.gz -t GL -r chrX:2699521-154931043\n"
+        "   bcftools view file.vcf.gz -r chrX:2699521-154931043 | bcftools +guess-ploidy\n"
+        "   bcftools +guess-ploidy in.bcf -v > ploidy.txt && guess-ploidy.py ploidy.txt img\n"
+        "\n";
+}
+
+static inline int smpl_pass(args_t *args, int ismpl)
+{
+    if ( !args->smpl_pass ) return 1;
+    int pass = args->smpl_pass[ismpl];
+    if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+    if ( pass ) return 1;
+    return 0;
+}
+
+void process_region_guess(args_t *args)
+{
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        if ( rec->n_allele==1 ) continue;
+        if ( !args->include_indels && !(bcf_get_variant_types(rec)&VCF_SNP) ) continue;
+
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, rec, &args->smpl_pass);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !args->smpl_pass && !pass ) continue;     // site-level filtering, not per-sample filtering
+        }
+
+        double freq[2] = {0,0}, sum;
+        int ismpl,i;
+        if ( args->tag & GUESS_GT )   // use GTs to guess the ploidy, considering only one ALT
+        {
+            int ngt = bcf_get_genotypes(args->hdr,rec,&args->arr,&args->narr);
+            if ( ngt<=0 ) continue;
+            ngt /= args->nsample;
+            for (ismpl=0; ismpl<args->nsample; ismpl++)
+            {
+                if ( !smpl_pass(args,ismpl) ) continue;
+                int32_t *ptr = args->arr + ismpl*ngt;
+                double *tmp = args->tmpf + ismpl*3;
+
+                if ( ptr[0]==bcf_gt_missing ) 
+                {
+                    tmp[0] = -1;
+                    continue;
+                }
+                if ( ptr[1]==bcf_int32_vector_end )
+                {
+                    if ( bcf_gt_allele(ptr[0])==0 ) // haploid R
+                    {
+                        tmp[0] = 1 - 2*args->gt_err_prob;
+                        tmp[1] = tmp[2] = args->gt_err_prob;
+                    }
+                    else    // haploid A
+                    {
+                        tmp[0] = tmp[1] = args->gt_err_prob;
+                        tmp[2] = 1 - 2*args->gt_err_prob;
+                    }
+                    continue;
+                }
+                if ( bcf_gt_allele(ptr[0])==0 && bcf_gt_allele(ptr[1])==0 ) // RR
+                {
+                    tmp[0] = 1 - 2*args->gt_err_prob;
+                    tmp[1] = tmp[2] = args->gt_err_prob;
+                }
+                else if ( bcf_gt_allele(ptr[0])==bcf_gt_allele(ptr[1]) ) // AA
+                {
+                    tmp[0] = tmp[1] = args->gt_err_prob;
+                    tmp[2] = 1 - 2*args->gt_err_prob;
+                }
+                else  // RA or hetAA, treating as RA
+                {
+                    tmp[1] = 1 - 2*args->gt_err_prob;
+                    tmp[0] = tmp[2] = args->gt_err_prob;
+                }
+                freq[0] += 2*tmp[0]+tmp[1];
+                freq[1] += tmp[1]+2*tmp[2];
+            }
+        }
+        else if ( args->tag & GUESS_PL )    // use PL guess the ploidy, restrict to first ALT allele
+        {
+            int npl = bcf_get_format_int32(args->hdr,rec,"PL",&args->arr,&args->narr);
+            if ( npl<=0 ) continue;
+            npl /= args->nsample;
+            int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+            if ( npl==ndip_gt )             // diploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    int32_t *ptr = args->arr + ismpl*npl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing || ptr[2]==bcf_int32_missing ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[2]==bcf_int32_vector_end )
+                    {
+                        tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+                        tmp[1] = args->pl2p[255];
+                        tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+                    }
+                    else
+                        for (i=0; i<3; i++)
+                            tmp[i] = (ptr[i]<0 || ptr[i]>=256) ? args->pl2p[255] : args->pl2p[ptr[i]];
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    if ( ptr[2]==bcf_int32_vector_end )
+                    {
+                        freq[0] += tmp[0];
+                        freq[1] += tmp[2];
+                    }
+                    else
+                    {
+                        freq[0] += 2*tmp[0]+tmp[1];
+                        freq[1] += tmp[1]+2*tmp[2];
+                    }
+                }
+            }
+            else if ( npl==rec->n_allele )  // all samples haploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    int32_t *ptr = args->arr + ismpl*npl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+                    tmp[1] = args->pl2p[255];
+                    tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    freq[0] += tmp[0];
+                    freq[1] += tmp[2];
+                }
+            }
+            else
+                continue;   // neither diploid nor haploid
+        }
+        else    // use GL
+        {
+            int ngl = bcf_get_format_float(args->hdr,rec,"GL",&args->farr,&args->nfarr);
+            if ( ngl<=0 ) continue;
+            ngl /= args->nsample;
+            int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+            if ( ngl==ndip_gt )             // diploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    float *ptr = args->farr + ismpl*ngl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) || bcf_float_is_missing(ptr[2]) ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    if ( bcf_float_is_vector_end(ptr[2]) )
+                    {
+                        tmp[0] = pow(10.,ptr[0]);
+                        tmp[1] = 1e-26;             // arbitrary small value for a het
+                        tmp[2] = pow(10.,ptr[1]);
+                    }
+                    else
+                        for (i=0; i<3; i++)
+                            tmp[i] = pow(10.,ptr[i]);
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    if ( bcf_float_is_vector_end(ptr[2]) )
+                    {
+                        freq[0] += tmp[0];
+                        freq[1] += tmp[2];
+                    }
+                    else
+                    {
+                        freq[0] += 2*tmp[0]+tmp[1];
+                        freq[1] += tmp[1]+2*tmp[2];
+                    }
+                }
+            }
+            else if ( ngl==rec->n_allele )  // all samples haploid
+            {
+                for (ismpl=0; ismpl<args->nsample; ismpl++)
+                {
+                    if ( !smpl_pass(args,ismpl) ) continue;
+                    float *ptr = args->farr + ismpl*ngl;
+                    double *tmp = args->tmpf + ismpl*3;
+
+                    // restrict to first ALT
+                    if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) ) 
+                    {
+                        tmp[0] = -1;
+                        continue;
+                    }
+                    tmp[0] = pow(10.,ptr[0]);
+                    tmp[1] = 1e-26;
+                    tmp[2] = pow(10.,ptr[1]);
+
+                    sum = 0;
+                    for (i=0; i<3; i++) sum += tmp[i];
+                    for (i=0; i<3; i++) tmp[i] /= sum;
+
+                    freq[0] += tmp[0];
+                    freq[1] += tmp[2];
+                }
+            }
+            else
+                continue;   // neither diploid nor haploid
+        }
+        if ( args->af_tag )
+        {
+            int ret = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af, &args->maf);
+            if ( ret>0 ) { freq[0] = 1 - args->af[0]; freq[1] = args->af[0]; }
+        }
+
+        if ( !freq[0] && !freq[1] ) { freq[0] = 1 - args->af_dflt; freq[1] = args->af_dflt; }
+        sum = freq[0] + freq[1];
+        freq[0] /= sum;
+        freq[1] /= sum;
+        for (ismpl=0; ismpl<args->nsample; ismpl++)
+        {
+            if ( !smpl_pass(args,ismpl) ) continue;
+            count_t *counts = &args->stats.counts[ismpl];
+            double *tmp = args->tmpf + ismpl*3;
+            if ( tmp[0] < 0 ) continue;
+            double phap = freq[0]*tmp[0] + freq[1]*tmp[2];
+            double pdip = freq[0]*freq[0]*tmp[0] + 2*freq[0]*freq[1]*tmp[1] + freq[1]*freq[1]*tmp[2];
+            counts->phap += log(phap);
+            counts->pdip += log(pdip);
+            counts->ncount++;
+            if ( args->verbose>1 )
+                fprintf(bcftools_stdout, "DBG\t%s\t%d\t%s\t%e\t%e\t%e\t%e\t%e\t%e\n", bcf_seqname(args->hdr,rec),rec->pos+1,bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,ismpl),
+                    freq[1],tmp[0],tmp[1],tmp[2],phap,pdip);
+        }
+    }
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->tag    = GUESS_PL;
+    args->argc   = argc; args->argv = argv;
+    args->gt_err_prob = 1e-3;
+    args->af_dflt = 0.5;
+    char *region  = NULL;
+    int region_is_file = 0;
+    static struct option loptions[] =
+    {
+        {"AF-tag",required_argument,NULL,0},
+        {"AF-dflt",required_argument,NULL,1},
+        {"exclude",required_argument,NULL,2},
+        {"include",required_argument,NULL,3},
+        {"verbose",no_argument,NULL,'v'},
+        {"include-indels",no_argument,NULL,'i'},
+        {"error-rate",required_argument,NULL,'e'},
+        {"tag",required_argument,NULL,'t'},
+        {"genome",required_argument,NULL,'g'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"background",required_argument,NULL,'b'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:t:e:ig:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 0: args->af_tag = optarg; break;
+            case 1: 
+                    args->af_dflt = strtod(optarg,&tmp);
+                    if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
+                    break;
+            case 2: args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 3: args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'i': args->include_indels = 1; break;
+            case 'e':
+                args->gt_err_prob = strtod(optarg,&tmp);
+                if ( *tmp ) error("Could not parse: -e %s\n", optarg);
+                if ( args->gt_err_prob<0 || args->gt_err_prob>1 ) error("Expected value from the interval [0,1]: -e %s\n", optarg);
+                break;
+            case 'g':
+                if ( !strcasecmp(optarg,"b37") ) region = "X:2699521-154931043";
+                else if ( !strcasecmp(optarg,"b38") ) region = "X:2781480-155701381";
+                else if ( !strcasecmp(optarg,"hg19") ) region = "chrX:2699521-154931043";
+                else if ( !strcasecmp(optarg,"hg38") ) region = "chrX:2781480-155701381";
+                else error("The argument not recognised, expected --genome b37, b38, hg19 or hg38: %s\n", optarg);
+                break;
+            case 'R': region_is_file = 1; 
+            case 'r': region = optarg; break; 
+            case 'v': args->verbose++; break; 
+            case 't':
+                if ( !strcasecmp(optarg,"GT") ) args->tag = GUESS_GT;
+                else if ( !strcasecmp(optarg,"PL") ) args->tag = GUESS_PL;
+                else if ( !strcasecmp(optarg,"GL") ) args->tag = GUESS_GL;
+                else error("The argument not recognised, expected --tag GT, PL or GL: %s\n", optarg);
+                break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of --include or --exclude can be given.\n");
+
+    char *fname = NULL;
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else fname = argv[optind];
+
+    args->sr = bcf_sr_init();
+    if ( strcmp("-",fname) )
+    {
+        if ( region )
+        {
+            args->sr->require_index = 1;
+            if ( bcf_sr_set_regions(args->sr, region, region_is_file)<0 )
+                error("Failed to read the regions: %s\n",region);
+        }
+    }
+    else
+    {
+        if ( region )
+        {
+            if ( bcf_sr_set_targets(args->sr, region, region_is_file, 0)<0 )
+                error("Failed to read the targets: %s\n",region);
+        }
+    }
+    if ( !bcf_sr_add_reader(args->sr,fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = args->sr->readers[0].header;
+    args->nsample = bcf_hdr_nsamples(args->hdr);
+    args->stats.counts = (count_t*) calloc(args->nsample,sizeof(count_t));
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
+    if ( args->af_tag && !bcf_hdr_idinfo_exists(args->hdr,BCF_HL_INFO,bcf_hdr_id2int(args->hdr,BCF_DT_ID,args->af_tag)) )
+        error("No such INFO tag: %s\n", args->af_tag);
+
+    if ( args->tag&GUESS_PL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL")<0 )
+    {
+        fprintf(bcftools_stderr, "Warning: PL tag not found in header, switching to GL\n");
+        args->tag = GUESS_GL;
+    }
+
+    if ( args->tag&GUESS_GL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GL")<0 )
+    {
+        fprintf(bcftools_stderr, "Warning: GL tag not found in header, switching to GT\n");
+        args->tag = GUESS_GT;
+    }
+
+    if ( args->tag&GUESS_GT && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GT")<0 )
+        error("Error: GT tag not found in header\n");
+
+    int i;
+    if ( args->tag&GUESS_PL )
+    {
+        args->pl2p = (double*) calloc(256,sizeof(double));
+        for (i=0; i<256; i++) args->pl2p[i] = pow(10., -i/10.);
+    }
+    if ( args->tag&GUESS_PL || args->tag&GUESS_GL || args->tag&GUESS_GT )
+        args->tmpf = (double*) malloc(sizeof(*args->tmpf)*3*args->nsample);
+
+    if ( args->verbose )
+    {
+        fprintf(bcftools_stdout, "# This file was produced by: bcftools +guess-ploidy(%s+htslib-%s)\n", bcftools_version(),hts_version());
+        fprintf(bcftools_stdout, "# The command line was:\tbcftools +%s", args->argv[0]);
+        for (i=1; i<args->argc; i++)
+            fprintf(bcftools_stdout, " %s",args->argv[i]);
+        fprintf(bcftools_stdout, "\n");
+        fprintf(bcftools_stdout, "# [1]SEX\t[2]Sample\t[3]Predicted sex\t[4]log P(Haploid)/nSites\t[5]log P(Diploid)/nSites\t[6]nSites\t[7]Score: F < 0 < M ($4-$5)\n");
+        if ( args->verbose>1 )
+            fprintf(bcftools_stdout, "# [1]DBG\t[2]Chr\t[3]Pos\t[4]Sample\t[5]AF\t[6]pRR\t[7]pRA\t[8]pAA\t[9]P(Haploid)\t[10]P(Diploid)\n");
+    }
+
+    process_region_guess(args);
+
+    for (i=0; i<args->nsample; i++)
+    {
+        double phap = args->stats.counts[i].ncount ? args->stats.counts[i].phap / args->stats.counts[i].ncount : 0.5;
+        double pdip = args->stats.counts[i].ncount ? args->stats.counts[i].pdip / args->stats.counts[i].ncount : 0.5;
+        char predicted_sex = 'U';
+        if (phap>pdip) predicted_sex = 'M';
+        else if (phap<pdip) predicted_sex = 'F';
+        if ( args->verbose )
+        {
+            fprintf(bcftools_stdout, "SEX\t%s\t%c\t%f\t%f\t%"PRId64"\t%f\n", args->hdr->samples[i],predicted_sex,
+                    phap,pdip,args->stats.counts[i].ncount,phap-pdip);
+        }
+        else
+            fprintf(bcftools_stdout, "%s\t%c\n", args->hdr->samples[i],predicted_sex);
+    }
+   
+    if ( args->filter )
+        filter_destroy(args->filter);
+
+    bcf_sr_destroy(args->sr);
+    free(args->pl2p);
+    free(args->tmpf);
+    free(args->counts);
+    free(args->stats.counts);
+    free(args->arr);
+    free(args->farr);
+    free(args->af);
+    free(args);
+    return 0;
+}
diff --git a/bcftools/plugins/impute-info.c b/bcftools/plugins/impute-info.c

new file mode 100644 (file)

index 0000000..42d4eb4
--- /dev/null
+++ b/bcftools/plugins/impute-info.c
@@ -0,0 +1,165 @@
+/*  plugins/impute-info.c -- adds info metrics to a VCF file.
+
+    Copyright (C) 2015-2016 Genome Research Ltd.
+
+    Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+
+/*
+
+Marchini & Howie, Nature Genetics, 11 (July 2010)
+
+Let G_ij in {0,1,2} be the genotype of the ith individual at the
+jth SNP in a study cohort of N samples. Let
+
+    p_ijk = P(G_ij = k | H,G)
+
+be the probability that the genotype at the jth SNP of the ith
+individual is k.
+
+Let the expected allele dosage for the genotype at the jth SNP
+of the ith individual be
+    
+    e_ij = p_ij1 + 2 * p_ij2
+
+and define
+
+    f_ij = p_ij1 + 4 * p_ij2
+
+Let theta_j denote the (unknown) population allele frequency of the jth SNP
+with:
+
+    theta_j = SUM[i=1..N] e_ij / 2 * N
+
+The IMPUTE2 information measure is then:
+
+    if theta_j in (0,1):
+        I(theta_j) = 1 - SUM[i=1..N](f_ij - e_ij^2) / 2 * N * theta_j * (1 - theta_j)
+    else:
+        I(theta_j) = 1
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+
+const char *about(void)
+{
+    return "Add imputation information metrics to the INFO field based on selected FORMAT tags.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Add imputation information metrics to the INFO field based\n"
+        "       on selected FORMAT tags. Only the IMPUTE2 INFO metric from\n"
+        "       FORMAT/GP tags is currently available.\n"
+        "Usage: bcftools +impute-info [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        // "Plugin options:\n"
+        // "   -i, --info <list>   information metrics to add [INFO]\n" // [BEAGLE_R2,MACH_R2]
+        // "   -t, --tags <tag>    VCF tags to determine the information from [GP]\n"
+        // "\n"
+        "Example:\n"
+        "   bcftools +impute-info in.vcf\n"
+        "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+int gp_type = BCF_HT_REAL;
+uint8_t *buf = NULL;
+int nbuf = 0;   // NB: number of elements, not bytes
+int nrec = 0, nskip_gp = 0, nskip_dip = 0;
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    in_hdr  = in;
+    out_hdr = out;
+    bcf_hdr_append(out_hdr,"##INFO=<ID=INFO,Number=1,Type=Float,Description=\"IMPUTE2 info score\">");
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int nval = 0, i, j, nret = bcf_get_format_values(in_hdr,rec,"GP",(void**)&buf,&nbuf,gp_type);
+    if ( nret<0 )
+    {
+        if (!nskip_gp) fprintf(stderr, "[impute-info.c] Warning: info tag not added to sites without GP tag\n");
+        nskip_gp++;
+        return rec; // require FORMAT/GP tag, return site unchanged
+    }
+
+    nret /= rec->n_sample;
+    if ( nret != 3 )
+    {
+        if (!nskip_dip) fprintf(stderr, "[impute-info.c] Warning: info tag not added to sites that are not biallelic diploid\n");
+        nskip_dip++;
+        return rec; // require biallelic diploid, return site unchanged
+    }
+
+    double esum = 0, e2sum = 0, fsum = 0;
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            double vals[3] = {0,0,0}; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = ptr[j]; \
+            } \
+            double norm = vals[0]+vals[1]+vals[2]; \
+            if ( norm ) for (j=0; j<3; j++) vals[j] /= norm; \
+            esum  += vals[1] + 2*vals[2]; \
+            e2sum += (vals[1] + 2*vals[2]) * (vals[1] + 2*vals[2]); \
+            fsum  += vals[1] + 4*vals[2]; \
+            ptr   += nret; \
+            nval++; \
+        } \
+    }
+    switch (gp_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+
+    double theta = esum / (2 * (double)nval);
+    float info  = (theta>0 && theta<1) ? (float)(1 - (fsum - e2sum) / (2 * (double)nval * theta * (1.0 - theta))) : 1;
+
+    bcf_update_info_float(out_hdr, rec, "INFO", &info, 1);
+    nrec++;
+    return rec;
+}
+
+
+void destroy(void)
+{
+    fprintf(stderr,"Lines total/info-added/unchanged-no-tag/unchanged-not-biallelic-diploid:\t%d/%d/%d/%d\n", nrec+nskip_gp+nskip_dip, nrec, nskip_gp, nskip_dip);
+    free(buf);
+}
diff --git a/bcftools/plugins/impute-info.c.pysam.c b/bcftools/plugins/impute-info.c.pysam.c

new file mode 100644 (file)

index 0000000..67545e5
--- /dev/null
+++ b/bcftools/plugins/impute-info.c.pysam.c
@@ -0,0 +1,167 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/impute-info.c -- adds info metrics to a VCF file.
+
+    Copyright (C) 2015-2016 Genome Research Ltd.
+
+    Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+
+/*
+
+Marchini & Howie, Nature Genetics, 11 (July 2010)
+
+Let G_ij in {0,1,2} be the genotype of the ith individual at the
+jth SNP in a study cohort of N samples. Let
+
+    p_ijk = P(G_ij = k | H,G)
+
+be the probability that the genotype at the jth SNP of the ith
+individual is k.
+
+Let the expected allele dosage for the genotype at the jth SNP
+of the ith individual be
+    
+    e_ij = p_ij1 + 2 * p_ij2
+
+and define
+
+    f_ij = p_ij1 + 4 * p_ij2
+
+Let theta_j denote the (unknown) population allele frequency of the jth SNP
+with:
+
+    theta_j = SUM[i=1..N] e_ij / 2 * N
+
+The IMPUTE2 information measure is then:
+
+    if theta_j in (0,1):
+        I(theta_j) = 1 - SUM[i=1..N](f_ij - e_ij^2) / 2 * N * theta_j * (1 - theta_j)
+    else:
+        I(theta_j) = 1
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+
+const char *about(void)
+{
+    return "Add imputation information metrics to the INFO field based on selected FORMAT tags.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Add imputation information metrics to the INFO field based\n"
+        "       on selected FORMAT tags. Only the IMPUTE2 INFO metric from\n"
+        "       FORMAT/GP tags is currently available.\n"
+        "Usage: bcftools +impute-info [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        // "Plugin options:\n"
+        // "   -i, --info <list>   information metrics to add [INFO]\n" // [BEAGLE_R2,MACH_R2]
+        // "   -t, --tags <tag>    VCF tags to determine the information from [GP]\n"
+        // "\n"
+        "Example:\n"
+        "   bcftools +impute-info in.vcf\n"
+        "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+int gp_type = BCF_HT_REAL;
+uint8_t *buf = NULL;
+int nbuf = 0;   // NB: number of elements, not bytes
+int nrec = 0, nskip_gp = 0, nskip_dip = 0;
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    in_hdr  = in;
+    out_hdr = out;
+    bcf_hdr_append(out_hdr,"##INFO=<ID=INFO,Number=1,Type=Float,Description=\"IMPUTE2 info score\">");
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int nval = 0, i, j, nret = bcf_get_format_values(in_hdr,rec,"GP",(void**)&buf,&nbuf,gp_type);
+    if ( nret<0 )
+    {
+        if (!nskip_gp) fprintf(bcftools_stderr, "[impute-info.c] Warning: info tag not added to sites without GP tag\n");
+        nskip_gp++;
+        return rec; // require FORMAT/GP tag, return site unchanged
+    }
+
+    nret /= rec->n_sample;
+    if ( nret != 3 )
+    {
+        if (!nskip_dip) fprintf(bcftools_stderr, "[impute-info.c] Warning: info tag not added to sites that are not biallelic diploid\n");
+        nskip_dip++;
+        return rec; // require biallelic diploid, return site unchanged
+    }
+
+    double esum = 0, e2sum = 0, fsum = 0;
+    #define BRANCH(type_t,is_missing,is_vector_end) \
+    { \
+        type_t *ptr = (type_t*) buf; \
+        for (i=0; i<rec->n_sample; i++) \
+        { \
+            double vals[3] = {0,0,0}; \
+            for (j=0; j<nret; j++) \
+            { \
+                if ( is_missing || is_vector_end ) break; \
+                vals[j] = ptr[j]; \
+            } \
+            double norm = vals[0]+vals[1]+vals[2]; \
+            if ( norm ) for (j=0; j<3; j++) vals[j] /= norm; \
+            esum  += vals[1] + 2*vals[2]; \
+            e2sum += (vals[1] + 2*vals[2]) * (vals[1] + 2*vals[2]); \
+            fsum  += vals[1] + 4*vals[2]; \
+            ptr   += nret; \
+            nval++; \
+        } \
+    }
+    switch (gp_type)
+    {
+        case BCF_HT_INT:  BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+        case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+    }
+    #undef BRANCH
+
+    double theta = esum / (2 * (double)nval);
+    float info  = (theta>0 && theta<1) ? (float)(1 - (fsum - e2sum) / (2 * (double)nval * theta * (1.0 - theta))) : 1;
+
+    bcf_update_info_float(out_hdr, rec, "INFO", &info, 1);
+    nrec++;
+    return rec;
+}
+
+
+void destroy(void)
+{
+    fprintf(bcftools_stderr,"Lines total/info-added/unchanged-no-tag/unchanged-not-biallelic-diploid:\t%d/%d/%d/%d\n", nrec+nskip_gp+nskip_dip, nrec, nskip_gp, nskip_dip);
+    free(buf);
+}
diff --git a/bcftools/plugins/isecGT.c b/bcftools/plugins/isecGT.c

new file mode 100644 (file)

index 0000000..38b2708
--- /dev/null
+++ b/bcftools/plugins/isecGT.c
@@ -0,0 +1,177 @@
+/* 
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "smpl_ilist.h"
+
+typedef struct
+{
+    int argc, output_type, regions_is_file, targets_is_file;
+    char **argv, *output_fname, *regions_list, *targets_list;
+    int32_t *arr_a, narr_a, *arr_b, narr_b;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr_a, *hdr_b;
+    htsFile *out_fh;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Compare two files and set non-identical genotypes to missing.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Compare two files and set non-identical genotypes in the first file to missing.\n"
+        "\n"
+        "Usage: bcftools +isecGT <A.bcf> <B.bcf> [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -o, --output <file>             write output to a file [standard output]\n"
+        "   -O, --output-type <b|u|z|v>     'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+        "   -r, --regions <region>          restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "   -t, --targets <region>          similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file <file>       similar to -R but streams rather than index-jumps\n"
+        "\n";
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->output_fname = "-";
+    args->output_type = FT_VCF;
+    static struct option loptions[] =
+    {
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"targets",required_argument,NULL,'t'},
+        {"targets-file",required_argument,NULL,'T'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "o:O:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'r': args->regions_list = optarg; break;
+            case 'R': args->regions_list = optarg; args->regions_is_file = 1; break;
+            case 't': args->targets_list = optarg; break;
+            case 'T': args->targets_list = optarg; args->targets_is_file = 1; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+
+    if ( optind+2!=argc ) error("%s",usage_text());
+
+    args->sr = bcf_sr_init();
+    args->sr->require_index = 1;
+    if ( args->regions_list )
+    {
+        if ( bcf_sr_set_regions(args->sr, args->regions_list, args->regions_is_file)<0 )
+            error("Failed to read the regions: %s\n", args->regions_list);
+    }
+    if ( args->targets_list )
+    {
+        if ( bcf_sr_set_targets(args->sr, args->targets_list, args->targets_is_file, 0)<0 )
+            error("Failed to read the targets: %s\n", args->targets_list);
+        args->sr->collapse |= COLLAPSE_BOTH;
+    }
+    if ( !bcf_sr_add_reader(args->sr,argv[optind]) ) error("Error opening %s: %s\n", argv[optind],bcf_sr_strerror(args->sr->errnum));
+    if ( !bcf_sr_add_reader(args->sr,argv[optind+1]) ) error("Error opening %s: %s\n", argv[optind+1],bcf_sr_strerror(args->sr->errnum));
+    args->hdr_a = bcf_sr_get_header(args->sr,0);
+    args->hdr_b = bcf_sr_get_header(args->sr,1);
+    smpl_ilist_t *smpl = smpl_ilist_map(args->hdr_a, args->hdr_b, SMPL_STRICT);
+    args->out_fh = hts_open(args->output_fname, hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    bcf_hdr_write(args->out_fh, args->hdr_a);
+    
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        if ( !bcf_sr_has_line(args->sr,0) ) continue;
+        if ( !bcf_sr_has_line(args->sr,1) )
+        {
+            bcf_write(args->out_fh, args->hdr_a, bcf_sr_get_line(args->sr,0));
+            continue;
+        }
+
+        bcf1_t *line_a = bcf_sr_get_line(args->sr,0);
+        bcf1_t *line_b = bcf_sr_get_line(args->sr,1);
+        int ngt_a = bcf_get_genotypes(args->hdr_a, line_a, &args->arr_a, &args->narr_a);
+        int ngt_b = bcf_get_genotypes(args->hdr_b, line_b, &args->arr_b, &args->narr_b);
+        assert( ngt_a==ngt_b );     // todo
+        ngt_a /= smpl->n;
+        ngt_b /= smpl->n;
+        int i, j, dirty = 0;
+        for (i=0; i<smpl->n; i++)
+        {
+            int32_t *a = args->arr_a + i*ngt_a;
+            int32_t *b = args->arr_b + smpl->idx[i]*ngt_b;
+            for (j=0; j<ngt_a; j++)
+                if ( a[j]!=b[j] ) break;
+            if ( j<ngt_a )
+            {
+                dirty = 1;
+                for (j=0; j<ngt_a; j++) a[j] = bcf_gt_missing;
+            }
+        }
+        if ( dirty ) bcf_update_genotypes(args->hdr_a, line_a, args->arr_a, ngt_a*smpl->n);
+        bcf_write(args->out_fh, args->hdr_a, line_a);
+    }
+
+    if ( hts_close(args->out_fh)!=0 ) error("Close failed: %s\n",args->output_fname);
+    smpl_ilist_destroy(smpl);
+    bcf_sr_destroy(args->sr);
+    free(args->arr_a);
+    free(args->arr_b);
+    free(args);
+    return 0;
+}
+
diff --git a/bcftools/plugins/isecGT.c.pysam.c b/bcftools/plugins/isecGT.c.pysam.c

new file mode 100644 (file)

index 0000000..54efcf6
--- /dev/null
+++ b/bcftools/plugins/isecGT.c.pysam.c
@@ -0,0 +1,179 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "smpl_ilist.h"
+
+typedef struct
+{
+    int argc, output_type, regions_is_file, targets_is_file;
+    char **argv, *output_fname, *regions_list, *targets_list;
+    int32_t *arr_a, narr_a, *arr_b, narr_b;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr_a, *hdr_b;
+    htsFile *out_fh;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Compare two files and set non-identical genotypes to missing.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Compare two files and set non-identical genotypes in the first file to missing.\n"
+        "\n"
+        "Usage: bcftools +isecGT <A.bcf> <B.bcf> [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -o, --output <file>             write output to a file [standard output]\n"
+        "   -O, --output-type <b|u|z|v>     'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+        "   -r, --regions <region>          restrict to comma-separated list of regions\n"
+        "   -R, --regions-file <file>       restrict to regions listed in a file\n"
+        "   -t, --targets <region>          similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file <file>       similar to -R but streams rather than index-jumps\n"
+        "\n";
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->output_fname = "-";
+    args->output_type = FT_VCF;
+    static struct option loptions[] =
+    {
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"targets",required_argument,NULL,'t'},
+        {"targets-file",required_argument,NULL,'T'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "o:O:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'r': args->regions_list = optarg; break;
+            case 'R': args->regions_list = optarg; args->regions_is_file = 1; break;
+            case 't': args->targets_list = optarg; break;
+            case 'T': args->targets_list = optarg; args->targets_is_file = 1; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+
+    if ( optind+2!=argc ) error("%s",usage_text());
+
+    args->sr = bcf_sr_init();
+    args->sr->require_index = 1;
+    if ( args->regions_list )
+    {
+        if ( bcf_sr_set_regions(args->sr, args->regions_list, args->regions_is_file)<0 )
+            error("Failed to read the regions: %s\n", args->regions_list);
+    }
+    if ( args->targets_list )
+    {
+        if ( bcf_sr_set_targets(args->sr, args->targets_list, args->targets_is_file, 0)<0 )
+            error("Failed to read the targets: %s\n", args->targets_list);
+        args->sr->collapse |= COLLAPSE_BOTH;
+    }
+    if ( !bcf_sr_add_reader(args->sr,argv[optind]) ) error("Error opening %s: %s\n", argv[optind],bcf_sr_strerror(args->sr->errnum));
+    if ( !bcf_sr_add_reader(args->sr,argv[optind+1]) ) error("Error opening %s: %s\n", argv[optind+1],bcf_sr_strerror(args->sr->errnum));
+    args->hdr_a = bcf_sr_get_header(args->sr,0);
+    args->hdr_b = bcf_sr_get_header(args->sr,1);
+    smpl_ilist_t *smpl = smpl_ilist_map(args->hdr_a, args->hdr_b, SMPL_STRICT);
+    args->out_fh = hts_open(args->output_fname, hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    bcf_hdr_write(args->out_fh, args->hdr_a);
+    
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        if ( !bcf_sr_has_line(args->sr,0) ) continue;
+        if ( !bcf_sr_has_line(args->sr,1) )
+        {
+            bcf_write(args->out_fh, args->hdr_a, bcf_sr_get_line(args->sr,0));
+            continue;
+        }
+
+        bcf1_t *line_a = bcf_sr_get_line(args->sr,0);
+        bcf1_t *line_b = bcf_sr_get_line(args->sr,1);
+        int ngt_a = bcf_get_genotypes(args->hdr_a, line_a, &args->arr_a, &args->narr_a);
+        int ngt_b = bcf_get_genotypes(args->hdr_b, line_b, &args->arr_b, &args->narr_b);
+        assert( ngt_a==ngt_b );     // todo
+        ngt_a /= smpl->n;
+        ngt_b /= smpl->n;
+        int i, j, dirty = 0;
+        for (i=0; i<smpl->n; i++)
+        {
+            int32_t *a = args->arr_a + i*ngt_a;
+            int32_t *b = args->arr_b + smpl->idx[i]*ngt_b;
+            for (j=0; j<ngt_a; j++)
+                if ( a[j]!=b[j] ) break;
+            if ( j<ngt_a )
+            {
+                dirty = 1;
+                for (j=0; j<ngt_a; j++) a[j] = bcf_gt_missing;
+            }
+        }
+        if ( dirty ) bcf_update_genotypes(args->hdr_a, line_a, args->arr_a, ngt_a*smpl->n);
+        bcf_write(args->out_fh, args->hdr_a, line_a);
+    }
+
+    if ( hts_close(args->out_fh)!=0 ) error("Close failed: %s\n",args->output_fname);
+    smpl_ilist_destroy(smpl);
+    bcf_sr_destroy(args->sr);
+    free(args->arr_a);
+    free(args->arr_b);
+    free(args);
+    return 0;
+}
+
diff --git a/bcftools/plugins/mendelian.c b/bcftools/plugins/mendelian.c

new file mode 100644 (file)

index 0000000..f3cda54
--- /dev/null
+++ b/bcftools/plugins/mendelian.c
@@ -0,0 +1,566 @@
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <errno.h>
+#include <ctype.h>
+#include <unistd.h>     // for isatty
+#include "bcftools.h"
+#include "regidx.h"
+
+#define MODE_COUNT     1
+#define MODE_LIST_GOOD 2
+#define MODE_LIST_BAD  4
+#define MODE_DELETE    8
+
+typedef struct
+{
+    int nok, nbad;
+    int imother,ifather,ichild;
+}
+trio_t;
+
+typedef struct
+{
+    int mpl, fpl, cpl;  // ploidies - mother, father, child
+    int mal, fal;       // expect an allele from mother and father
+}
+rule_t;
+
+typedef struct _args_t
+{
+    regidx_t *rules;
+    regitr_t *itr, *itr_ori;
+    bcf_hdr_t *hdr;
+    htsFile *out_fh;
+    int32_t *gt_arr;
+    int mode;
+    int ngt_arr, nrec;
+    trio_t *trios;
+    int ntrios;
+    int output_type;
+    char *output_fname;
+    bcf_srs_t *sr;
+}
+args_t;
+
+static args_t args;
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr);
+static bcf1_t *process(bcf1_t *rec);
+
+const char *about(void)
+{
+    return "Count Mendelian consistent / inconsistent genotypes.\n";
+}
+
+typedef struct
+{
+    const char *alias, *about, *rules;
+}
+rules_predef_t;
+
+static rules_predef_t rules_predefs[] =
+{
+    { .alias = "GRCh37",
+      .about = "Human Genome reference assembly GRCh37 / hg19, both chr naming conventions",
+      .rules =
+            "   X:1-60000               M/M + F > M\n"
+            "   X:1-60000               M/M + F > M/F\n"
+            "   X:2699521-154931043     M/M + F > M\n"
+            "   X:2699521-154931043     M/M + F > M/F\n"
+            "   Y:1-59373566            .   + F > F\n"
+            "   MT:1-16569              M   + F > M\n"
+            "\n"
+            "   chrX:1-60000            M/M + F > M\n"
+            "   chrX:1-60000            M/M + F > M/F\n"
+            "   chrX:2699521-154931043  M/M + F > M\n"
+            "   chrX:2699521-154931043  M/M + F > M/F\n"
+            "   chrY:1-59373566         .   + F > F\n"
+            "   chrM:1-16569            M   + F > M\n"
+    },
+    { .alias = "GRCh38",
+      .about = "Human Genome reference assembly GRCh38 / hg38, both chr naming conventions",
+      .rules =
+            "   X:1-9999                M/M + F > M\n"
+            "   X:1-9999                M/M + F > M/F\n"
+            "   X:2781480-155701381     M/M + F > M\n"
+            "   X:2781480-155701381     M/M + F > M/F\n"
+            "   Y:1-57227415            .   + F > F\n"
+            "   MT:1-16569              M   + F > M\n"
+            "\n"
+            "   chrX:1-9999             M/M + F > M\n"
+            "   chrX:1-9999             M/M + F > M/F\n"
+            "   chrX:2781480-155701381  M/M + F > M\n"
+            "   chrX:2781480-155701381  M/M + F > M/F\n"
+            "   chrY:1-57227415         .   + F > F\n"
+            "   chrM:1-16569            M   + F > M\n"
+    },
+    {
+        .alias = NULL,
+        .about = NULL,
+        .rules = NULL,
+    }
+};
+
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Count Mendelian consistent / inconsistent genotypes.\n"
+        "Usage: bcftools +mendelian [Options]\n"
+        "Options:\n"
+        "   -c, --count                 count the number of consistent sites\n"
+        "   -d, --delete                delete inconsistent genotypes (set to \"./.\")\n"
+        "   -l, --list [+x]             list consistent (+) or inconsistent (x) sites\n"
+        "   -o, --output <file>         write output to a file [standard output]\n"
+        "   -O, --output-type <type>    'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+        "   -r, --rules <assembly>[?]   predefined rules, 'list' to print available settings, append '?' for details\n"
+        "   -R, --rules-file <file>     inheritance rules, see example below\n"
+        "   -t, --trio <m,f,c>          names of mother, father and the child\n"
+        "   -T, --trio-file <file>      list of trios, one per line\n"
+        "\n"
+        "Example:\n"
+        "   # Default inheritance patterns, override with -r\n"
+        "   #   region  mothernal_ploidy + paternal > offspring\n"
+        "   X:1-60000            M/M + F > M\n"
+        "   X:1-60000            M/M + F > M/F\n"
+        "   X:2699521-154931043  M/M + F > M\n"
+        "   X:2699521-154931043  M/M + F > M/F\n"
+        "   Y:1-59373566         .   + F > F\n"
+        "   MT:1-16569           M   + F > M\n"
+        "\n"
+        "   bcftools +mendelian in.vcf -t Mother,Father,Child -c\n"
+        "\n";
+}
+
+regidx_t *init_rules(args_t *args, char *alias)
+{
+    const rules_predef_t *rules = rules_predefs;
+    if ( !alias ) alias = "GRCh37";
+
+    int detailed = 0, len = strlen(alias);
+    if ( alias[len-1]=='?' ) { detailed = 1; alias[len-1] = 0; }
+
+    while ( rules->alias && strcasecmp(alias,rules->alias) ) rules++;
+
+    if ( !rules->alias )
+    {
+        fprintf(stderr,"\nPRE-DEFINED INHERITANCE RULES\n\n");
+        fprintf(stderr," * Columns are: CHROM:BEG-END MATERNAL_PLOIDY + PATERNAL_PLOIDY > OFFSPRING\n");
+        fprintf(stderr," * Coordinates are 1-based inclusive.\n\n");
+        rules = rules_predefs;
+        while ( rules->alias )
+        {
+            fprintf(stderr,"%s\n   .. %s\n\n", rules->alias,rules->about);
+            if ( detailed )
+                fprintf(stderr,"%s\n", rules->rules);
+            rules++;
+        }
+        fprintf(stderr,"Run as --rules <alias> (e.g. --rules GRCh37).\n");
+        fprintf(stderr,"To see the detailed ploidy definition, append a question mark (e.g. --rules GRCh37?).\n");
+        fprintf(stderr,"\n");
+        exit(-1);
+    }
+    else if ( detailed )
+    {
+        fprintf(stderr,"%s", rules->rules);
+        exit(-1);
+    }
+    return regidx_init_string(rules->rules, parse_rules, NULL, sizeof(rule_t), &args);
+}
+
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    // e.g. "Y:1-59373566        .   + F > . # daugther"
+
+    // eat any leading spaces
+    char *ss = (char*) line;
+    while ( *ss && isspace(*ss) ) ss++;
+    if ( !*ss ) return -1;      // skip empty lines
+
+    // chromosome name, beg, end
+    char *tmp, *se = ss;
+    while ( se[1] && !isspace(se[1]) ) se++;
+    while ( se > ss && isdigit(*se) ) se--;
+    if ( *se!='-' ) error("Could not parse the region: %s\n", line);
+    *end = strtol(se+1, &tmp, 10) - 1;
+    if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+    while ( se > ss && *se!=':' ) se--;
+    *beg = strtol(se+1, &tmp, 10) - 1;
+    if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+
+    *chr_beg = ss;
+    *chr_end = se-1;
+
+    // skip region
+    while ( *ss && !isspace(*ss) ) ss++;
+    while ( *ss && isspace(*ss) ) ss++;
+
+    rule_t *rule = (rule_t*) payload;
+    memset(rule, 0, sizeof(rule_t));
+
+    // mothernal ploidy
+    se = ss;
+    while ( *se && !isspace(*se) ) se++;
+    int err = 0;
+    if ( se - ss == 1 )
+    {
+        if ( *ss=='M' ) rule->mpl = 1;
+        else if ( *ss=='.' ) rule->mpl = 0;
+        else err = 1;
+    }
+    else if ( se - ss == 3 )
+    {
+        if ( !strncmp(ss,"M/M",3) ) rule->mpl = 2;
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the mothernal ploidy, only \"M\", \"M/M\" and \".\" currently supported: %s\n",line);
+
+    // skip "+"
+    while ( *se && isspace(*se) ) se++;
+    if ( *se != '+' ) error("Could not parse the line: %s\n",line);
+    se++;
+    while ( *se && isspace(*se) ) se++;
+
+    // paternal ploidy
+    ss = se;
+    while ( *se && !isspace(*se) ) se++;
+    if ( se - ss == 1 )
+    {
+        if ( *ss=='F' ) rule->fpl = 1;
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the paternal ploidy, only \"F\" is currently supported: %s [%s]\n",line, ss);
+
+    // skip ">"
+    while ( *se && isspace(*se) ) se++;
+    if ( *se != '>' ) error("Could not parse the line: %s\n",line);
+    se++;
+    while ( *se && isspace(*se) ) se++;
+
+    // ploidy in offspring
+    ss = se;
+    while ( *se && !isspace(*se) ) se++;
+    if ( se - ss == 3 )
+    {
+        if ( !strncmp(ss,"M/F",3) ) { rule->cpl = 2; rule->fal = 1; rule->mal = 1; }
+        else err = 1;
+    }
+    else if ( se - ss == 1 )
+    {
+        if ( *ss=='F' ) { rule->cpl = 1; rule->fal = 1; }
+        else if ( *ss=='M' ) { rule->cpl = 1; rule->mal = 1; }
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the offspring's ploidy, only \"M\", \"F\" or \"M/F\" is currently supported: %s\n",line);
+
+    return 0;
+}
+
+int run(int argc, char **argv)
+{
+    char *trio_samples = NULL, *trio_file = NULL, *rules_fname = NULL, *rules_string = NULL;
+    memset(&args,0,sizeof(args_t));
+    args.mode = 0;
+    args.output_fname = "-";
+
+    static struct option loptions[] =
+    {
+        {"trio",1,0,'t'},
+        {"trio-file",1,0,'T'},
+        {"delete",0,0,'d'},
+        {"list",1,0,'l'},
+        {"count",0,0,'c'},
+        {"rules",1,0,'r'},
+        {"rules-file",1,0,'R'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:T:l:cdr:R:o:O:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'o': args.output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args.output_type = FT_BCF_GZ; break;
+                          case 'u': args.output_type = FT_BCF; break;
+                          case 'z': args.output_type = FT_VCF_GZ; break;
+                          case 'v': args.output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      };
+                      break;
+            case 'R': rules_fname = optarg; break;
+            case 'r': rules_string = optarg; break;
+            case 'd': args.mode |= MODE_DELETE; break;
+            case 'c': args.mode |= MODE_COUNT; break;
+            case 'l': 
+                if ( !strcmp("+",optarg) ) args.mode |= MODE_LIST_GOOD; 
+                else if ( !strcmp("x",optarg) ) args.mode |= MODE_LIST_BAD; 
+                else error("The argument not recognised: --list %s\n", optarg);
+                break;
+            case 't': trio_samples = optarg; break;
+            case 'T': trio_file = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s",usage()); break;
+        }
+    }
+    if ( rules_fname )
+        args.rules = regidx_init(rules_fname, parse_rules, NULL, sizeof(rule_t), &args);
+    else
+        args.rules = init_rules(&args, rules_string);
+    if ( !args.rules ) return -1;
+    args.itr     = regitr_init(args.rules);
+    args.itr_ori = regitr_init(args.rules);
+
+    char *fname = NULL;
+    if ( optind>=argc || argv[optind][0]=='-' )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) fname = "-";  // reading from stdin
+        else error("%s",usage());
+    }
+    else
+        fname = argv[optind];
+
+    if ( !trio_samples && !trio_file ) error("Expected the -t/T option\n");
+    if ( !args.mode ) error("Expected one of the -c, -d or -l options\n");
+    if ( args.mode&MODE_DELETE && !(args.mode&(MODE_LIST_GOOD|MODE_LIST_BAD)) ) args.mode |= MODE_LIST_GOOD|MODE_LIST_BAD;
+
+    args.sr = bcf_sr_init();
+    if ( !bcf_sr_add_reader(args.sr, fname) ) error("Failed to open %s: %s\n", fname,bcf_sr_strerror(args.sr->errnum));
+    args.hdr = bcf_sr_get_header(args.sr, 0);
+    args.out_fh = hts_open(args.output_fname,hts_bcf_wmode(args.output_type));
+    if ( args.out_fh == NULL ) error("Can't write to \"%s\": %s\n", args.output_fname, strerror(errno));
+    bcf_hdr_write(args.out_fh, args.hdr);
+
+
+    int i, n = 0;
+    char **list;
+    if ( trio_samples )
+    {
+        args.ntrios = 1;
+        args.trios = (trio_t*) calloc(1,sizeof(trio_t));
+        list = hts_readlist(trio_samples, 0, &n);
+        if ( n!=3 ) error("Expected three sample names with -t\n");
+        args.trios[0].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.trios[0].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        args.trios[0].ichild  = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+    }
+    if ( trio_file )
+    {
+        list = hts_readlist(trio_file, 1, &n);
+        args.ntrios = n;
+        args.trios = (trio_t*) calloc(n,sizeof(trio_t));
+        for (i=0; i<n; i++)
+        {
+            char *ss = list[i], *se;
+            se = strchr(ss, ',');
+            if ( !se ) error("Could not parse %s: %s\n",trio_file, ss);
+            *se = 0;
+            args.trios[i].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].imother<0 ) error("No such sample: \"%s\"\n", ss);
+            ss = ++se; 
+            se = strchr(ss, ',');
+            if ( !se ) error("Could not parse %s\n",trio_file);
+            *se = 0;
+            args.trios[i].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].ifather<0 ) error("No such sample: \"%s\"\n", ss);
+            ss = ++se; 
+            if ( *ss=='\0' ) error("Could not parse %s\n",trio_file);
+            args.trios[i].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].ichild<0 ) error("No such sample: \"%s\"\n", ss);
+            free(list[i]);
+        }
+        free(list);
+    }
+
+    while ( bcf_sr_next_line(args.sr) )
+    {
+        bcf1_t *line = bcf_sr_get_line(args.sr,0);
+        line = process(line);
+        if ( line )
+        {
+            if ( line->errcode ) error("TODO: Unchecked error (%d), exiting\n",line->errcode);
+            bcf_write1(args.out_fh, args.hdr, line);
+        }
+    }
+
+
+    fprintf(stderr,"# [1]nOK\t[2]nBad\t[3]nSkipped\t[4]Trio\n");
+    for (i=0; i<args.ntrios; i++)
+    {
+        trio_t *trio = &args.trios[i];
+        fprintf(stderr,"%d\t%d\t%d\t%s,%s,%s\n", 
+            trio->nok,trio->nbad,args.nrec-(trio->nok+trio->nbad),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->imother),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ifather),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ichild)
+            );
+    }
+    free(args.gt_arr);
+    free(args.trios);
+    regitr_destroy(args.itr);
+    regitr_destroy(args.itr_ori);
+    regidx_destroy(args.rules);
+    bcf_sr_destroy(args.sr);
+    if ( hts_close(args.out_fh)!=0 ) error("Error: close failed\n");
+    return 0;
+}
+
+static void warn_ploidy(bcf1_t *rec)
+{
+    static int warned = 0;
+    if ( warned ) return;
+    fprintf(stderr,"Incorrect ploidy at %s:%d, skipping the trio. (This warning is printed only once.)\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+    warned = 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    bcf1_t *dflt = args.mode&MODE_LIST_GOOD ? rec : NULL;
+    args.nrec++;
+
+    if ( rec->n_allele > 63 ) return dflt;      // we use 64bit bitmask below
+
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return dflt;
+    if ( ngt!=2*bcf_hdr_nsamples(args.hdr) && ngt!=bcf_hdr_nsamples(args.hdr) ) return dflt;
+    ngt /= bcf_hdr_nsamples(args.hdr);
+
+    int itr_set = regidx_overlap(args.rules, bcf_seqname(args.hdr,rec),rec->pos,rec->pos, args.itr_ori);
+
+    int i, has_bad = 0, needs_update = 0;
+    for (i=0; i<args.ntrios; i++)
+    {
+        int32_t a,b,c,d,e,f;
+        trio_t *trio = &args.trios[i];
+
+        a = args.gt_arr[ngt*trio->imother];
+        b = ngt==2 ? args.gt_arr[ngt*trio->imother+1] : bcf_int32_vector_end;
+        c = args.gt_arr[ngt*trio->ifather];
+        d = ngt==2 ? args.gt_arr[ngt*trio->ifather+1] : bcf_int32_vector_end;
+        e = args.gt_arr[ngt*trio->ichild];
+        f = ngt==2 ? args.gt_arr[ngt*trio->ichild+1] : bcf_int32_vector_end;
+
+        // skip sites with missing data in child
+        if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) continue;
+
+        uint64_t mother = 0, father = 0,child1,child2;
+
+        int is_ok = 0;
+        if ( !itr_set )
+        {
+            if ( f==bcf_int32_vector_end ) { warn_ploidy(rec); continue; }
+
+            // All M,F,C genotypes are diploid. Missing data are considered consistent.
+            child1 = 1<<bcf_gt_allele(e);
+            child2 = 1<<bcf_gt_allele(f);
+            mother  = bcf_gt_is_missing(a) ? child1|child2 : 1<<bcf_gt_allele(a);
+            mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(b);
+            father  = bcf_gt_is_missing(c) ? child1|child2 : 1<<bcf_gt_allele(c);
+            father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(d);
+
+            if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+        }
+        else
+        {
+            child1  = 1<<bcf_gt_allele(e);
+            child2  = bcf_gt_is_missing(f) || f==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(f);
+            mother |= bcf_gt_is_missing(a) ? 0 : 1<<bcf_gt_allele(a);
+            mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(b);
+            father |= bcf_gt_is_missing(c) ? 0 : 1<<bcf_gt_allele(c);
+            father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(d);
+
+            regitr_copy(args.itr, args.itr_ori);
+            while ( !is_ok && regitr_overlap(args.itr) )
+            {
+                rule_t *rule = &regitr_payload(args.itr,rule_t);
+                if ( child1 && child2 )
+                {
+                    if ( !rule->mal || !rule->fal ) continue;   // wrong rule (haploid), but this is a diploid GT
+                    if ( !mother ) mother = child1|child2;
+                    if ( !father ) father = child1|child2;
+                    if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1; 
+                    continue;
+                }
+                if ( rule->mal )
+                {
+                    if ( mother && !(child1&mother) ) continue;
+                }
+                if ( rule->fal )
+                {
+                    if ( father && !(child1&father) ) continue;
+                }
+                is_ok = 1;
+            }
+        }
+        if ( is_ok )
+        {
+            trio->nok++;
+        }
+        else
+        {
+            trio->nbad++;
+            has_bad = 1;
+            if ( args.mode&MODE_DELETE )
+            {
+                args.gt_arr[ngt*trio->imother] = bcf_gt_missing;
+                if ( b!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->imother+1] = bcf_gt_missing; // should be always true 
+                args.gt_arr[ngt*trio->ifather] = bcf_gt_missing;
+                if ( d!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ifather+1] = bcf_gt_missing;
+                args.gt_arr[ngt*trio->ichild] = bcf_gt_missing;
+                if ( f!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ichild+1]  = bcf_gt_missing;
+                needs_update = 1;
+            }
+        }
+    }
+
+    if ( needs_update && bcf_update_genotypes(args.hdr,rec,args.gt_arr,ngt*bcf_hdr_nsamples(args.hdr)) )
+        error("Could not update GT field at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+
+    if ( args.mode&MODE_DELETE ) return rec;
+    if ( args.mode&MODE_LIST_GOOD ) return has_bad ? NULL : rec;
+    if ( args.mode&MODE_LIST_BAD ) return has_bad ? rec : NULL;
+
+    return NULL;
+}
+
diff --git a/bcftools/plugins/mendelian.c.pysam.c b/bcftools/plugins/mendelian.c.pysam.c

new file mode 100644 (file)

index 0000000..904587f
--- /dev/null
+++ b/bcftools/plugins/mendelian.c.pysam.c
@@ -0,0 +1,568 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2015 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <errno.h>
+#include <ctype.h>
+#include <unistd.h>     // for isatty
+#include "bcftools.h"
+#include "regidx.h"
+
+#define MODE_COUNT     1
+#define MODE_LIST_GOOD 2
+#define MODE_LIST_BAD  4
+#define MODE_DELETE    8
+
+typedef struct
+{
+    int nok, nbad;
+    int imother,ifather,ichild;
+}
+trio_t;
+
+typedef struct
+{
+    int mpl, fpl, cpl;  // ploidies - mother, father, child
+    int mal, fal;       // expect an allele from mother and father
+}
+rule_t;
+
+typedef struct _args_t
+{
+    regidx_t *rules;
+    regitr_t *itr, *itr_ori;
+    bcf_hdr_t *hdr;
+    htsFile *out_fh;
+    int32_t *gt_arr;
+    int mode;
+    int ngt_arr, nrec;
+    trio_t *trios;
+    int ntrios;
+    int output_type;
+    char *output_fname;
+    bcf_srs_t *sr;
+}
+args_t;
+
+static args_t args;
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr);
+static bcf1_t *process(bcf1_t *rec);
+
+const char *about(void)
+{
+    return "Count Mendelian consistent / inconsistent genotypes.\n";
+}
+
+typedef struct
+{
+    const char *alias, *about, *rules;
+}
+rules_predef_t;
+
+static rules_predef_t rules_predefs[] =
+{
+    { .alias = "GRCh37",
+      .about = "Human Genome reference assembly GRCh37 / hg19, both chr naming conventions",
+      .rules =
+            "   X:1-60000               M/M + F > M\n"
+            "   X:1-60000               M/M + F > M/F\n"
+            "   X:2699521-154931043     M/M + F > M\n"
+            "   X:2699521-154931043     M/M + F > M/F\n"
+            "   Y:1-59373566            .   + F > F\n"
+            "   MT:1-16569              M   + F > M\n"
+            "\n"
+            "   chrX:1-60000            M/M + F > M\n"
+            "   chrX:1-60000            M/M + F > M/F\n"
+            "   chrX:2699521-154931043  M/M + F > M\n"
+            "   chrX:2699521-154931043  M/M + F > M/F\n"
+            "   chrY:1-59373566         .   + F > F\n"
+            "   chrM:1-16569            M   + F > M\n"
+    },
+    { .alias = "GRCh38",
+      .about = "Human Genome reference assembly GRCh38 / hg38, both chr naming conventions",
+      .rules =
+            "   X:1-9999                M/M + F > M\n"
+            "   X:1-9999                M/M + F > M/F\n"
+            "   X:2781480-155701381     M/M + F > M\n"
+            "   X:2781480-155701381     M/M + F > M/F\n"
+            "   Y:1-57227415            .   + F > F\n"
+            "   MT:1-16569              M   + F > M\n"
+            "\n"
+            "   chrX:1-9999             M/M + F > M\n"
+            "   chrX:1-9999             M/M + F > M/F\n"
+            "   chrX:2781480-155701381  M/M + F > M\n"
+            "   chrX:2781480-155701381  M/M + F > M/F\n"
+            "   chrY:1-57227415         .   + F > F\n"
+            "   chrM:1-16569            M   + F > M\n"
+    },
+    {
+        .alias = NULL,
+        .about = NULL,
+        .rules = NULL,
+    }
+};
+
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Count Mendelian consistent / inconsistent genotypes.\n"
+        "Usage: bcftools +mendelian [Options]\n"
+        "Options:\n"
+        "   -c, --count                 count the number of consistent sites\n"
+        "   -d, --delete                delete inconsistent genotypes (set to \"./.\")\n"
+        "   -l, --list [+x]             list consistent (+) or inconsistent (x) sites\n"
+        "   -o, --output <file>         write output to a file [standard output]\n"
+        "   -O, --output-type <type>    'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+        "   -r, --rules <assembly>[?]   predefined rules, 'list' to print available settings, append '?' for details\n"
+        "   -R, --rules-file <file>     inheritance rules, see example below\n"
+        "   -t, --trio <m,f,c>          names of mother, father and the child\n"
+        "   -T, --trio-file <file>      list of trios, one per line\n"
+        "\n"
+        "Example:\n"
+        "   # Default inheritance patterns, override with -r\n"
+        "   #   region  mothernal_ploidy + paternal > offspring\n"
+        "   X:1-60000            M/M + F > M\n"
+        "   X:1-60000            M/M + F > M/F\n"
+        "   X:2699521-154931043  M/M + F > M\n"
+        "   X:2699521-154931043  M/M + F > M/F\n"
+        "   Y:1-59373566         .   + F > F\n"
+        "   MT:1-16569           M   + F > M\n"
+        "\n"
+        "   bcftools +mendelian in.vcf -t Mother,Father,Child -c\n"
+        "\n";
+}
+
+regidx_t *init_rules(args_t *args, char *alias)
+{
+    const rules_predef_t *rules = rules_predefs;
+    if ( !alias ) alias = "GRCh37";
+
+    int detailed = 0, len = strlen(alias);
+    if ( alias[len-1]=='?' ) { detailed = 1; alias[len-1] = 0; }
+
+    while ( rules->alias && strcasecmp(alias,rules->alias) ) rules++;
+
+    if ( !rules->alias )
+    {
+        fprintf(bcftools_stderr,"\nPRE-DEFINED INHERITANCE RULES\n\n");
+        fprintf(bcftools_stderr," * Columns are: CHROM:BEG-END MATERNAL_PLOIDY + PATERNAL_PLOIDY > OFFSPRING\n");
+        fprintf(bcftools_stderr," * Coordinates are 1-based inclusive.\n\n");
+        rules = rules_predefs;
+        while ( rules->alias )
+        {
+            fprintf(bcftools_stderr,"%s\n   .. %s\n\n", rules->alias,rules->about);
+            if ( detailed )
+                fprintf(bcftools_stderr,"%s\n", rules->rules);
+            rules++;
+        }
+        fprintf(bcftools_stderr,"Run as --rules <alias> (e.g. --rules GRCh37).\n");
+        fprintf(bcftools_stderr,"To see the detailed ploidy definition, append a question mark (e.g. --rules GRCh37?).\n");
+        fprintf(bcftools_stderr,"\n");
+        exit(-1);
+    }
+    else if ( detailed )
+    {
+        fprintf(bcftools_stderr,"%s", rules->rules);
+        exit(-1);
+    }
+    return regidx_init_string(rules->rules, parse_rules, NULL, sizeof(rule_t), &args);
+}
+
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    // e.g. "Y:1-59373566        .   + F > . # daugther"
+
+    // eat any leading spaces
+    char *ss = (char*) line;
+    while ( *ss && isspace(*ss) ) ss++;
+    if ( !*ss ) return -1;      // skip empty lines
+
+    // chromosome name, beg, end
+    char *tmp, *se = ss;
+    while ( se[1] && !isspace(se[1]) ) se++;
+    while ( se > ss && isdigit(*se) ) se--;
+    if ( *se!='-' ) error("Could not parse the region: %s\n", line);
+    *end = strtol(se+1, &tmp, 10) - 1;
+    if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+    while ( se > ss && *se!=':' ) se--;
+    *beg = strtol(se+1, &tmp, 10) - 1;
+    if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+
+    *chr_beg = ss;
+    *chr_end = se-1;
+
+    // skip region
+    while ( *ss && !isspace(*ss) ) ss++;
+    while ( *ss && isspace(*ss) ) ss++;
+
+    rule_t *rule = (rule_t*) payload;
+    memset(rule, 0, sizeof(rule_t));
+
+    // mothernal ploidy
+    se = ss;
+    while ( *se && !isspace(*se) ) se++;
+    int err = 0;
+    if ( se - ss == 1 )
+    {
+        if ( *ss=='M' ) rule->mpl = 1;
+        else if ( *ss=='.' ) rule->mpl = 0;
+        else err = 1;
+    }
+    else if ( se - ss == 3 )
+    {
+        if ( !strncmp(ss,"M/M",3) ) rule->mpl = 2;
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the mothernal ploidy, only \"M\", \"M/M\" and \".\" currently supported: %s\n",line);
+
+    // skip "+"
+    while ( *se && isspace(*se) ) se++;
+    if ( *se != '+' ) error("Could not parse the line: %s\n",line);
+    se++;
+    while ( *se && isspace(*se) ) se++;
+
+    // paternal ploidy
+    ss = se;
+    while ( *se && !isspace(*se) ) se++;
+    if ( se - ss == 1 )
+    {
+        if ( *ss=='F' ) rule->fpl = 1;
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the paternal ploidy, only \"F\" is currently supported: %s [%s]\n",line, ss);
+
+    // skip ">"
+    while ( *se && isspace(*se) ) se++;
+    if ( *se != '>' ) error("Could not parse the line: %s\n",line);
+    se++;
+    while ( *se && isspace(*se) ) se++;
+
+    // ploidy in offspring
+    ss = se;
+    while ( *se && !isspace(*se) ) se++;
+    if ( se - ss == 3 )
+    {
+        if ( !strncmp(ss,"M/F",3) ) { rule->cpl = 2; rule->fal = 1; rule->mal = 1; }
+        else err = 1;
+    }
+    else if ( se - ss == 1 )
+    {
+        if ( *ss=='F' ) { rule->cpl = 1; rule->fal = 1; }
+        else if ( *ss=='M' ) { rule->cpl = 1; rule->mal = 1; }
+        else err = 1;
+    }
+    else err = 1;
+    if ( err ) error("Could not parse the offspring's ploidy, only \"M\", \"F\" or \"M/F\" is currently supported: %s\n",line);
+
+    return 0;
+}
+
+int run(int argc, char **argv)
+{
+    char *trio_samples = NULL, *trio_file = NULL, *rules_fname = NULL, *rules_string = NULL;
+    memset(&args,0,sizeof(args_t));
+    args.mode = 0;
+    args.output_fname = "-";
+
+    static struct option loptions[] =
+    {
+        {"trio",1,0,'t'},
+        {"trio-file",1,0,'T'},
+        {"delete",0,0,'d'},
+        {"list",1,0,'l'},
+        {"count",0,0,'c'},
+        {"rules",1,0,'r'},
+        {"rules-file",1,0,'R'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?ht:T:l:cdr:R:o:O:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'o': args.output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args.output_type = FT_BCF_GZ; break;
+                          case 'u': args.output_type = FT_BCF; break;
+                          case 'z': args.output_type = FT_VCF_GZ; break;
+                          case 'v': args.output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      };
+                      break;
+            case 'R': rules_fname = optarg; break;
+            case 'r': rules_string = optarg; break;
+            case 'd': args.mode |= MODE_DELETE; break;
+            case 'c': args.mode |= MODE_COUNT; break;
+            case 'l': 
+                if ( !strcmp("+",optarg) ) args.mode |= MODE_LIST_GOOD; 
+                else if ( !strcmp("x",optarg) ) args.mode |= MODE_LIST_BAD; 
+                else error("The argument not recognised: --list %s\n", optarg);
+                break;
+            case 't': trio_samples = optarg; break;
+            case 'T': trio_file = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s",usage()); break;
+        }
+    }
+    if ( rules_fname )
+        args.rules = regidx_init(rules_fname, parse_rules, NULL, sizeof(rule_t), &args);
+    else
+        args.rules = init_rules(&args, rules_string);
+    if ( !args.rules ) return -1;
+    args.itr     = regitr_init(args.rules);
+    args.itr_ori = regitr_init(args.rules);
+
+    char *fname = NULL;
+    if ( optind>=argc || argv[optind][0]=='-' )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) fname = "-";  // reading from stdin
+        else error("%s",usage());
+    }
+    else
+        fname = argv[optind];
+
+    if ( !trio_samples && !trio_file ) error("Expected the -t/T option\n");
+    if ( !args.mode ) error("Expected one of the -c, -d or -l options\n");
+    if ( args.mode&MODE_DELETE && !(args.mode&(MODE_LIST_GOOD|MODE_LIST_BAD)) ) args.mode |= MODE_LIST_GOOD|MODE_LIST_BAD;
+
+    args.sr = bcf_sr_init();
+    if ( !bcf_sr_add_reader(args.sr, fname) ) error("Failed to open %s: %s\n", fname,bcf_sr_strerror(args.sr->errnum));
+    args.hdr = bcf_sr_get_header(args.sr, 0);
+    args.out_fh = hts_open(args.output_fname,hts_bcf_wmode(args.output_type));
+    if ( args.out_fh == NULL ) error("Can't write to \"%s\": %s\n", args.output_fname, strerror(errno));
+    bcf_hdr_write(args.out_fh, args.hdr);
+
+
+    int i, n = 0;
+    char **list;
+    if ( trio_samples )
+    {
+        args.ntrios = 1;
+        args.trios = (trio_t*) calloc(1,sizeof(trio_t));
+        list = hts_readlist(trio_samples, 0, &n);
+        if ( n!=3 ) error("Expected three sample names with -t\n");
+        args.trios[0].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+        args.trios[0].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+        args.trios[0].ichild  = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+        for (i=0; i<n; i++) free(list[i]);
+        free(list);
+    }
+    if ( trio_file )
+    {
+        list = hts_readlist(trio_file, 1, &n);
+        args.ntrios = n;
+        args.trios = (trio_t*) calloc(n,sizeof(trio_t));
+        for (i=0; i<n; i++)
+        {
+            char *ss = list[i], *se;
+            se = strchr(ss, ',');
+            if ( !se ) error("Could not parse %s: %s\n",trio_file, ss);
+            *se = 0;
+            args.trios[i].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].imother<0 ) error("No such sample: \"%s\"\n", ss);
+            ss = ++se; 
+            se = strchr(ss, ',');
+            if ( !se ) error("Could not parse %s\n",trio_file);
+            *se = 0;
+            args.trios[i].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].ifather<0 ) error("No such sample: \"%s\"\n", ss);
+            ss = ++se; 
+            if ( *ss=='\0' ) error("Could not parse %s\n",trio_file);
+            args.trios[i].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+            if ( args.trios[i].ichild<0 ) error("No such sample: \"%s\"\n", ss);
+            free(list[i]);
+        }
+        free(list);
+    }
+
+    while ( bcf_sr_next_line(args.sr) )
+    {
+        bcf1_t *line = bcf_sr_get_line(args.sr,0);
+        line = process(line);
+        if ( line )
+        {
+            if ( line->errcode ) error("TODO: Unchecked error (%d), exiting\n",line->errcode);
+            bcf_write1(args.out_fh, args.hdr, line);
+        }
+    }
+
+
+    fprintf(bcftools_stderr,"# [1]nOK\t[2]nBad\t[3]nSkipped\t[4]Trio\n");
+    for (i=0; i<args.ntrios; i++)
+    {
+        trio_t *trio = &args.trios[i];
+        fprintf(bcftools_stderr,"%d\t%d\t%d\t%s,%s,%s\n", 
+            trio->nok,trio->nbad,args.nrec-(trio->nok+trio->nbad),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->imother),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ifather),
+            bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ichild)
+            );
+    }
+    free(args.gt_arr);
+    free(args.trios);
+    regitr_destroy(args.itr);
+    regitr_destroy(args.itr_ori);
+    regidx_destroy(args.rules);
+    bcf_sr_destroy(args.sr);
+    if ( hts_close(args.out_fh)!=0 ) error("Error: close failed\n");
+    return 0;
+}
+
+static void warn_ploidy(bcf1_t *rec)
+{
+    static int warned = 0;
+    if ( warned ) return;
+    fprintf(bcftools_stderr,"Incorrect ploidy at %s:%d, skipping the trio. (This warning is printed only once.)\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+    warned = 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    bcf1_t *dflt = args.mode&MODE_LIST_GOOD ? rec : NULL;
+    args.nrec++;
+
+    if ( rec->n_allele > 63 ) return dflt;      // we use 64bit bitmask below
+
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+    if ( ngt<0 ) return dflt;
+    if ( ngt!=2*bcf_hdr_nsamples(args.hdr) && ngt!=bcf_hdr_nsamples(args.hdr) ) return dflt;
+    ngt /= bcf_hdr_nsamples(args.hdr);
+
+    int itr_set = regidx_overlap(args.rules, bcf_seqname(args.hdr,rec),rec->pos,rec->pos, args.itr_ori);
+
+    int i, has_bad = 0, needs_update = 0;
+    for (i=0; i<args.ntrios; i++)
+    {
+        int32_t a,b,c,d,e,f;
+        trio_t *trio = &args.trios[i];
+
+        a = args.gt_arr[ngt*trio->imother];
+        b = ngt==2 ? args.gt_arr[ngt*trio->imother+1] : bcf_int32_vector_end;
+        c = args.gt_arr[ngt*trio->ifather];
+        d = ngt==2 ? args.gt_arr[ngt*trio->ifather+1] : bcf_int32_vector_end;
+        e = args.gt_arr[ngt*trio->ichild];
+        f = ngt==2 ? args.gt_arr[ngt*trio->ichild+1] : bcf_int32_vector_end;
+
+        // skip sites with missing data in child
+        if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) continue;
+
+        uint64_t mother = 0, father = 0,child1,child2;
+
+        int is_ok = 0;
+        if ( !itr_set )
+        {
+            if ( f==bcf_int32_vector_end ) { warn_ploidy(rec); continue; }
+
+            // All M,F,C genotypes are diploid. Missing data are considered consistent.
+            child1 = 1<<bcf_gt_allele(e);
+            child2 = 1<<bcf_gt_allele(f);
+            mother  = bcf_gt_is_missing(a) ? child1|child2 : 1<<bcf_gt_allele(a);
+            mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(b);
+            father  = bcf_gt_is_missing(c) ? child1|child2 : 1<<bcf_gt_allele(c);
+            father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(d);
+
+            if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+        }
+        else
+        {
+            child1  = 1<<bcf_gt_allele(e);
+            child2  = bcf_gt_is_missing(f) || f==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(f);
+            mother |= bcf_gt_is_missing(a) ? 0 : 1<<bcf_gt_allele(a);
+            mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(b);
+            father |= bcf_gt_is_missing(c) ? 0 : 1<<bcf_gt_allele(c);
+            father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(d);
+
+            regitr_copy(args.itr, args.itr_ori);
+            while ( !is_ok && regitr_overlap(args.itr) )
+            {
+                rule_t *rule = &regitr_payload(args.itr,rule_t);
+                if ( child1 && child2 )
+                {
+                    if ( !rule->mal || !rule->fal ) continue;   // wrong rule (haploid), but this is a diploid GT
+                    if ( !mother ) mother = child1|child2;
+                    if ( !father ) father = child1|child2;
+                    if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1; 
+                    continue;
+                }
+                if ( rule->mal )
+                {
+                    if ( mother && !(child1&mother) ) continue;
+                }
+                if ( rule->fal )
+                {
+                    if ( father && !(child1&father) ) continue;
+                }
+                is_ok = 1;
+            }
+        }
+        if ( is_ok )
+        {
+            trio->nok++;
+        }
+        else
+        {
+            trio->nbad++;
+            has_bad = 1;
+            if ( args.mode&MODE_DELETE )
+            {
+                args.gt_arr[ngt*trio->imother] = bcf_gt_missing;
+                if ( b!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->imother+1] = bcf_gt_missing; // should be always true 
+                args.gt_arr[ngt*trio->ifather] = bcf_gt_missing;
+                if ( d!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ifather+1] = bcf_gt_missing;
+                args.gt_arr[ngt*trio->ichild] = bcf_gt_missing;
+                if ( f!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ichild+1]  = bcf_gt_missing;
+                needs_update = 1;
+            }
+        }
+    }
+
+    if ( needs_update && bcf_update_genotypes(args.hdr,rec,args.gt_arr,ngt*bcf_hdr_nsamples(args.hdr)) )
+        error("Could not update GT field at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+
+    if ( args.mode&MODE_DELETE ) return rec;
+    if ( args.mode&MODE_LIST_GOOD ) return has_bad ? NULL : rec;
+    if ( args.mode&MODE_LIST_BAD ) return has_bad ? rec : NULL;
+
+    return NULL;
+}
+
diff --git a/bcftools/plugins/missing2ref.c b/bcftools/plugins/missing2ref.c

new file mode 100644 (file)

index 0000000..c4fb6ab
--- /dev/null
+++ b/bcftools/plugins/missing2ref.c
@@ -0,0 +1,144 @@
+/*  plugins/missing2ref.c -- sets missing genotypes to reference allele.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int32_t *gts = NULL, mgts = 0;
+int *arr = NULL, marr = 0;
+uint64_t nchanged = 0;
+int new_gt = bcf_gt_unphased(0);
+int use_major = 0;
+
+const char *about(void)
+{
+    return "Set missing genotypes (\"./.\") to ref or major allele (\"0/0\" or \"0|0\").\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Set missing genotypes\n"
+        "Usage: bcftools +missing2ref [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --phased       Set to \"0|0\" \n"
+        "   -m, --major        Set to major allele \n"
+        "\n"
+        "Example:\n"
+        "   bcftools +missing2ref in.vcf -- -p\n"
+        "   bcftools +missing2ref in.vcf -- -p -m\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    static struct option loptions[] =
+    {
+        {"phased",0,0,'p'},
+        {"major",0,0,'m'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "mp?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': new_gt = bcf_gt_phased(0); break;
+            case 'm': use_major = 1; break;
+            case 'h':
+            case '?':
+            default: fprintf(stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    in_hdr  = in;
+    out_hdr = out;
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int ngts = bcf_get_genotypes(in_hdr, rec, &gts, &mgts);
+    int i, changed = 0;
+    
+    // Calculating allele frequency for each allele and determining major allele
+    // only do this if use_major is true
+    int majorAllele = -1;
+    int maxAC = -1;
+    int an = 0;
+    if(use_major == 1){
+        hts_expand(int,rec->n_allele,marr,arr);
+        int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+        if(ret > 0){
+            for(i=0; i < rec->n_allele; ++i){
+                an += arr[i];
+                if(*(arr+i) > maxAC){
+                    maxAC = *(arr+i);
+                    majorAllele = i;
+                }
+            }
+        }
+        else{
+            fprintf(stderr,"Warning: Could not calculate allele count at position %d\n", rec->pos);
+            exit(1);
+        }
+
+        // replacing new_gt by major allele
+        if(bcf_gt_is_phased(new_gt))
+            new_gt = bcf_gt_phased(majorAllele);
+        else
+            new_gt = bcf_gt_unphased(majorAllele);
+    }
+
+    // replace gts
+    for (i=0; i<ngts; i++)
+    {
+        if ( gts[i]==bcf_gt_missing )
+        {
+            gts[i] = new_gt;
+            changed++;
+        }
+    }
+    nchanged += changed;
+    if ( changed ) bcf_update_genotypes(out_hdr, rec, gts, ngts);
+    return rec;
+}
+
+void destroy(void)
+{
+    free(arr);
+    fprintf(stderr,"Filled %"PRId64" REF alleles\n", nchanged);
+    free(gts);
+}
+
+
diff --git a/bcftools/plugins/missing2ref.c.pysam.c b/bcftools/plugins/missing2ref.c.pysam.c

new file mode 100644 (file)

index 0000000..e701dac
--- /dev/null
+++ b/bcftools/plugins/missing2ref.c.pysam.c
@@ -0,0 +1,146 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/missing2ref.c -- sets missing genotypes to reference allele.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int32_t *gts = NULL, mgts = 0;
+int *arr = NULL, marr = 0;
+uint64_t nchanged = 0;
+int new_gt = bcf_gt_unphased(0);
+int use_major = 0;
+
+const char *about(void)
+{
+    return "Set missing genotypes (\"./.\") to ref or major allele (\"0/0\" or \"0|0\").\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Set missing genotypes\n"
+        "Usage: bcftools +missing2ref [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --phased       Set to \"0|0\" \n"
+        "   -m, --major        Set to major allele \n"
+        "\n"
+        "Example:\n"
+        "   bcftools +missing2ref in.vcf -- -p\n"
+        "   bcftools +missing2ref in.vcf -- -p -m\n"
+        "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    int c;
+    static struct option loptions[] =
+    {
+        {"phased",0,0,'p'},
+        {"major",0,0,'m'},
+        {0,0,0,0}
+    };
+    while ((c = getopt_long(argc, argv, "mp?h",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': new_gt = bcf_gt_phased(0); break;
+            case 'm': use_major = 1; break;
+            case 'h':
+            case '?':
+            default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+        }
+    }
+    in_hdr  = in;
+    out_hdr = out;
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int ngts = bcf_get_genotypes(in_hdr, rec, &gts, &mgts);
+    int i, changed = 0;
+    
+    // Calculating allele frequency for each allele and determining major allele
+    // only do this if use_major is true
+    int majorAllele = -1;
+    int maxAC = -1;
+    int an = 0;
+    if(use_major == 1){
+        hts_expand(int,rec->n_allele,marr,arr);
+        int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+        if(ret > 0){
+            for(i=0; i < rec->n_allele; ++i){
+                an += arr[i];
+                if(*(arr+i) > maxAC){
+                    maxAC = *(arr+i);
+                    majorAllele = i;
+                }
+            }
+        }
+        else{
+            fprintf(bcftools_stderr,"Warning: Could not calculate allele count at position %d\n", rec->pos);
+            exit(1);
+        }
+
+        // replacing new_gt by major allele
+        if(bcf_gt_is_phased(new_gt))
+            new_gt = bcf_gt_phased(majorAllele);
+        else
+            new_gt = bcf_gt_unphased(majorAllele);
+    }
+
+    // replace gts
+    for (i=0; i<ngts; i++)
+    {
+        if ( gts[i]==bcf_gt_missing )
+        {
+            gts[i] = new_gt;
+            changed++;
+        }
+    }
+    nchanged += changed;
+    if ( changed ) bcf_update_genotypes(out_hdr, rec, gts, ngts);
+    return rec;
+}
+
+void destroy(void)
+{
+    free(arr);
+    fprintf(bcftools_stderr,"Filled %"PRId64" REF alleles\n", nchanged);
+    free(gts);
+}
+
+
diff --git a/bcftools/plugins/prune.c b/bcftools/plugins/prune.c

new file mode 100644 (file)

index 0000000..c2ac860
--- /dev/null
+++ b/bcftools/plugins/prune.c
@@ -0,0 +1,293 @@
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+/*
+    Prune sites by missingness, LD
+
+    See calc_ld() in vcfbuf.c for the actual LD calculation
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "vcfbuf.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    filter_t *filter;
+    char *filter_str, *af_tag;
+    int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+    vcfbuf_t *vcfbuf;
+    int argc, region_is_file, target_is_file, output_type, filter_r2_id, rand_missing, nsites, ld_win;
+    char **argv, *region, *target, *fname, *output_fname, *info_pos, *info_r2, *filter_r2;
+    htsFile *out_fh;
+    bcf_hdr_t *hdr;
+    bcf_srs_t *sr;
+    double max_ld;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Prune sites by missingness, linkage disequilibrium\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Prune sites by missingness or linkage disequilibrium.\n"
+        "\n"
+        "Usage: bcftools +prune [Options]\n"
+        "Plugin options:\n"
+        "       --AF-tag STR                use this tag with -n to determine allele frequency\n"
+        "   -a, --annotate-info STR         add INFO/STR_POS and STR_R2 annotation: an upstream site with the biggest r2 value\n"
+        "   -e, --exclude EXPR              exclude sites for which the expression is true\n"
+        "   -f, --set-filter STR            annotate FILTER column with STR instead of discarding the site\n"
+        "   -i, --include EXPR              include only sites for which the expression is true\n"
+        "   -l, --max-LD R2                 remove sites with r2 bigger than R2 within within the -w window\n"
+        "   -n, --nsites-per-win N          keep at most N sites in the -w window, removing sites with small AF first\n"
+        "   -o, --output FILE               write output to the FILE [standard output]\n"
+        "   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "       --randomize-missing         replace missing data with randomly assigned genotype based on site's allele frequency\n"
+        "   -r, --regions REGION            restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE         restrict to regions listed in a file\n"
+        "   -t, --targets REGION            similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE         similar to -R but streams rather than index-jumps\n"
+        "   -w, --window INT[bp|kb]         the window size of INT sites/bp/kb for the -n/-l options [100kb]\n"
+        "Examples:\n"
+        "   # Discard records with r2 bigger than 0.6 in a window of 1000 sites\n"
+        "   bcftools +prune -l 0.6 -w 1000 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Set FILTER (but do not discard) records with r2 bigger than 0.4 in the default window of 100kb\n"
+        "   bcftools +prune -l 0.4 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Annotate INFO field of all records with maximum r2 in a window of 1000 sites\n"
+        "   bcftools +prune -l 0.6 -w 1000 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Discard records with r2 bigger than 0.6, first removing records with more than 2% of genotypes missing\n"
+        "   bcftools +prune -l 0.6 -e'F_MISSING>=0.02' input.bcf -Ob -o output.bcf\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->region )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+    }
+    if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    if ( args->filter_r2 )
+    {
+        bcf_hdr_printf(args->hdr,"##FILTER=<ID=%s,Description=\"A site with r2>%e upstream within %d%s\">",args->filter_r2,args->max_ld,
+                args->ld_win < 0 ? -args->ld_win/1000 : args->ld_win,
+                args->ld_win < 0 ? "kb" : " sites");
+    }
+    if ( args->info_r2 )
+    {
+        bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Integer,Description=\"A site with r2>%e upstream\">",args->info_pos,args->max_ld);
+        bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Float,Description=\"A site with r2>%e upstream\">",args->info_r2,args->max_ld);
+    }
+    bcf_hdr_write(args->out_fh, args->hdr);
+    if ( args->filter_r2 )
+        args->filter_r2_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, args->filter_r2);
+
+    args->vcfbuf = vcfbuf_init(args->hdr, args->ld_win);
+    vcfbuf_set_opt(args->vcfbuf,double,VCFBUF_LD_MAX,args->max_ld);
+    if ( args->nsites ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_NSITES,args->nsites);
+    if ( args->af_tag ) vcfbuf_set_opt(args->vcfbuf,char*,VCFBUF_AF_TAG,args->af_tag);
+    if ( args->rand_missing ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_RAND_MISSING,1);
+    vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_SKIP_FILTER,args->filter_r2 ? 1 : 0);
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+}
+static void destroy_data(args_t *args)
+{
+    if ( args->filter )
+        filter_destroy(args->filter);
+    hts_close(args->out_fh);
+    vcfbuf_destroy(args->vcfbuf);
+    bcf_sr_destroy(args->sr);
+    free(args->info_pos);
+    free(args->info_r2);
+    free(args);
+}
+static void flush(args_t *args, int flush_all)
+{
+    bcf1_t *rec;
+    while ( (rec = vcfbuf_flush(args->vcfbuf, flush_all)) )
+        bcf_write1(args->out_fh, args->hdr, rec);
+}
+static void process(args_t *args)
+{
+    bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+    if ( args->filter )
+    {
+        int ret  = filter_test(args->filter, rec, NULL);
+        if ( args->filter_logic==FLT_INCLUDE ) { if ( !ret ) return; }
+        else if ( ret ) return;
+    }
+    bcf_sr_t *sr = bcf_sr_get_reader(args->sr, 0);
+    if ( args->max_ld )
+    {
+        double ld_val;
+        bcf1_t *ld_rec = vcfbuf_max_ld(args->vcfbuf, rec, &ld_val);
+        if ( ld_rec && ld_val > args->max_ld )
+        {
+            if ( !args->filter_r2 ) return;
+            bcf_add_filter(args->hdr, rec, args->filter_r2_id);
+        }
+        if ( ld_rec && args->info_r2 )
+        {
+            float tmp = ld_val;
+            int32_t tmp_pos = ld_rec->pos + 1;
+            bcf_update_info_float(args->hdr, rec, args->info_r2, &tmp, 1);
+            bcf_update_info_int32(args->hdr, rec, args->info_pos, &tmp_pos, 1);
+        }
+    }
+    sr->buffer[0] = vcfbuf_push(args->vcfbuf, rec, 1);
+    flush(args,0);
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_type  = FT_VCF;
+    args->output_fname = "-";
+    args->ld_win = -100e3;
+    static struct option loptions[] =
+    {
+        {"randomize-missing",no_argument,NULL,1},
+        {"AF-tag",required_argument,NULL,2},
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"annotate-info",required_argument,NULL,'a'},
+        {"set-filter",required_argument,NULL,'f'},
+        {"max-LD",required_argument,NULL,'l'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {"nsites-per-win",required_argument,NULL,'n'},
+        {"window",required_argument,NULL,'w'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:t:T:l:o:O:a:f:i:e:n:w:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case  1 : args->rand_missing = 1; break;
+            case  2 : args->af_tag = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'a': 
+                {
+                    int l = strlen(optarg);
+                    args->info_pos = (char*)malloc(l+5);
+                    args->info_r2  = (char*)malloc(l+5);
+                    sprintf(args->info_pos,"%s_POS", optarg);
+                    sprintf(args->info_r2,"%s_R2", optarg);
+                }
+                break; 
+            case 'f': args->filter_r2 = optarg; break;
+            case 'n': 
+                args->nsites = strtod(optarg,&tmp);
+                if ( tmp==optarg || *tmp ) error("Could not parse: --nsites-per-win %s\n", optarg);
+                break;
+            case 'l': 
+                args->max_ld = strtod(optarg,&tmp);
+                if ( tmp==optarg || *tmp ) error("Could not parse: --max-LD %s\n", optarg);
+                break;
+            case 'w': 
+                args->ld_win = strtod(optarg,&tmp);
+                if ( !*tmp ) break;
+                if ( tmp==optarg ) error("Could not parse: --window %s\n", optarg);
+                else if ( !strcasecmp("bp",tmp) ) args->ld_win *= -1;
+                else if ( !strcasecmp("kb",tmp) ) args->ld_win *= -1000;
+                else error("Could not parse: --window %s\n", optarg);
+                break;
+            case 'T': args->target_is_file = 1; 
+            case 't': args->target = optarg; break; 
+            case 'R': args->region_is_file = 1; 
+            case 'r': args->region = optarg; break; 
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+    if ( !args->max_ld && !args->nsites ) error("%sError: Expected --max-LD, --nsites-per-win or both\n\n", usage_text());
+
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+    
+    while ( bcf_sr_next_line(args->sr) ) process(args);
+    flush(args,1);
+
+    destroy_data(args);
+    return 0;
+}
+
+
diff --git a/bcftools/plugins/prune.c.pysam.c b/bcftools/plugins/prune.c.pysam.c

new file mode 100644 (file)

index 0000000..55f16e4
--- /dev/null
+++ b/bcftools/plugins/prune.c.pysam.c
@@ -0,0 +1,295 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+/*
+    Prune sites by missingness, LD
+
+    See calc_ld() in vcfbuf.c for the actual LD calculation
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "vcfbuf.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    filter_t *filter;
+    char *filter_str, *af_tag;
+    int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+    vcfbuf_t *vcfbuf;
+    int argc, region_is_file, target_is_file, output_type, filter_r2_id, rand_missing, nsites, ld_win;
+    char **argv, *region, *target, *fname, *output_fname, *info_pos, *info_r2, *filter_r2;
+    htsFile *out_fh;
+    bcf_hdr_t *hdr;
+    bcf_srs_t *sr;
+    double max_ld;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Prune sites by missingness, linkage disequilibrium\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Prune sites by missingness or linkage disequilibrium.\n"
+        "\n"
+        "Usage: bcftools +prune [Options]\n"
+        "Plugin options:\n"
+        "       --AF-tag STR                use this tag with -n to determine allele frequency\n"
+        "   -a, --annotate-info STR         add INFO/STR_POS and STR_R2 annotation: an upstream site with the biggest r2 value\n"
+        "   -e, --exclude EXPR              exclude sites for which the expression is true\n"
+        "   -f, --set-filter STR            annotate FILTER column with STR instead of discarding the site\n"
+        "   -i, --include EXPR              include only sites for which the expression is true\n"
+        "   -l, --max-LD R2                 remove sites with r2 bigger than R2 within within the -w window\n"
+        "   -n, --nsites-per-win N          keep at most N sites in the -w window, removing sites with small AF first\n"
+        "   -o, --output FILE               write output to the FILE [standard output]\n"
+        "   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "       --randomize-missing         replace missing data with randomly assigned genotype based on site's allele frequency\n"
+        "   -r, --regions REGION            restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE         restrict to regions listed in a file\n"
+        "   -t, --targets REGION            similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE         similar to -R but streams rather than index-jumps\n"
+        "   -w, --window INT[bp|kb]         the window size of INT sites/bp/kb for the -n/-l options [100kb]\n"
+        "Examples:\n"
+        "   # Discard records with r2 bigger than 0.6 in a window of 1000 sites\n"
+        "   bcftools +prune -l 0.6 -w 1000 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Set FILTER (but do not discard) records with r2 bigger than 0.4 in the default window of 100kb\n"
+        "   bcftools +prune -l 0.4 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Annotate INFO field of all records with maximum r2 in a window of 1000 sites\n"
+        "   bcftools +prune -l 0.6 -w 1000 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+        "\n"
+        "   # Discard records with r2 bigger than 0.6, first removing records with more than 2% of genotypes missing\n"
+        "   bcftools +prune -l 0.6 -e'F_MISSING>=0.02' input.bcf -Ob -o output.bcf\n"
+        "\n";
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->region )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+    }
+    if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+    if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+    if ( args->filter_r2 )
+    {
+        bcf_hdr_printf(args->hdr,"##FILTER=<ID=%s,Description=\"A site with r2>%e upstream within %d%s\">",args->filter_r2,args->max_ld,
+                args->ld_win < 0 ? -args->ld_win/1000 : args->ld_win,
+                args->ld_win < 0 ? "kb" : " sites");
+    }
+    if ( args->info_r2 )
+    {
+        bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Integer,Description=\"A site with r2>%e upstream\">",args->info_pos,args->max_ld);
+        bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Float,Description=\"A site with r2>%e upstream\">",args->info_r2,args->max_ld);
+    }
+    bcf_hdr_write(args->out_fh, args->hdr);
+    if ( args->filter_r2 )
+        args->filter_r2_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, args->filter_r2);
+
+    args->vcfbuf = vcfbuf_init(args->hdr, args->ld_win);
+    vcfbuf_set_opt(args->vcfbuf,double,VCFBUF_LD_MAX,args->max_ld);
+    if ( args->nsites ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_NSITES,args->nsites);
+    if ( args->af_tag ) vcfbuf_set_opt(args->vcfbuf,char*,VCFBUF_AF_TAG,args->af_tag);
+    if ( args->rand_missing ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_RAND_MISSING,1);
+    vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_SKIP_FILTER,args->filter_r2 ? 1 : 0);
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+}
+static void destroy_data(args_t *args)
+{
+    if ( args->filter )
+        filter_destroy(args->filter);
+    hts_close(args->out_fh);
+    vcfbuf_destroy(args->vcfbuf);
+    bcf_sr_destroy(args->sr);
+    free(args->info_pos);
+    free(args->info_r2);
+    free(args);
+}
+static void flush(args_t *args, int flush_all)
+{
+    bcf1_t *rec;
+    while ( (rec = vcfbuf_flush(args->vcfbuf, flush_all)) )
+        bcf_write1(args->out_fh, args->hdr, rec);
+}
+static void process(args_t *args)
+{
+    bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+    if ( args->filter )
+    {
+        int ret  = filter_test(args->filter, rec, NULL);
+        if ( args->filter_logic==FLT_INCLUDE ) { if ( !ret ) return; }
+        else if ( ret ) return;
+    }
+    bcf_sr_t *sr = bcf_sr_get_reader(args->sr, 0);
+    if ( args->max_ld )
+    {
+        double ld_val;
+        bcf1_t *ld_rec = vcfbuf_max_ld(args->vcfbuf, rec, &ld_val);
+        if ( ld_rec && ld_val > args->max_ld )
+        {
+            if ( !args->filter_r2 ) return;
+            bcf_add_filter(args->hdr, rec, args->filter_r2_id);
+        }
+        if ( ld_rec && args->info_r2 )
+        {
+            float tmp = ld_val;
+            int32_t tmp_pos = ld_rec->pos + 1;
+            bcf_update_info_float(args->hdr, rec, args->info_r2, &tmp, 1);
+            bcf_update_info_int32(args->hdr, rec, args->info_pos, &tmp_pos, 1);
+        }
+    }
+    sr->buffer[0] = vcfbuf_push(args->vcfbuf, rec, 1);
+    flush(args,0);
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_type  = FT_VCF;
+    args->output_fname = "-";
+    args->ld_win = -100e3;
+    static struct option loptions[] =
+    {
+        {"randomize-missing",no_argument,NULL,1},
+        {"AF-tag",required_argument,NULL,2},
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"annotate-info",required_argument,NULL,'a'},
+        {"set-filter",required_argument,NULL,'f'},
+        {"max-LD",required_argument,NULL,'l'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {"nsites-per-win",required_argument,NULL,'n'},
+        {"window",required_argument,NULL,'w'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *tmp;
+    while ((c = getopt_long(argc, argv, "vr:R:t:T:l:o:O:a:f:i:e:n:w:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case  1 : args->rand_missing = 1; break;
+            case  2 : args->af_tag = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'a': 
+                {
+                    int l = strlen(optarg);
+                    args->info_pos = (char*)malloc(l+5);
+                    args->info_r2  = (char*)malloc(l+5);
+                    sprintf(args->info_pos,"%s_POS", optarg);
+                    sprintf(args->info_r2,"%s_R2", optarg);
+                }
+                break; 
+            case 'f': args->filter_r2 = optarg; break;
+            case 'n': 
+                args->nsites = strtod(optarg,&tmp);
+                if ( tmp==optarg || *tmp ) error("Could not parse: --nsites-per-win %s\n", optarg);
+                break;
+            case 'l': 
+                args->max_ld = strtod(optarg,&tmp);
+                if ( tmp==optarg || *tmp ) error("Could not parse: --max-LD %s\n", optarg);
+                break;
+            case 'w': 
+                args->ld_win = strtod(optarg,&tmp);
+                if ( !*tmp ) break;
+                if ( tmp==optarg ) error("Could not parse: --window %s\n", optarg);
+                else if ( !strcasecmp("bp",tmp) ) args->ld_win *= -1;
+                else if ( !strcasecmp("kb",tmp) ) args->ld_win *= -1000;
+                else error("Could not parse: --window %s\n", optarg);
+                break;
+            case 'T': args->target_is_file = 1; 
+            case 't': args->target = optarg; break; 
+            case 'R': args->region_is_file = 1; 
+            case 'r': args->region = optarg; break; 
+            case 'o': args->output_fname = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+    if ( !args->max_ld && !args->nsites ) error("%sError: Expected --max-LD, --nsites-per-win or both\n\n", usage_text());
+
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+    
+    while ( bcf_sr_next_line(args->sr) ) process(args);
+    flush(args,1);
+
+    destroy_data(args);
+    return 0;
+}
+
+
diff --git a/bcftools/plugins/setGT.c b/bcftools/plugins/setGT.c

new file mode 100644 (file)

index 0000000..80943ff
--- /dev/null
+++ b/bcftools/plugins/setGT.c
@@ -0,0 +1,447 @@
+/*  plugins/setGT.c -- set gentoypes to given values
+
+    Copyright (C) 2015-2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include <getopt.h>
+#include <ctype.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef int (*cmp_f)(double a, double b);
+
+static int cmp_eq(double a, double b) { return a==b ? 1 : 0; }
+static int cmp_le(double a, double b) { return a<=b ? 1 : 0; }
+static int cmp_ge(double a, double b) { return a>=b ? 1 : 0; }
+static int cmp_lt(double a, double b) { return a<b ? 1 : 0; }
+static int cmp_gt(double a, double b) { return a>b ? 1 : 0; }
+
+typedef struct
+{
+    bcf_hdr_t *in_hdr, *out_hdr;
+    int32_t *gts, mgts, *iarr, miarr;
+    int *arr, marr;
+    uint64_t nchanged;
+    int tgt_mask, new_mask, new_gt;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;
+    uint8_t *smpl_pass;
+    double binom_val;
+    char *binom_tag;
+    cmp_f binom_cmp;
+}
+args_t;
+
+args_t *args = NULL;
+
+#define GT_MISSING   1
+#define GT_PARTIAL  (1<<1)
+#define GT_REF      (1<<2)
+#define GT_MAJOR    (1<<3)
+#define GT_PHASED   (1<<4)
+#define GT_UNPHASED (1<<5)
+#define GT_ALL      (1<<6)
+#define GT_QUERY    (1<<7)
+#define GT_BINOM    (1<<8)
+
+const char *about(void)
+{
+    return "Set genotypes: partially missing to missing, missing to ref/major allele, etc.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "About: Sets genotypes. The target genotypes can be specified as:\n"
+        "           ./.  .. completely missing (\".\" or \"./.\", depending on ploidy)\n"
+        "           ./x  .. partially missing (e.g., \"./0\" or \".|1\" but not \"./.\")\n"
+        "           .    .. partially or completely missing\n"
+        "           a    .. all genotypes\n"
+        "           b    .. heterozygous genotypes failing two-tailed binomial test (example below)\n"
+        "           q    .. select genotypes using -i/-e options\n"
+        "       and the new genotype can be one of:\n"
+        "           .    .. missing (\".\" or \"./.\", keeps ploidy)\n"
+        "           0    .. reference allele\n"
+        "           M    .. major allele\n"
+        "           p    .. phased genotype\n"
+        "           u    .. unphase genotype and sort by allele (1|0 becomes 0/1)\n"
+        "Usage: bcftools +setGT [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -e, --exclude <expr>        Exclude a genotype if true (requires -t q)\n"
+        "   -i, --include <expr>        include a genotype if true (requires -t q)\n"
+        "   -n, --new-gt <type>         Genotypes to set, see above\n"
+        "   -t, --target-gt <type>      Genotypes to change, see above\n"
+        "\n"
+        "Example:\n"
+        "   # set missing genotypes (\"./.\") to phased ref genotypes (\"0|0\")\n"
+        "   bcftools +setGT in.vcf -- -t . -n 0p\n"
+        "\n"
+        "   # set missing genotypes with DP>0 and GQ>20 to ref genotypes (\"0/0\")\n"
+        "   bcftools +setGT in.vcf -- -t q -n 0 -i 'GT=\".\" && FMT/DP>0 && GQ>20'\n"
+        "\n"
+        "   # set partially missing genotypes to completely missing\n"
+        "   bcftools +setGT in.vcf -- -t ./x -n .\n"
+        "\n"
+        "   # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3\n"
+        "   bcftools +setGT in.vcf -- -t \"b:AD<1e-3\" -n 0\n"  // todo: make -i/-e recognise something like is_het or gt="het" so that this can be generalized?
+        "\n";
+}
+
+static void _parse_binom_expr_error(char *str)
+{
+    error(
+            "Error parsing the expression: %s\n"
+            "Expected TAG CMP VAL, where\n"
+            "   TAG .. one of the format tags\n"
+            "   CMP .. operator, one of <, <=, >, >=\n"
+            "   VAL .. value\n"
+            "For example:\n"
+            "   bcftools +setGT in.vcf -- -t \"b:AD>1e-3\" -n 0\n"
+            "\n", str
+         );
+}
+void parse_binom_expr(args_t *args, char *str)
+{
+    if ( str[1]!=':' ) _parse_binom_expr_error(str);
+
+    char *beg = str+2;
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) _parse_binom_expr_error(str);
+    char *end = beg;
+    while ( *end )
+    {
+        if ( isspace(*end) || *end=='<' || *end=='=' || *end=='>' ) break;
+        end++;
+    }
+    if ( !*end ) _parse_binom_expr_error(str);
+    args->binom_tag = (char*) calloc(1,end-beg+1);
+    memcpy(args->binom_tag,beg,end-beg);
+    int tag_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,args->binom_tag);
+    if ( !bcf_hdr_idinfo_exists(args->in_hdr,BCF_HL_FMT,tag_id) ) error("The FORMAT tag \"%s\" is not present in the VCF\n", args->binom_tag);
+    
+    while ( *end && isspace(*end) ) end++;
+    if ( !*end ) _parse_binom_expr_error(str);
+
+    if ( !strncmp(end,"<=",2) ) { args->binom_cmp = cmp_le; beg = end+2; }
+    else if ( !strncmp(end,">=",2) ) { args->binom_cmp = cmp_ge; beg = end+2; }
+    else if ( !strncmp(end,"==",2) ) { args->binom_cmp = cmp_eq; beg = end+2; }
+    else if ( !strncmp(end,"<",1) ) { args->binom_cmp = cmp_lt; beg = end+1; }
+    else if ( !strncmp(end,">",1) ) { args->binom_cmp = cmp_gt; beg = end+1; }
+    else if ( !strncmp(end,"=",1) ) { args->binom_cmp = cmp_eq; beg = end+1; }
+    else _parse_binom_expr_error(str);
+
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) _parse_binom_expr_error(str);
+
+    args->binom_val = strtod(beg, &end);
+    while ( *end && isspace(*end) ) end++;
+    if ( *end ) _parse_binom_expr_error(str);
+
+    args->tgt_mask |= GT_BINOM;
+    return;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->in_hdr  = in;
+    args->out_hdr = out;
+
+    int c;
+    static struct option loptions[] =
+    {
+        {"include",required_argument,NULL,'i'},
+        {"exclude",required_argument,NULL,'e'},
+        {"new-gt",required_argument,NULL,'n'},
+        {"target-gt",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    while ((c = getopt_long(argc, argv, "?hn:t:i:e:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'i': args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+            case 'e': args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
+            case 'n': args->new_mask = 0;
+                if ( strchr(optarg,'.') ) args->new_mask |= GT_MISSING;
+                if ( strchr(optarg,'0') ) args->new_mask |= GT_REF;
+                if ( strchr(optarg,'M') ) args->new_mask |= GT_MAJOR;
+                if ( strchr(optarg,'p') ) args->new_mask |= GT_PHASED;
+                if ( strchr(optarg,'u') ) args->new_mask |= GT_UNPHASED;
+                if ( args->new_mask==0 ) error("Unknown parameter to --new-gt: %s\n", optarg);
+                break;
+            case 't':
+                if ( !strcmp(optarg,".") ) args->tgt_mask |= GT_MISSING|GT_PARTIAL;
+                if ( !strcmp(optarg,"./x") ) args->tgt_mask |= GT_PARTIAL;
+                if ( !strcmp(optarg,"./.") ) args->tgt_mask |= GT_MISSING;
+                if ( !strcmp(optarg,"a") ) args->tgt_mask |= GT_ALL;
+                if ( !strcmp(optarg,"q") ) args->tgt_mask |= GT_QUERY;
+                if ( !strcmp(optarg,"?") ) args->tgt_mask |= GT_QUERY;        // for backward compatibility
+                if ( strchr(optarg,'b') ) parse_binom_expr(args, strchr(optarg,'b'));
+                if ( args->tgt_mask==0 ) error("Unknown parameter to --target-gt: %s\n", optarg);
+                break;
+            case 'h':
+            case '?':
+            default: fprintf(stderr,"%s", usage()); exit(1); break;
+        }
+    }
+
+    if ( !args->new_mask ) error("Expected -n option\n");
+    if ( !args->tgt_mask ) error("Expected -t option\n");
+
+    if ( args->new_mask & GT_MISSING ) args->new_gt = bcf_gt_missing;
+    if ( args->new_mask & GT_REF ) args->new_gt = args->new_mask&GT_PHASED ? bcf_gt_phased(0) : bcf_gt_unphased(0);
+
+    if ( args->filter_str  && !(args->tgt_mask&GT_QUERY) ) error("Expected -tq with -i/-e\n");
+    if ( !args->filter_str && args->tgt_mask&GT_QUERY ) error("Expected -i/-e with -tq\n");
+    if ( args->filter_str ) args->filter = filter_init(in,args->filter_str);
+
+    return 0;
+}
+
+static inline int phase_gt(int32_t *ptr, int ngts)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( bcf_gt_is_phased(ptr[j]) ) continue;
+        ptr[j] = bcf_gt_phased(bcf_gt_allele(ptr[j]));    // add phasing; this may need a fix, I think the flag should be set for one allele only?
+        changed++;
+    }
+    return changed;
+}
+
+static inline int unphase_gt(int32_t *ptr, int ngts)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( !bcf_gt_is_phased(ptr[j]) ) continue;
+        ptr[j] = bcf_gt_unphased(bcf_gt_allele(ptr[j]));    // remove phasing
+        changed++;
+    }
+
+    // insertion sort
+    int k, l;
+    for (k=1; k<j; k++)
+    {
+        int32_t x = ptr[k];
+        l = k;
+        while ( l>0 && ptr[l-1]>x )
+        {
+            ptr[l] = ptr[l-1];
+            l--;
+        }
+        ptr[l] = x;
+    }
+    return changed;
+}
+static inline int set_gt(int32_t *ptr, int ngts, int gt)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( ptr[j] != gt ) changed++;
+        ptr[j] = gt;
+    }
+    return changed;
+}
+
+static inline double calc_binom(int na, int nb)
+{
+    if ( na + nb == 0 ) return 1;
+
+    /*
+        kfunc.h implements kf_betai, which is the regularized beta function I_x(a,b) = P(X<=a/(a+b))
+    */
+    double prob = na > nb ? 2*kf_betai(na, nb + 1, 0.5) : 2*kf_betai(nb, na + 1, 0.5);
+    if ( prob > 1 ) prob = 1;
+
+    return prob;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( !rec->n_sample ) return rec;
+
+    int ngts = bcf_get_genotypes(args->in_hdr, rec, &args->gts, &args->mgts);
+    ngts /= rec->n_sample;
+    int i, j, changed = 0;
+
+    int nbinom = 0;
+    if ( args->tgt_mask & GT_BINOM )
+    {
+        nbinom = bcf_get_format_int32(args->in_hdr, rec, args->binom_tag, &args->iarr, &args->miarr);
+        if ( nbinom<0 ) nbinom = 0;
+        nbinom /= rec->n_sample;
+    }
+    
+    // Calculating allele frequency for each allele and determining major allele
+    // only do this if use_major is true
+    int an = 0, maxAC = -1, majorAllele = -1;
+    if ( args->new_mask & GT_MAJOR )
+    {
+        hts_expand(int,rec->n_allele,args->marr,args->arr);
+        int ret = bcf_calc_ac(args->in_hdr,rec,args->arr,BCF_UN_FMT);
+        if ( ret<= 0 )
+            error("Could not calculate allele count at %s:%d\n", bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+        for(i=0; i < rec->n_allele; ++i)
+        {
+            an += args->arr[i];
+            if (args->arr[i] > maxAC)
+            {
+                maxAC = args->arr[i];
+                majorAllele = i;
+            }
+        }
+
+        // replacing new_gt by major allele
+        args->new_gt = args->new_mask & GT_PHASED ?  bcf_gt_phased(majorAllele) : bcf_gt_unphased(majorAllele);
+    }
+
+    // replace gts
+    if ( nbinom && ngts>=2 )    // only diploid genotypes are considered: higher ploidy ignored further, haploid here
+    {
+        if ( args->filter ) filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+        for (i=0; i<rec->n_sample; i++)
+        {
+            if ( args->smpl_pass )
+            {
+                if ( !args->smpl_pass[i] && args->filter_logic==FLT_INCLUDE ) continue;
+                if (  args->smpl_pass[i] && args->filter_logic==FLT_EXCLUDE ) continue;
+            }
+            int32_t *ptr = args->gts + i*ngts;
+            if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end ) continue;
+            if ( ptr[0]==ptr[1] ) continue; // a hom
+            int ia = bcf_gt_allele(ptr[0]); 
+            int ib = bcf_gt_allele(ptr[1]); 
+            if ( ia>=nbinom || ib>=nbinom ) 
+                error("The sample %s has incorrect number of %s fields at %s:%d\n",
+                        args->in_hdr->samples[i],args->binom_tag,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+            double prob = calc_binom(args->iarr[i*nbinom+ia],args->iarr[i*nbinom+ib]);
+            if ( !args->binom_cmp(prob,args->binom_val) ) continue;
+
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(ptr, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(ptr, ngts);
+            else
+                changed += set_gt(ptr, ngts, args->new_gt);
+        }
+    }
+    else if ( args->tgt_mask&GT_QUERY )
+    {
+        int pass_site = filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+        if ( pass_site && args->filter_logic==FLT_EXCLUDE )
+        {
+            // -i can include a site but exclude a sample, -e exclude a site but include a sample
+            if ( pass_site )
+            {
+                if ( !args->smpl_pass ) return rec;
+                pass_site = 0;
+                for (i=0; i<rec->n_sample; i++)
+                {
+                    if ( args->smpl_pass[i] ) args->smpl_pass[i] = 0;
+                    else { args->smpl_pass[i] = 1; pass_site = 1; }
+                }
+                if ( !pass_site ) return rec;
+            }
+            else if ( args->smpl_pass )
+                for (i=0; i<rec->n_sample; i++) args->smpl_pass[i] = 1;
+        }
+        else if ( !pass_site ) return rec;
+
+        for (i=0; i<rec->n_sample; i++)
+        {
+            if ( !args->smpl_pass[i] ) continue;
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(args->gts + i*ngts, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(args->gts + i*ngts, ngts);
+            else
+                changed += set_gt(args->gts + i*ngts, ngts, args->new_gt);
+        }
+    }
+    else
+    {
+        for (i=0; i<rec->n_sample; i++)
+        {
+            int ploidy = 0, nmiss = 0;
+            int32_t *ptr = args->gts + i*ngts;
+            for (j=0; j<ngts; j++)
+            {
+                if ( ptr[j]==bcf_int32_vector_end ) break;
+                ploidy++;
+                if ( ptr[j]==bcf_gt_missing ) nmiss++;
+            }
+
+            int do_set = 0;
+            if ( args->tgt_mask&GT_ALL ) do_set = 1;
+            else if ( args->tgt_mask&GT_PARTIAL && nmiss ) do_set = 1;
+            else if ( args->tgt_mask&GT_MISSING && ploidy==nmiss ) do_set = 1;
+
+            if ( !do_set ) continue;
+
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(ptr, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(ptr, ngts);
+            else
+                changed += set_gt(ptr, ngts, args->new_gt);
+        }
+    }
+    args->nchanged += changed;
+    if ( changed ) bcf_update_genotypes(args->out_hdr, rec, args->gts, ngts*rec->n_sample);
+    return rec;
+}
+
+void destroy(void)
+{
+    fprintf(stderr,"Filled %"PRId64" alleles\n", args->nchanged);
+    free(args->binom_tag);
+    if ( args->filter ) filter_destroy(args->filter);
+    free(args->arr);
+    free(args->iarr);
+    free(args->gts);
+    free(args);
+}
+
+
diff --git a/bcftools/plugins/setGT.c.pysam.c b/bcftools/plugins/setGT.c.pysam.c

new file mode 100644 (file)

index 0000000..c3759de
--- /dev/null
+++ b/bcftools/plugins/setGT.c.pysam.c
@@ -0,0 +1,449 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/setGT.c -- set gentoypes to given values
+
+    Copyright (C) 2015-2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include <getopt.h>
+#include <ctype.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef int (*cmp_f)(double a, double b);
+
+static int cmp_eq(double a, double b) { return a==b ? 1 : 0; }
+static int cmp_le(double a, double b) { return a<=b ? 1 : 0; }
+static int cmp_ge(double a, double b) { return a>=b ? 1 : 0; }
+static int cmp_lt(double a, double b) { return a<b ? 1 : 0; }
+static int cmp_gt(double a, double b) { return a>b ? 1 : 0; }
+
+typedef struct
+{
+    bcf_hdr_t *in_hdr, *out_hdr;
+    int32_t *gts, mgts, *iarr, miarr;
+    int *arr, marr;
+    uint64_t nchanged;
+    int tgt_mask, new_mask, new_gt;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;
+    uint8_t *smpl_pass;
+    double binom_val;
+    char *binom_tag;
+    cmp_f binom_cmp;
+}
+args_t;
+
+args_t *args = NULL;
+
+#define GT_MISSING   1
+#define GT_PARTIAL  (1<<1)
+#define GT_REF      (1<<2)
+#define GT_MAJOR    (1<<3)
+#define GT_PHASED   (1<<4)
+#define GT_UNPHASED (1<<5)
+#define GT_ALL      (1<<6)
+#define GT_QUERY    (1<<7)
+#define GT_BINOM    (1<<8)
+
+const char *about(void)
+{
+    return "Set genotypes: partially missing to missing, missing to ref/major allele, etc.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "About: Sets genotypes. The target genotypes can be specified as:\n"
+        "           ./.  .. completely missing (\".\" or \"./.\", depending on ploidy)\n"
+        "           ./x  .. partially missing (e.g., \"./0\" or \".|1\" but not \"./.\")\n"
+        "           .    .. partially or completely missing\n"
+        "           a    .. all genotypes\n"
+        "           b    .. heterozygous genotypes failing two-tailed binomial test (example below)\n"
+        "           q    .. select genotypes using -i/-e options\n"
+        "       and the new genotype can be one of:\n"
+        "           .    .. missing (\".\" or \"./.\", keeps ploidy)\n"
+        "           0    .. reference allele\n"
+        "           M    .. major allele\n"
+        "           p    .. phased genotype\n"
+        "           u    .. unphase genotype and sort by allele (1|0 becomes 0/1)\n"
+        "Usage: bcftools +setGT [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -e, --exclude <expr>        Exclude a genotype if true (requires -t q)\n"
+        "   -i, --include <expr>        include a genotype if true (requires -t q)\n"
+        "   -n, --new-gt <type>         Genotypes to set, see above\n"
+        "   -t, --target-gt <type>      Genotypes to change, see above\n"
+        "\n"
+        "Example:\n"
+        "   # set missing genotypes (\"./.\") to phased ref genotypes (\"0|0\")\n"
+        "   bcftools +setGT in.vcf -- -t . -n 0p\n"
+        "\n"
+        "   # set missing genotypes with DP>0 and GQ>20 to ref genotypes (\"0/0\")\n"
+        "   bcftools +setGT in.vcf -- -t q -n 0 -i 'GT=\".\" && FMT/DP>0 && GQ>20'\n"
+        "\n"
+        "   # set partially missing genotypes to completely missing\n"
+        "   bcftools +setGT in.vcf -- -t ./x -n .\n"
+        "\n"
+        "   # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3\n"
+        "   bcftools +setGT in.vcf -- -t \"b:AD<1e-3\" -n 0\n"  // todo: make -i/-e recognise something like is_het or gt="het" so that this can be generalized?
+        "\n";
+}
+
+static void _parse_binom_expr_error(char *str)
+{
+    error(
+            "Error parsing the expression: %s\n"
+            "Expected TAG CMP VAL, where\n"
+            "   TAG .. one of the format tags\n"
+            "   CMP .. operator, one of <, <=, >, >=\n"
+            "   VAL .. value\n"
+            "For example:\n"
+            "   bcftools +setGT in.vcf -- -t \"b:AD>1e-3\" -n 0\n"
+            "\n", str
+         );
+}
+void parse_binom_expr(args_t *args, char *str)
+{
+    if ( str[1]!=':' ) _parse_binom_expr_error(str);
+
+    char *beg = str+2;
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) _parse_binom_expr_error(str);
+    char *end = beg;
+    while ( *end )
+    {
+        if ( isspace(*end) || *end=='<' || *end=='=' || *end=='>' ) break;
+        end++;
+    }
+    if ( !*end ) _parse_binom_expr_error(str);
+    args->binom_tag = (char*) calloc(1,end-beg+1);
+    memcpy(args->binom_tag,beg,end-beg);
+    int tag_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,args->binom_tag);
+    if ( !bcf_hdr_idinfo_exists(args->in_hdr,BCF_HL_FMT,tag_id) ) error("The FORMAT tag \"%s\" is not present in the VCF\n", args->binom_tag);
+    
+    while ( *end && isspace(*end) ) end++;
+    if ( !*end ) _parse_binom_expr_error(str);
+
+    if ( !strncmp(end,"<=",2) ) { args->binom_cmp = cmp_le; beg = end+2; }
+    else if ( !strncmp(end,">=",2) ) { args->binom_cmp = cmp_ge; beg = end+2; }
+    else if ( !strncmp(end,"==",2) ) { args->binom_cmp = cmp_eq; beg = end+2; }
+    else if ( !strncmp(end,"<",1) ) { args->binom_cmp = cmp_lt; beg = end+1; }
+    else if ( !strncmp(end,">",1) ) { args->binom_cmp = cmp_gt; beg = end+1; }
+    else if ( !strncmp(end,"=",1) ) { args->binom_cmp = cmp_eq; beg = end+1; }
+    else _parse_binom_expr_error(str);
+
+    while ( *beg && isspace(*beg) ) beg++;
+    if ( !*beg ) _parse_binom_expr_error(str);
+
+    args->binom_val = strtod(beg, &end);
+    while ( *end && isspace(*end) ) end++;
+    if ( *end ) _parse_binom_expr_error(str);
+
+    args->tgt_mask |= GT_BINOM;
+    return;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    args = (args_t*) calloc(1,sizeof(args_t));
+    args->in_hdr  = in;
+    args->out_hdr = out;
+
+    int c;
+    static struct option loptions[] =
+    {
+        {"include",required_argument,NULL,'i'},
+        {"exclude",required_argument,NULL,'e'},
+        {"new-gt",required_argument,NULL,'n'},
+        {"target-gt",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    while ((c = getopt_long(argc, argv, "?hn:t:i:e:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'i': args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+            case 'e': args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
+            case 'n': args->new_mask = 0;
+                if ( strchr(optarg,'.') ) args->new_mask |= GT_MISSING;
+                if ( strchr(optarg,'0') ) args->new_mask |= GT_REF;
+                if ( strchr(optarg,'M') ) args->new_mask |= GT_MAJOR;
+                if ( strchr(optarg,'p') ) args->new_mask |= GT_PHASED;
+                if ( strchr(optarg,'u') ) args->new_mask |= GT_UNPHASED;
+                if ( args->new_mask==0 ) error("Unknown parameter to --new-gt: %s\n", optarg);
+                break;
+            case 't':
+                if ( !strcmp(optarg,".") ) args->tgt_mask |= GT_MISSING|GT_PARTIAL;
+                if ( !strcmp(optarg,"./x") ) args->tgt_mask |= GT_PARTIAL;
+                if ( !strcmp(optarg,"./.") ) args->tgt_mask |= GT_MISSING;
+                if ( !strcmp(optarg,"a") ) args->tgt_mask |= GT_ALL;
+                if ( !strcmp(optarg,"q") ) args->tgt_mask |= GT_QUERY;
+                if ( !strcmp(optarg,"?") ) args->tgt_mask |= GT_QUERY;        // for backward compatibility
+                if ( strchr(optarg,'b') ) parse_binom_expr(args, strchr(optarg,'b'));
+                if ( args->tgt_mask==0 ) error("Unknown parameter to --target-gt: %s\n", optarg);
+                break;
+            case 'h':
+            case '?':
+            default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+        }
+    }
+
+    if ( !args->new_mask ) error("Expected -n option\n");
+    if ( !args->tgt_mask ) error("Expected -t option\n");
+
+    if ( args->new_mask & GT_MISSING ) args->new_gt = bcf_gt_missing;
+    if ( args->new_mask & GT_REF ) args->new_gt = args->new_mask&GT_PHASED ? bcf_gt_phased(0) : bcf_gt_unphased(0);
+
+    if ( args->filter_str  && !(args->tgt_mask&GT_QUERY) ) error("Expected -tq with -i/-e\n");
+    if ( !args->filter_str && args->tgt_mask&GT_QUERY ) error("Expected -i/-e with -tq\n");
+    if ( args->filter_str ) args->filter = filter_init(in,args->filter_str);
+
+    return 0;
+}
+
+static inline int phase_gt(int32_t *ptr, int ngts)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( bcf_gt_is_phased(ptr[j]) ) continue;
+        ptr[j] = bcf_gt_phased(bcf_gt_allele(ptr[j]));    // add phasing; this may need a fix, I think the flag should be set for one allele only?
+        changed++;
+    }
+    return changed;
+}
+
+static inline int unphase_gt(int32_t *ptr, int ngts)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( !bcf_gt_is_phased(ptr[j]) ) continue;
+        ptr[j] = bcf_gt_unphased(bcf_gt_allele(ptr[j]));    // remove phasing
+        changed++;
+    }
+
+    // insertion sort
+    int k, l;
+    for (k=1; k<j; k++)
+    {
+        int32_t x = ptr[k];
+        l = k;
+        while ( l>0 && ptr[l-1]>x )
+        {
+            ptr[l] = ptr[l-1];
+            l--;
+        }
+        ptr[l] = x;
+    }
+    return changed;
+}
+static inline int set_gt(int32_t *ptr, int ngts, int gt)
+{
+    int j, changed = 0;
+    for (j=0; j<ngts; j++)
+    {
+        if ( ptr[j]==bcf_int32_vector_end ) break;
+        if ( ptr[j] != gt ) changed++;
+        ptr[j] = gt;
+    }
+    return changed;
+}
+
+static inline double calc_binom(int na, int nb)
+{
+    if ( na + nb == 0 ) return 1;
+
+    /*
+        kfunc.h implements kf_betai, which is the regularized beta function I_x(a,b) = P(X<=a/(a+b))
+    */
+    double prob = na > nb ? 2*kf_betai(na, nb + 1, 0.5) : 2*kf_betai(nb, na + 1, 0.5);
+    if ( prob > 1 ) prob = 1;
+
+    return prob;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    if ( !rec->n_sample ) return rec;
+
+    int ngts = bcf_get_genotypes(args->in_hdr, rec, &args->gts, &args->mgts);
+    ngts /= rec->n_sample;
+    int i, j, changed = 0;
+
+    int nbinom = 0;
+    if ( args->tgt_mask & GT_BINOM )
+    {
+        nbinom = bcf_get_format_int32(args->in_hdr, rec, args->binom_tag, &args->iarr, &args->miarr);
+        if ( nbinom<0 ) nbinom = 0;
+        nbinom /= rec->n_sample;
+    }
+    
+    // Calculating allele frequency for each allele and determining major allele
+    // only do this if use_major is true
+    int an = 0, maxAC = -1, majorAllele = -1;
+    if ( args->new_mask & GT_MAJOR )
+    {
+        hts_expand(int,rec->n_allele,args->marr,args->arr);
+        int ret = bcf_calc_ac(args->in_hdr,rec,args->arr,BCF_UN_FMT);
+        if ( ret<= 0 )
+            error("Could not calculate allele count at %s:%d\n", bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+        for(i=0; i < rec->n_allele; ++i)
+        {
+            an += args->arr[i];
+            if (args->arr[i] > maxAC)
+            {
+                maxAC = args->arr[i];
+                majorAllele = i;
+            }
+        }
+
+        // replacing new_gt by major allele
+        args->new_gt = args->new_mask & GT_PHASED ?  bcf_gt_phased(majorAllele) : bcf_gt_unphased(majorAllele);
+    }
+
+    // replace gts
+    if ( nbinom && ngts>=2 )    // only diploid genotypes are considered: higher ploidy ignored further, haploid here
+    {
+        if ( args->filter ) filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+        for (i=0; i<rec->n_sample; i++)
+        {
+            if ( args->smpl_pass )
+            {
+                if ( !args->smpl_pass[i] && args->filter_logic==FLT_INCLUDE ) continue;
+                if (  args->smpl_pass[i] && args->filter_logic==FLT_EXCLUDE ) continue;
+            }
+            int32_t *ptr = args->gts + i*ngts;
+            if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end ) continue;
+            if ( ptr[0]==ptr[1] ) continue; // a hom
+            int ia = bcf_gt_allele(ptr[0]); 
+            int ib = bcf_gt_allele(ptr[1]); 
+            if ( ia>=nbinom || ib>=nbinom ) 
+                error("The sample %s has incorrect number of %s fields at %s:%d\n",
+                        args->in_hdr->samples[i],args->binom_tag,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+            double prob = calc_binom(args->iarr[i*nbinom+ia],args->iarr[i*nbinom+ib]);
+            if ( !args->binom_cmp(prob,args->binom_val) ) continue;
+
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(ptr, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(ptr, ngts);
+            else
+                changed += set_gt(ptr, ngts, args->new_gt);
+        }
+    }
+    else if ( args->tgt_mask&GT_QUERY )
+    {
+        int pass_site = filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+        if ( pass_site && args->filter_logic==FLT_EXCLUDE )
+        {
+            // -i can include a site but exclude a sample, -e exclude a site but include a sample
+            if ( pass_site )
+            {
+                if ( !args->smpl_pass ) return rec;
+                pass_site = 0;
+                for (i=0; i<rec->n_sample; i++)
+                {
+                    if ( args->smpl_pass[i] ) args->smpl_pass[i] = 0;
+                    else { args->smpl_pass[i] = 1; pass_site = 1; }
+                }
+                if ( !pass_site ) return rec;
+            }
+            else if ( args->smpl_pass )
+                for (i=0; i<rec->n_sample; i++) args->smpl_pass[i] = 1;
+        }
+        else if ( !pass_site ) return rec;
+
+        for (i=0; i<rec->n_sample; i++)
+        {
+            if ( !args->smpl_pass[i] ) continue;
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(args->gts + i*ngts, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(args->gts + i*ngts, ngts);
+            else
+                changed += set_gt(args->gts + i*ngts, ngts, args->new_gt);
+        }
+    }
+    else
+    {
+        for (i=0; i<rec->n_sample; i++)
+        {
+            int ploidy = 0, nmiss = 0;
+            int32_t *ptr = args->gts + i*ngts;
+            for (j=0; j<ngts; j++)
+            {
+                if ( ptr[j]==bcf_int32_vector_end ) break;
+                ploidy++;
+                if ( ptr[j]==bcf_gt_missing ) nmiss++;
+            }
+
+            int do_set = 0;
+            if ( args->tgt_mask&GT_ALL ) do_set = 1;
+            else if ( args->tgt_mask&GT_PARTIAL && nmiss ) do_set = 1;
+            else if ( args->tgt_mask&GT_MISSING && ploidy==nmiss ) do_set = 1;
+
+            if ( !do_set ) continue;
+
+            if ( args->new_mask&GT_UNPHASED )
+                changed += unphase_gt(ptr, ngts);
+            else if ( args->new_mask==GT_PHASED )
+                changed += phase_gt(ptr, ngts);
+            else
+                changed += set_gt(ptr, ngts, args->new_gt);
+        }
+    }
+    args->nchanged += changed;
+    if ( changed ) bcf_update_genotypes(args->out_hdr, rec, args->gts, ngts*rec->n_sample);
+    return rec;
+}
+
+void destroy(void)
+{
+    fprintf(bcftools_stderr,"Filled %"PRId64" alleles\n", args->nchanged);
+    free(args->binom_tag);
+    if ( args->filter ) filter_destroy(args->filter);
+    free(args->arr);
+    free(args->iarr);
+    free(args->gts);
+    free(args);
+}
+
+
diff --git a/bcftools/plugins/smpl-stats.c b/bcftools/plugins/smpl-stats.c

new file mode 100644 (file)

index 0000000..0139a9f
--- /dev/null
+++ b/bcftools/plugins/smpl-stats.c
@@ -0,0 +1,483 @@
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    uint32_t
+        npass,          // number of genotypes passing the filter
+        nnon_ref,       // number of non-reference genotypes
+        nhomRR,
+        nhomAA,
+        nhemi,
+        nhet,
+        nSNV,
+        nIndel,
+        nmissing,
+        nsingleton,     // het different from everyone else
+        nts, ntv;       // number of transitions and transversions
+}
+stats_t;
+
+typedef struct
+{
+    stats_t *stats, site_stats;
+    filter_t *filter;
+    char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file;
+    int nflt_str;
+    char *filter_str, **flt_str;
+    char **argv, *output_fname, *fname, *regions, *targets;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+    flt_stats_t *filters;
+    int nfilters, nsmpl;
+    int32_t *gt_arr, *ac;
+    int mgt_arr, mac;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate basic per-sample stats scanning over a range of thresholds simultaneously.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Calculates basic per-sample stats. Use curly brackets to scan a range of values simultaneously\n"
+        "Usage: bcftools +smpl-stats [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [stdout]\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +smpl-stats -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+        "\n";
+}
+
+static void parse_filters(args_t *args)
+{
+    if ( !args->filter_str ) return;
+    int mflt = 1;
+    args->nflt_str = 1;
+    args->flt_str  = (char**) malloc(sizeof(char*));
+    args->flt_str[0] = strdup(args->filter_str);
+    while (1)
+    {
+        int i, expanded = 0;
+        for (i=args->nflt_str-1; i>=0; i--)
+        {
+            char *exp_beg = strchr(args->flt_str[i], '{');
+            if ( !exp_beg ) continue;
+            char *exp_end = strchr(exp_beg+1, '}');
+            if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+            char *beg = exp_beg+1, *mid = beg;
+            while ( mid<exp_end )
+            {
+                while ( mid<exp_end && *mid!=',' ) mid++;
+                kstring_t tmp = {0,0,0};
+                kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+                kputsn(beg, mid - beg, &tmp);
+                kputs(exp_end+1, &tmp);
+                args->nflt_str++;
+                hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+                args->flt_str[args->nflt_str-1] = tmp.s;
+                beg = ++mid;
+            }
+            expanded = 1;
+            free(args->flt_str[i]);
+            memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+            args->nflt_str--;
+            args->flt_str[args->nflt_str] = NULL;
+        }
+        if ( !expanded ) break;
+    }
+    
+    fprintf(stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    parse_filters(args);
+
+    int i;
+    if ( !args->nflt_str )
+    {
+        args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+        args->nfilters = 1;
+        args->filters[0].expr = strdup("all");
+    }
+    else
+    {
+        args->nfilters = args->nflt_str;
+        args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+        for (i=0; i<args->nfilters; i++)
+        {
+            args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+            args->filters[i].expr   = strdup(args->flt_str[i]);
+
+            // replace tab's with spaces so that the output stays parsable
+            char *tmp = args->filters[i].expr;
+            while ( *tmp )
+            { 
+                if ( *tmp=='\t' ) *tmp = ' '; 
+                tmp++; 
+            }
+        }
+    }
+    args->nsmpl = bcf_hdr_nsamples(args->hdr);
+    for (i=0; i<args->nfilters; i++)
+        args->filters[i].stats = (stats_t*) calloc(args->nsmpl,sizeof(stats_t));
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    for (i=0; i<args->nfilters; i++)
+    {
+        if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+        free(args->filters[i].stats);
+        free(args->filters[i].expr);
+    }
+    free(args->filters);
+    for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+    free(args->flt_str);
+    bcf_sr_destroy(args->sr);
+    free(args->ac);
+    free(args->gt_arr);
+    free(args);
+}
+static void report_stats(args_t *args)
+{
+    int i = 0,j;
+    FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? stdout : fopen(args->output_fname,"w");
+    if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+    fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+    fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+    fprintf(fh,"# FLT* lines report numbers for every threshold and every sample:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) sample\n", ++i);
+    fprintf(fh,"#   %d) number of genotypes which pass the filter\n", ++i);
+    fprintf(fh,"#   %d) number of non-reference genotypes\n", ++i);
+    fprintf(fh,"#   %d) number of homozygous ref genotypes (0/0 or 0)\n", ++i);
+    fprintf(fh,"#   %d) number of homozygous alt genotypes (1/1, 2/2, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of heterozygous genotypes (0/1, 1/2, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of hemizygous genotypes (0, 1, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of SNVs\n", ++i);
+    fprintf(fh,"#   %d) number of indels\n", ++i);
+    fprintf(fh,"#   %d) number of singletons\n", ++i);
+    fprintf(fh,"#   %d) number of missing genotypes (./., ., ./0, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of transitions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+    fprintf(fh,"#   %d) number of transversions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv\n", ++i);
+    i = 0;
+    fprintf(fh,"# SITE* lines report numbers for every threshold and site:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) number of sites which pass the filter\n", ++i);
+    fprintf(fh,"#   %d) number of SNVs\n", ++i);
+    fprintf(fh,"#   %d) number of indels\n", ++i);
+    fprintf(fh,"#   %d) number of singletons\n", ++i);
+    fprintf(fh,"#   %d) number of transitions (counted at most once at multiallelic sites)\n", ++i);
+    fprintf(fh,"#   %d) number of transversions (counted at most once at multiallelic sites)\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv\n", ++i);
+    fprintf(fh, "CMD\t%s", args->argv[0]);
+    for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+    fprintf(fh, "\n");
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+    }
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        for (j=0; j<args->nsmpl; j++)
+        {
+            fprintf(fh,"FLT%d", i);
+            fprintf(fh,"\t%s",args->hdr->samples[j]);
+            stats_t *stats = &flt->stats[j];
+            fprintf(fh,"\t%d", stats->npass);
+            fprintf(fh,"\t%d", stats->nnon_ref);
+            fprintf(fh,"\t%d", stats->nhomRR);
+            fprintf(fh,"\t%d", stats->nhomAA);
+            fprintf(fh,"\t%d", stats->nhet);
+            fprintf(fh,"\t%d", stats->nhemi);
+            fprintf(fh,"\t%d", stats->nSNV);
+            fprintf(fh,"\t%d", stats->nIndel);
+            fprintf(fh,"\t%d", stats->nsingleton);
+            fprintf(fh,"\t%d", stats->nmissing);
+            fprintf(fh,"\t%d", stats->nts);
+            fprintf(fh,"\t%d", stats->ntv);
+            fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+            fprintf(fh,"\n");
+        }
+        fprintf(fh,"SITE%d", i);
+        stats_t *stats = &flt->site_stats;
+        fprintf(fh,"\t%d", stats->npass);
+        fprintf(fh,"\t%d", stats->nSNV);
+        fprintf(fh,"\t%d", stats->nIndel);
+        fprintf(fh,"\t%d", stats->nsingleton);
+        fprintf(fh,"\t%d", stats->nts);
+        fprintf(fh,"\t%d", stats->ntv);
+        fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+        fprintf(fh,"\n");
+    }
+    if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+    int32_t *ptr = arr + ngt1 * idx;
+    if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+    als[0] = bcf_gt_allele(ptr[0]);
+
+    if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { ptr[1] = ptr[0]; return -2; }
+
+    if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+    als[1] = bcf_gt_allele(ptr[1]);
+
+    return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+    int i,j;
+    uint8_t *smpl_pass = NULL;
+
+    // Find out which trios pass and if the site passes
+    if ( flt->filter )
+    {
+        int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE )
+        {
+            if ( pass_site )
+            {
+                if ( !smpl_pass ) return;
+                pass_site = 0;
+                for (i=0; i<args->nsmpl; i++)
+                {
+                    if ( smpl_pass[i] ) smpl_pass[i] = 0;
+                    else { smpl_pass[i] = 1; pass_site = 1; }
+                }
+                if ( !pass_site ) return;
+            }
+            else
+                for (i=0; i<args->nsmpl; i++) smpl_pass[i] = 1;
+        }
+        else if ( !pass_site ) return;
+    }
+
+    // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+    hts_expand(int, rec->n_allele, args->mac, args->ac);
+    if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+
+    // Get the genotypes
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+    if ( ngt<0 ) return;
+    int ngt1 = ngt / rec->n_sample;
+    
+
+    // For ts/tv: numeric code of the reference allele, -1 for insertions
+    int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+    int star_allele = -1;
+    for (i=1; i<rec->n_allele; i++)
+        if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+    // Run the stats
+    int site_pass      = 0;
+    int site_SNV       = 0;
+    int site_Indel     = 0;
+    int site_has_ts    = 0;
+    int site_has_tv    = 0;
+    int site_singleton = 0;
+    for (i=0; i<args->nsmpl; i++)
+    {
+        if ( smpl_pass && !smpl_pass[i] ) continue;
+        stats_t *stats = &flt->stats[i];
+
+        // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+        int als[2];
+        int ret = parse_genotype(args->gt_arr, ngt1, i, als);
+        if ( ret==-1 ) { stats->nmissing++; continue; }   // missing allele
+        if ( ret==-2 ) stats->nhemi++;
+        else if ( als[0]!=als[1] ) stats->nhet++;
+        else if ( als[0]==0 ) stats->nhomRR++;
+        else stats->nhomAA++;
+
+        stats->npass++;
+        site_pass = 1;
+
+        // Is there an alternate allele other than *?
+        int has_nonref = 0;
+        for (j=0; j<2; j++)
+        {
+            if ( als[j]==star_allele ) continue;
+            if ( als[j]==0 ) continue;
+            has_nonref = 1;
+        }
+        if ( !has_nonref ) continue; // only ref or * in this genotype
+        
+        stats->nnon_ref++;
+
+        // Calculate ts/tv, count SNPs, indels. It does the right thing and handles also HetAA genotypes
+        {
+            int has_ts = 0, has_tv = 0, has_snv = 0, has_indel = 0;
+            for (j=0; j<2; j++)
+            {
+                if ( als[j]==0 || als[j]==star_allele ) continue;
+                if ( als[j] >= rec->n_allele )
+                    error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[j]);
+
+                if ( args->ac[als[j]]==1 ) { stats->nsingleton++; site_singleton = 1; }
+
+                int var_type = bcf_get_variant_type(rec, als[j]);
+                if ( var_type==VCF_SNP || var_type==VCF_MNP )
+                {
+                    int k = 0;
+                    while ( rec->d.allele[0][k] && rec->d.allele[als[j]][k] )
+                    {
+                        if ( rec->d.allele[0][k]==rec->d.allele[als[j]][k] ) { k++; continue; }
+
+                        int alt = bcf_acgt2int(rec->d.allele[als[j]][k]);
+                        if ( abs(ref-alt)==2 ) has_ts = 1;
+                        else has_tv = 1;
+                        has_snv = 1;
+
+                        k++;
+                    }
+                }
+                else if ( var_type==VCF_INDEL ) has_indel = 1;
+            }
+            if ( has_ts ) { stats->nts++; site_has_ts = 1; }
+            if ( has_tv ) { stats->ntv++; site_has_tv = 1; }
+            if ( has_snv ) { stats->nSNV++; site_SNV = 1; }
+            if ( has_indel ) { stats->nIndel++; site_Indel = 1; }
+        }
+    }
+    flt->site_stats.npass  += site_pass;
+    flt->site_stats.nSNV   += site_SNV;
+    flt->site_stats.nIndel += site_Indel;
+    flt->site_stats.nts    += site_has_ts;
+    flt->site_stats.ntv    += site_has_tv;
+    flt->site_stats.nsingleton += site_singleton;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c, i;
+    while ((c = getopt_long(argc, argv, "o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        for (i=0; i<args->nfilters; i++)
+            process_record(args, rec, &args->filters[i]);
+    }
+
+    report_stats(args);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/smpl-stats.c.pysam.c b/bcftools/plugins/smpl-stats.c.pysam.c

new file mode 100644 (file)

index 0000000..5d9dc87
--- /dev/null
+++ b/bcftools/plugins/smpl-stats.c.pysam.c
@@ -0,0 +1,485 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    uint32_t
+        npass,          // number of genotypes passing the filter
+        nnon_ref,       // number of non-reference genotypes
+        nhomRR,
+        nhomAA,
+        nhemi,
+        nhet,
+        nSNV,
+        nIndel,
+        nmissing,
+        nsingleton,     // het different from everyone else
+        nts, ntv;       // number of transitions and transversions
+}
+stats_t;
+
+typedef struct
+{
+    stats_t *stats, site_stats;
+    filter_t *filter;
+    char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file;
+    int nflt_str;
+    char *filter_str, **flt_str;
+    char **argv, *output_fname, *fname, *regions, *targets;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+    flt_stats_t *filters;
+    int nfilters, nsmpl;
+    int32_t *gt_arr, *ac;
+    int mgt_arr, mac;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate basic per-sample stats scanning over a range of thresholds simultaneously.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Calculates basic per-sample stats. Use curly brackets to scan a range of values simultaneously\n"
+        "Usage: bcftools +smpl-stats [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [bcftools_stdout]\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +smpl-stats -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+        "\n";
+}
+
+static void parse_filters(args_t *args)
+{
+    if ( !args->filter_str ) return;
+    int mflt = 1;
+    args->nflt_str = 1;
+    args->flt_str  = (char**) malloc(sizeof(char*));
+    args->flt_str[0] = strdup(args->filter_str);
+    while (1)
+    {
+        int i, expanded = 0;
+        for (i=args->nflt_str-1; i>=0; i--)
+        {
+            char *exp_beg = strchr(args->flt_str[i], '{');
+            if ( !exp_beg ) continue;
+            char *exp_end = strchr(exp_beg+1, '}');
+            if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+            char *beg = exp_beg+1, *mid = beg;
+            while ( mid<exp_end )
+            {
+                while ( mid<exp_end && *mid!=',' ) mid++;
+                kstring_t tmp = {0,0,0};
+                kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+                kputsn(beg, mid - beg, &tmp);
+                kputs(exp_end+1, &tmp);
+                args->nflt_str++;
+                hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+                args->flt_str[args->nflt_str-1] = tmp.s;
+                beg = ++mid;
+            }
+            expanded = 1;
+            free(args->flt_str[i]);
+            memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+            args->nflt_str--;
+            args->flt_str[args->nflt_str] = NULL;
+        }
+        if ( !expanded ) break;
+    }
+    
+    fprintf(bcftools_stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    parse_filters(args);
+
+    int i;
+    if ( !args->nflt_str )
+    {
+        args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+        args->nfilters = 1;
+        args->filters[0].expr = strdup("all");
+    }
+    else
+    {
+        args->nfilters = args->nflt_str;
+        args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+        for (i=0; i<args->nfilters; i++)
+        {
+            args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+            args->filters[i].expr   = strdup(args->flt_str[i]);
+
+            // replace tab's with spaces so that the output stays parsable
+            char *tmp = args->filters[i].expr;
+            while ( *tmp )
+            { 
+                if ( *tmp=='\t' ) *tmp = ' '; 
+                tmp++; 
+            }
+        }
+    }
+    args->nsmpl = bcf_hdr_nsamples(args->hdr);
+    for (i=0; i<args->nfilters; i++)
+        args->filters[i].stats = (stats_t*) calloc(args->nsmpl,sizeof(stats_t));
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    for (i=0; i<args->nfilters; i++)
+    {
+        if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+        free(args->filters[i].stats);
+        free(args->filters[i].expr);
+    }
+    free(args->filters);
+    for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+    free(args->flt_str);
+    bcf_sr_destroy(args->sr);
+    free(args->ac);
+    free(args->gt_arr);
+    free(args);
+}
+static void report_stats(args_t *args)
+{
+    int i = 0,j;
+    FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? bcftools_stdout : fopen(args->output_fname,"w");
+    if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+    fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+    fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+    fprintf(fh,"# FLT* lines report numbers for every threshold and every sample:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) sample\n", ++i);
+    fprintf(fh,"#   %d) number of genotypes which pass the filter\n", ++i);
+    fprintf(fh,"#   %d) number of non-reference genotypes\n", ++i);
+    fprintf(fh,"#   %d) number of homozygous ref genotypes (0/0 or 0)\n", ++i);
+    fprintf(fh,"#   %d) number of homozygous alt genotypes (1/1, 2/2, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of heterozygous genotypes (0/1, 1/2, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of hemizygous genotypes (0, 1, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of SNVs\n", ++i);
+    fprintf(fh,"#   %d) number of indels\n", ++i);
+    fprintf(fh,"#   %d) number of singletons\n", ++i);
+    fprintf(fh,"#   %d) number of missing genotypes (./., ., ./0, etc)\n", ++i);
+    fprintf(fh,"#   %d) number of transitions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+    fprintf(fh,"#   %d) number of transversions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv\n", ++i);
+    i = 0;
+    fprintf(fh,"# SITE* lines report numbers for every threshold and site:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) number of sites which pass the filter\n", ++i);
+    fprintf(fh,"#   %d) number of SNVs\n", ++i);
+    fprintf(fh,"#   %d) number of indels\n", ++i);
+    fprintf(fh,"#   %d) number of singletons\n", ++i);
+    fprintf(fh,"#   %d) number of transitions (counted at most once at multiallelic sites)\n", ++i);
+    fprintf(fh,"#   %d) number of transversions (counted at most once at multiallelic sites)\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv\n", ++i);
+    fprintf(fh, "CMD\t%s", args->argv[0]);
+    for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+    fprintf(fh, "\n");
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+    }
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        for (j=0; j<args->nsmpl; j++)
+        {
+            fprintf(fh,"FLT%d", i);
+            fprintf(fh,"\t%s",args->hdr->samples[j]);
+            stats_t *stats = &flt->stats[j];
+            fprintf(fh,"\t%d", stats->npass);
+            fprintf(fh,"\t%d", stats->nnon_ref);
+            fprintf(fh,"\t%d", stats->nhomRR);
+            fprintf(fh,"\t%d", stats->nhomAA);
+            fprintf(fh,"\t%d", stats->nhet);
+            fprintf(fh,"\t%d", stats->nhemi);
+            fprintf(fh,"\t%d", stats->nSNV);
+            fprintf(fh,"\t%d", stats->nIndel);
+            fprintf(fh,"\t%d", stats->nsingleton);
+            fprintf(fh,"\t%d", stats->nmissing);
+            fprintf(fh,"\t%d", stats->nts);
+            fprintf(fh,"\t%d", stats->ntv);
+            fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+            fprintf(fh,"\n");
+        }
+        fprintf(fh,"SITE%d", i);
+        stats_t *stats = &flt->site_stats;
+        fprintf(fh,"\t%d", stats->npass);
+        fprintf(fh,"\t%d", stats->nSNV);
+        fprintf(fh,"\t%d", stats->nIndel);
+        fprintf(fh,"\t%d", stats->nsingleton);
+        fprintf(fh,"\t%d", stats->nts);
+        fprintf(fh,"\t%d", stats->ntv);
+        fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+        fprintf(fh,"\n");
+    }
+    if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "bcftools_stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+    int32_t *ptr = arr + ngt1 * idx;
+    if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+    als[0] = bcf_gt_allele(ptr[0]);
+
+    if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { ptr[1] = ptr[0]; return -2; }
+
+    if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+    als[1] = bcf_gt_allele(ptr[1]);
+
+    return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+    int i,j;
+    uint8_t *smpl_pass = NULL;
+
+    // Find out which trios pass and if the site passes
+    if ( flt->filter )
+    {
+        int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE )
+        {
+            if ( pass_site )
+            {
+                if ( !smpl_pass ) return;
+                pass_site = 0;
+                for (i=0; i<args->nsmpl; i++)
+                {
+                    if ( smpl_pass[i] ) smpl_pass[i] = 0;
+                    else { smpl_pass[i] = 1; pass_site = 1; }
+                }
+                if ( !pass_site ) return;
+            }
+            else
+                for (i=0; i<args->nsmpl; i++) smpl_pass[i] = 1;
+        }
+        else if ( !pass_site ) return;
+    }
+
+    // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+    hts_expand(int, rec->n_allele, args->mac, args->ac);
+    if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+
+    // Get the genotypes
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+    if ( ngt<0 ) return;
+    int ngt1 = ngt / rec->n_sample;
+    
+
+    // For ts/tv: numeric code of the reference allele, -1 for insertions
+    int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+    int star_allele = -1;
+    for (i=1; i<rec->n_allele; i++)
+        if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+    // Run the stats
+    int site_pass      = 0;
+    int site_SNV       = 0;
+    int site_Indel     = 0;
+    int site_has_ts    = 0;
+    int site_has_tv    = 0;
+    int site_singleton = 0;
+    for (i=0; i<args->nsmpl; i++)
+    {
+        if ( smpl_pass && !smpl_pass[i] ) continue;
+        stats_t *stats = &flt->stats[i];
+
+        // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+        int als[2];
+        int ret = parse_genotype(args->gt_arr, ngt1, i, als);
+        if ( ret==-1 ) { stats->nmissing++; continue; }   // missing allele
+        if ( ret==-2 ) stats->nhemi++;
+        else if ( als[0]!=als[1] ) stats->nhet++;
+        else if ( als[0]==0 ) stats->nhomRR++;
+        else stats->nhomAA++;
+
+        stats->npass++;
+        site_pass = 1;
+
+        // Is there an alternate allele other than *?
+        int has_nonref = 0;
+        for (j=0; j<2; j++)
+        {
+            if ( als[j]==star_allele ) continue;
+            if ( als[j]==0 ) continue;
+            has_nonref = 1;
+        }
+        if ( !has_nonref ) continue; // only ref or * in this genotype
+        
+        stats->nnon_ref++;
+
+        // Calculate ts/tv, count SNPs, indels. It does the right thing and handles also HetAA genotypes
+        {
+            int has_ts = 0, has_tv = 0, has_snv = 0, has_indel = 0;
+            for (j=0; j<2; j++)
+            {
+                if ( als[j]==0 || als[j]==star_allele ) continue;
+                if ( als[j] >= rec->n_allele )
+                    error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[j]);
+
+                if ( args->ac[als[j]]==1 ) { stats->nsingleton++; site_singleton = 1; }
+
+                int var_type = bcf_get_variant_type(rec, als[j]);
+                if ( var_type==VCF_SNP || var_type==VCF_MNP )
+                {
+                    int k = 0;
+                    while ( rec->d.allele[0][k] && rec->d.allele[als[j]][k] )
+                    {
+                        if ( rec->d.allele[0][k]==rec->d.allele[als[j]][k] ) { k++; continue; }
+
+                        int alt = bcf_acgt2int(rec->d.allele[als[j]][k]);
+                        if ( abs(ref-alt)==2 ) has_ts = 1;
+                        else has_tv = 1;
+                        has_snv = 1;
+
+                        k++;
+                    }
+                }
+                else if ( var_type==VCF_INDEL ) has_indel = 1;
+            }
+            if ( has_ts ) { stats->nts++; site_has_ts = 1; }
+            if ( has_tv ) { stats->ntv++; site_has_tv = 1; }
+            if ( has_snv ) { stats->nSNV++; site_SNV = 1; }
+            if ( has_indel ) { stats->nIndel++; site_Indel = 1; }
+        }
+    }
+    flt->site_stats.npass  += site_pass;
+    flt->site_stats.nSNV   += site_SNV;
+    flt->site_stats.nIndel += site_Indel;
+    flt->site_stats.nts    += site_has_ts;
+    flt->site_stats.ntv    += site_has_tv;
+    flt->site_stats.nsingleton += site_singleton;
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c, i;
+    while ((c = getopt_long(argc, argv, "o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s",usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s",usage_text());
+    else args->fname = argv[optind];
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        for (i=0; i<args->nfilters; i++)
+            process_record(args, rec, &args->filters[i]);
+    }
+
+    report_stats(args);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/split.c b/bcftools/plugins/split.c

new file mode 100644 (file)

index 0000000..7cd38a4
--- /dev/null
+++ b/bcftools/plugins/split.c
@@ -0,0 +1,415 @@
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+/*
+    Split VCF by sample(s)
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    htsFile **fh;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+    uint8_t *info_tags, *fmt_tags;
+    int ninfo_tags, minfo_tags, nfmt_tags, mfmt_tags, keep_info, keep_fmt;
+    int argc, region_is_file, target_is_file, output_type;
+    char **argv, *region, *target, *fname, *output_dir, *keep_tags, **bnames, *samples_fname;
+    bcf_hdr_t *hdr_in, *hdr_out;
+    bcf_srs_t *sr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Split VCF by sample creating single-sample VCFs\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Split VCF by sample, creating single-sample VCFs.\n"
+        "\n"
+        "Usage: bcftools +split [Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR              exclude sites for which the expression is true (applied on the outputs)\n"
+        "   -i, --include EXPR              include only sites for which the expression is true (applied on the outputs)\n"
+        "   -k, --keep-tags LIST            list of tags to keep. By default all tags are preserved\n"
+        "   -o, --output DIR                write output to the directory DIR\n"
+        "   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "   -r, --regions REGION            restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE         restrict to regions listed in a file\n"
+        "   -S, --samples-file FILE         list of samples to keep with second (optional) column for basename of the new file\n"
+        "   -t, --targets REGION            similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE         similar to -R but streams rather than index-jumps\n"
+        "Examples:\n"
+        "   # Split a VCF file\n"
+        "   bcftools +split input.bcf -Ob -o dir\n"
+        "\n"
+        "   # Exclude sites with missing or hom-ref genotypes\n"
+        "   bcftools +split input.bcf -Ob -o dir -i'GT=\"alt\"'\n"
+        "\n"
+        "   # Keep all INFO tags but only GT and PL in FORMAT\n"
+        "   bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL\n"
+        "\n"
+        "   # Keep all FORMAT tags but drop all INFO tags\n"
+        "   bcftools +split input.bcf -Ob -o dir -k FMT\n"
+        "\n";
+}
+
+void mkdir_p(const char *fmt, ...);
+
+char **set_file_base_names(args_t *args)
+{
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    char **fnames = (char**) calloc(nsmpl,sizeof(char*));
+    if ( args->samples_fname )
+    {
+        kstring_t str = {0,0,0};
+        int nsamples = 0;
+        char **samples = NULL;
+        samples = hts_readlines(args->samples_fname, &nsamples);
+        for (i=0; i<nsamples; i++)
+        {
+            str.l = 0;
+            int escaped = 0;
+            char *ptr = samples[i];
+            while ( *ptr )
+            {
+                if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+                if ( isspace(*ptr) && !escaped ) break;
+                kputc(*ptr, &str);
+                escaped = 0;
+                ptr++;
+            }
+            int idx = bcf_hdr_id2int(args->hdr_in, BCF_DT_SAMPLE, str.s);
+            if ( idx<0 )
+            {
+                fprintf(stderr,"Warning: The sample \"%s\" is not present in %s\n", str.s,args->fname);
+                continue;
+            }
+            while ( *ptr && isspace(*ptr) ) ptr++;
+            if ( !*ptr )
+            {
+                fnames[idx] = strdup(str.s);
+                continue;
+            }
+            str.l = 0;
+            while ( *ptr )
+            {
+                if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+                if ( isspace(*ptr) && !escaped ) break;
+                kputc(*ptr, &str);
+                escaped = 0;
+                ptr++;
+            }
+            fnames[idx] = strdup(str.s);
+        }
+        for (i=0; i<nsamples; i++) free(samples[i]);
+        free(samples);
+        free(str.s);
+    }
+    else
+    {
+        for (i=0; i<nsmpl; i++)
+            fnames[i] = strdup(args->hdr_in->samples[i]);
+    }
+    return fnames;
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->region )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+    }
+    if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr_in  = bcf_sr_get_header(args->sr,0);
+    args->hdr_out = bcf_hdr_dup(args->hdr_in);
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr_in, args->filter_str);
+
+    mkdir_p("%s/",args->output_dir);
+
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    if ( !nsmpl ) error("No samples to split: %s\n", args->fname);
+    args->fh = (htsFile**)calloc(nsmpl,sizeof(*args->fh));
+    args->bnames = set_file_base_names(args);
+    kstring_t str = {0,0,0};
+    for (i=0; i<nsmpl; i++)
+    {
+        if ( !args->bnames[i] ) continue;
+        str.l = 0;
+        kputs(args->output_dir, &str);
+        if ( str.s[str.l-1] != '/' ) kputc('/', &str);
+        int k, l = str.l;
+        kputs(args->bnames[i], &str);
+        for (k=l; k<str.l; k++) if ( isspace(str.s[k]) ) str.s[k] = '_';
+        if ( args->output_type & FT_BCF ) kputs(".bcf", &str);
+        else if ( args->output_type & FT_GZ ) kputs(".vcf.gz", &str);
+        else kputs(".vcf", &str);
+        args->fh[i] = hts_open(str.s, hts_bcf_wmode(args->output_type));
+        if ( args->fh[i] == NULL ) error("Can't write to \"%s\": %s\n", str.s, strerror(errno));
+        bcf_hdr_nsamples(args->hdr_out) = 1;
+        args->hdr_out->samples[0] = args->bnames[i];
+        bcf_hdr_write(args->fh[i], args->hdr_out);
+    }
+    free(str.s);
+
+    // parse tags
+    int is_info = 0, is_fmt = 0;
+    char *beg = args->keep_tags;
+    while ( beg && *beg )
+    {
+        if ( !strncasecmp("INFO/",beg,5) ) { is_info = 1; is_fmt = 0; beg += 5; }
+        else if ( !strcasecmp("INFO",beg) ) { args->keep_info = 1; break; }
+        else if ( !strncasecmp("INFO,",beg,5) ) { args->keep_info = 1; beg += 5; continue; }
+        else if ( !strncasecmp("FMT/",beg,4) ) { is_info = 0; is_fmt = 1; beg += 4; }
+        else if ( !strncasecmp("FORMAT/",beg,7) ) { is_info = 0; is_fmt = 1; beg += 7; }
+        else if ( !strcasecmp("FMT",beg) ) { args->keep_fmt = 1; break; }
+        else if ( !strcasecmp("FORMAT",beg) ) { args->keep_fmt = 1; break; }
+        else if ( !strncasecmp("FMT,",beg,4) ) { args->keep_fmt = 1; beg += 4; continue; }
+        else if ( !strncasecmp("FORMAT,",beg,7) ) { args->keep_fmt = 1; beg += 7; continue; }
+        char *end = beg;
+        while ( *end && *end!=',' ) end++;
+        char tmp = *end; *end = 0;
+        int id = bcf_hdr_id2int(args->hdr_in, BCF_DT_ID, beg);
+        beg = tmp ? end + 1 : end;
+        if ( is_info && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_INFO,id) )
+        {
+            if ( id >= args->ninfo_tags ) args->ninfo_tags = id + 1;
+            hts_expand0(uint8_t, args->ninfo_tags, args->minfo_tags, args->info_tags);
+            args->info_tags[id] = 1;
+        }
+        if ( is_fmt && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_FMT,id) )
+        {
+            if ( id >= args->nfmt_tags ) args->nfmt_tags = id + 1;
+            hts_expand0(uint8_t, args->nfmt_tags, args->mfmt_tags, args->fmt_tags);
+            args->fmt_tags[id] = 1;
+        }
+    }
+    if ( !args->keep_info && !args->keep_fmt && !args->ninfo_tags && !args->nfmt_tags )
+    {
+        args->keep_info = args->keep_fmt = 1;
+    }
+}
+static void destroy_data(args_t *args)
+{
+    free(args->info_tags);
+    free(args->fmt_tags);
+    if ( args->filter )
+        filter_destroy(args->filter);
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    for (i=0; i<nsmpl; i++)
+    {
+        if ( args->fh[i] && hts_close(args->fh[i])!=0 ) error("Error: close failed!\n");
+        free(args->bnames[i]);
+    }
+    free(args->bnames);
+    free(args->fh);
+    bcf_sr_destroy(args->sr);
+    bcf_hdr_destroy(args->hdr_out);
+    free(args);
+}
+
+static bcf1_t *rec_set_info(args_t *args, bcf1_t *rec)
+{
+    bcf1_t *out = bcf_init1();
+    out->rid  = rec->rid;
+    out->pos  = rec->pos;
+    out->rlen = rec->rlen;
+    out->qual = rec->qual;
+    out->n_allele = rec->n_allele;
+    out->n_sample = 1;
+    if ( args->keep_info )
+    {
+        out->n_info = rec->n_info;
+        out->shared.m = out->shared.l = rec->shared.l;
+        out->shared.s = (char*) malloc(out->shared.l);
+        memcpy(out->shared.s,rec->shared.s,out->shared.l);
+        return out;
+    }
+
+    // build the BCF record
+    kstring_t tmp = {0,0,0};
+    char *ptr = rec->shared.s;
+    kputsn_(ptr, rec->unpack_size[0], &tmp); ptr += rec->unpack_size[0]; // ID
+    kputsn_(ptr, rec->unpack_size[1], &tmp); ptr += rec->unpack_size[1]; // REF+ALT
+    kputsn_(ptr, rec->unpack_size[2], &tmp);                             // FILTER
+    if ( args->ninfo_tags )
+    {
+        int i;
+        for (i=0; i<rec->n_info; i++)
+        {
+            bcf_info_t *info = &rec->d.info[i];
+            int id = info->key;
+            if ( !args->info_tags[id] ) continue;
+            kputsn_(info->vptr - info->vptr_off, info->vptr_len + info->vptr_off, &tmp);
+            out->n_info++;
+        }
+    }
+    out->shared.m = tmp.m;
+    out->shared.s = tmp.s;
+    out->shared.l = tmp.l;
+    out->unpacked = 0;
+    return out;
+}
+
+static bcf1_t *rec_set_format(args_t *args, bcf1_t *src, int ismpl, bcf1_t *dst)
+{
+    dst->n_fmt = 0;
+    kstring_t tmp = dst->indiv; tmp.l = 0;
+    int i;
+    for (i=0; i<src->n_fmt; i++)
+    {
+        bcf_fmt_t *fmt = &src->d.fmt[i];
+        int id = fmt->id;
+        if ( !args->keep_fmt && !args->fmt_tags[id] ) continue;
+
+        bcf_enc_int1(&tmp, id);
+        bcf_enc_size(&tmp, fmt->n, fmt->type);
+        kputsn_(fmt->p + ismpl*fmt->size, fmt->size, &tmp);
+
+        dst->n_fmt++;
+    }
+    dst->indiv = tmp;
+    return dst;
+}
+
+static void process(args_t *args)
+{
+    bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+    bcf_unpack(rec, BCF_UN_ALL);
+
+    int i, site_pass = 1;
+    const uint8_t *smpl_pass = NULL;
+    if ( args->filter )
+    {
+        site_pass = filter_test(args->filter, rec, &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE ) site_pass = site_pass ? 0 : 1;
+    }
+    bcf1_t *out = NULL; 
+    for (i=0; i<rec->n_sample; i++)
+    {
+        if ( !args->fh[i] ) continue;
+        if ( !smpl_pass && !site_pass ) continue;
+        if ( smpl_pass )
+        {
+            int pass = args->filter_logic & FLT_EXCLUDE ? ( smpl_pass[i] ? 0 : 1) : smpl_pass[i];
+            if ( !pass ) continue;
+        }
+        if ( !out ) out = rec_set_info(args, rec);
+        rec_set_format(args, rec, i, out);
+        bcf_write(args->fh[i], args->hdr_out, out);
+    }
+    if ( out ) bcf_destroy(out);
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_type  = FT_VCF;
+    static struct option loptions[] =
+    {
+        {"keep-tags",required_argument,NULL,'k'},
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"samples-file",required_argument,NULL,'S'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "vr:R:t:T:o:O:i:e:k:S:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'k': args->keep_tags = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'T': args->target = optarg; args->target_is_file = 1; break;
+            case 't': args->target = optarg; break; 
+            case 'R': args->region = optarg; args->region_is_file = 1;  break;
+            case 'S': args->samples_fname = optarg; break;
+            case 'r': args->region = optarg; break; 
+            case 'o': args->output_dir = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s", usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s", usage_text());
+    else args->fname = argv[optind];
+
+    if ( !args->output_dir ) error("Missing the -o option\n");
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+
+    init_data(args);
+    
+    while ( bcf_sr_next_line(args->sr) ) process(args);
+
+    destroy_data(args);
+    return 0;
+}
+
+
diff --git a/bcftools/plugins/split.c.pysam.c b/bcftools/plugins/split.c.pysam.c

new file mode 100644 (file)

index 0000000..140ab5d
--- /dev/null
+++ b/bcftools/plugins/split.c.pysam.c
@@ -0,0 +1,417 @@
+#include "bcftools.pysam.h"
+
+/* 
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+/*
+    Split VCF by sample(s)
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+    htsFile **fh;
+    filter_t *filter;
+    char *filter_str;
+    int filter_logic;   // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+    uint8_t *info_tags, *fmt_tags;
+    int ninfo_tags, minfo_tags, nfmt_tags, mfmt_tags, keep_info, keep_fmt;
+    int argc, region_is_file, target_is_file, output_type;
+    char **argv, *region, *target, *fname, *output_dir, *keep_tags, **bnames, *samples_fname;
+    bcf_hdr_t *hdr_in, *hdr_out;
+    bcf_srs_t *sr;
+}
+args_t;
+
+const char *about(void)
+{
+    return "Split VCF by sample creating single-sample VCFs\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Split VCF by sample, creating single-sample VCFs.\n"
+        "\n"
+        "Usage: bcftools +split [Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR              exclude sites for which the expression is true (applied on the outputs)\n"
+        "   -i, --include EXPR              include only sites for which the expression is true (applied on the outputs)\n"
+        "   -k, --keep-tags LIST            list of tags to keep. By default all tags are preserved\n"
+        "   -o, --output DIR                write output to the directory DIR\n"
+        "   -O, --output-type b|u|z|v       b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+        "   -r, --regions REGION            restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE         restrict to regions listed in a file\n"
+        "   -S, --samples-file FILE         list of samples to keep with second (optional) column for basename of the new file\n"
+        "   -t, --targets REGION            similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE         similar to -R but streams rather than index-jumps\n"
+        "Examples:\n"
+        "   # Split a VCF file\n"
+        "   bcftools +split input.bcf -Ob -o dir\n"
+        "\n"
+        "   # Exclude sites with missing or hom-ref genotypes\n"
+        "   bcftools +split input.bcf -Ob -o dir -i'GT=\"alt\"'\n"
+        "\n"
+        "   # Keep all INFO tags but only GT and PL in FORMAT\n"
+        "   bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL\n"
+        "\n"
+        "   # Keep all FORMAT tags but drop all INFO tags\n"
+        "   bcftools +split input.bcf -Ob -o dir -k FMT\n"
+        "\n";
+}
+
+void mkdir_p(const char *fmt, ...);
+
+char **set_file_base_names(args_t *args)
+{
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    char **fnames = (char**) calloc(nsmpl,sizeof(char*));
+    if ( args->samples_fname )
+    {
+        kstring_t str = {0,0,0};
+        int nsamples = 0;
+        char **samples = NULL;
+        samples = hts_readlines(args->samples_fname, &nsamples);
+        for (i=0; i<nsamples; i++)
+        {
+            str.l = 0;
+            int escaped = 0;
+            char *ptr = samples[i];
+            while ( *ptr )
+            {
+                if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+                if ( isspace(*ptr) && !escaped ) break;
+                kputc(*ptr, &str);
+                escaped = 0;
+                ptr++;
+            }
+            int idx = bcf_hdr_id2int(args->hdr_in, BCF_DT_SAMPLE, str.s);
+            if ( idx<0 )
+            {
+                fprintf(bcftools_stderr,"Warning: The sample \"%s\" is not present in %s\n", str.s,args->fname);
+                continue;
+            }
+            while ( *ptr && isspace(*ptr) ) ptr++;
+            if ( !*ptr )
+            {
+                fnames[idx] = strdup(str.s);
+                continue;
+            }
+            str.l = 0;
+            while ( *ptr )
+            {
+                if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+                if ( isspace(*ptr) && !escaped ) break;
+                kputc(*ptr, &str);
+                escaped = 0;
+                ptr++;
+            }
+            fnames[idx] = strdup(str.s);
+        }
+        for (i=0; i<nsamples; i++) free(samples[i]);
+        free(samples);
+        free(str.s);
+    }
+    else
+    {
+        for (i=0; i<nsmpl; i++)
+            fnames[i] = strdup(args->hdr_in->samples[i]);
+    }
+    return fnames;
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->region )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+    }
+    if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr_in  = bcf_sr_get_header(args->sr,0);
+    args->hdr_out = bcf_hdr_dup(args->hdr_in);
+
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr_in, args->filter_str);
+
+    mkdir_p("%s/",args->output_dir);
+
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    if ( !nsmpl ) error("No samples to split: %s\n", args->fname);
+    args->fh = (htsFile**)calloc(nsmpl,sizeof(*args->fh));
+    args->bnames = set_file_base_names(args);
+    kstring_t str = {0,0,0};
+    for (i=0; i<nsmpl; i++)
+    {
+        if ( !args->bnames[i] ) continue;
+        str.l = 0;
+        kputs(args->output_dir, &str);
+        if ( str.s[str.l-1] != '/' ) kputc('/', &str);
+        int k, l = str.l;
+        kputs(args->bnames[i], &str);
+        for (k=l; k<str.l; k++) if ( isspace(str.s[k]) ) str.s[k] = '_';
+        if ( args->output_type & FT_BCF ) kputs(".bcf", &str);
+        else if ( args->output_type & FT_GZ ) kputs(".vcf.gz", &str);
+        else kputs(".vcf", &str);
+        args->fh[i] = hts_open(str.s, hts_bcf_wmode(args->output_type));
+        if ( args->fh[i] == NULL ) error("Can't write to \"%s\": %s\n", str.s, strerror(errno));
+        bcf_hdr_nsamples(args->hdr_out) = 1;
+        args->hdr_out->samples[0] = args->bnames[i];
+        bcf_hdr_write(args->fh[i], args->hdr_out);
+    }
+    free(str.s);
+
+    // parse tags
+    int is_info = 0, is_fmt = 0;
+    char *beg = args->keep_tags;
+    while ( beg && *beg )
+    {
+        if ( !strncasecmp("INFO/",beg,5) ) { is_info = 1; is_fmt = 0; beg += 5; }
+        else if ( !strcasecmp("INFO",beg) ) { args->keep_info = 1; break; }
+        else if ( !strncasecmp("INFO,",beg,5) ) { args->keep_info = 1; beg += 5; continue; }
+        else if ( !strncasecmp("FMT/",beg,4) ) { is_info = 0; is_fmt = 1; beg += 4; }
+        else if ( !strncasecmp("FORMAT/",beg,7) ) { is_info = 0; is_fmt = 1; beg += 7; }
+        else if ( !strcasecmp("FMT",beg) ) { args->keep_fmt = 1; break; }
+        else if ( !strcasecmp("FORMAT",beg) ) { args->keep_fmt = 1; break; }
+        else if ( !strncasecmp("FMT,",beg,4) ) { args->keep_fmt = 1; beg += 4; continue; }
+        else if ( !strncasecmp("FORMAT,",beg,7) ) { args->keep_fmt = 1; beg += 7; continue; }
+        char *end = beg;
+        while ( *end && *end!=',' ) end++;
+        char tmp = *end; *end = 0;
+        int id = bcf_hdr_id2int(args->hdr_in, BCF_DT_ID, beg);
+        beg = tmp ? end + 1 : end;
+        if ( is_info && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_INFO,id) )
+        {
+            if ( id >= args->ninfo_tags ) args->ninfo_tags = id + 1;
+            hts_expand0(uint8_t, args->ninfo_tags, args->minfo_tags, args->info_tags);
+            args->info_tags[id] = 1;
+        }
+        if ( is_fmt && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_FMT,id) )
+        {
+            if ( id >= args->nfmt_tags ) args->nfmt_tags = id + 1;
+            hts_expand0(uint8_t, args->nfmt_tags, args->mfmt_tags, args->fmt_tags);
+            args->fmt_tags[id] = 1;
+        }
+    }
+    if ( !args->keep_info && !args->keep_fmt && !args->ninfo_tags && !args->nfmt_tags )
+    {
+        args->keep_info = args->keep_fmt = 1;
+    }
+}
+static void destroy_data(args_t *args)
+{
+    free(args->info_tags);
+    free(args->fmt_tags);
+    if ( args->filter )
+        filter_destroy(args->filter);
+    int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+    for (i=0; i<nsmpl; i++)
+    {
+        if ( args->fh[i] && hts_close(args->fh[i])!=0 ) error("Error: close failed!\n");
+        free(args->bnames[i]);
+    }
+    free(args->bnames);
+    free(args->fh);
+    bcf_sr_destroy(args->sr);
+    bcf_hdr_destroy(args->hdr_out);
+    free(args);
+}
+
+static bcf1_t *rec_set_info(args_t *args, bcf1_t *rec)
+{
+    bcf1_t *out = bcf_init1();
+    out->rid  = rec->rid;
+    out->pos  = rec->pos;
+    out->rlen = rec->rlen;
+    out->qual = rec->qual;
+    out->n_allele = rec->n_allele;
+    out->n_sample = 1;
+    if ( args->keep_info )
+    {
+        out->n_info = rec->n_info;
+        out->shared.m = out->shared.l = rec->shared.l;
+        out->shared.s = (char*) malloc(out->shared.l);
+        memcpy(out->shared.s,rec->shared.s,out->shared.l);
+        return out;
+    }
+
+    // build the BCF record
+    kstring_t tmp = {0,0,0};
+    char *ptr = rec->shared.s;
+    kputsn_(ptr, rec->unpack_size[0], &tmp); ptr += rec->unpack_size[0]; // ID
+    kputsn_(ptr, rec->unpack_size[1], &tmp); ptr += rec->unpack_size[1]; // REF+ALT
+    kputsn_(ptr, rec->unpack_size[2], &tmp);                             // FILTER
+    if ( args->ninfo_tags )
+    {
+        int i;
+        for (i=0; i<rec->n_info; i++)
+        {
+            bcf_info_t *info = &rec->d.info[i];
+            int id = info->key;
+            if ( !args->info_tags[id] ) continue;
+            kputsn_(info->vptr - info->vptr_off, info->vptr_len + info->vptr_off, &tmp);
+            out->n_info++;
+        }
+    }
+    out->shared.m = tmp.m;
+    out->shared.s = tmp.s;
+    out->shared.l = tmp.l;
+    out->unpacked = 0;
+    return out;
+}
+
+static bcf1_t *rec_set_format(args_t *args, bcf1_t *src, int ismpl, bcf1_t *dst)
+{
+    dst->n_fmt = 0;
+    kstring_t tmp = dst->indiv; tmp.l = 0;
+    int i;
+    for (i=0; i<src->n_fmt; i++)
+    {
+        bcf_fmt_t *fmt = &src->d.fmt[i];
+        int id = fmt->id;
+        if ( !args->keep_fmt && !args->fmt_tags[id] ) continue;
+
+        bcf_enc_int1(&tmp, id);
+        bcf_enc_size(&tmp, fmt->n, fmt->type);
+        kputsn_(fmt->p + ismpl*fmt->size, fmt->size, &tmp);
+
+        dst->n_fmt++;
+    }
+    dst->indiv = tmp;
+    return dst;
+}
+
+static void process(args_t *args)
+{
+    bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+    bcf_unpack(rec, BCF_UN_ALL);
+
+    int i, site_pass = 1;
+    const uint8_t *smpl_pass = NULL;
+    if ( args->filter )
+    {
+        site_pass = filter_test(args->filter, rec, &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE ) site_pass = site_pass ? 0 : 1;
+    }
+    bcf1_t *out = NULL; 
+    for (i=0; i<rec->n_sample; i++)
+    {
+        if ( !args->fh[i] ) continue;
+        if ( !smpl_pass && !site_pass ) continue;
+        if ( smpl_pass )
+        {
+            int pass = args->filter_logic & FLT_EXCLUDE ? ( smpl_pass[i] ? 0 : 1) : smpl_pass[i];
+            if ( !pass ) continue;
+        }
+        if ( !out ) out = rec_set_info(args, rec);
+        rec_set_format(args, rec, i, out);
+        bcf_write(args->fh[i], args->hdr_out, out);
+    }
+    if ( out ) bcf_destroy(out);
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_type  = FT_VCF;
+    static struct option loptions[] =
+    {
+        {"keep-tags",required_argument,NULL,'k'},
+        {"exclude",required_argument,NULL,'e'},
+        {"include",required_argument,NULL,'i'},
+        {"regions",required_argument,NULL,'r'},
+        {"regions-file",required_argument,NULL,'R'},
+        {"samples-file",required_argument,NULL,'S'},
+        {"output",required_argument,NULL,'o'},
+        {"output-type",required_argument,NULL,'O'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "vr:R:t:T:o:O:i:e:k:S:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'k': args->keep_tags = optarg; break;
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 'T': args->target = optarg; args->target_is_file = 1; break;
+            case 't': args->target = optarg; break; 
+            case 'R': args->region = optarg; args->region_is_file = 1;  break;
+            case 'S': args->samples_fname = optarg; break;
+            case 'r': args->region = optarg; break; 
+            case 'o': args->output_dir = optarg; break;
+            case 'O':
+                      switch (optarg[0]) {
+                          case 'b': args->output_type = FT_BCF_GZ; break;
+                          case 'u': args->output_type = FT_BCF; break;
+                          case 'z': args->output_type = FT_VCF_GZ; break;
+                          case 'v': args->output_type = FT_VCF; break;
+                          default: error("The output type \"%s\" not recognised\n", optarg);
+                      }
+                      break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s", usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s", usage_text());
+    else args->fname = argv[optind];
+
+    if ( !args->output_dir ) error("Missing the -o option\n");
+    if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+
+    init_data(args);
+    
+    while ( bcf_sr_next_line(args->sr) ) process(args);
+
+    destroy_data(args);
+    return 0;
+}
+
+
diff --git a/bcftools/plugins/tag2tag.c b/bcftools/plugins/tag2tag.c

new file mode 100644 (file)

index 0000000..6fd2255
--- /dev/null
+++ b/bcftools/plugins/tag2tag.c
@@ -0,0 +1,254 @@
+/*  plugins/tag2tag.c -- convert between similar tags
+
+    Copyright (C) 2014-2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+
+
+#define GP_TO_GL 1
+#define GL_TO_PL 2
+#define GP_TO_GT 3
+#define PL_TO_GL 4
+
+static int mode = 0, drop_source_tag = 0;
+static bcf_hdr_t *in_hdr, *out_hdr;
+static float *farr = NULL, thresh = 0.1;
+static int32_t *iarr = NULL;
+static int mfarr = 0, miarr = 0;
+
+const char *about(void)
+{
+    return "Convert between similar tags, such as GL, PL and GP.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Convert between similar tags, such as GL, PL and GP.\n"
+        "Usage: bcftools +tag2tag [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "       --gp-to-gl           convert FORMAT/GP to FORMAT/GL\n"
+        "       --gp-to-gt           convert FORMAT/GP to FORMAT/GT by taking argmax of GP\n"
+        "       --gl-to-pl           convert FORMAT/GL to FORMAT/PL\n"
+        "       --pl-to-gl           convert FORMAT/PL to FORMAT/GL\n"
+        "   -r, --replace            drop the source tag\n"
+        "   -t, --threshold <float>  threshold for GP to GT hard-call [0.1]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +tag2tag in.vcf -- -r --gp-to-gl\n"
+        "\n";
+}
+
+
+static void init_header(bcf_hdr_t *hdr, const char *ori, int ori_type, const char *new_hdr_line)
+{
+    if ( ori )
+        bcf_hdr_remove(hdr,ori_type,ori);
+
+    bcf_hdr_append(hdr, new_hdr_line);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    static struct option loptions[] =
+    {
+        {"replace",no_argument,NULL,'r'},
+        {"gp-to-gl",no_argument,NULL,1},
+        {"gl-to-pl",no_argument,NULL,2},
+        {"gp-to-gt",no_argument,NULL,3},
+        {"pl-to-gl",no_argument,NULL,4},
+        {"threshold",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *src_tag = "GP";
+    while ((c = getopt_long(argc, argv, "?hrt:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case  1 : src_tag = "GP"; mode = GP_TO_GL; break;
+            case  2 : src_tag = "GL"; mode = GL_TO_PL; break;
+            case  3 : src_tag = "GP"; mode = GP_TO_GT; break;
+            case  4 : src_tag = "PL"; mode = PL_TO_GL; break;
+            case 'r': drop_source_tag = 1; break;
+            case 't': thresh = atof(optarg); break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !mode ) mode = GP_TO_GL;
+
+    in_hdr  = in;
+    out_hdr = out;
+
+    if ( mode==GP_TO_GL )
+        init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype Likelihoods\">");
+    else if ( mode==GL_TO_PL )
+        init_header(out_hdr,drop_source_tag?"GL":NULL,BCF_HL_FMT,"##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype likelihoods\">");
+    else if ( mode==PL_TO_GL )
+        init_header(out_hdr,drop_source_tag?"PL":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype likelihoods\">");
+    else if ( mode==GP_TO_GT ) {
+        if (thresh<0||thresh>1) error("--threshold must be in the range [0,1]: %f\n", thresh);
+        init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
+    }
+
+    int tag_id;
+    if ( (tag_id=bcf_hdr_id2int(in_hdr,BCF_DT_ID,src_tag))<0 || !bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,tag_id) )
+        error("The source tag does not exist: %s\n", src_tag);
+
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i, n;
+    if ( mode==GP_TO_GL )
+    {
+        n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+        for (i=0; i<n; i++)
+        {
+            if ( bcf_float_is_missing(farr[i]) || bcf_float_is_vector_end(farr[i]) ) continue;
+            farr[i] = farr[i] ? log10(farr[i]) : -99;
+        }
+        bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+    }
+    else if ( mode==PL_TO_GL )
+    {
+        n = bcf_get_format_int32(in_hdr,rec,"PL",&iarr,&miarr);
+        if ( n<=0 ) return rec;
+        hts_expand(float, n, mfarr, farr);
+        for (i=0; i<n; i++)
+        {
+            if ( iarr[i]==bcf_int32_missing )
+                bcf_float_set_missing(farr[i]);
+            else if ( iarr[i]==bcf_int32_vector_end )
+                bcf_float_set_vector_end(farr[i]);
+            else
+                farr[i] = -0.1 * iarr[i];
+        }
+        bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+        if ( drop_source_tag )
+            bcf_update_format_int32(out_hdr,rec,"PL",NULL,0);
+    }
+    else if ( mode==GL_TO_PL )
+    {
+        n = bcf_get_format_float(in_hdr,rec,"GL",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+        hts_expand(int32_t, n, miarr, iarr);
+        for (i=0; i<n; i++)
+        {
+            if ( bcf_float_is_missing(farr[i]) )
+                iarr[i] = bcf_int32_missing;
+            else if ( bcf_float_is_vector_end(farr[i]) )
+                iarr[i] = bcf_int32_vector_end;
+            else
+                iarr[i] = lroundf(-10 * farr[i]);
+        }
+        bcf_update_format_int32(out_hdr,rec,"PL",iarr,n);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GL",NULL,0);
+    }
+    else if ( mode==GP_TO_GT )
+    {
+        int nals  = rec->n_allele;
+        int nsmpl = bcf_hdr_nsamples(in_hdr);
+        hts_expand(int32_t,nsmpl*2,miarr,iarr);
+
+        n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+
+        n /= nsmpl;
+        for (i=0; i<nsmpl; i++)
+        {
+            float *ptr = farr + i*n;
+            if ( bcf_float_is_missing(ptr[0]) )
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+                continue;
+            }
+
+            int j, jmax = 0;
+            for (j=1; j<n; j++)
+            {
+                if ( bcf_float_is_missing(ptr[j]) || bcf_float_is_vector_end(ptr[j]) ) break;
+                if ( ptr[j] > ptr[jmax] ) jmax = j;
+            }
+
+            // haploid genotype
+            if ( j==nals )
+            {
+                iarr[2*i]   = ptr[jmax] < 1-thresh ? bcf_gt_missing : bcf_gt_unphased(jmax);
+                iarr[2*i+1] = bcf_int32_vector_end;
+                continue;
+            }
+
+            if ( j!=nals*(nals+1)/2 )
+                error("Wrong number of GP values for diploid genotype at %s:%d, expected %d, found %d\n",
+                    bcf_seqname(in_hdr,rec),rec->pos+1, nals*(nals+1)/2,j);
+
+            if (ptr[jmax] < 1-thresh)
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+                continue;
+            }
+
+            // most common case: RR
+            if ( jmax==0 )
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_unphased(0);
+                continue;
+            }
+
+            int a,b;
+            bcf_gt2alleles(jmax,&a,&b);
+            iarr[2*i]   = bcf_gt_unphased(a);
+            iarr[2*i+1] = bcf_gt_unphased(b);
+        }
+        bcf_update_genotypes(out_hdr,rec,iarr,nsmpl*2);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+    }
+    return rec;
+}
+
+void destroy(void)
+{
+    free(farr);
+    free(iarr);
+}
+
+
diff --git a/bcftools/plugins/tag2tag.c.pysam.c b/bcftools/plugins/tag2tag.c.pysam.c

new file mode 100644 (file)

index 0000000..8d8d13b
--- /dev/null
+++ b/bcftools/plugins/tag2tag.c.pysam.c
@@ -0,0 +1,256 @@
+#include "bcftools.pysam.h"
+
+/*  plugins/tag2tag.c -- convert between similar tags
+
+    Copyright (C) 2014-2016 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+
+
+#define GP_TO_GL 1
+#define GL_TO_PL 2
+#define GP_TO_GT 3
+#define PL_TO_GL 4
+
+static int mode = 0, drop_source_tag = 0;
+static bcf_hdr_t *in_hdr, *out_hdr;
+static float *farr = NULL, thresh = 0.1;
+static int32_t *iarr = NULL;
+static int mfarr = 0, miarr = 0;
+
+const char *about(void)
+{
+    return "Convert between similar tags, such as GL, PL and GP.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Convert between similar tags, such as GL, PL and GP.\n"
+        "Usage: bcftools +tag2tag [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "       --gp-to-gl           convert FORMAT/GP to FORMAT/GL\n"
+        "       --gp-to-gt           convert FORMAT/GP to FORMAT/GT by taking argmax of GP\n"
+        "       --gl-to-pl           convert FORMAT/GL to FORMAT/PL\n"
+        "       --pl-to-gl           convert FORMAT/PL to FORMAT/GL\n"
+        "   -r, --replace            drop the source tag\n"
+        "   -t, --threshold <float>  threshold for GP to GT hard-call [0.1]\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +tag2tag in.vcf -- -r --gp-to-gl\n"
+        "\n";
+}
+
+
+static void init_header(bcf_hdr_t *hdr, const char *ori, int ori_type, const char *new_hdr_line)
+{
+    if ( ori )
+        bcf_hdr_remove(hdr,ori_type,ori);
+
+    bcf_hdr_append(hdr, new_hdr_line);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    static struct option loptions[] =
+    {
+        {"replace",no_argument,NULL,'r'},
+        {"gp-to-gl",no_argument,NULL,1},
+        {"gl-to-pl",no_argument,NULL,2},
+        {"gp-to-gt",no_argument,NULL,3},
+        {"pl-to-gl",no_argument,NULL,4},
+        {"threshold",required_argument,NULL,'t'},
+        {NULL,0,NULL,0}
+    };
+    int c;
+    char *src_tag = "GP";
+    while ((c = getopt_long(argc, argv, "?hrt:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case  1 : src_tag = "GP"; mode = GP_TO_GL; break;
+            case  2 : src_tag = "GL"; mode = GL_TO_PL; break;
+            case  3 : src_tag = "GP"; mode = GP_TO_GT; break;
+            case  4 : src_tag = "PL"; mode = PL_TO_GL; break;
+            case 'r': drop_source_tag = 1; break;
+            case 't': thresh = atof(optarg); break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !mode ) mode = GP_TO_GL;
+
+    in_hdr  = in;
+    out_hdr = out;
+
+    if ( mode==GP_TO_GL )
+        init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype Likelihoods\">");
+    else if ( mode==GL_TO_PL )
+        init_header(out_hdr,drop_source_tag?"GL":NULL,BCF_HL_FMT,"##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype likelihoods\">");
+    else if ( mode==PL_TO_GL )
+        init_header(out_hdr,drop_source_tag?"PL":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype likelihoods\">");
+    else if ( mode==GP_TO_GT ) {
+        if (thresh<0||thresh>1) error("--threshold must be in the range [0,1]: %f\n", thresh);
+        init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
+    }
+
+    int tag_id;
+    if ( (tag_id=bcf_hdr_id2int(in_hdr,BCF_DT_ID,src_tag))<0 || !bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,tag_id) )
+        error("The source tag does not exist: %s\n", src_tag);
+
+    return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int i, n;
+    if ( mode==GP_TO_GL )
+    {
+        n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+        for (i=0; i<n; i++)
+        {
+            if ( bcf_float_is_missing(farr[i]) || bcf_float_is_vector_end(farr[i]) ) continue;
+            farr[i] = farr[i] ? log10(farr[i]) : -99;
+        }
+        bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+    }
+    else if ( mode==PL_TO_GL )
+    {
+        n = bcf_get_format_int32(in_hdr,rec,"PL",&iarr,&miarr);
+        if ( n<=0 ) return rec;
+        hts_expand(float, n, mfarr, farr);
+        for (i=0; i<n; i++)
+        {
+            if ( iarr[i]==bcf_int32_missing )
+                bcf_float_set_missing(farr[i]);
+            else if ( iarr[i]==bcf_int32_vector_end )
+                bcf_float_set_vector_end(farr[i]);
+            else
+                farr[i] = -0.1 * iarr[i];
+        }
+        bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+        if ( drop_source_tag )
+            bcf_update_format_int32(out_hdr,rec,"PL",NULL,0);
+    }
+    else if ( mode==GL_TO_PL )
+    {
+        n = bcf_get_format_float(in_hdr,rec,"GL",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+        hts_expand(int32_t, n, miarr, iarr);
+        for (i=0; i<n; i++)
+        {
+            if ( bcf_float_is_missing(farr[i]) )
+                iarr[i] = bcf_int32_missing;
+            else if ( bcf_float_is_vector_end(farr[i]) )
+                iarr[i] = bcf_int32_vector_end;
+            else
+                iarr[i] = lroundf(-10 * farr[i]);
+        }
+        bcf_update_format_int32(out_hdr,rec,"PL",iarr,n);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GL",NULL,0);
+    }
+    else if ( mode==GP_TO_GT )
+    {
+        int nals  = rec->n_allele;
+        int nsmpl = bcf_hdr_nsamples(in_hdr);
+        hts_expand(int32_t,nsmpl*2,miarr,iarr);
+
+        n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+        if ( n<=0 ) return rec;
+
+        n /= nsmpl;
+        for (i=0; i<nsmpl; i++)
+        {
+            float *ptr = farr + i*n;
+            if ( bcf_float_is_missing(ptr[0]) )
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+                continue;
+            }
+
+            int j, jmax = 0;
+            for (j=1; j<n; j++)
+            {
+                if ( bcf_float_is_missing(ptr[j]) || bcf_float_is_vector_end(ptr[j]) ) break;
+                if ( ptr[j] > ptr[jmax] ) jmax = j;
+            }
+
+            // haploid genotype
+            if ( j==nals )
+            {
+                iarr[2*i]   = ptr[jmax] < 1-thresh ? bcf_gt_missing : bcf_gt_unphased(jmax);
+                iarr[2*i+1] = bcf_int32_vector_end;
+                continue;
+            }
+
+            if ( j!=nals*(nals+1)/2 )
+                error("Wrong number of GP values for diploid genotype at %s:%d, expected %d, found %d\n",
+                    bcf_seqname(in_hdr,rec),rec->pos+1, nals*(nals+1)/2,j);
+
+            if (ptr[jmax] < 1-thresh)
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+                continue;
+            }
+
+            // most common case: RR
+            if ( jmax==0 )
+            {
+                iarr[2*i] = iarr[2*i+1] = bcf_gt_unphased(0);
+                continue;
+            }
+
+            int a,b;
+            bcf_gt2alleles(jmax,&a,&b);
+            iarr[2*i]   = bcf_gt_unphased(a);
+            iarr[2*i+1] = bcf_gt_unphased(b);
+        }
+        bcf_update_genotypes(out_hdr,rec,iarr,nsmpl*2);
+        if ( drop_source_tag )
+            bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+    }
+    return rec;
+}
+
+void destroy(void)
+{
+    free(farr);
+    free(iarr);
+}
+
+
diff --git a/bcftools/plugins/trio-stats.c b/bcftools/plugins/trio-stats.c

new file mode 100644 (file)

index 0000000..b77d1df
--- /dev/null
+++ b/bcftools/plugins/trio-stats.c
@@ -0,0 +1,554 @@
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define iCHILD  0
+#define iFATHER 1
+#define iMOTHER 2
+
+typedef struct
+{
+    int idx[3];     // VCF sample index for father, mother and child
+    int pass;       // do all three pass the filters?
+}
+trio_t;
+
+typedef struct
+{
+    uint32_t
+        npass,          // number of genotypes passing the filter
+        nnon_ref,       // number of non-reference genotypes
+        nmendel_err,    // number of mendelian errors
+        nnovel,         // a singleton allele, but observed only in the child. Counted as mendel_err as well.
+        nsingleton,     // het mother or father different from everyone else
+        ndoubleton,     // het mother+child or father+child different from everyone else
+        nts, ntv;       // number of transitions and transversions
+}
+trio_stats_t;
+
+typedef struct
+{
+    trio_stats_t *stats;
+    filter_t *filter;
+    char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file;
+    int nflt_str;
+    char *filter_str, **flt_str;
+    char **argv, *ped_fname, *output_fname, *fname, *regions, *targets;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+    trio_t *trio;
+    int ntrio, mtrio;
+    flt_stats_t *filters;
+    int nfilters;
+    int32_t *gt_arr, *ac, *ac_trio;
+    int mgt_arr, mac, mac_trio;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate transmission rate and other stats in trio children.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Calculate transmission rate in trio children. Use curly brackets to scan\n"
+        "       a range of values simultaneously\n"
+        "Usage: bcftools +trio-stats [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [stdout]\n"
+        "   -p, --ped FILE              PED file\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +trio-stats -p file.ped -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+        "\n";
+}
+
+static int cmp_trios(const void *_a, const void *_b)
+{
+    trio_t *a = (trio_t *) _a;
+    trio_t *b = (trio_t *) _b;
+    int i;
+    int amin = a->idx[0];
+    for (i=1; i<3; i++)
+        if ( amin > a->idx[i] ) amin = a->idx[i];
+    int bmin = b->idx[0];
+    for (i=1; i<3; i++)
+        if ( bmin > b->idx[i] ) bmin = b->idx[i];
+    if ( amin < bmin ) return -1;
+    if ( amin > bmin ) return 1;
+    return 0;
+}
+
+static void parse_ped(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // familyID    sampleID paternalID maternalID sex   phenotype   population relationship   siblings   secondOrder   thirdOrder   children    comment
+        // BB03    HG01884 HG01885 HG01956 2   0   ACB child   0   0   0   0
+        int ncols = ksplit_core(str.s,0,&moff,&off);
+        if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+        int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+        if ( father<0 ) continue;
+        int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+        if ( mother<0 ) continue;
+        int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( child<0 ) continue;
+
+        args->ntrio++;
+        hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+        trio_t *trio = &args->trio[args->ntrio-1];
+        trio->idx[iFATHER] = father;
+        trio->idx[iMOTHER] = mother;
+        trio->idx[iCHILD]  = child;
+    }
+    while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    fprintf(stderr,"Identified %d complete trios in the VCF file\n", args->ntrio);
+
+    // sort the sample by index so that they are accessed more or less sequentially
+    qsort(args->trio,args->ntrio,sizeof(trio_t),cmp_trios);
+    
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+static void parse_filters(args_t *args)
+{
+    if ( !args->filter_str ) return;
+    int mflt = 1;
+    args->nflt_str = 1;
+    args->flt_str  = (char**) malloc(sizeof(char*));
+    args->flt_str[0] = strdup(args->filter_str);
+    while (1)
+    {
+        int i, expanded = 0;
+        for (i=args->nflt_str-1; i>=0; i--)
+        {
+            char *exp_beg = strchr(args->flt_str[i], '{');
+            if ( !exp_beg ) continue;
+            char *exp_end = strchr(exp_beg+1, '}');
+            if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+            char *beg = exp_beg+1, *mid = beg;
+            while ( mid<exp_end )
+            {
+                while ( mid<exp_end && *mid!=',' ) mid++;
+                kstring_t tmp = {0,0,0};
+                kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+                kputsn(beg, mid - beg, &tmp);
+                kputs(exp_end+1, &tmp);
+                args->nflt_str++;
+                hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+                args->flt_str[args->nflt_str-1] = tmp.s;
+                beg = ++mid;
+            }
+            expanded = 1;
+            free(args->flt_str[i]);
+            memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+            args->nflt_str--;
+            args->flt_str[args->nflt_str] = NULL;
+        }
+        if ( !expanded ) break;
+    }
+    
+    fprintf(stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    parse_ped(args, args->ped_fname);
+    parse_filters(args);
+
+    int i;
+    if ( !args->nflt_str )
+    {
+        args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+        args->nfilters = 1;
+        args->filters[0].expr = strdup("all");
+    }
+    else
+    {
+        args->nfilters = args->nflt_str;
+        args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+        for (i=0; i<args->nfilters; i++)
+        {
+            args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+            args->filters[i].expr   = strdup(args->flt_str[i]);
+
+            // replace tab's with spaces so that the output stays parsable
+            char *tmp = args->filters[i].expr;
+            while ( *tmp )
+            { 
+                if ( *tmp=='\t' ) *tmp = ' '; 
+                tmp++; 
+            }
+        }
+    }
+    for (i=0; i<args->nfilters; i++)
+        args->filters[i].stats = (trio_stats_t*) calloc(args->ntrio,sizeof(trio_stats_t));
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    for (i=0; i<args->nfilters; i++)
+    {
+        if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+        free(args->filters[i].stats);
+        free(args->filters[i].expr);
+    }
+    free(args->filters);
+    for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+    free(args->flt_str);
+    bcf_sr_destroy(args->sr);
+    free(args->trio);
+    free(args->ac);
+    free(args->ac_trio);
+    free(args->gt_arr);
+    free(args);
+}
+static void report_stats(args_t *args)
+{
+    int i = 0,j;
+    FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? stdout : fopen(args->output_fname,"w");
+    if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+    fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+    fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+    fprintf(fh,"# FLT* lines report numbers for every threshold and every trio:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) child\n", ++i);
+    fprintf(fh,"#   %d) father\n", ++i);
+    fprintf(fh,"#   %d) mother\n", ++i);
+    fprintf(fh,"#   %d) number of valid trio genotypes (all trio members pass filters, all non-missing)\n", ++i);
+    fprintf(fh,"#   %d) number of non-reference trio GTs (at least one trio member carries an alternate allele)\n", ++i);
+    fprintf(fh,"#   %d) number of Mendelian errors\n", ++i);
+    fprintf(fh,"#   %d) number of novel singleton alleles in the child (counted also as a Mendelian error)\n", ++i);
+    fprintf(fh,"#   %d) number of untransmitted singletons, present only in one parent\n", ++i);
+    fprintf(fh,"#   %d) number of transmitted singletons, present only in one parent and the child\n", ++i);
+    fprintf(fh,"#   %d) number of transitions, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh,"#   %d) number of transversions, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh, "CMD\t%s", args->argv[0]);
+    for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+    fprintf(fh, "\n");
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+    }
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        for (j=0; j<args->ntrio; j++)
+        {
+            fprintf(fh,"FLT%d", i);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iCHILD]]);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iFATHER]]);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iMOTHER]]);
+            trio_stats_t *stats = &flt->stats[j];
+            fprintf(fh,"\t%d", stats->npass);
+            fprintf(fh,"\t%d", stats->nnon_ref);
+            fprintf(fh,"\t%d", stats->nmendel_err);
+            fprintf(fh,"\t%d", stats->nnovel);
+            fprintf(fh,"\t%d", stats->nsingleton);
+            fprintf(fh,"\t%d", stats->ndoubleton);
+            fprintf(fh,"\t%d", stats->nts);
+            fprintf(fh,"\t%d", stats->ntv);
+            fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+            fprintf(fh,"\n");
+        }
+    }
+    if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+    int32_t *ptr = arr + ngt1 * idx;
+    if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+    als[0] = bcf_gt_allele(ptr[0]);
+
+    // treat haploid GTs as homozygous diploid
+    if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { als[1] = als[0]; return 0; }
+
+    if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+    als[1] = bcf_gt_allele(ptr[1]);
+
+    return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+    int i,j;
+
+    // Find out which trios pass and if the site passes
+    if ( flt->filter )
+    {
+        uint8_t *smpl_pass = NULL;
+        int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE )
+        {
+            if ( pass_site )
+            {
+                if ( !smpl_pass ) return;
+                pass_site = 0;
+                for (i=0; i<args->ntrio; i++)
+                {
+                    int pass_trio = 1;
+                    for (j=0; j<3; j++)
+                    {
+                        int idx = args->trio[i].idx[j];
+                        if ( smpl_pass[idx] ) { pass_trio = 0; break; }
+                    }
+                    args->trio[i].pass = pass_trio;
+                    if ( pass_trio ) pass_site = 1;
+                }
+                if ( !pass_site ) return;
+            }
+            else
+                for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+        }
+        else if ( !pass_site ) return;
+        else if ( smpl_pass )
+        {
+            pass_site = 0;
+            for (i=0; i<args->ntrio; i++)
+            {
+                int pass_trio = 1;
+                for (j=0; j<3; j++)
+                {
+                    int idx = args->trio[i].idx[j];
+                    if ( !smpl_pass[idx] ) { pass_trio = 0; break; }
+                }
+                args->trio[i].pass = pass_trio;
+                if ( pass_trio ) pass_site = 1;
+            }
+            if ( !pass_site ) return;
+        }
+        else
+            for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+    }
+
+    // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+    hts_expand(int, rec->n_allele, args->mac, args->ac);
+    if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+    hts_expand(int, rec->n_allele, args->mac_trio, args->ac_trio);
+
+    // Get the genotypes
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+    if ( ngt<0 ) return;
+    int ngt1 = ngt / rec->n_sample;
+    
+
+    // For ts/tv: numeric code of the reference allele, -1 for insertions
+    int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+    int star_allele = -1;
+    for (i=1; i<rec->n_allele; i++)
+        if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+    // Run the stats
+    for (i=0; i<args->ntrio; i++)
+    {
+        if ( flt->filter && !args->trio[i].pass ) continue;
+        trio_stats_t *stats = &flt->stats[i];
+
+        // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+        // the order is: child, father, mother
+        int als[6], *als_child = als, *als_father = als+2, *als_mother = als+4; 
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iCHILD], als_child) < 0 ) continue;
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iFATHER], als_father) < 0 ) continue;
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iMOTHER], als_mother) < 0 ) continue;
+
+        stats->npass++;
+
+        // Has the trio an alternate allele other than *?
+        int has_star_allele = 0, has_nonref = 0;
+        memset(args->ac_trio,0,rec->n_allele*sizeof(*args->ac_trio));
+        for (j=0; j<6; j++)
+        {
+            if ( als[j]==star_allele ) { has_star_allele = 1; continue; }
+            if ( als[j]==0 ) continue;
+            has_nonref = 1;
+            args->ac_trio[ als[j] ]++;
+        }
+        if ( !has_nonref ) continue;   // only ref or * in this trio
+        
+        stats->nnon_ref++;
+
+        // Calculate ts/tv. It does the right thing and handles also HetAA genotypes
+        if ( ref != -1 )
+        {
+            int has_ts = 0, has_tv = 0;
+            for (j=0; j<6; j++)
+            {
+                if ( als[j]==0 || als[j]==star_allele ) continue;
+                if ( als[j] >= rec->n_allele )
+                    error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[args->trio[i].idx[j/2]]);
+                if ( rec->d.allele[als[j]][1] ) continue;
+
+                int alt = bcf_acgt2int(rec->d.allele[als[j]][0]);
+                if ( abs(ref-alt)==2 ) has_ts = 1;
+                else has_tv = 1;
+            }
+            if ( has_ts ) stats->nts++;
+            if ( has_tv ) stats->ntv++;
+        }
+
+        // Skip some stats if the star allele is present as it was already checked at the primary record, we do not want to count the same
+        // thing multiple times. There can be other alternate allele, but we ignore that for simplicity.
+        if ( has_star_allele ) continue;
+
+        // Detect mendelian errors
+        int mendel_ok = (als_child[0]==als_father[0] || als_child[0]==als_father[1]) && (als_child[1]==als_mother[0] || als_child[1]==als_mother[1]) ? 1 : 0;
+        if ( !mendel_ok ) mendel_ok = (als_child[1]==als_father[0] || als_child[1]==als_father[1]) && (als_child[0]==als_mother[0] || als_child[0]==als_mother[1]) ? 1 : 0;
+        if ( !mendel_ok ) stats->nmendel_err++;
+
+        // Is this a singleton, doubleton, neither?
+        for (j=1; j<rec->n_allele; j++)
+        {
+            if ( args->ac_trio[j]==1 && args->ac[j]==1 )  // singleton (in parent) or novel (in child)
+            {
+                if ( als_child[0]==j || als_child[1]==j ) stats->nnovel++;
+                else stats->nsingleton++;
+            }
+            else if ( args->ac_trio[j]==2 && args->ac[j]==2 )   // possibly a doubleton
+            {
+                if ( (als_child[0]==j || als_child[1]==j) && (als_child[0]!=j || als_child[1]!=j) ) stats->ndoubleton++;
+            }
+        }
+    }
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"ped",required_argument,NULL,'p'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c, i;
+    while ((c = getopt_long(argc, argv, "p:o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'p': args->ped_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s", usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s", usage_text());
+    else args->fname = argv[optind];
+
+    if ( !args->ped_fname ) error("Missing the -p, --ped option\n");
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        for (i=0; i<args->nfilters; i++)
+            process_record(args, rec, &args->filters[i]);
+    }
+
+    report_stats(args);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/trio-stats.c.pysam.c b/bcftools/plugins/trio-stats.c.pysam.c

new file mode 100644 (file)

index 0000000..626cafa
--- /dev/null
+++ b/bcftools/plugins/trio-stats.c.pysam.c
@@ -0,0 +1,556 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2018 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>     // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define iCHILD  0
+#define iFATHER 1
+#define iMOTHER 2
+
+typedef struct
+{
+    int idx[3];     // VCF sample index for father, mother and child
+    int pass;       // do all three pass the filters?
+}
+trio_t;
+
+typedef struct
+{
+    uint32_t
+        npass,          // number of genotypes passing the filter
+        nnon_ref,       // number of non-reference genotypes
+        nmendel_err,    // number of mendelian errors
+        nnovel,         // a singleton allele, but observed only in the child. Counted as mendel_err as well.
+        nsingleton,     // het mother or father different from everyone else
+        ndoubleton,     // het mother+child or father+child different from everyone else
+        nts, ntv;       // number of transitions and transversions
+}
+trio_stats_t;
+
+typedef struct
+{
+    trio_stats_t *stats;
+    filter_t *filter;
+    char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+    int argc, filter_logic, regions_is_file, targets_is_file;
+    int nflt_str;
+    char *filter_str, **flt_str;
+    char **argv, *ped_fname, *output_fname, *fname, *regions, *targets;
+    bcf_srs_t *sr;
+    bcf_hdr_t *hdr;
+    trio_t *trio;
+    int ntrio, mtrio;
+    flt_stats_t *filters;
+    int nfilters;
+    int32_t *gt_arr, *ac, *ac_trio;
+    int mgt_arr, mac, mac_trio;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate transmission rate and other stats in trio children.\n";
+}
+
+static const char *usage_text(void)
+{
+    return 
+        "\n"
+        "About: Calculate transmission rate in trio children. Use curly brackets to scan\n"
+        "       a range of values simultaneously\n"
+        "Usage: bcftools +trio-stats [Plugin Options]\n"
+        "Plugin options:\n"
+        "   -e, --exclude EXPR          exclude sites and samples for which the expression is true\n"
+        "   -i, --include EXPR          include sites and samples for which the expression is true\n"
+        "   -o, --output FILE           output file name [bcftools_stdout]\n"
+        "   -p, --ped FILE              PED file\n"
+        "   -r, --regions REG           restrict to comma-separated list of regions\n"
+        "   -R, --regions-file FILE     restrict to regions listed in a file\n"
+        "   -t, --targets REG           similar to -r but streams rather than index-jumps\n"
+        "   -T, --targets-file FILE     similar to -R but streams rather than index-jumps\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +trio-stats -p file.ped -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+        "\n";
+}
+
+static int cmp_trios(const void *_a, const void *_b)
+{
+    trio_t *a = (trio_t *) _a;
+    trio_t *b = (trio_t *) _b;
+    int i;
+    int amin = a->idx[0];
+    for (i=1; i<3; i++)
+        if ( amin > a->idx[i] ) amin = a->idx[i];
+    int bmin = b->idx[0];
+    for (i=1; i<3; i++)
+        if ( bmin > b->idx[i] ) bmin = b->idx[i];
+    if ( amin < bmin ) return -1;
+    if ( amin > bmin ) return 1;
+    return 0;
+}
+
+static void parse_ped(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // familyID    sampleID paternalID maternalID sex   phenotype   population relationship   siblings   secondOrder   thirdOrder   children    comment
+        // BB03    HG01884 HG01885 HG01956 2   0   ACB child   0   0   0   0
+        int ncols = ksplit_core(str.s,0,&moff,&off);
+        if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+        int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+        if ( father<0 ) continue;
+        int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+        if ( mother<0 ) continue;
+        int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( child<0 ) continue;
+
+        args->ntrio++;
+        hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+        trio_t *trio = &args->trio[args->ntrio-1];
+        trio->idx[iFATHER] = father;
+        trio->idx[iMOTHER] = mother;
+        trio->idx[iCHILD]  = child;
+    }
+    while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    fprintf(bcftools_stderr,"Identified %d complete trios in the VCF file\n", args->ntrio);
+
+    // sort the sample by index so that they are accessed more or less sequentially
+    qsort(args->trio,args->ntrio,sizeof(trio_t),cmp_trios);
+    
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+static void parse_filters(args_t *args)
+{
+    if ( !args->filter_str ) return;
+    int mflt = 1;
+    args->nflt_str = 1;
+    args->flt_str  = (char**) malloc(sizeof(char*));
+    args->flt_str[0] = strdup(args->filter_str);
+    while (1)
+    {
+        int i, expanded = 0;
+        for (i=args->nflt_str-1; i>=0; i--)
+        {
+            char *exp_beg = strchr(args->flt_str[i], '{');
+            if ( !exp_beg ) continue;
+            char *exp_end = strchr(exp_beg+1, '}');
+            if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+            char *beg = exp_beg+1, *mid = beg;
+            while ( mid<exp_end )
+            {
+                while ( mid<exp_end && *mid!=',' ) mid++;
+                kstring_t tmp = {0,0,0};
+                kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+                kputsn(beg, mid - beg, &tmp);
+                kputs(exp_end+1, &tmp);
+                args->nflt_str++;
+                hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+                args->flt_str[args->nflt_str-1] = tmp.s;
+                beg = ++mid;
+            }
+            expanded = 1;
+            free(args->flt_str[i]);
+            memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+            args->nflt_str--;
+            args->flt_str[args->nflt_str] = NULL;
+        }
+        if ( !expanded ) break;
+    }
+    
+    fprintf(bcftools_stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+    args->sr = bcf_sr_init();
+    if ( args->regions )
+    {
+        args->sr->require_index = 1;
+        if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+    }
+    if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+    if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+    args->hdr = bcf_sr_get_header(args->sr,0);
+
+    parse_ped(args, args->ped_fname);
+    parse_filters(args);
+
+    int i;
+    if ( !args->nflt_str )
+    {
+        args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+        args->nfilters = 1;
+        args->filters[0].expr = strdup("all");
+    }
+    else
+    {
+        args->nfilters = args->nflt_str;
+        args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+        for (i=0; i<args->nfilters; i++)
+        {
+            args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+            args->filters[i].expr   = strdup(args->flt_str[i]);
+
+            // replace tab's with spaces so that the output stays parsable
+            char *tmp = args->filters[i].expr;
+            while ( *tmp )
+            { 
+                if ( *tmp=='\t' ) *tmp = ' '; 
+                tmp++; 
+            }
+        }
+    }
+    for (i=0; i<args->nfilters; i++)
+        args->filters[i].stats = (trio_stats_t*) calloc(args->ntrio,sizeof(trio_stats_t));
+}
+static void destroy_data(args_t *args)
+{
+    int i;
+    for (i=0; i<args->nfilters; i++)
+    {
+        if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+        free(args->filters[i].stats);
+        free(args->filters[i].expr);
+    }
+    free(args->filters);
+    for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+    free(args->flt_str);
+    bcf_sr_destroy(args->sr);
+    free(args->trio);
+    free(args->ac);
+    free(args->ac_trio);
+    free(args->gt_arr);
+    free(args);
+}
+static void report_stats(args_t *args)
+{
+    int i = 0,j;
+    FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? bcftools_stdout : fopen(args->output_fname,"w");
+    if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+    fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+    fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+    fprintf(fh,"# FLT* lines report numbers for every threshold and every trio:\n");
+    fprintf(fh,"#   %d) filter id\n", ++i);
+    fprintf(fh,"#   %d) child\n", ++i);
+    fprintf(fh,"#   %d) father\n", ++i);
+    fprintf(fh,"#   %d) mother\n", ++i);
+    fprintf(fh,"#   %d) number of valid trio genotypes (all trio members pass filters, all non-missing)\n", ++i);
+    fprintf(fh,"#   %d) number of non-reference trio GTs (at least one trio member carries an alternate allele)\n", ++i);
+    fprintf(fh,"#   %d) number of Mendelian errors\n", ++i);
+    fprintf(fh,"#   %d) number of novel singleton alleles in the child (counted also as a Mendelian error)\n", ++i);
+    fprintf(fh,"#   %d) number of untransmitted singletons, present only in one parent\n", ++i);
+    fprintf(fh,"#   %d) number of transmitted singletons, present only in one parent and the child\n", ++i);
+    fprintf(fh,"#   %d) number of transitions, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh,"#   %d) number of transversions, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh,"#   %d) overall ts/tv, all ALT alleles present in the trio are considered\n", ++i);
+    fprintf(fh, "CMD\t%s", args->argv[0]);
+    for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+    fprintf(fh, "\n");
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+    }
+    for (i=0; i<args->nfilters; i++)
+    {
+        flt_stats_t *flt = &args->filters[i];
+        for (j=0; j<args->ntrio; j++)
+        {
+            fprintf(fh,"FLT%d", i);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iCHILD]]);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iFATHER]]);
+            fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iMOTHER]]);
+            trio_stats_t *stats = &flt->stats[j];
+            fprintf(fh,"\t%d", stats->npass);
+            fprintf(fh,"\t%d", stats->nnon_ref);
+            fprintf(fh,"\t%d", stats->nmendel_err);
+            fprintf(fh,"\t%d", stats->nnovel);
+            fprintf(fh,"\t%d", stats->nsingleton);
+            fprintf(fh,"\t%d", stats->ndoubleton);
+            fprintf(fh,"\t%d", stats->nts);
+            fprintf(fh,"\t%d", stats->ntv);
+            fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+            fprintf(fh,"\n");
+        }
+    }
+    if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "bcftools_stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+    int32_t *ptr = arr + ngt1 * idx;
+    if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+    als[0] = bcf_gt_allele(ptr[0]);
+
+    // treat haploid GTs as homozygous diploid
+    if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { als[1] = als[0]; return 0; }
+
+    if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+    als[1] = bcf_gt_allele(ptr[1]);
+
+    return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+    int i,j;
+
+    // Find out which trios pass and if the site passes
+    if ( flt->filter )
+    {
+        uint8_t *smpl_pass = NULL;
+        int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+        if ( args->filter_logic & FLT_EXCLUDE )
+        {
+            if ( pass_site )
+            {
+                if ( !smpl_pass ) return;
+                pass_site = 0;
+                for (i=0; i<args->ntrio; i++)
+                {
+                    int pass_trio = 1;
+                    for (j=0; j<3; j++)
+                    {
+                        int idx = args->trio[i].idx[j];
+                        if ( smpl_pass[idx] ) { pass_trio = 0; break; }
+                    }
+                    args->trio[i].pass = pass_trio;
+                    if ( pass_trio ) pass_site = 1;
+                }
+                if ( !pass_site ) return;
+            }
+            else
+                for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+        }
+        else if ( !pass_site ) return;
+        else if ( smpl_pass )
+        {
+            pass_site = 0;
+            for (i=0; i<args->ntrio; i++)
+            {
+                int pass_trio = 1;
+                for (j=0; j<3; j++)
+                {
+                    int idx = args->trio[i].idx[j];
+                    if ( !smpl_pass[idx] ) { pass_trio = 0; break; }
+                }
+                args->trio[i].pass = pass_trio;
+                if ( pass_trio ) pass_site = 1;
+            }
+            if ( !pass_site ) return;
+        }
+        else
+            for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+    }
+
+    // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+    hts_expand(int, rec->n_allele, args->mac, args->ac);
+    if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+    hts_expand(int, rec->n_allele, args->mac_trio, args->ac_trio);
+
+    // Get the genotypes
+    int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+    if ( ngt<0 ) return;
+    int ngt1 = ngt / rec->n_sample;
+    
+
+    // For ts/tv: numeric code of the reference allele, -1 for insertions
+    int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+    int star_allele = -1;
+    for (i=1; i<rec->n_allele; i++)
+        if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+    // Run the stats
+    for (i=0; i<args->ntrio; i++)
+    {
+        if ( flt->filter && !args->trio[i].pass ) continue;
+        trio_stats_t *stats = &flt->stats[i];
+
+        // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+        // the order is: child, father, mother
+        int als[6], *als_child = als, *als_father = als+2, *als_mother = als+4; 
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iCHILD], als_child) < 0 ) continue;
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iFATHER], als_father) < 0 ) continue;
+        if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iMOTHER], als_mother) < 0 ) continue;
+
+        stats->npass++;
+
+        // Has the trio an alternate allele other than *?
+        int has_star_allele = 0, has_nonref = 0;
+        memset(args->ac_trio,0,rec->n_allele*sizeof(*args->ac_trio));
+        for (j=0; j<6; j++)
+        {
+            if ( als[j]==star_allele ) { has_star_allele = 1; continue; }
+            if ( als[j]==0 ) continue;
+            has_nonref = 1;
+            args->ac_trio[ als[j] ]++;
+        }
+        if ( !has_nonref ) continue;   // only ref or * in this trio
+        
+        stats->nnon_ref++;
+
+        // Calculate ts/tv. It does the right thing and handles also HetAA genotypes
+        if ( ref != -1 )
+        {
+            int has_ts = 0, has_tv = 0;
+            for (j=0; j<6; j++)
+            {
+                if ( als[j]==0 || als[j]==star_allele ) continue;
+                if ( als[j] >= rec->n_allele )
+                    error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[args->trio[i].idx[j/2]]);
+                if ( rec->d.allele[als[j]][1] ) continue;
+
+                int alt = bcf_acgt2int(rec->d.allele[als[j]][0]);
+                if ( abs(ref-alt)==2 ) has_ts = 1;
+                else has_tv = 1;
+            }
+            if ( has_ts ) stats->nts++;
+            if ( has_tv ) stats->ntv++;
+        }
+
+        // Skip some stats if the star allele is present as it was already checked at the primary record, we do not want to count the same
+        // thing multiple times. There can be other alternate allele, but we ignore that for simplicity.
+        if ( has_star_allele ) continue;
+
+        // Detect mendelian errors
+        int mendel_ok = (als_child[0]==als_father[0] || als_child[0]==als_father[1]) && (als_child[1]==als_mother[0] || als_child[1]==als_mother[1]) ? 1 : 0;
+        if ( !mendel_ok ) mendel_ok = (als_child[1]==als_father[0] || als_child[1]==als_father[1]) && (als_child[0]==als_mother[0] || als_child[0]==als_mother[1]) ? 1 : 0;
+        if ( !mendel_ok ) stats->nmendel_err++;
+
+        // Is this a singleton, doubleton, neither?
+        for (j=1; j<rec->n_allele; j++)
+        {
+            if ( args->ac_trio[j]==1 && args->ac[j]==1 )  // singleton (in parent) or novel (in child)
+            {
+                if ( als_child[0]==j || als_child[1]==j ) stats->nnovel++;
+                else stats->nsingleton++;
+            }
+            else if ( args->ac_trio[j]==2 && args->ac[j]==2 )   // possibly a doubleton
+            {
+                if ( (als_child[0]==j || als_child[1]==j) && (als_child[0]!=j || als_child[1]!=j) ) stats->ndoubleton++;
+            }
+        }
+    }
+}
+
+int run(int argc, char **argv)
+{
+    args_t *args = (args_t*) calloc(1,sizeof(args_t));
+    args->argc   = argc; args->argv = argv;
+    args->output_fname = "-";
+    static struct option loptions[] =
+    {
+        {"include",required_argument,0,'i'},
+        {"exclude",required_argument,0,'e'},
+        {"output",required_argument,NULL,'o'},
+        {"ped",required_argument,NULL,'p'},
+        {"regions",1,0,'r'},
+        {"regions-file",1,0,'R'},
+        {"targets",1,0,'t'},
+        {"targets-file",1,0,'T'},
+        {NULL,0,NULL,0}
+    };
+    int c, i;
+    while ((c = getopt_long(argc, argv, "p:o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+            case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+            case 't': args->targets = optarg; break;
+            case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+            case 'r': args->regions = optarg; break;
+            case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+            case 'o': args->output_fname = optarg; break;
+            case 'p': args->ped_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage_text()); break;
+        }
+    }
+    if ( optind==argc )
+    {
+        if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-";  // reading from stdin
+        else { error("%s", usage_text()); }
+    }
+    else if ( optind+1!=argc ) error("%s", usage_text());
+    else args->fname = argv[optind];
+
+    if ( !args->ped_fname ) error("Missing the -p, --ped option\n");
+
+    init_data(args);
+
+    while ( bcf_sr_next_line(args->sr) )
+    {
+        bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+        for (i=0; i<args->nfilters; i++)
+            process_record(args, rec, &args->filters[i]);
+    }
+
+    report_stats(args);
+    destroy_data(args);
+
+    return 0;
+}
diff --git a/bcftools/plugins/trio-switch-rate.c b/bcftools/plugins/trio-switch-rate.c

new file mode 100644 (file)

index 0000000..34f840d
--- /dev/null
+++ b/bcftools/plugins/trio-switch-rate.c
@@ -0,0 +1,273 @@
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    int father, mother, child;      // VCF sample index
+    int prev, ipop;
+    uint32_t err, nswitch, ntest;
+}
+trio_t;
+
+typedef struct
+{
+    char *name;
+    uint32_t err, nswitch, ntest, ntrio;
+    float pswitch;
+}
+pop_t;
+
+typedef struct
+{
+    int argc;
+    char **argv;
+    bcf_hdr_t *hdr;
+    trio_t *trio;
+    int ntrio, mtrio;
+    int32_t *gt_arr;
+    int npop;
+    pop_t *pop;
+    int mgt_arr, prev_rid;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate phase switch rate in trio samples, children samples must have phased GTs.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Calculate phase switch rate in trio children.\n"
+        "Usage: bcftools +trio-swich-rate [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --ped <file>        PED file with optional 7th column to group\n"
+        "                           results by population\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +trio-switch-rate file.bcf -- -p file.ped\n"
+        "\n";
+}
+
+void parse_ped(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    void *pop2i = khash_str2int_init();
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // familyID    sampleID paternalID maternalID sex   phenotype   population relationship   siblings   secondOrder   thirdOrder   children    comment
+        // BB03    HG01884 HG01885 HG01956 2   0   ACB child   0   0   0   0
+        int ncols = ksplit_core(str.s,0,&moff,&off);
+        if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+        int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+        if ( father<0 ) continue;
+        int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+        if ( mother<0 ) continue;
+        int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( child<0 ) continue;
+
+        args->ntrio++;
+        hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+        trio_t *trio = &args->trio[args->ntrio-1];
+        trio->father = father;
+        trio->mother = mother;
+        trio->child  = child;
+
+        if (ncols>6) {
+            char *pop_name = &str.s[off[6]];
+            if ( !khash_str2int_has_key(pop2i,pop_name) )
+            {
+                pop_name = strdup(&str.s[off[6]]);
+                khash_str2int_set(pop2i,pop_name,args->npop);
+                args->npop++;
+                args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+                memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+                args->pop[args->npop-1].name = pop_name;
+            }
+            khash_str2int_get(pop2i,pop_name,&trio->ipop);
+            args->pop[trio->ipop].ntrio++;
+        }
+    } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    khash_str2int_destroy(pop2i);
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.argc   = argc; args.argv = argv;
+    args.prev_rid = -1;
+    args.hdr = in;
+    char *ped_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"ped",required_argument,NULL,'p'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?hp:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': ped_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !ped_fname ) error("Expected the -p option\n");
+    parse_ped(&args, ped_fname);
+    return 1;
+}
+
+typedef struct
+{
+    int a, b, phased;
+}
+gt_t;
+
+int parse_genotype(gt_t *gt, int32_t *ptr);
+
+inline int parse_genotype(gt_t *gt, int32_t *ptr)
+{
+    if ( ptr[0]==bcf_gt_missing ) return 0;
+    if ( ptr[1]==bcf_gt_missing ) return 0;
+    if ( ptr[1]==bcf_int32_vector_end ) return 0;
+    gt->phased = bcf_gt_is_phased(ptr[1]) ? 1 : 0;
+    gt->a = bcf_gt_allele(ptr[0]); if ( gt->a > 1 ) return 0;  // consider only the first two alleles at biallelic sites
+    gt->b = bcf_gt_allele(ptr[1]); if ( gt->b > 1 ) return 0;
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.mgt_arr);
+    if ( ngt<0 ) return NULL;
+    ngt /= bcf_hdr_nsamples(args.hdr);
+    if ( ngt!=2 ) return NULL;
+
+    int i;
+    if ( rec->rid!=args.prev_rid )
+    {
+        args.prev_rid = rec->rid;
+        for (i=0; i<args.ntrio; i++) args.trio[i].prev = 0;
+    }
+
+    gt_t child, father, mother;
+    for (i=0; i<args.ntrio; i++)
+    {
+        trio_t *trio = &args.trio[i];
+
+        if ( !parse_genotype(&child, args.gt_arr + ngt*trio->child) ) continue;
+        if ( !child.phased ) continue;
+        if ( child.a+child.b != 1 ) continue;       // child is not a het
+
+        if ( !parse_genotype(&father, args.gt_arr + ngt*trio->father) ) continue;
+        if ( !parse_genotype(&mother, args.gt_arr + ngt*trio->mother) ) continue;
+        if ( father.a+father.b == 1 && mother.a+mother.b == 1 ) continue;     // both parents are hets
+        if ( father.a+father.b == mother.a+mother.b ) { trio->err++; continue; }    // mendelian error
+
+        int test_phase = 0; 
+        if ( father.a==father.b ) test_phase = 1 + (child.a==father.a);
+        else if ( mother.a==mother.b ) test_phase = 1 + (child.b==mother.a);
+        if ( trio->prev > 0 )
+        {
+            if ( trio->prev!=test_phase ) trio->nswitch++;
+        }
+        trio->ntest++;
+        trio->prev = test_phase;
+    }
+    return NULL;
+}
+
+void destroy(void)
+{
+    int i;
+    printf("# This file was produced by: bcftools +trio-switch-rate(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    printf("# The command line was:\tbcftools +trio-switch-rate %s", args.argv[0]);
+    for (i=1; i<args.argc; i++) printf(" %s",args.argv[i]);
+    printf("\n#\n");
+    printf("# TRIO\t[2]Father\t[3]Mother\t[4]Child\t[5]nTested\t[6]nMendelian Errors\t[7]nSwitch\t[8]nSwitch (%%)\n");
+    for (i=0; i<args.ntrio; i++)
+    {
+        trio_t *trio = &args.trio[i];
+        printf("TRIO\t%s\t%s\t%s\t%d\t%d\t%d\t%.2f\n",
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->father),
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->mother),
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->child),
+            trio->ntest, trio->err, trio->nswitch, trio->ntest ? trio->nswitch*100./trio->ntest : 0
+        );
+        if (args.npop) {
+            pop_t *pop = &args.pop[trio->ipop];
+            pop->err     += trio->err;
+            pop->nswitch += trio->nswitch;
+            pop->ntest   += trio->ntest;
+            pop->pswitch += trio->ntest ? trio->nswitch*100./trio->ntest : 0;
+        }
+    }
+    printf("# POP\tpopulation or other grouping defined by an optional 7-th column of the PED file\n");
+    printf("# POP\t[2]Name\t[3]Number of trios\t[4]avgTested\t[5]avgMendelian Errors\t[6]avgSwitch\t[7]avgSwitch (%%)\n");
+    for (i=0; i<args.npop; i++)
+    {
+        pop_t *pop = &args.pop[i];
+        printf("POP\t%s\t%d\t%.0f\t%.0f\t%.0f\t%.2f\n", pop->name,pop->ntrio,
+            (float)pop->ntest/pop->ntrio,(float)pop->err/pop->ntrio,(float)pop->nswitch/pop->ntrio,
+            pop->pswitch/pop->ntrio);
+    }
+    for (i=0; i<args.npop; i++) free(args.pop[i].name);
+    free(args.pop);
+    free(args.trio);
+    free(args.gt_arr);
+}
diff --git a/bcftools/plugins/trio-switch-rate.c.pysam.c b/bcftools/plugins/trio-switch-rate.c.pysam.c

new file mode 100644 (file)

index 0000000..3ac712a
--- /dev/null
+++ b/bcftools/plugins/trio-switch-rate.c.pysam.c
@@ -0,0 +1,275 @@
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+   Copyright (c) 2016 Genome Research Ltd.
+
+   Author: Petr Danecek <pd3@sanger.ac.uk>
+   
+   Permission is hereby granted, free of charge, to any person obtaining a copy
+   of this software and associated documentation files (the "Software"), to deal
+   in the Software without restriction, including without limitation the rights
+   to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+   copies of the Software, and to permit persons to whom the Software is
+   furnished to do so, subject to the following conditions:
+   
+   The above copyright notice and this permission notice shall be included in
+   all copies or substantial portions of the Software.
+   
+   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+   IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+   FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+   AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+   LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+   OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+   THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+typedef struct
+{
+    int father, mother, child;      // VCF sample index
+    int prev, ipop;
+    uint32_t err, nswitch, ntest;
+}
+trio_t;
+
+typedef struct
+{
+    char *name;
+    uint32_t err, nswitch, ntest, ntrio;
+    float pswitch;
+}
+pop_t;
+
+typedef struct
+{
+    int argc;
+    char **argv;
+    bcf_hdr_t *hdr;
+    trio_t *trio;
+    int ntrio, mtrio;
+    int32_t *gt_arr;
+    int npop;
+    pop_t *pop;
+    int mgt_arr, prev_rid;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+    return "Calculate phase switch rate in trio samples, children samples must have phased GTs.\n";
+}
+
+const char *usage(void)
+{
+    return 
+        "\n"
+        "About: Calculate phase switch rate in trio children.\n"
+        "Usage: bcftools +trio-swich-rate [General Options] -- [Plugin Options]\n"
+        "Options:\n"
+        "   run \"bcftools plugin\" for a list of common options\n"
+        "\n"
+        "Plugin options:\n"
+        "   -p, --ped <file>        PED file with optional 7th column to group\n"
+        "                           results by population\n"
+        "\n"
+        "Example:\n"
+        "   bcftools +trio-switch-rate file.bcf -- -p file.ped\n"
+        "\n";
+}
+
+void parse_ped(args_t *args, char *fname)
+{
+    htsFile *fp = hts_open(fname, "r");
+    if ( !fp ) error("Could not read: %s\n", fname);
+
+    kstring_t str = {0,0,0};
+    if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+    void *pop2i = khash_str2int_init();
+
+    int moff = 0, *off = NULL;
+    do
+    {
+        // familyID    sampleID paternalID maternalID sex   phenotype   population relationship   siblings   secondOrder   thirdOrder   children    comment
+        // BB03    HG01884 HG01885 HG01956 2   0   ACB child   0   0   0   0
+        int ncols = ksplit_core(str.s,0,&moff,&off);
+        if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+        int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+        if ( father<0 ) continue;
+        int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+        if ( mother<0 ) continue;
+        int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+        if ( child<0 ) continue;
+
+        args->ntrio++;
+        hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+        trio_t *trio = &args->trio[args->ntrio-1];
+        trio->father = father;
+        trio->mother = mother;
+        trio->child  = child;
+
+        if (ncols>6) {
+            char *pop_name = &str.s[off[6]];
+            if ( !khash_str2int_has_key(pop2i,pop_name) )
+            {
+                pop_name = strdup(&str.s[off[6]]);
+                khash_str2int_set(pop2i,pop_name,args->npop);
+                args->npop++;
+                args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+                memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+                args->pop[args->npop-1].name = pop_name;
+            }
+            khash_str2int_get(pop2i,pop_name,&trio->ipop);
+            args->pop[trio->ipop].ntrio++;
+        }
+    } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+    khash_str2int_destroy(pop2i);
+    free(str.s);
+    free(off);
+    hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+    memset(&args,0,sizeof(args_t));
+    args.argc   = argc; args.argv = argv;
+    args.prev_rid = -1;
+    args.hdr = in;
+    char *ped_fname = NULL;
+    static struct option loptions[] =
+    {
+        {"ped",required_argument,NULL,'p'},
+        {0,0,0,0}
+    };
+    int c;
+    while ((c = getopt_long(argc, argv, "?hp:",loptions,NULL)) >= 0)
+    {
+        switch (c) 
+        {
+            case 'p': ped_fname = optarg; break;
+            case 'h':
+            case '?':
+            default: error("%s", usage()); break;
+        }
+    }
+    if ( !ped_fname ) error("Expected the -p option\n");
+    parse_ped(&args, ped_fname);
+    return 1;
+}
+
+typedef struct
+{
+    int a, b, phased;
+}
+gt_t;
+
+int parse_genotype(gt_t *gt, int32_t *ptr);
+
+inline int parse_genotype(gt_t *gt, int32_t *ptr)
+{
+    if ( ptr[0]==bcf_gt_missing ) return 0;
+    if ( ptr[1]==bcf_gt_missing ) return 0;
+    if ( ptr[1]==bcf_int32_vector_end ) return 0;
+    gt->phased = bcf_gt_is_phased(ptr[1]) ? 1 : 0;
+    gt->a = bcf_gt_allele(ptr[0]); if ( gt->a > 1 ) return 0;  // consider only the first two alleles at biallelic sites
+    gt->b = bcf_gt_allele(ptr[1]); if ( gt->b > 1 ) return 0;
+    return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+    int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.mgt_arr);
+    if ( ngt<0 ) return NULL;
+    ngt /= bcf_hdr_nsamples(args.hdr);
+    if ( ngt!=2 ) return NULL;
+
+    int i;
+    if ( rec->rid!=args.prev_rid )
+    {
+        args.prev_rid = rec->rid;
+        for (i=0; i<args.ntrio; i++) args.trio[i].prev = 0;
+    }
+
+    gt_t child, father, mother;
+    for (i=0; i<args.ntrio; i++)
+    {
+        trio_t *trio = &args.trio[i];
+
+        if ( !parse_genotype(&child, args.gt_arr + ngt*trio->child) ) continue;
+        if ( !child.phased ) continue;
+        if ( child.a+child.b != 1 ) continue;       // child is not a het
+
+        if ( !parse_genotype(&father, args.gt_arr + ngt*trio->father) ) continue;
+        if ( !parse_genotype(&mother, args.gt_arr + ngt*trio->mother) ) continue;
+        if ( father.a+father.b == 1 && mother.a+mother.b == 1 ) continue;     // both parents are hets
+        if ( father.a+father.b == mother.a+mother.b ) { trio->err++; continue; }    // mendelian error
+
+        int test_phase = 0; 
+        if ( father.a==father.b ) test_phase = 1 + (child.a==father.a);
+        else if ( mother.a==mother.b ) test_phase = 1 + (child.b==mother.a);
+        if ( trio->prev > 0 )
+        {
+            if ( trio->prev!=test_phase ) trio->nswitch++;
+        }
+        trio->ntest++;
+        trio->prev = test_phase;
+    }
+    return NULL;
+}
+
+void destroy(void)
+{
+    int i;
+    fprintf(bcftools_stdout, "# This file was produced by: bcftools +trio-switch-rate(%s+htslib-%s)\n", bcftools_version(),hts_version());
+    fprintf(bcftools_stdout, "# The command line was:\tbcftools +trio-switch-rate %s", args.argv[0]);
+    for (i=1; i<args.argc; i++) fprintf(bcftools_stdout, " %s",args.argv[i]);
+    fprintf(bcftools_stdout, "\n#\n");
+    fprintf(bcftools_stdout, "# TRIO\t[2]Father\t[3]Mother\t[4]Child\t[5]nTested\t[6]nMendelian Errors\t[7]nSwitch\t[8]nSwitch (%%)\n");
+    for (i=0; i<args.ntrio; i++)
+    {
+        trio_t *trio = &args.trio[i];
+        fprintf(bcftools_stdout, "TRIO\t%s\t%s\t%s\t%d\t%d\t%d\t%.2f\n",
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->father),
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->mother),
+            bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->child),
+            trio->ntest, trio->err, trio->nswitch, trio->ntest ? trio->nswitch*100./trio->ntest : 0
+        );
+        if (args.npop) {
+            pop_t *pop = &args.pop[trio->ipop];
+            pop->err     += trio->err;
+            pop->nswitch += trio->nswitch;
+            pop->ntest   += trio->ntest;
+            pop->pswitch += trio->ntest ? trio->nswitch*100./trio->ntest : 0;
+        }
+    }
+    fprintf(bcftools_stdout, "# POP\tpopulation or other grouping defined by an optional 7-th column of the PED file\n");
+    fprintf(bcftools_stdout, "# POP\t[2]Name\t[3]Number of trios\t[4]avgTested\t[5]avgMendelian Errors\t[6]avgSwitch\t[7]avgSwitch (%%)\n");
+    for (i=0; i<args.npop; i++)
+    {
+        pop_t *pop = &args.pop[i];
+        fprintf(bcftools_stdout, "POP\t%s\t%d\t%.0f\t%.0f\t%.0f\t%.2f\n", pop->name,pop->ntrio,
+            (float)pop->ntest/pop->ntrio,(float)pop->err/pop->ntrio,(float)pop->nswitch/pop->ntrio,
+            pop->pswitch/pop->ntrio);
+    }
+    for (i=0; i<args.npop; i++) free(args.pop[i].name);
+    free(args.pop);
+    free(args.trio);
+    free(args.gt_arr);
+}
diff --git a/bcftools/regidx.c b/bcftools/regidx.c

index 9b2c66d71eae327872c98cecd1347621f4157a28..3e4620ff16b7337a171ca4492cc0543cbf8901d4 100644 (file)
--- a/bcftools/regidx.c
+++ b/bcftools/regidx.c
@@ -232,6 +232,10 @@ regidx_t *regidx_init(const char *fname, regidx_parse_f parser, regidx_free_f fr
                  parser = regidx_parse_bed;
              else if ( len>=4 && !strcasecmp(".bed",fname+len-4) )
                  parser = regidx_parse_bed;
+            else if ( len>=4 && !strcasecmp(".vcf",fname+len-4) )
+                parser = regidx_parse_vcf;
+            else if ( len>=7 && !strcasecmp(".vcf.gz",fname+len-7) )
+                parser = regidx_parse_vcf;
              else
                  parser = regidx_parse_tab;
          }
@@ -488,13 +492,20 @@ int regidx_parse_tab(const char *line, char **chr_beg, char **chr_end, uint32_t
      {
          ss = se+1;
          *end = strtod(ss, &se);
-        if ( ss==se ) *end = *beg;
+        if ( ss==se || (*se && !isspace(*se)) ) *end = *beg;
          else if ( *end==0 ) { fprintf(stderr,"Could not parse tab line, expected 1-based coordinate: %s\n", line); return -2; }
          else (*end)--;
      }
      return 0;
  }
  
+int regidx_parse_vcf(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    int ret = regidx_parse_tab(line, chr_beg, chr_end, beg, end, payload, usr);
+    if ( !ret ) *end = *beg;
+    return ret;
+}
+
  int regidx_parse_reg(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
  {
      char *ss = (char*) line;
diff --git a/bcftools/regidx.c.pysam.c b/bcftools/regidx.c.pysam.c

index 1082e4aa9a5ff94d3fa2630cfdca7f63bcae348c..2703eb1e3b3d666671c11d626bdd8dfe3a464e3b 100644 (file)
--- a/bcftools/regidx.c.pysam.c
+++ b/bcftools/regidx.c.pysam.c
@@ -234,6 +234,10 @@ regidx_t *regidx_init(const char *fname, regidx_parse_f parser, regidx_free_f fr
                  parser = regidx_parse_bed;
              else if ( len>=4 && !strcasecmp(".bed",fname+len-4) )
                  parser = regidx_parse_bed;
+            else if ( len>=4 && !strcasecmp(".vcf",fname+len-4) )
+                parser = regidx_parse_vcf;
+            else if ( len>=7 && !strcasecmp(".vcf.gz",fname+len-7) )
+                parser = regidx_parse_vcf;
              else
                  parser = regidx_parse_tab;
          }
@@ -490,13 +494,20 @@ int regidx_parse_tab(const char *line, char **chr_beg, char **chr_end, uint32_t
      {
          ss = se+1;
          *end = strtod(ss, &se);
-        if ( ss==se ) *end = *beg;
+        if ( ss==se || (*se && !isspace(*se)) ) *end = *beg;
          else if ( *end==0 ) { fprintf(bcftools_stderr,"Could not parse tab line, expected 1-based coordinate: %s\n", line); return -2; }
          else (*end)--;
      }
      return 0;
  }
  
+int regidx_parse_vcf(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    int ret = regidx_parse_tab(line, chr_beg, chr_end, beg, end, payload, usr);
+    if ( !ret ) *end = *beg;
+    return ret;
+}
+
  int regidx_parse_reg(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
  {
      char *ss = (char*) line;
diff --git a/bcftools/regidx.h b/bcftools/regidx.h

index fe0a897e4ae550f0602a2fee5af990aafdee686f..eec978b6c9a9a0f4aad5ac6f7ffed6e0ef4d6707 100644 (file)
--- a/bcftools/regidx.h
+++ b/bcftools/regidx.h
@@ -106,6 +106,7 @@ typedef void (*regidx_free_f)(void *payload);
  int regidx_parse_bed(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*);   // CHROM or whitespace-sepatated CHROM,FROM,TO (0-based,right-open)
  int regidx_parse_tab(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*);   // CHROM or whitespace-separated CHROM,POS (1-based, inclusive)
  int regidx_parse_reg(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*);   // CHROM, CHROM:POS, CHROM:FROM-TO, CHROM:FROM- (1-based, inclusive)
+int regidx_parse_vcf(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*);
  
  /*
   *  regidx_init() - creates new index
diff --git a/bcftools/reheader.c b/bcftools/reheader.c

index 2776069009a718d24b854cb891d884b143dd173c..d38735f999c2b045458199c4f6850973d448d6bb 100644 (file)
--- a/bcftools/reheader.c
+++ b/bcftools/reheader.c
@@ -1,6 +1,6 @@
  /*  reheader.c -- reheader subcommand.
  
-    Copyright (C) 2014-2017 Genome Research Ltd.
+    Copyright (C) 2014-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -30,12 +30,14 @@ THE SOFTWARE.  */
  #include <errno.h>
  #include <sys/stat.h>
  #include <sys/types.h>
+#include <inttypes.h>
  #include <fcntl.h>
  #include <math.h>
  #include <htslib/vcf.h>
  #include <htslib/bgzf.h>
  #include <htslib/tbx.h> // for hts_get_bgzfp()
  #include <htslib/kseq.h>
+#include <htslib/thread_pool.h>
  #include "bcftools.h"
  #include "khash_str2str.h"
  
@@ -44,7 +46,8 @@ typedef struct _args_t
      char **argv, *fname, *samples_fname, *header_fname, *output_fname;
      htsFile *fp;
      htsFormat type;
-    int argc;
+    htsThreadPool *threads;
+    int argc, n_threads;
  }
  args_t;
  
@@ -299,16 +302,16 @@ static void reheader_vcf(args_t *args)
  
      int out = args->output_fname ? open(args->output_fname, O_WRONLY|O_CREAT|O_TRUNC, 0666) : STDOUT_FILENO;
      if ( out==-1 ) error("%s: %s\n", args->output_fname,strerror(errno));
-    if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %d bytes\n", hdr.l);
+    if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)hdr.l);
      free(hdr.s);
      if ( fp->line.l )
      {
-        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
      }
      while ( hts_getline(fp, KS_SEP_LINE, &fp->line) >=0 )   // uncompressed file implies small size, we don't worry about speed
      {
          kputc('\n',&fp->line);
-        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
      }
      hts_close(fp);
      close(out);
@@ -369,6 +372,16 @@ static bcf_hdr_t *strip_header(bcf_hdr_t *src, bcf_hdr_t *dst)
  static void reheader_bcf(args_t *args, int is_compressed)
  {
      htsFile *fp = args->fp;
+
+    if ( args->n_threads > 0 )
+    {
+        args->threads = calloc(1, sizeof(*args->threads));
+        if ( !args->threads ) error("Could not allocate memory\n");
+        if ( !(args->threads->pool = hts_tpool_init(args->n_threads)) ) error("Could not initialize threading\n");
+        BGZF *bgzf = hts_get_bgzfp(fp);
+        if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+    }
+
      bcf_hdr_t *hdr = bcf_hdr_read(fp); if ( !hdr ) error("Failed to read the header: %s\n", args->fname);
      kstring_t htxt = {0,0,0};
      bcf_hdr_format(hdr, 1, &htxt);
@@ -396,6 +409,11 @@ static void reheader_bcf(args_t *args, int is_compressed)
      // write the header and the body
      htsFile *fp_out = hts_open(args->output_fname ? args->output_fname : "-",is_compressed ? "wb" : "wbu");
      if ( !fp_out ) error("%s: %s\n", args->output_fname ? args->output_fname : "-", strerror(errno));
+    if ( args->threads )
+    {
+        BGZF *bgzf = hts_get_bgzfp(fp_out);
+        if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+    }
      bcf_hdr_write(fp_out, hdr_out);
  
      bcf1_t *rec = bcf_init();
@@ -450,6 +468,11 @@ static void reheader_bcf(args_t *args, int is_compressed)
      hts_close(fp);
      bcf_hdr_destroy(hdr_out);
      bcf_hdr_destroy(hdr);
+    if ( args->threads )
+    {
+        hts_tpool_destroy(args->threads->pool);
+        free(args->threads);
+    }
  }
  
  
@@ -463,6 +486,7 @@ static void usage(args_t *args)
      fprintf(stderr, "    -h, --header <file>     new header\n");
      fprintf(stderr, "    -o, --output <file>     write output to a file [standard output]\n");
      fprintf(stderr, "    -s, --samples <file>    new sample names\n");
+    fprintf(stderr, "        --threads <int>     number of extra compression threads (BCF only) [0]\n");
      fprintf(stderr, "\n");
      exit(1);
  }
@@ -472,18 +496,20 @@ int main_reheader(int argc, char *argv[])
      int c;
      args_t *args  = (args_t*) calloc(1,sizeof(args_t));
      args->argc    = argc; args->argv = argv;
-
+    
      static struct option loptions[] =
      {
          {"output",1,0,'o'},
          {"header",1,0,'h'},
          {"samples",1,0,'s'},
+        {"threads",1,NULL,1},
          {0,0,0,0}
      };
      while ((c = getopt_long(argc, argv, "s:h:o:",loptions,NULL)) >= 0)
      {
          switch (c)
          {
+            case  1 : args->n_threads = strtol(optarg, 0, 0); break;
              case 'o': args->output_fname = optarg; break;
              case 's': args->samples_fname = optarg; break;
              case 'h': args->header_fname = optarg; break;
diff --git a/bcftools/reheader.c.pysam.c b/bcftools/reheader.c.pysam.c

index 11c0870a98f41e4733b4522b70410a12fd707356..dd588de1ccabee6d80855d6f3bd52ec1d45e97e6 100644 (file)
--- a/bcftools/reheader.c.pysam.c
+++ b/bcftools/reheader.c.pysam.c
@@ -2,7 +2,7 @@
  
  /*  reheader.c -- reheader subcommand.
  
-    Copyright (C) 2014-2017 Genome Research Ltd.
+    Copyright (C) 2014-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -32,12 +32,14 @@ THE SOFTWARE.  */
  #include <errno.h>
  #include <sys/stat.h>
  #include <sys/types.h>
+#include <inttypes.h>
  #include <fcntl.h>
  #include <math.h>
  #include <htslib/vcf.h>
  #include <htslib/bgzf.h>
  #include <htslib/tbx.h> // for hts_get_bgzfp()
  #include <htslib/kseq.h>
+#include <htslib/thread_pool.h>
  #include "bcftools.h"
  #include "khash_str2str.h"
  
@@ -46,7 +48,8 @@ typedef struct _args_t
      char **argv, *fname, *samples_fname, *header_fname, *output_fname;
      htsFile *fp;
      htsFormat type;
-    int argc;
+    htsThreadPool *threads;
+    int argc, n_threads;
  }
  args_t;
  
@@ -301,16 +304,16 @@ static void reheader_vcf(args_t *args)
  
      int out = args->output_fname ? open(args->output_fname, O_WRONLY|O_CREAT|O_TRUNC, 0666) : STDOUT_FILENO;
      if ( out==-1 ) error("%s: %s\n", args->output_fname,strerror(errno));
-    if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %d bytes\n", hdr.l);
+    if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)hdr.l);
      free(hdr.s);
      if ( fp->line.l )
      {
-        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
      }
      while ( hts_getline(fp, KS_SEP_LINE, &fp->line) >=0 )   // uncompressed file implies small size, we don't worry about speed
      {
          kputc('\n',&fp->line);
-        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+        if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
      }
      hts_close(fp);
      close(out);
@@ -371,6 +374,16 @@ static bcf_hdr_t *strip_header(bcf_hdr_t *src, bcf_hdr_t *dst)
  static void reheader_bcf(args_t *args, int is_compressed)
  {
      htsFile *fp = args->fp;
+
+    if ( args->n_threads > 0 )
+    {
+        args->threads = calloc(1, sizeof(*args->threads));
+        if ( !args->threads ) error("Could not allocate memory\n");
+        if ( !(args->threads->pool = hts_tpool_init(args->n_threads)) ) error("Could not initialize threading\n");
+        BGZF *bgzf = hts_get_bgzfp(fp);
+        if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+    }
+
      bcf_hdr_t *hdr = bcf_hdr_read(fp); if ( !hdr ) error("Failed to read the header: %s\n", args->fname);
      kstring_t htxt = {0,0,0};
      bcf_hdr_format(hdr, 1, &htxt);
@@ -398,6 +411,11 @@ static void reheader_bcf(args_t *args, int is_compressed)
      // write the header and the body
      htsFile *fp_out = hts_open(args->output_fname ? args->output_fname : "-",is_compressed ? "wb" : "wbu");
      if ( !fp_out ) error("%s: %s\n", args->output_fname ? args->output_fname : "-", strerror(errno));
+    if ( args->threads )
+    {
+        BGZF *bgzf = hts_get_bgzfp(fp_out);
+        if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+    }
      bcf_hdr_write(fp_out, hdr_out);
  
      bcf1_t *rec = bcf_init();
@@ -452,6 +470,11 @@ static void reheader_bcf(args_t *args, int is_compressed)
      hts_close(fp);
      bcf_hdr_destroy(hdr_out);
      bcf_hdr_destroy(hdr);
+    if ( args->threads )
+    {
+        hts_tpool_destroy(args->threads->pool);
+        free(args->threads);
+    }
  }
  
  
@@ -465,6 +488,7 @@ static void usage(args_t *args)
      fprintf(bcftools_stderr, "    -h, --header <file>     new header\n");
      fprintf(bcftools_stderr, "    -o, --output <file>     write output to a file [standard output]\n");
      fprintf(bcftools_stderr, "    -s, --samples <file>    new sample names\n");
+    fprintf(bcftools_stderr, "        --threads <int>     number of extra compression threads (BCF only) [0]\n");
      fprintf(bcftools_stderr, "\n");
      exit(1);
  }
@@ -474,18 +498,20 @@ int main_reheader(int argc, char *argv[])
      int c;
      args_t *args  = (args_t*) calloc(1,sizeof(args_t));
      args->argc    = argc; args->argv = argv;
-
+    
      static struct option loptions[] =
      {
          {"output",1,0,'o'},
          {"header",1,0,'h'},
          {"samples",1,0,'s'},
+        {"threads",1,NULL,1},
          {0,0,0,0}
      };
      while ((c = getopt_long(argc, argv, "s:h:o:",loptions,NULL)) >= 0)
      {
          switch (c)
          {
+            case  1 : args->n_threads = strtol(optarg, 0, 0); break;
              case 'o': args->output_fname = optarg; break;
              case 's': args->samples_fname = optarg; break;
              case 'h': args->header_fname = optarg; break;
diff --git a/bcftools/tabix.c.pysam.c b/bcftools/tabix.c.pysam.c

index ba9e1b31f633ba4c99be644aa580915098bc6c2c..2216d6e6e2766beda5dae0b1836cd90c71bb6026 100644 (file)
--- a/bcftools/tabix.c.pysam.c
+++ b/bcftools/tabix.c.pysam.c
@@ -78,7 +78,7 @@ int main_tabix(int argc, char *argv[])
          BGZF *fp;
          s.l = s.m = 0; s.s = 0;
          fp = bgzf_open(argv[optind], "r");
-        while (bgzf_getline(fp, '\n', &s) >= 0) fputs(s.s, bcftools_stdout) & fputc('\n', bcftools_stdout);
+        while (bgzf_getline(fp, '\n', &s) >= 0) bcftools_puts(s.s);
          bgzf_close(fp);
          free(s.s);
      } else if (optind + 2 > argc) { // create index
@@ -122,7 +122,7 @@ int main_tabix(int argc, char *argv[])
          for (i = optind + 1; i < argc; ++i) {
              hts_itr_t *itr;
              if ((itr = tbx_itr_querys(tbx, argv[i])) == 0) continue;
-            while (tbx_bgzf_itr_next(fp, tbx, itr, &s) >= 0) fputs(s.s, bcftools_stdout) & fputc('\n', bcftools_stdout);
+            while (tbx_bgzf_itr_next(fp, tbx, itr, &s) >= 0) bcftools_puts(s.s);
              tbx_itr_destroy(itr);
          }
          free(s.s);
diff --git a/bcftools/test/test-rbuf.c b/bcftools/test/test-rbuf.c

new file mode 100644 (file)

index 0000000..5c0480f
--- /dev/null
+++ b/bcftools/test/test-rbuf.c
@@ -0,0 +1,75 @@
+/*  test/test-rbuf.c -- rbuf_t test harness.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include "rbuf.h"
+
+void debug_print(rbuf_t *rbuf, int *dat)
+{
+    int i;
+    for (i=-1; rbuf_next(rbuf, &i); ) printf(" %2d", i);
+    printf("\n");
+    for (i=-1; rbuf_next(rbuf, &i); ) printf(" %2d", dat[i]);
+    printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+    int i, j, *dat = (int*)calloc(10,sizeof(int));
+    rbuf_t rbuf;
+    rbuf_init(&rbuf,10);
+
+    rbuf.f = 5; // force wrapping
+    for (i=0; i<9; i++)
+    {
+        j = rbuf_append(&rbuf);
+        dat[j] = i+1;
+    }
+    printf("Inserted 1-9 starting at offset 5:\n");
+    debug_print(&rbuf, dat);
+
+    i = rbuf_kth(&rbuf, 3);
+    printf("4th is %d\n", dat[i]);
+
+    printf("Deleting 1-2:\n");
+    rbuf_shift_n(&rbuf, 2);
+    debug_print(&rbuf, dat);
+
+    printf("Prepending 0-8:\n");
+    for (i=0; i<9; i++)
+    {
+        j = rbuf_prepend(&rbuf);
+        dat[j] = i;
+    }
+    debug_print(&rbuf, dat);
+
+    printf("Expanding:\n");
+    rbuf_expand0(&rbuf,int,rbuf.n+1,dat);
+    debug_print(&rbuf, dat);
+
+    free(dat);
+    return 0;
+}
+
diff --git a/bcftools/test/test-rbuf.c.pysam.c b/bcftools/test/test-rbuf.c.pysam.c

new file mode 100644 (file)

index 0000000..9dae8bc
--- /dev/null
+++ b/bcftools/test/test-rbuf.c.pysam.c
@@ -0,0 +1,77 @@
+#include "bcftools.pysam.h"
+
+/*  test/test-rbuf.c -- rbuf_t test harness.
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include "rbuf.h"
+
+void debug_print(rbuf_t *rbuf, int *dat)
+{
+    int i;
+    for (i=-1; rbuf_next(rbuf, &i); ) fprintf(bcftools_stdout, " %2d", i);
+    fprintf(bcftools_stdout, "\n");
+    for (i=-1; rbuf_next(rbuf, &i); ) fprintf(bcftools_stdout, " %2d", dat[i]);
+    fprintf(bcftools_stdout, "\n");
+}
+
+int bcftools_test-rbuf_main(int argc, char **argv)
+{
+    int i, j, *dat = (int*)calloc(10,sizeof(int));
+    rbuf_t rbuf;
+    rbuf_init(&rbuf,10);
+
+    rbuf.f = 5; // force wrapping
+    for (i=0; i<9; i++)
+    {
+        j = rbuf_append(&rbuf);
+        dat[j] = i+1;
+    }
+    fprintf(bcftools_stdout, "Inserted 1-9 starting at offset 5:\n");
+    debug_print(&rbuf, dat);
+
+    i = rbuf_kth(&rbuf, 3);
+    fprintf(bcftools_stdout, "4th is %d\n", dat[i]);
+
+    fprintf(bcftools_stdout, "Deleting 1-2:\n");
+    rbuf_shift_n(&rbuf, 2);
+    debug_print(&rbuf, dat);
+
+    fprintf(bcftools_stdout, "Prepending 0-8:\n");
+    for (i=0; i<9; i++)
+    {
+        j = rbuf_prepend(&rbuf);
+        dat[j] = i;
+    }
+    debug_print(&rbuf, dat);
+
+    fprintf(bcftools_stdout, "Expanding:\n");
+    rbuf_expand0(&rbuf,int,rbuf.n+1,dat);
+    debug_print(&rbuf, dat);
+
+    free(dat);
+    return 0;
+}
+
diff --git a/bcftools/test/test-regidx.c b/bcftools/test/test-regidx.c

new file mode 100644 (file)

index 0000000..be44077
--- /dev/null
+++ b/bcftools/test/test-regidx.c
@@ -0,0 +1,374 @@
+/*  test/test-regidx.c -- Regions index test harness.
+
+    gcc -g -Wall -O0 -I. -I../htslib/ -L../htslib regidx.c -o test-regidx test-regidx.c -lhts
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdarg.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <getopt.h>
+#include <htslib/kstring.h>
+#include <time.h>
+#include "regidx.h"
+
+static int verbose = 0;
+
+void debug(const char *format, ...)
+{
+    if ( verbose<2 ) return;
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(stderr, format, ap);
+    va_end(ap);
+}
+void info(const char *format, ...)
+{
+    if ( verbose<1 ) return;
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(stderr, format, ap);
+    va_end(ap);
+}
+void error(const char *format, ...)
+{
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(stderr, format, ap);
+    va_end(ap);
+    exit(-1);
+}
+
+int custom_parse(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    // Use the standard parser for CHROM,FROM,TO
+    int i, ret = regidx_parse_tab(line,chr_beg,chr_end,beg,end,NULL,NULL);
+    if ( ret!=0 ) return ret;
+
+    // Skip the fields that were parsed above
+    char *ss = (char*) line;
+    while ( *ss && isspace(*ss) ) ss++;
+    for (i=0; i<3; i++)
+    {
+        while ( *ss && !isspace(*ss) ) ss++;
+        if ( !*ss ) return -2;  // wrong number of fields
+        while ( *ss && isspace(*ss) ) ss++;
+    }
+    if ( !*ss ) return -2;
+
+    // Parse the payload
+    char *se = ss;
+    while ( *se && !isspace(*se) ) se++;
+    char **dat = (char**) payload;
+    *dat = (char*) malloc(se-ss+1);
+    memcpy(*dat,ss,se-ss+1);
+    (*dat)[se-ss] = 0;
+    return 0;
+}
+void custom_free(void *payload)
+{
+    char **dat = (char**)payload;
+    free(*dat);
+}
+
+void test_sequential_access(void)
+{
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Insert regions
+    kstring_t str = {0,0,0};
+    int i, n = 10;
+    for (i=0; i<n; i++)
+    {
+        int beg = 10*(i+1);
+        str.l = 0;
+        ksprintf(&str,"1\t%d\t%d\t%d",beg,beg,beg);
+        if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n",str.s);
+    }
+
+    // Test
+    regitr_t *itr = regitr_init(idx);
+    i = 0;
+    while ( regitr_loop(itr) )
+    {
+        if ( itr->beg!=itr->end || itr->beg+1!=10*(i+1) ) error("listing failed, expected %d, found %d\n",10*(i+1),itr->beg+1);
+        str.l = 0;
+        ksprintf(&str,"%d",itr->beg+1);
+        if ( strcmp(regitr_payload(itr,char*),str.s) ) error("listing failed, expected payload \"%s\", found \"%s\"\n",str.s,regitr_payload(itr,char*));
+        i++;
+    }
+    if ( i!=n ) error("Expected %d regions, listed %d\n", n,i);
+    debug("ok: listed %d regions\n", n);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+    free(str.s);
+}
+
+void test_custom_payload(void)
+{
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Insert regions
+    char *line;
+    line = "1 10000000 10000000 1:10000000-10000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 20000000 20000001 1:20000000-20000001"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 20000002 20000002 1:20000002-20000002"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 30000000 30000000 1:30000000-30000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+    // Test 
+    regitr_t *itr = regitr_init(idx);
+    int from, to;
+
+    from = to = 10000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+    if ( strcmp("1:10000000-10000000",regitr_payload(itr,char*)) ) error("query failed: 1:%d-%d vs %s\n", from,to,regitr_payload(itr,char*));
+    if ( !regidx_overlap(idx,"1",from-2,to-1,itr) ) error("query failed: 1:%d-%d\n",from-1,to);
+    if ( !regidx_overlap(idx,"1",from-2,to+3,itr) ) error("query failed: 1:%d-%d\n",from-1,to+2);
+    if ( regidx_overlap(idx,"1",from-2,to-2,itr) ) error("query failed: 1:%d-%d\n",from-1,to-1);
+
+    from = to = 20000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    from = to = 20000002;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    from = to = 30000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+}
+
+void get_random_region(uint32_t min, uint32_t max, uint32_t *beg, uint32_t *end)
+{
+    long int b = random(), e = random();
+    *beg = min + (float)b * (max-min) / RAND_MAX;
+    *end = *beg + (float)e * (max-*beg) / RAND_MAX;
+}
+
+void test_random(int nregs, uint32_t min, uint32_t max)
+{
+    min--;
+    max--;
+
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Test region
+    uint32_t beg,end;
+    get_random_region(min,max,&beg,&end);
+
+    // Insert regions
+    int i, nexp = 0;
+    kstring_t str = {0,0,0};
+    for (i=0; i<nregs; i++)
+    {
+        uint32_t b,e;
+        get_random_region(min,max,&b,&e);
+        str.l = 0;
+        ksprintf(&str,"1\t%"PRIu32"\t%"PRIu32"\t1:%"PRIu32"-%"PRIu32"",b+1,e+1,b+1,e+1);
+        if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n", str.s);
+        if ( e>=beg && b<=end ) nexp++;
+    }
+
+    // Test 
+    regitr_t *itr = regitr_init(idx);
+    int nhit = 0, ret = regidx_overlap(idx,"1",beg,end,itr);
+    if ( nexp && !ret ) error("query failed, expected %d overlap(s), found none: %d-%d\n", nexp,beg+1,end+1);
+    if ( !nexp && ret ) error("query failed, expected no overlaps, found some: %d-%d\n", beg+1,end+1);
+    while ( ret && regitr_overlap(itr) )
+    {
+        str.l = 0;
+        ksprintf(&str,"1:%"PRIu32"-%"PRIu32"",itr->beg+1,itr->end+1);
+        if ( strcmp(str.s,regitr_payload(itr,char*)) )
+            error("query failed, incorrect payload: %s vs %s (%d-%d)\n",str.s,regitr_payload(itr,char*),beg+1,end+1);
+        if ( itr->beg > end || itr->end < beg )
+            error("query failed, incorrect hit: %d-%d vs %d-%d, payload %s\n", beg+1,end+1,itr->beg+1,itr->end+1,regitr_payload(itr,char*));
+        nhit++;
+    }
+    if ( nexp!=nhit ) error("query failed, expected %d overlap(s), found %d: %d-%d\n",nexp,nhit,beg+1,end+1);
+    debug("ok: found %d overlaps\n", nexp);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+    free(str.s);
+}
+
+void create_line_bed(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s\t%d\t%d\n",chr,start-1,end);
+}
+void create_line_tab(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s\t%d\t%d\n",chr,start,end);
+}
+void create_line_reg(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s:%d-%d\n",chr,start,end);
+}
+
+typedef void (*set_line_f)(char *line, char *chr, int start, int end);
+
+void test(set_line_f set_line, regidx_parse_f parse)
+{
+    regidx_t *idx = regidx_init(NULL,parse,NULL,0,NULL);
+    if ( !idx ) error("init failed\n");
+
+    char line[250], *chr = "1";
+    int i, n = 10, start, end, nhit;
+    for (i=1; i<n; i++)
+    {
+        start = end = 10*i;
+        set_line(line,chr,start,end);
+        debug("insert: %s", line);
+        if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+        start = end = 10*i + 1;
+        set_line(line,chr,start,end);
+        debug("insert: %s", line);
+        if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    }
+
+    regitr_t *itr = regitr_init(idx);
+    for (i=1; i<n; i++)
+    {
+        // no hit
+        start = end = 10*i - 1;
+        if ( regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be no hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: no overlap found for %s:%d-%d\n",chr,start,end);
+
+
+        // one hit
+        start = end = 10*i;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+        // one hit
+        start = end = 10*i+1;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+        // two hits
+        start = 10*i; end = start+1;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=2 ) error("query failed, expected two hits, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+    }
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+}
+
+static void usage(void)
+{
+    fprintf(stderr, "Usage: test-regidx [OPTIONS]\n");
+    fprintf(stderr, "Options:\n");
+    fprintf(stderr, "   -h, --help          this help message\n");
+    fprintf(stderr, "   -s, --seed <int>    random seed\n");
+    fprintf(stderr, "   -v, --verbose       increase verbosity by giving multiple times\n");
+
+    exit(1);
+}
+
+int main(int argc, char **argv)
+{
+    static struct option loptions[] =
+    {
+        {"help",0,0,'h'},
+        {"verbose",0,0,'v'},
+        {"seed",1,0,'s'},
+        {0,0,0,0}
+    };
+    int c;
+    int seed = (int)time(NULL);
+    while ((c = getopt_long(argc, argv, "hvs:",loptions,NULL)) >= 0) 
+    {
+        switch (c)
+        {
+            case 's': seed = atoi(optarg); break;
+            case 'v': verbose++; break;
+            default: usage(); break;
+        }
+    }
+
+    info("Testing sequential access\n");
+    test_sequential_access();
+
+    info("Testing TAB\n");
+    test(create_line_tab,regidx_parse_tab);
+
+    info("Testing REG\n");
+    test(create_line_reg,regidx_parse_reg);
+
+    info("Testing BED\n");
+    test(create_line_bed,regidx_parse_bed);
+
+    info("Testing custom payload\n");
+    test_custom_payload();
+
+    int i, ntest = 1000, nreg = 50;
+    srandom(seed);
+    info("%d randomized tests, %d regions per test. Random seed is %d\n", ntest,nreg,seed);
+    for (i=0; i<ntest; i++) test_random(nreg,1,1000);
+
+    return 0;
+}
+
+
diff --git a/bcftools/test/test-regidx.c.pysam.c b/bcftools/test/test-regidx.c.pysam.c

new file mode 100644 (file)

index 0000000..9eeba3d
--- /dev/null
+++ b/bcftools/test/test-regidx.c.pysam.c
@@ -0,0 +1,376 @@
+#include "bcftools.pysam.h"
+
+/*  test/test-regidx.c -- Regions index test harness.
+
+    gcc -g -Wall -O0 -I. -I../htslib/ -L../htslib regidx.c -o test-regidx test-regidx.c -lhts
+
+    Copyright (C) 2014 Genome Research Ltd.
+
+    Author: Petr Danecek <pd3@sanger.ac.uk>
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    
+    The above copyright notice and this permission notice shall be included in
+    all copies or substantial portions of the Software.
+    
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+    THE SOFTWARE.
+*/
+
+#include <stdarg.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <getopt.h>
+#include <htslib/kstring.h>
+#include <time.h>
+#include "regidx.h"
+
+static int verbose = 0;
+
+void debug(const char *format, ...)
+{
+    if ( verbose<2 ) return;
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(bcftools_stderr, format, ap);
+    va_end(ap);
+}
+void info(const char *format, ...)
+{
+    if ( verbose<1 ) return;
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(bcftools_stderr, format, ap);
+    va_end(ap);
+}
+void error(const char *format, ...)
+{
+    va_list ap;
+    va_start(ap, format);
+    vfprintf(bcftools_stderr, format, ap);
+    va_end(ap);
+    exit(-1);
+}
+
+int custom_parse(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+    // Use the standard parser for CHROM,FROM,TO
+    int i, ret = regidx_parse_tab(line,chr_beg,chr_end,beg,end,NULL,NULL);
+    if ( ret!=0 ) return ret;
+
+    // Skip the fields that were parsed above
+    char *ss = (char*) line;
+    while ( *ss && isspace(*ss) ) ss++;
+    for (i=0; i<3; i++)
+    {
+        while ( *ss && !isspace(*ss) ) ss++;
+        if ( !*ss ) return -2;  // wrong number of fields
+        while ( *ss && isspace(*ss) ) ss++;
+    }
+    if ( !*ss ) return -2;
+
+    // Parse the payload
+    char *se = ss;
+    while ( *se && !isspace(*se) ) se++;
+    char **dat = (char**) payload;
+    *dat = (char*) malloc(se-ss+1);
+    memcpy(*dat,ss,se-ss+1);
+    (*dat)[se-ss] = 0;
+    return 0;
+}
+void custom_free(void *payload)
+{
+    char **dat = (char**)payload;
+    free(*dat);
+}
+
+void test_sequential_access(void)
+{
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Insert regions
+    kstring_t str = {0,0,0};
+    int i, n = 10;
+    for (i=0; i<n; i++)
+    {
+        int beg = 10*(i+1);
+        str.l = 0;
+        ksprintf(&str,"1\t%d\t%d\t%d",beg,beg,beg);
+        if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n",str.s);
+    }
+
+    // Test
+    regitr_t *itr = regitr_init(idx);
+    i = 0;
+    while ( regitr_loop(itr) )
+    {
+        if ( itr->beg!=itr->end || itr->beg+1!=10*(i+1) ) error("listing failed, expected %d, found %d\n",10*(i+1),itr->beg+1);
+        str.l = 0;
+        ksprintf(&str,"%d",itr->beg+1);
+        if ( strcmp(regitr_payload(itr,char*),str.s) ) error("listing failed, expected payload \"%s\", found \"%s\"\n",str.s,regitr_payload(itr,char*));
+        i++;
+    }
+    if ( i!=n ) error("Expected %d regions, listed %d\n", n,i);
+    debug("ok: listed %d regions\n", n);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+    free(str.s);
+}
+
+void test_custom_payload(void)
+{
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Insert regions
+    char *line;
+    line = "1 10000000 10000000 1:10000000-10000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 20000000 20000001 1:20000000-20000001"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 20000002 20000002 1:20000002-20000002"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    line = "1 30000000 30000000 1:30000000-30000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+    // Test 
+    regitr_t *itr = regitr_init(idx);
+    int from, to;
+
+    from = to = 10000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+    if ( strcmp("1:10000000-10000000",regitr_payload(itr,char*)) ) error("query failed: 1:%d-%d vs %s\n", from,to,regitr_payload(itr,char*));
+    if ( !regidx_overlap(idx,"1",from-2,to-1,itr) ) error("query failed: 1:%d-%d\n",from-1,to);
+    if ( !regidx_overlap(idx,"1",from-2,to+3,itr) ) error("query failed: 1:%d-%d\n",from-1,to+2);
+    if ( regidx_overlap(idx,"1",from-2,to-2,itr) ) error("query failed: 1:%d-%d\n",from-1,to-1);
+
+    from = to = 20000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    from = to = 20000002;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    from = to = 30000000;
+    if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+}
+
+void get_random_region(uint32_t min, uint32_t max, uint32_t *beg, uint32_t *end)
+{
+    long int b = random(), e = random();
+    *beg = min + (float)b * (max-min) / RAND_MAX;
+    *end = *beg + (float)e * (max-*beg) / RAND_MAX;
+}
+
+void test_random(int nregs, uint32_t min, uint32_t max)
+{
+    min--;
+    max--;
+
+    // Init index with no file name, we will insert the regions manually
+    regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+    if ( !idx ) error("init failed\n");
+
+    // Test region
+    uint32_t beg,end;
+    get_random_region(min,max,&beg,&end);
+
+    // Insert regions
+    int i, nexp = 0;
+    kstring_t str = {0,0,0};
+    for (i=0; i<nregs; i++)
+    {
+        uint32_t b,e;
+        get_random_region(min,max,&b,&e);
+        str.l = 0;
+        ksprintf(&str,"1\t%"PRIu32"\t%"PRIu32"\t1:%"PRIu32"-%"PRIu32"",b+1,e+1,b+1,e+1);
+        if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n", str.s);
+        if ( e>=beg && b<=end ) nexp++;
+    }
+
+    // Test 
+    regitr_t *itr = regitr_init(idx);
+    int nhit = 0, ret = regidx_overlap(idx,"1",beg,end,itr);
+    if ( nexp && !ret ) error("query failed, expected %d overlap(s), found none: %d-%d\n", nexp,beg+1,end+1);
+    if ( !nexp && ret ) error("query failed, expected no overlaps, found some: %d-%d\n", beg+1,end+1);
+    while ( ret && regitr_overlap(itr) )
+    {
+        str.l = 0;
+        ksprintf(&str,"1:%"PRIu32"-%"PRIu32"",itr->beg+1,itr->end+1);
+        if ( strcmp(str.s,regitr_payload(itr,char*)) )
+            error("query failed, incorrect payload: %s vs %s (%d-%d)\n",str.s,regitr_payload(itr,char*),beg+1,end+1);
+        if ( itr->beg > end || itr->end < beg )
+            error("query failed, incorrect hit: %d-%d vs %d-%d, payload %s\n", beg+1,end+1,itr->beg+1,itr->end+1,regitr_payload(itr,char*));
+        nhit++;
+    }
+    if ( nexp!=nhit ) error("query failed, expected %d overlap(s), found %d: %d-%d\n",nexp,nhit,beg+1,end+1);
+    debug("ok: found %d overlaps\n", nexp);
+
+    // Clean up
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+    free(str.s);
+}
+
+void create_line_bed(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s\t%d\t%d\n",chr,start-1,end);
+}
+void create_line_tab(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s\t%d\t%d\n",chr,start,end);
+}
+void create_line_reg(char *line, char *chr, int start, int end)
+{
+    sprintf(line,"%s:%d-%d\n",chr,start,end);
+}
+
+typedef void (*set_line_f)(char *line, char *chr, int start, int end);
+
+void test(set_line_f set_line, regidx_parse_f parse)
+{
+    regidx_t *idx = regidx_init(NULL,parse,NULL,0,NULL);
+    if ( !idx ) error("init failed\n");
+
+    char line[250], *chr = "1";
+    int i, n = 10, start, end, nhit;
+    for (i=1; i<n; i++)
+    {
+        start = end = 10*i;
+        set_line(line,chr,start,end);
+        debug("insert: %s", line);
+        if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+        start = end = 10*i + 1;
+        set_line(line,chr,start,end);
+        debug("insert: %s", line);
+        if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+    }
+
+    regitr_t *itr = regitr_init(idx);
+    for (i=1; i<n; i++)
+    {
+        // no hit
+        start = end = 10*i - 1;
+        if ( regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be no hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: no overlap found for %s:%d-%d\n",chr,start,end);
+
+
+        // one hit
+        start = end = 10*i;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+        // one hit
+        start = end = 10*i+1;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+        // two hits
+        start = 10*i; end = start+1;
+        if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+        debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+        nhit = 0;
+        while ( regitr_overlap(itr) )
+        {
+            if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+            debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+            nhit++;
+        }
+        if ( nhit!=2 ) error("query failed, expected two hits, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+    }
+    regitr_destroy(itr);
+    regidx_destroy(idx);
+}
+
+static void usage(void)
+{
+    fprintf(bcftools_stderr, "Usage: test-regidx [OPTIONS]\n");
+    fprintf(bcftools_stderr, "Options:\n");
+    fprintf(bcftools_stderr, "   -h, --help          this help message\n");
+    fprintf(bcftools_stderr, "   -s, --seed <int>    random seed\n");
+    fprintf(bcftools_stderr, "   -v, --verbose       increase verbosity by giving multiple times\n");
+
+    exit(1);
+}
+
+int bcftools_test-regidx_main(int argc, char **argv)
+{
+    static struct option loptions[] =
+    {
+        {"help",0,0,'h'},
+        {"verbose",0,0,'v'},
+        {"seed",1,0,'s'},
+        {0,0,0,0}
+    };
+    int c;
+    int seed = (int)time(NULL);
+    while ((c = getopt_long(argc, argv, "hvs:",loptions,NULL)) >= 0) 
+    {
+        switch (c)
+        {
+            case 's': seed = atoi(optarg); break;
+            case 'v': verbose++; break;
+            default: usage(); break;
+        }
+    }
+
+    info("Testing sequential access\n");
+    test_sequential_access();
+
+    info("Testing TAB\n");
+    test(create_line_tab,regidx_parse_tab);
+
+    info("Testing REG\n");
+    test(create_line_reg,regidx_parse_reg);
+
+    info("Testing BED\n");
+    test(create_line_bed,regidx_parse_bed);
+
+    info("Testing custom payload\n");
+    test_custom_payload();
+
+    int i, ntest = 1000, nreg = 50;
+    srandom(seed);
+    info("%d randomized tests, %d regions per test. Random seed is %d\n", ntest,nreg,seed);
+    for (i=0; i<ntest; i++) test_random(nreg,1,1000);
+
+    return 0;
+}
+
+
diff --git a/bcftools/vcfannotate.c b/bcftools/vcfannotate.c

index abea98f9099cd7ea69a9a8dd6696d6133837c0dc..abb0d59a2edbb2d24272cf7d605dea0f225a7073 100644 (file)
--- a/bcftools/vcfannotate.c
+++ b/bcftools/vcfannotate.c
@@ -114,6 +114,7 @@ typedef struct _args_t
  
      int nsmpl_annot;
      int *sample_map, nsample_map, sample_is_file;   // map[idst] -> isrc
+    uint8_t *src_smpl_pld, *dst_smpl_pld;   // for Number=G format fields
      int mtmpi, mtmpf, mtmps;
      int mtmpi2, mtmpf2, mtmps2;
      int mtmpi3, mtmpf3, mtmps3;
@@ -462,6 +463,34 @@ static int vcf_setter_id(args_t *args, bcf1_t *line, annot_col_t *col, void *dat
          return bcf_update_id(args->hdr_out,line,rec->d.id);
      return 0;
  }
+static int vcf_setter_ref(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+    bcf1_t *rec = (bcf1_t*) data;
+    if ( !strcmp(rec->d.allele[0],line->d.allele[0]) ) return 0;    // no update necessary
+    const char **als = (const char**) malloc(sizeof(char*)*line->n_allele);
+    als[0] = rec->d.allele[0];
+    int i;
+    for (i=1; i<line->n_allele; i++) als[i] = line->d.allele[i];
+    bcf_update_alleles(args->hdr_out, line, als, line->n_allele);
+    free(als);
+    return 0;
+}
+static int vcf_setter_alt(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+    bcf1_t *rec = (bcf1_t*) data;
+    int i;
+    if ( rec->n_allele==line->n_allele )
+    {
+        for (i=1; i<rec->n_allele; i++) if ( strcmp(rec->d.allele[i],line->d.allele[i]) ) break;
+        if ( i==rec->n_allele ) return 0;   // no update necessary
+    }
+    const char **als = (const char**) malloc(sizeof(char*)*rec->n_allele);
+    als[0] = line->d.allele[0];
+    for (i=1; i<rec->n_allele; i++) als[i] = rec->d.allele[i];
+    bcf_update_alleles(args->hdr_out, line, als, rec->n_allele);
+    free(als);
+    return 0;
+}
  static int setter_qual(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
      annot_line_t *tab = (annot_line_t*) data;
@@ -491,7 +520,7 @@ static int setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *
  
      if ( str[0]=='1' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,1);
      if ( str[0]=='0' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,0);
-    error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+    error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
      return -1;
  }
  static int vcf_setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
@@ -510,7 +539,7 @@ static int setter_ARinfo_int32(args_t *args, bcf1_t *line, annot_col_t *col, int
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int ntmpi2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpi2, &args->mtmpi2);
@@ -544,7 +573,7 @@ static int setter_info_int(args_t *args, bcf1_t *line, annot_col_t *col, void *d
      {
          int val = strtol(str, &end, 10); 
          if ( end==str )
-            error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+            error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
          ntmpi++;
          hts_expand(int32_t,ntmpi,args->mtmpi,args->tmpi);
          args->tmpi[ntmpi-1] = val;
@@ -590,7 +619,7 @@ static int setter_ARinfo_real(args_t *args, bcf1_t *line, annot_col_t *col, int
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int ntmpf2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpf2, &args->mtmpf2);
@@ -624,7 +653,7 @@ static int setter_info_real(args_t *args, bcf1_t *line, annot_col_t *col, void *
      {
          double val = strtod(str, &end);
          if ( end==str )
-            error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+            error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
          ntmpf++;
          hts_expand(float,ntmpf,args->mtmpf,args->tmpf);
          args->tmpf[ntmpf-1] = val;
@@ -677,7 +706,7 @@ static int setter_ARinfo_string(args_t *args, bcf1_t *line, annot_col_t *col, in
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int i, empty = 0, nstr, mstr = args->tmpks.m;
@@ -1113,21 +1142,261 @@ static int setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void
  
      return core_setter_format_str(args,line,col,args->tmpp);
  }
+static int determine_ploidy(int nals, int *vals, int nvals1, uint8_t *smpl, int nsmpl)
+{
+    int i, j, ndip = nals*(nals+1)/2, max_ploidy = 0;
+    for (i=0; i<nsmpl; i++)
+    {
+        int *ptr = vals + i*nvals1;
+        int has_value = 0;
+        for (j=0; j<nvals1; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( ptr[j]!=bcf_int32_missing ) has_value = 1;
+        }
+        if ( has_value )
+        {
+            if ( j==ndip )
+            { 
+                smpl[i] = 2;
+                max_ploidy = 2; 
+            }
+            else if ( j==nals )
+            { 
+                smpl[i] = 1;
+                if ( !max_ploidy ) max_ploidy = 1;
+            }
+            else return -j;
+        }
+        else smpl[i] = 0;
+    }
+    return max_ploidy;
+}
  static int vcf_setter_format_int(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
      bcf1_t *rec = (bcf1_t*) data;
      int nsrc = bcf_get_format_int32(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpi,&args->mtmpi);
      if ( nsrc==-3 ) return 0;    // the tag is not present
      if ( nsrc<=0 ) return 1;     // error
-    return core_setter_format_int(args,line,col,args->tmpi,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+    int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+    int nsrc1 = nsrc / nsmpl_src;
+    if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+        return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+
+    // create mapping from src to dst genotypes, haploid and diploid version
+    int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+    int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+    if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+    int i, j;
+    if ( rec->n_allele==line->n_allele )
+    {
+        // alleles unchanged?
+        for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+        if ( i==rec->n_allele )
+            return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+    }
+
+    int nsmpl_dst = rec->n_sample;
+    int ndst  = bcf_get_format_int32(args->hdr,line,col->hdr_key_dst,&args->tmpi2,&args->mtmpi2);
+    int ndst1 = ndst / nsmpl_dst;
+    if ( ndst <= 0 )
+    {
+        if ( col->replace==REPLACE_NON_MISSING ) return 0;  // overwrite only if present
+        if ( col->number==BCF_VL_G )
+            ndst1 = line->n_allele*(line->n_allele+1)/2;
+        else
+            ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(int, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+        for (i=0; i<nsmpl_dst; i++)
+        {
+            int32_t *dst = args->tmpi2 + i*ndst1;
+            for (j=0; j<ndst1; j++) dst[j] = bcf_int32_missing;
+        }
+    }
+
+    int nmap_dip = 0, *map_dip = NULL;
+    if ( col->number==BCF_VL_G )
+    {
+        map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+        if ( !args->src_smpl_pld )
+        {
+            args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+            args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+        }
+        int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+        if ( pld_src<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+        int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+        if ( pld_dst<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+        int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+        if ( ndst1_new != ndst1 )
+        {
+            if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d  at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+            ndst1 = ndst1_new;
+            hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+        }
+    }
+    else if ( !ndst1 )
+    {
+        ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+    }
+
+    for (i=0; i<nsmpl_dst; i++)
+    {
+        int ii = args->sample_map ? args->sample_map[i] : i;
+        int32_t *ptr_src = args->tmpi + i*nsrc1;
+        int32_t *ptr_dst = args->tmpi2 + ii*ndst1;
+
+        if ( col->number==BCF_VL_G )
+        {
+            if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+                error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+            if ( !args->dst_smpl_pld[i] )
+                for (j=0; j<ndst1; j++) ptr_dst[j] = bcf_int32_missing;
+        }
+        if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+        {
+            for (j=0; j<nmap_hap; j++)
+            {
+                int k = map_hap[j];
+                if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+            }
+            if ( col->number==BCF_VL_G )
+                for (j=line->n_allele; j<ndst1; j++) ptr_dst[j++] = bcf_int32_vector_end;
+        }
+        else
+        {
+            for (j=0; j<nmap_dip; j++)
+            {
+                int k = map_dip[j];
+                if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+            }
+        }
+    }
+    return bcf_update_format_int32(args->hdr_out,line,col->hdr_key_dst,args->tmpi2,nsmpl_dst*ndst1);
  }
  static int vcf_setter_format_real(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
+
      bcf1_t *rec = (bcf1_t*) data;
      int nsrc = bcf_get_format_float(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpf,&args->mtmpf);
      if ( nsrc==-3 ) return 0;    // the tag is not present
      if ( nsrc<=0 ) return 1;     // error
-    return core_setter_format_real(args,line,col,args->tmpf,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+    int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+    int nsrc1 = nsrc / nsmpl_src;
+    if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+        return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+
+    // create mapping from src to dst genotypes, haploid and diploid version
+    int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+    int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+    if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+    int i, j;
+    if ( rec->n_allele==line->n_allele )
+    {
+        // alleles unchanged?
+        for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+        if ( i==rec->n_allele )
+            return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+    }
+
+    int nsmpl_dst = rec->n_sample;
+    int ndst  = bcf_get_format_float(args->hdr,line,col->hdr_key_dst,&args->tmpf2,&args->mtmpf2);
+    int ndst1 = ndst / nsmpl_dst;
+    if ( ndst <= 0 )
+    {
+        if ( col->replace==REPLACE_NON_MISSING ) return 0;  // overwrite only if present
+        if ( col->number==BCF_VL_G )
+            ndst1 = line->n_allele*(line->n_allele+1)/2;
+        else
+            ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+        for (i=0; i<nsmpl_dst; i++)
+        {
+            float *dst = args->tmpf2 + i*ndst1;
+            for (j=0; j<ndst1; j++) bcf_float_set_missing(dst[j]);
+        }
+    }
+
+    int nmap_dip = 0, *map_dip = NULL;
+    if ( col->number==BCF_VL_G )
+    {
+        map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+        if ( !args->src_smpl_pld )
+        {
+            args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+            args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+        }
+        int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+        if ( pld_src<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+        int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+        if ( pld_dst<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+        int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+        if ( ndst1_new != ndst1 )
+        {
+            if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d  at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+            ndst1 = ndst1_new;
+            hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+        }
+    }
+    else if ( !ndst1 )
+    {
+        ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+    }
+
+    for (i=0; i<nsmpl_dst; i++)
+    {
+        int ii = args->sample_map ? args->sample_map[i] : i;
+        float *ptr_src = args->tmpf + i*nsrc1;
+        float *ptr_dst = args->tmpf2 + ii*ndst1;
+
+        if ( col->number==BCF_VL_G )
+        {
+            if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+                error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+            if ( !args->dst_smpl_pld[i] )
+                for (j=0; j<ndst1; j++) bcf_float_set_missing(ptr_dst[j]);
+        }
+        if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+        {
+            for (j=0; j<nmap_hap; j++)
+            {
+                int k = map_hap[j];
+                if ( k>=0 )
+                {
+                    if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+                    else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]); 
+                    else ptr_dst[k] = ptr_src[j];
+                }
+            }
+            if ( col->number==BCF_VL_G )
+                for (j=line->n_allele; j<ndst1; j++) bcf_float_set_vector_end(ptr_dst[j]);
+        }
+        else
+        {
+            for (j=0; j<nmap_dip; j++)
+            {
+                int k = map_dip[j];
+                if ( k>=0 )
+                {
+                    if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+                    else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+                    else ptr_dst[k] = ptr_src[j];
+                }
+            }
+        }
+    }
+    return bcf_update_format_float(args->hdr_out,line,col->hdr_key_dst,args->tmpf2,nsmpl_dst*ndst1);
+
  }
  
  static int vcf_setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
@@ -1339,8 +1608,30 @@ static void init_columns(args_t *args)
          else if ( !strcasecmp("POS",str.s) ) args->from_idx = icol;
          else if ( !strcasecmp("FROM",str.s) ) args->from_idx = icol;
          else if ( !strcasecmp("TO",str.s) ) args->to_idx = icol;
-        else if ( !strcasecmp("REF",str.s) ) args->ref_idx = icol;
-        else if ( !strcasecmp("ALT",str.s) ) args->alt_idx = icol;
+        else if ( !strcasecmp("REF",str.s) )
+        {
+            if ( args->tgts_is_vcf )
+            {
+                args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+                annot_col_t *col = &args->cols[args->ncols-1];
+                col->setter = vcf_setter_ref;
+                col->hdr_key_src = strdup(str.s);
+                col->hdr_key_dst = strdup(str.s);
+            }
+            else args->ref_idx = icol;
+        }
+        else if ( !strcasecmp("ALT",str.s) )
+        {
+            if ( args->tgts_is_vcf )
+            {
+                args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+                annot_col_t *col = &args->cols[args->ncols-1];
+                col->setter = vcf_setter_alt;
+                col->hdr_key_src = strdup(str.s);
+                col->hdr_key_dst = strdup(str.s);
+            }
+            else args->alt_idx = icol;
+        }
          else if ( !strcasecmp("ID",str.s) )
          {
              if ( replace==REPLACE_NON_MISSING ) error("Apologies, the -ID feature has not been implemented yet.\n");
@@ -1458,6 +1749,8 @@ static void init_columns(args_t *args)
                          case BCF_HT_STR:    col->setter = vcf_setter_format_str; has_fmt_str = 1; break;
                          default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
                      }
+                hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, hrec->vals[k]);
+                col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
              }
          }
          else if ( !strncasecmp("FORMAT/",str.s, 7) || !strncasecmp("FMT/",str.s,4) )
@@ -1477,6 +1770,7 @@ static void init_columns(args_t *args)
              if ( args->tgts_is_vcf )
              {
                  bcf_hrec_t *hrec = bcf_hdr_get_hrec(args->files->readers[1].header, BCF_HL_FMT, "ID", key_src, NULL);
+                if ( !hrec ) error("No such annotation \"%s\" in %s\n", key_src,args->targets_fname);
                  tmp.l = 0;
                  bcf_hrec_format_rename(hrec, key_dst, &tmp);
                  bcf_hdr_append(args->hdr_out, tmp.s);
@@ -1506,6 +1800,12 @@ static void init_columns(args_t *args)
                      case BCF_HT_STR:    col->setter = args->tgts_is_vcf ? vcf_setter_format_str  : setter_format_str; has_fmt_str = 1; break;
                      default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
                  }
+            if ( args->tgts_is_vcf )
+            {
+                bcf_hdr_t *tgts_hdr = args->files->readers[1].header;
+                hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, col->hdr_key_src);
+                col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
+            }
          }
          else
          {
@@ -1697,6 +1997,8 @@ static void destroy_data(args_t *args)
      free(args->tmpp2);
      free(args->tmpi3);
      free(args->tmpf3);
+    free(args->src_smpl_pld);
+    free(args->dst_smpl_pld);
      if ( args->set_ids )
          convert_destroy(args->set_ids);
      if ( args->filter )
diff --git a/bcftools/vcfannotate.c.pysam.c b/bcftools/vcfannotate.c.pysam.c

index 87a6cc4efd81a69396030428697a782593606cc9..b4993eb546120581fc941234891c09e2f63d715e 100644 (file)
--- a/bcftools/vcfannotate.c.pysam.c
+++ b/bcftools/vcfannotate.c.pysam.c
@@ -116,6 +116,7 @@ typedef struct _args_t
  
      int nsmpl_annot;
      int *sample_map, nsample_map, sample_is_file;   // map[idst] -> isrc
+    uint8_t *src_smpl_pld, *dst_smpl_pld;   // for Number=G format fields
      int mtmpi, mtmpf, mtmps;
      int mtmpi2, mtmpf2, mtmps2;
      int mtmpi3, mtmpf3, mtmps3;
@@ -464,6 +465,34 @@ static int vcf_setter_id(args_t *args, bcf1_t *line, annot_col_t *col, void *dat
          return bcf_update_id(args->hdr_out,line,rec->d.id);
      return 0;
  }
+static int vcf_setter_ref(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+    bcf1_t *rec = (bcf1_t*) data;
+    if ( !strcmp(rec->d.allele[0],line->d.allele[0]) ) return 0;    // no update necessary
+    const char **als = (const char**) malloc(sizeof(char*)*line->n_allele);
+    als[0] = rec->d.allele[0];
+    int i;
+    for (i=1; i<line->n_allele; i++) als[i] = line->d.allele[i];
+    bcf_update_alleles(args->hdr_out, line, als, line->n_allele);
+    free(als);
+    return 0;
+}
+static int vcf_setter_alt(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+    bcf1_t *rec = (bcf1_t*) data;
+    int i;
+    if ( rec->n_allele==line->n_allele )
+    {
+        for (i=1; i<rec->n_allele; i++) if ( strcmp(rec->d.allele[i],line->d.allele[i]) ) break;
+        if ( i==rec->n_allele ) return 0;   // no update necessary
+    }
+    const char **als = (const char**) malloc(sizeof(char*)*rec->n_allele);
+    als[0] = line->d.allele[0];
+    for (i=1; i<rec->n_allele; i++) als[i] = rec->d.allele[i];
+    bcf_update_alleles(args->hdr_out, line, als, rec->n_allele);
+    free(als);
+    return 0;
+}
  static int setter_qual(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
      annot_line_t *tab = (annot_line_t*) data;
@@ -493,7 +522,7 @@ static int setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *
  
      if ( str[0]=='1' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,1);
      if ( str[0]=='0' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,0);
-    error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+    error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
      return -1;
  }
  static int vcf_setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
@@ -512,7 +541,7 @@ static int setter_ARinfo_int32(args_t *args, bcf1_t *line, annot_col_t *col, int
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int ntmpi2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpi2, &args->mtmpi2);
@@ -546,7 +575,7 @@ static int setter_info_int(args_t *args, bcf1_t *line, annot_col_t *col, void *d
      {
          int val = strtol(str, &end, 10); 
          if ( end==str )
-            error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+            error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
          ntmpi++;
          hts_expand(int32_t,ntmpi,args->mtmpi,args->tmpi);
          args->tmpi[ntmpi-1] = val;
@@ -592,7 +621,7 @@ static int setter_ARinfo_real(args_t *args, bcf1_t *line, annot_col_t *col, int
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int ntmpf2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpf2, &args->mtmpf2);
@@ -626,7 +655,7 @@ static int setter_info_real(args_t *args, bcf1_t *line, annot_col_t *col, void *
      {
          double val = strtod(str, &end);
          if ( end==str )
-            error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+            error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
          ntmpf++;
          hts_expand(float,ntmpf,args->mtmpf,args->tmpf);
          args->tmpf[ntmpf-1] = val;
@@ -679,7 +708,7 @@ static int setter_ARinfo_string(args_t *args, bcf1_t *line, annot_col_t *col, in
  
      int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
      int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
-    if ( !map ) error("REF alleles not compatible at %s:%d\n");
+    if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
  
      // fill in any missing values in the target VCF (or all, if not present)
      int i, empty = 0, nstr, mstr = args->tmpks.m;
@@ -1115,21 +1144,261 @@ static int setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void
  
      return core_setter_format_str(args,line,col,args->tmpp);
  }
+static int determine_ploidy(int nals, int *vals, int nvals1, uint8_t *smpl, int nsmpl)
+{
+    int i, j, ndip = nals*(nals+1)/2, max_ploidy = 0;
+    for (i=0; i<nsmpl; i++)
+    {
+        int *ptr = vals + i*nvals1;
+        int has_value = 0;
+        for (j=0; j<nvals1; j++)
+        {
+            if ( ptr[j]==bcf_int32_vector_end ) break;
+            if ( ptr[j]!=bcf_int32_missing ) has_value = 1;
+        }
+        if ( has_value )
+        {
+            if ( j==ndip )
+            { 
+                smpl[i] = 2;
+                max_ploidy = 2; 
+            }
+            else if ( j==nals )
+            { 
+                smpl[i] = 1;
+                if ( !max_ploidy ) max_ploidy = 1;
+            }
+            else return -j;
+        }
+        else smpl[i] = 0;
+    }
+    return max_ploidy;
+}
  static int vcf_setter_format_int(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
      bcf1_t *rec = (bcf1_t*) data;
      int nsrc = bcf_get_format_int32(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpi,&args->mtmpi);
      if ( nsrc==-3 ) return 0;    // the tag is not present
      if ( nsrc<=0 ) return 1;     // error
-    return core_setter_format_int(args,line,col,args->tmpi,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+    int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+    int nsrc1 = nsrc / nsmpl_src;
+    if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+        return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+
+    // create mapping from src to dst genotypes, haploid and diploid version
+    int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+    int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+    if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+    int i, j;
+    if ( rec->n_allele==line->n_allele )
+    {
+        // alleles unchanged?
+        for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+        if ( i==rec->n_allele )
+            return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+    }
+
+    int nsmpl_dst = rec->n_sample;
+    int ndst  = bcf_get_format_int32(args->hdr,line,col->hdr_key_dst,&args->tmpi2,&args->mtmpi2);
+    int ndst1 = ndst / nsmpl_dst;
+    if ( ndst <= 0 )
+    {
+        if ( col->replace==REPLACE_NON_MISSING ) return 0;  // overwrite only if present
+        if ( col->number==BCF_VL_G )
+            ndst1 = line->n_allele*(line->n_allele+1)/2;
+        else
+            ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(int, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+        for (i=0; i<nsmpl_dst; i++)
+        {
+            int32_t *dst = args->tmpi2 + i*ndst1;
+            for (j=0; j<ndst1; j++) dst[j] = bcf_int32_missing;
+        }
+    }
+
+    int nmap_dip = 0, *map_dip = NULL;
+    if ( col->number==BCF_VL_G )
+    {
+        map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+        if ( !args->src_smpl_pld )
+        {
+            args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+            args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+        }
+        int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+        if ( pld_src<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+        int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+        if ( pld_dst<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+        int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+        if ( ndst1_new != ndst1 )
+        {
+            if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d  at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+            ndst1 = ndst1_new;
+            hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+        }
+    }
+    else if ( !ndst1 )
+    {
+        ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+    }
+
+    for (i=0; i<nsmpl_dst; i++)
+    {
+        int ii = args->sample_map ? args->sample_map[i] : i;
+        int32_t *ptr_src = args->tmpi + i*nsrc1;
+        int32_t *ptr_dst = args->tmpi2 + ii*ndst1;
+
+        if ( col->number==BCF_VL_G )
+        {
+            if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+                error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+            if ( !args->dst_smpl_pld[i] )
+                for (j=0; j<ndst1; j++) ptr_dst[j] = bcf_int32_missing;
+        }
+        if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+        {
+            for (j=0; j<nmap_hap; j++)
+            {
+                int k = map_hap[j];
+                if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+            }
+            if ( col->number==BCF_VL_G )
+                for (j=line->n_allele; j<ndst1; j++) ptr_dst[j++] = bcf_int32_vector_end;
+        }
+        else
+        {
+            for (j=0; j<nmap_dip; j++)
+            {
+                int k = map_dip[j];
+                if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+            }
+        }
+    }
+    return bcf_update_format_int32(args->hdr_out,line,col->hdr_key_dst,args->tmpi2,nsmpl_dst*ndst1);
  }
  static int vcf_setter_format_real(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
  {
+
      bcf1_t *rec = (bcf1_t*) data;
      int nsrc = bcf_get_format_float(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpf,&args->mtmpf);
      if ( nsrc==-3 ) return 0;    // the tag is not present
      if ( nsrc<=0 ) return 1;     // error
-    return core_setter_format_real(args,line,col,args->tmpf,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+    int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+    int nsrc1 = nsrc / nsmpl_src;
+    if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+        return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+
+    // create mapping from src to dst genotypes, haploid and diploid version
+    int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+    int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+    if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+    int i, j;
+    if ( rec->n_allele==line->n_allele )
+    {
+        // alleles unchanged?
+        for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+        if ( i==rec->n_allele )
+            return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+    }
+
+    int nsmpl_dst = rec->n_sample;
+    int ndst  = bcf_get_format_float(args->hdr,line,col->hdr_key_dst,&args->tmpf2,&args->mtmpf2);
+    int ndst1 = ndst / nsmpl_dst;
+    if ( ndst <= 0 )
+    {
+        if ( col->replace==REPLACE_NON_MISSING ) return 0;  // overwrite only if present
+        if ( col->number==BCF_VL_G )
+            ndst1 = line->n_allele*(line->n_allele+1)/2;
+        else
+            ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+        for (i=0; i<nsmpl_dst; i++)
+        {
+            float *dst = args->tmpf2 + i*ndst1;
+            for (j=0; j<ndst1; j++) bcf_float_set_missing(dst[j]);
+        }
+    }
+
+    int nmap_dip = 0, *map_dip = NULL;
+    if ( col->number==BCF_VL_G )
+    {
+        map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+        if ( !args->src_smpl_pld )
+        {
+            args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+            args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+        }
+        int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+        if ( pld_src<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+        int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+        if ( pld_dst<0 )
+            error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+        int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+        if ( ndst1_new != ndst1 )
+        {
+            if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d  at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+            ndst1 = ndst1_new;
+            hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+        }
+    }
+    else if ( !ndst1 )
+    {
+        ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+        hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+    }
+
+    for (i=0; i<nsmpl_dst; i++)
+    {
+        int ii = args->sample_map ? args->sample_map[i] : i;
+        float *ptr_src = args->tmpf + i*nsrc1;
+        float *ptr_dst = args->tmpf2 + ii*ndst1;
+
+        if ( col->number==BCF_VL_G )
+        {
+            if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+                error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+            if ( !args->dst_smpl_pld[i] )
+                for (j=0; j<ndst1; j++) bcf_float_set_missing(ptr_dst[j]);
+        }
+        if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+        {
+            for (j=0; j<nmap_hap; j++)
+            {
+                int k = map_hap[j];
+                if ( k>=0 )
+                {
+                    if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+                    else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]); 
+                    else ptr_dst[k] = ptr_src[j];
+                }
+            }
+            if ( col->number==BCF_VL_G )
+                for (j=line->n_allele; j<ndst1; j++) bcf_float_set_vector_end(ptr_dst[j]);
+        }
+        else
+        {
+            for (j=0; j<nmap_dip; j++)
+            {
+                int k = map_dip[j];
+                if ( k>=0 )
+                {
+                    if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+                    else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+                    else ptr_dst[k] = ptr_src[j];
+                }
+            }
+        }
+    }
+    return bcf_update_format_float(args->hdr_out,line,col->hdr_key_dst,args->tmpf2,nsmpl_dst*ndst1);
+
  }
  
  static int vcf_setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
@@ -1341,8 +1610,30 @@ static void init_columns(args_t *args)
          else if ( !strcasecmp("POS",str.s) ) args->from_idx = icol;
          else if ( !strcasecmp("FROM",str.s) ) args->from_idx = icol;
          else if ( !strcasecmp("TO",str.s) ) args->to_idx = icol;
-        else if ( !strcasecmp("REF",str.s) ) args->ref_idx = icol;
-        else if ( !strcasecmp("ALT",str.s) ) args->alt_idx = icol;
+        else if ( !strcasecmp("REF",str.s) )
+        {
+            if ( args->tgts_is_vcf )
+            {
+                args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+                annot_col_t *col = &args->cols[args->ncols-1];
+                col->setter = vcf_setter_ref;
+                col->hdr_key_src = strdup(str.s);
+                col->hdr_key_dst = strdup(str.s);
+            }
+            else args->ref_idx = icol;
+        }
+        else if ( !strcasecmp("ALT",str.s) )
+        {
+            if ( args->tgts_is_vcf )
+            {
+                args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+                annot_col_t *col = &args->cols[args->ncols-1];
+                col->setter = vcf_setter_alt;
+                col->hdr_key_src = strdup(str.s);
+                col->hdr_key_dst = strdup(str.s);
+            }
+            else args->alt_idx = icol;
+        }
          else if ( !strcasecmp("ID",str.s) )
          {
              if ( replace==REPLACE_NON_MISSING ) error("Apologies, the -ID feature has not been implemented yet.\n");
@@ -1460,6 +1751,8 @@ static void init_columns(args_t *args)
                          case BCF_HT_STR:    col->setter = vcf_setter_format_str; has_fmt_str = 1; break;
                          default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
                      }
+                hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, hrec->vals[k]);
+                col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
              }
          }
          else if ( !strncasecmp("FORMAT/",str.s, 7) || !strncasecmp("FMT/",str.s,4) )
@@ -1479,6 +1772,7 @@ static void init_columns(args_t *args)
              if ( args->tgts_is_vcf )
              {
                  bcf_hrec_t *hrec = bcf_hdr_get_hrec(args->files->readers[1].header, BCF_HL_FMT, "ID", key_src, NULL);
+                if ( !hrec ) error("No such annotation \"%s\" in %s\n", key_src,args->targets_fname);
                  tmp.l = 0;
                  bcf_hrec_format_rename(hrec, key_dst, &tmp);
                  bcf_hdr_append(args->hdr_out, tmp.s);
@@ -1508,6 +1802,12 @@ static void init_columns(args_t *args)
                      case BCF_HT_STR:    col->setter = args->tgts_is_vcf ? vcf_setter_format_str  : setter_format_str; has_fmt_str = 1; break;
                      default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
                  }
+            if ( args->tgts_is_vcf )
+            {
+                bcf_hdr_t *tgts_hdr = args->files->readers[1].header;
+                hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, col->hdr_key_src);
+                col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
+            }
          }
          else
          {
@@ -1699,6 +1999,8 @@ static void destroy_data(args_t *args)
      free(args->tmpp2);
      free(args->tmpi3);
      free(args->tmpf3);
+    free(args->src_smpl_pld);
+    free(args->dst_smpl_pld);
      if ( args->set_ids )
          convert_destroy(args->set_ids);
      if ( args->filter )
diff --git a/bcftools/vcfconcat.c b/bcftools/vcfconcat.c

index 3345c203135e33b2569995a6847961998862c8d0..934125048a7269b490417690393ed543fb168771 100644 (file)
--- a/bcftools/vcfconcat.c
+++ b/bcftools/vcfconcat.c
@@ -28,6 +28,7 @@ THE SOFTWARE.  */
  #include <string.h>
  #include <errno.h>
  #include <math.h>
+#include <inttypes.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kseq.h>
@@ -522,7 +523,7 @@ static void concat(args_t *args)
                      args->seen_seq[chr_id] = 1;
                      prev_chr_id = chr_id;
  
-                    if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %d bytes\n", fp->line.l);
+                    if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
                  }
              }
              else
@@ -593,7 +594,7 @@ int print_vcf_gz_header(BGZF *fp, BGZF *bgzf_out, int print_header, kstring_t *t
      }
      if ( print_header )
      {
-        if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %d bytes\n", tmp->l);
+        if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)tmp->l);
          tmp->l = 0;
      }
      return nskip;
@@ -652,7 +653,7 @@ static void naive_concat(args_t *args)
              {
                  if ( bgzf_write(bgzf_out, "BCF\2\2", 5) !=5 ) error("Failed to write %d bytes to %s\n", 5,args->output_fname);
                  if ( bgzf_write(bgzf_out, &tmp.l, 4) !=4 ) error("Failed to write %d bytes to %s\n", 4,args->output_fname);
-                if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %d bytes to %s\n", tmp.l,args->output_fname);
+                if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %"PRId64" bytes to %s\n", (uint64_t)tmp.l,args->output_fname);
              }
              nskip = fp->block_offset;
          }
@@ -683,10 +684,10 @@ static void naive_concat(args_t *args)
              nblock = unpackInt16(buf+16) + 1;
              assert( nblock <= page_size && nblock >= nheader );
              nread += bgzf_raw_read(fp, buf+nheader, nblock - nheader);
-            if ( nread!=nblock ) error("Could not read %d bytes: %s\n",nblock,args->fnames[i]);
+            if ( nread!=nblock ) error("Could not read %"PRId64" bytes: %s\n",(uint64_t)nblock,args->fnames[i]);
              if ( nread==neof && !memcmp(buf,eof,neof) ) continue;
              nwr = bgzf_raw_write(bgzf_out, buf, nread);
-            if ( nwr != nread ) error("Write failed, wrote %d instead of %d bytes.\n", nwr,(int)nread);
+            if ( nwr != nread ) error("Write failed, wrote %"PRId64" instead of %d bytes.\n", (uint64_t)nwr,(int)nread);
          }
          if (hts_close(hts_fp)) error("Close failed: %s\n",args->fnames[i]);
      }
diff --git a/bcftools/vcfconcat.c.pysam.c b/bcftools/vcfconcat.c.pysam.c

index 1a67b8647d8de219c2f7f0b5a3a6b57fc95ad990..1bc00c4726f99f1aa7c3aeef433754c931d8b062 100644 (file)
--- a/bcftools/vcfconcat.c.pysam.c
+++ b/bcftools/vcfconcat.c.pysam.c
@@ -30,6 +30,7 @@ THE SOFTWARE.  */
  #include <string.h>
  #include <errno.h>
  #include <math.h>
+#include <inttypes.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kseq.h>
@@ -524,7 +525,7 @@ static void concat(args_t *args)
                      args->seen_seq[chr_id] = 1;
                      prev_chr_id = chr_id;
  
-                    if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %d bytes\n", fp->line.l);
+                    if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
                  }
              }
              else
@@ -595,7 +596,7 @@ int print_vcf_gz_header(BGZF *fp, BGZF *bgzf_out, int print_header, kstring_t *t
      }
      if ( print_header )
      {
-        if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %d bytes\n", tmp->l);
+        if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)tmp->l);
          tmp->l = 0;
      }
      return nskip;
@@ -654,7 +655,7 @@ static void naive_concat(args_t *args)
              {
                  if ( bgzf_write(bgzf_out, "BCF\2\2", 5) !=5 ) error("Failed to write %d bytes to %s\n", 5,args->output_fname);
                  if ( bgzf_write(bgzf_out, &tmp.l, 4) !=4 ) error("Failed to write %d bytes to %s\n", 4,args->output_fname);
-                if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %d bytes to %s\n", tmp.l,args->output_fname);
+                if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %"PRId64" bytes to %s\n", (uint64_t)tmp.l,args->output_fname);
              }
              nskip = fp->block_offset;
          }
@@ -685,10 +686,10 @@ static void naive_concat(args_t *args)
              nblock = unpackInt16(buf+16) + 1;
              assert( nblock <= page_size && nblock >= nheader );
              nread += bgzf_raw_read(fp, buf+nheader, nblock - nheader);
-            if ( nread!=nblock ) error("Could not read %d bytes: %s\n",nblock,args->fnames[i]);
+            if ( nread!=nblock ) error("Could not read %"PRId64" bytes: %s\n",(uint64_t)nblock,args->fnames[i]);
              if ( nread==neof && !memcmp(buf,eof,neof) ) continue;
              nwr = bgzf_raw_write(bgzf_out, buf, nread);
-            if ( nwr != nread ) error("Write failed, wrote %d instead of %d bytes.\n", nwr,(int)nread);
+            if ( nwr != nread ) error("Write failed, wrote %"PRId64" instead of %d bytes.\n", (uint64_t)nwr,(int)nread);
          }
          if (hts_close(hts_fp)) error("Close failed: %s\n",args->fnames[i]);
      }
diff --git a/bcftools/vcfconvert.c b/bcftools/vcfconvert.c

index f815e25cb3426225e1391211207a96fe2980da37..c8c0bd174de280a756f2fc7ecfd83cfffacecc85 100644 (file)
--- a/bcftools/vcfconvert.c
+++ b/bcftools/vcfconvert.c
@@ -1112,7 +1112,7 @@ static inline int tsv_setter_aa1(args_t *args, char *ss, char *se, int alleles[]
      {
          // missing GT
          gts[0] = bcf_gt_missing;
-        gts[1] = bcf_int32_vector_end;
+        gts[1] = bcf_gt_missing;
          args->n.missing++;
          return 0;
      }
@@ -1306,13 +1306,11 @@ static void gvcf_to_vcf(args_t *args)
          {
              int pass = filter_test(args->filter, line, NULL);
              if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
-            if ( !pass ) continue;
-        }
-
-        if (!bcf_has_filter(hdr,line,"PASS"))
-        {
-            bcf_write(out_fh,hdr,line);
-            continue;
+            if ( !pass ) 
+            {
+                bcf_write(out_fh,hdr,line);
+                continue;
+            }
          }
  
          // check if alleles compatible with being a gVCF record
diff --git a/bcftools/vcfconvert.c.pysam.c b/bcftools/vcfconvert.c.pysam.c

index a0ec8b9a5f934b2d49ac969bd52509a06cd0ab31..2efd4e197cf94928a7458752c46ad0e1112de3f2 100644 (file)
--- a/bcftools/vcfconvert.c.pysam.c
+++ b/bcftools/vcfconvert.c.pysam.c
@@ -1114,7 +1114,7 @@ static inline int tsv_setter_aa1(args_t *args, char *ss, char *se, int alleles[]
      {
          // missing GT
          gts[0] = bcf_gt_missing;
-        gts[1] = bcf_int32_vector_end;
+        gts[1] = bcf_gt_missing;
          args->n.missing++;
          return 0;
      }
@@ -1308,13 +1308,11 @@ static void gvcf_to_vcf(args_t *args)
          {
              int pass = filter_test(args->filter, line, NULL);
              if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
-            if ( !pass ) continue;
-        }
-
-        if (!bcf_has_filter(hdr,line,"PASS"))
-        {
-            bcf_write(out_fh,hdr,line);
-            continue;
+            if ( !pass ) 
+            {
+                bcf_write(out_fh,hdr,line);
+                continue;
+            }
          }
  
          // check if alleles compatible with being a gVCF record
diff --git a/bcftools/vcfindex.c b/bcftools/vcfindex.c

index 807fedd5f5c0dd4445d3d44ff79a84e750a06a09..c19587dc5f4629e8a3f0ff2c02b04431041037e3 100644 (file)
--- a/bcftools/vcfindex.c
+++ b/bcftools/vcfindex.c
@@ -213,6 +213,7 @@ int main_vcfindex(int argc, char *argv[])
          // check for truncated files, allow only with -f
          BGZF *fp = bgzf_open(fname, "r");
          if ( !fp ) error("index: failed to open %s\n", fname);
+        if ( bgzf_compression(fp)!=2 ) error("index: the file is not BGZF compressed, cannot index: %s\n", fname);
          if ( bgzf_check_EOF(fp)!=1 ) error("index: the input is probably truncated, use -f to index anyway: %s\n", fname);
          if ( bgzf_close(fp)!=0 ) error("index: close failed: %s\n", fname);
      }
diff --git a/bcftools/vcfindex.c.pysam.c b/bcftools/vcfindex.c.pysam.c

index 1e7578c0f4c80d59873bd0d252502f84321e0a53..bcd13c971e53bebb1a4d2affda176c873f295026 100644 (file)
--- a/bcftools/vcfindex.c.pysam.c
+++ b/bcftools/vcfindex.c.pysam.c
@@ -215,6 +215,7 @@ int main_vcfindex(int argc, char *argv[])
          // check for truncated files, allow only with -f
          BGZF *fp = bgzf_open(fname, "r");
          if ( !fp ) error("index: failed to open %s\n", fname);
+        if ( bgzf_compression(fp)!=2 ) error("index: the file is not BGZF compressed, cannot index: %s\n", fname);
          if ( bgzf_check_EOF(fp)!=1 ) error("index: the input is probably truncated, use -f to index anyway: %s\n", fname);
          if ( bgzf_close(fp)!=0 ) error("index: close failed: %s\n", fname);
      }
diff --git a/bcftools/vcfisec.c b/bcftools/vcfisec.c

index 3e0e1e534351d9fc64c2d4b1679bf127d59a8cd3..6c66ecba0699296cb12eea5af537f7eaa1425844 100644 (file)
--- a/bcftools/vcfisec.c
+++ b/bcftools/vcfisec.c
@@ -582,7 +582,7 @@ int main_vcfisec(int argc, char *argv[])
      if ( !args->targets_list )
      {
          if ( argc-optind<2  ) error("Expected multiple files or the --targets option\n");
-        if ( !args->isec_op ) error("Expected two file names or one of the options --complement, --nfiles or --targets\n");
+        if ( !args->isec_op ) error("One of the options --complement, --nfiles or --targets must be given with more than two files\n");
      }
      args->files->require_index = 1;
      while (optind<argc)
diff --git a/bcftools/vcfisec.c.pysam.c b/bcftools/vcfisec.c.pysam.c

index d168457992fc0884682b43fb514fc3ba6aaa757e..339834db99829619b4af5e0d95bb37b331fa655f 100644 (file)
--- a/bcftools/vcfisec.c.pysam.c
+++ b/bcftools/vcfisec.c.pysam.c
@@ -584,7 +584,7 @@ int main_vcfisec(int argc, char *argv[])
      if ( !args->targets_list )
      {
          if ( argc-optind<2  ) error("Expected multiple files or the --targets option\n");
-        if ( !args->isec_op ) error("Expected two file names or one of the options --complement, --nfiles or --targets\n");
+        if ( !args->isec_op ) error("One of the options --complement, --nfiles or --targets must be given with more than two files\n");
      }
      args->files->require_index = 1;
      while (optind<argc)
diff --git a/bcftools/vcfmerge.c b/bcftools/vcfmerge.c

index 31f5dad5137e226aa4c5439e3988624984f6aefd..27f0417cfb51202b82696b1c3168039056449ca3 100644 (file)
--- a/bcftools/vcfmerge.c
+++ b/bcftools/vcfmerge.c
@@ -1281,6 +1281,37 @@ void update_AN_AC(bcf_hdr_t *hdr, bcf1_t *line)
      free(tmp);
  }
  
+static inline int max_used_gt_ploidy(bcf_fmt_t *fmt, int nsmpl)
+{
+    int i,j, max_ploidy = 0;
+
+    #define BRANCH(type_t, vector_end) { \
+        type_t *ptr  = (type_t*) fmt->p; \
+        for (i=0; i<nsmpl; i++) \
+        { \
+            for (j=0; j<fmt->n; j++) \
+                if ( ptr[j]==vector_end ) break; \
+            if ( j==fmt->n ) \
+            { \
+                /* all fields were used */ \
+                if ( max_ploidy < j ) max_ploidy = j; \
+                break; \
+            } \
+            if ( max_ploidy < j ) max_ploidy = j; \
+            ptr += fmt->n; \
+        } \
+    }
+    switch (fmt->type)
+    {
+        case BCF_BT_INT8:  BRANCH(int8_t,   bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH(int32_t, bcf_int32_vector_end); break;
+        default: error("Unexpected case: %d\n", fmt->type);
+    }
+    #undef BRANCH
+    return max_ploidy;
+}
+
  void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
  {
      bcf_srs_t *files = args->files;
@@ -1291,9 +1322,12 @@ void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
      int nsize = 0, msize = sizeof(int32_t);
      for (i=0; i<files->nreaders; i++)
      {
-        if ( !fmt_map[i] ) continue;
-        if ( fmt_map[i]->n > nsize ) nsize = fmt_map[i]->n;
+        bcf_fmt_t *fmt = fmt_map[i];
+        if ( !fmt ) continue;
+        int pld = max_used_gt_ploidy(fmt_map[i], bcf_hdr_nsamples(bcf_sr_get_header(args->files,i)));
+        if ( nsize < pld ) nsize = pld;
      }
+    if ( nsize==0 ) nsize = 1;
  
      if ( ma->ntmp_arr < nsamples*nsize*msize )
      {
@@ -1311,7 +1345,7 @@ void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
          int32_t *tmp  = (int32_t *) ma->tmp_arr + ismpl*nsize;
          int irec = ma->buf[i].cur;
  
-        int j, k;
+        int j,k;
          if ( !fmt_ori )
          {
              // missing values: assume maximum ploidy
@@ -1437,17 +1471,23 @@ void merge_format_field(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
              {
                  // if all fields are missing then n==1 is valid
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori*(nals_ori+1)/2 && fmt_map[i]->n != nals_ori )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=G, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
              else if ( length==BCF_VL_A )
              {
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori-1 )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=A, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
              else if ( length==BCF_VL_R )
              {
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=R, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
          }
  
@@ -2145,7 +2185,7 @@ int can_merge(args_t *args)
              }
              // normalize alleles
              maux->als = merge_alleles(line->d.allele, line->n_allele, buf->rec[j].map, maux->als, &maux->nals, &maux->mals);
-            if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",bcf_seqname(args->out_hdr,line),line->pos+1,reader->fname);
+            if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",maux->chr,line->pos+1,reader->fname);
              hts_expand0(int, maux->nals, maux->ncnt, maux->cnt);
              for (k=1; k<line->n_allele; k++)
                  maux->cnt[ buf->rec[j].map[k] ]++;    // how many times an allele appears in the files
diff --git a/bcftools/vcfmerge.c.pysam.c b/bcftools/vcfmerge.c.pysam.c

index e1047aeb6209c1c3cfbb5505cc5a6c94b804af35..6969da1b00cbb1e3b3be16263ca22294e37e13a9 100644 (file)
--- a/bcftools/vcfmerge.c.pysam.c
+++ b/bcftools/vcfmerge.c.pysam.c
@@ -1283,6 +1283,37 @@ void update_AN_AC(bcf_hdr_t *hdr, bcf1_t *line)
      free(tmp);
  }
  
+static inline int max_used_gt_ploidy(bcf_fmt_t *fmt, int nsmpl)
+{
+    int i,j, max_ploidy = 0;
+
+    #define BRANCH(type_t, vector_end) { \
+        type_t *ptr  = (type_t*) fmt->p; \
+        for (i=0; i<nsmpl; i++) \
+        { \
+            for (j=0; j<fmt->n; j++) \
+                if ( ptr[j]==vector_end ) break; \
+            if ( j==fmt->n ) \
+            { \
+                /* all fields were used */ \
+                if ( max_ploidy < j ) max_ploidy = j; \
+                break; \
+            } \
+            if ( max_ploidy < j ) max_ploidy = j; \
+            ptr += fmt->n; \
+        } \
+    }
+    switch (fmt->type)
+    {
+        case BCF_BT_INT8:  BRANCH(int8_t,   bcf_int8_vector_end); break;
+        case BCF_BT_INT16: BRANCH(int16_t, bcf_int16_vector_end); break;
+        case BCF_BT_INT32: BRANCH(int32_t, bcf_int32_vector_end); break;
+        default: error("Unexpected case: %d\n", fmt->type);
+    }
+    #undef BRANCH
+    return max_ploidy;
+}
+
  void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
  {
      bcf_srs_t *files = args->files;
@@ -1293,9 +1324,12 @@ void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
      int nsize = 0, msize = sizeof(int32_t);
      for (i=0; i<files->nreaders; i++)
      {
-        if ( !fmt_map[i] ) continue;
-        if ( fmt_map[i]->n > nsize ) nsize = fmt_map[i]->n;
+        bcf_fmt_t *fmt = fmt_map[i];
+        if ( !fmt ) continue;
+        int pld = max_used_gt_ploidy(fmt_map[i], bcf_hdr_nsamples(bcf_sr_get_header(args->files,i)));
+        if ( nsize < pld ) nsize = pld;
      }
+    if ( nsize==0 ) nsize = 1;
  
      if ( ma->ntmp_arr < nsamples*nsize*msize )
      {
@@ -1313,7 +1347,7 @@ void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
          int32_t *tmp  = (int32_t *) ma->tmp_arr + ismpl*nsize;
          int irec = ma->buf[i].cur;
  
-        int j, k;
+        int j,k;
          if ( !fmt_ori )
          {
              // missing values: assume maximum ploidy
@@ -1439,17 +1473,23 @@ void merge_format_field(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
              {
                  // if all fields are missing then n==1 is valid
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori*(nals_ori+1)/2 && fmt_map[i]->n != nals_ori )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=G, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
              else if ( length==BCF_VL_A )
              {
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori-1 )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=A, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
              else if ( length==BCF_VL_R )
              {
                  if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori )
-                    error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+                    error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=R, but found\n"
+                          "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+                          key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
              }
          }
  
@@ -2147,7 +2187,7 @@ int can_merge(args_t *args)
              }
              // normalize alleles
              maux->als = merge_alleles(line->d.allele, line->n_allele, buf->rec[j].map, maux->als, &maux->nals, &maux->mals);
-            if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",bcf_seqname(args->out_hdr,line),line->pos+1,reader->fname);
+            if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",maux->chr,line->pos+1,reader->fname);
              hts_expand0(int, maux->nals, maux->ncnt, maux->cnt);
              for (k=1; k<line->n_allele; k++)
                  maux->cnt[ buf->rec[j].map[k] ]++;    // how many times an allele appears in the files
diff --git a/bcftools/vcfnorm.c b/bcftools/vcfnorm.c

index ead1e8951d9ae02fa30d7d8ad7880b88d87fcc4b..7515e50afa1c58f4ec2a52dc36df13434c729990 100644 (file)
--- a/bcftools/vcfnorm.c
+++ b/bcftools/vcfnorm.c
@@ -73,7 +73,8 @@ typedef struct
      char **als;
      int mmaps, nals, mals;
      uint8_t *tmp_arr1, *tmp_arr2, *diploid;
-    int ntmp_arr1, ntmp_arr2;
+    int32_t *int32_arr;
+    int ntmp_arr1, ntmp_arr2, nint32_arr;
      kstring_t *tmp_str;
      kstring_t *tmp_als, tmp_als_str;
      int ntmp_als;
@@ -432,6 +433,15 @@ static int realign(args_t *args, bcf1_t *line)
      bcf_update_alleles_str(args->hdr,line,args->tmp_als_str.s);
      args->nchanged++;
  
+    // Update INFO/END if necessary
+    int new_reflen = strlen(line->d.allele[0]);
+    if ( (ori_pos!=line->pos || reflen!=new_reflen) && bcf_get_info_int32(args->hdr, line, "END", &args->int32_arr, &args->nint32_arr)==1 )
+    {
+        // bcf_update_alleles_str() messed up rlen because line->pos changed. This will be fixed by bcf_update_info_int32()
+        args->int32_arr[0] = line->pos + new_reflen;
+        bcf_update_info_int32(args->hdr, line, "END", args->int32_arr, 1);
+    }
+
      return ERR_OK;
  }
  
@@ -1036,6 +1046,7 @@ static void merge_info_string(args_t *args, bcf1_t **lines, int nlines, bcf_info
  }
  static void merge_format_genotype(args_t *args, bcf1_t **lines, int nlines, bcf_fmt_t *fmt, bcf1_t *dst)
  {
+    // reusing int8_t arrays as int32_t arrays
      int ntmp = args->ntmp_arr1 / 4;
      int ngts = bcf_get_genotypes(args->hdr,lines[0],&args->tmp_arr1,&ntmp);
      args->ntmp_arr1 = ntmp * 4;
@@ -1685,6 +1696,7 @@ static void destroy_data(args_t *args)
      }
      free(args->maps);
      free(args->als);
+    free(args->int32_arr);
      free(args->tmp_arr1);
      free(args->tmp_arr2);
      free(args->diploid);
@@ -1924,7 +1936,7 @@ int main_vcfnorm(int argc, char *argv[])
                  break;
              case 'o': args->output_fname = optarg; break;
              case 'D':
-                fprintf(stderr,"Warning: `-D` is functional but deprecated, replaced by `-d both`.\n"); 
+                fprintf(stderr,"Warning: `-D` is functional but deprecated, replaced by and alias of `-d none`.\n"); 
                  args->rmdup = BCF_SR_PAIR_EXACT;
                  break;
              case 's': args->strict_filter = 1; break;
@@ -1944,9 +1956,7 @@ int main_vcfnorm(int argc, char *argv[])
              default: error("Unknown argument: %s\n", optarg);
          }
      }
-    if ( argc>optind+1 ) usage();
-    if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) usage();
-    if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
      char *fname = NULL;
      if ( optind>=argc )
      {
@@ -1955,6 +1965,9 @@ int main_vcfnorm(int argc, char *argv[])
      }
      else fname = argv[optind];
  
+    if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) error("Expected -f, -m, -D or -d option\n");
+    if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
      if ( args->region )
      {
          if ( bcf_sr_set_regions(args->files, args->region,region_is_file)<0 )
diff --git a/bcftools/vcfnorm.c.pysam.c b/bcftools/vcfnorm.c.pysam.c

index d204de4fb1f343211e21d2f61095f6287009d18c..f92420e2ffdcc330b368d3dd88ba0b259ecd643e 100644 (file)
--- a/bcftools/vcfnorm.c.pysam.c
+++ b/bcftools/vcfnorm.c.pysam.c
@@ -75,7 +75,8 @@ typedef struct
      char **als;
      int mmaps, nals, mals;
      uint8_t *tmp_arr1, *tmp_arr2, *diploid;
-    int ntmp_arr1, ntmp_arr2;
+    int32_t *int32_arr;
+    int ntmp_arr1, ntmp_arr2, nint32_arr;
      kstring_t *tmp_str;
      kstring_t *tmp_als, tmp_als_str;
      int ntmp_als;
@@ -434,6 +435,15 @@ static int realign(args_t *args, bcf1_t *line)
      bcf_update_alleles_str(args->hdr,line,args->tmp_als_str.s);
      args->nchanged++;
  
+    // Update INFO/END if necessary
+    int new_reflen = strlen(line->d.allele[0]);
+    if ( (ori_pos!=line->pos || reflen!=new_reflen) && bcf_get_info_int32(args->hdr, line, "END", &args->int32_arr, &args->nint32_arr)==1 )
+    {
+        // bcf_update_alleles_str() messed up rlen because line->pos changed. This will be fixed by bcf_update_info_int32()
+        args->int32_arr[0] = line->pos + new_reflen;
+        bcf_update_info_int32(args->hdr, line, "END", args->int32_arr, 1);
+    }
+
      return ERR_OK;
  }
  
@@ -1038,6 +1048,7 @@ static void merge_info_string(args_t *args, bcf1_t **lines, int nlines, bcf_info
  }
  static void merge_format_genotype(args_t *args, bcf1_t **lines, int nlines, bcf_fmt_t *fmt, bcf1_t *dst)
  {
+    // reusing int8_t arrays as int32_t arrays
      int ntmp = args->ntmp_arr1 / 4;
      int ngts = bcf_get_genotypes(args->hdr,lines[0],&args->tmp_arr1,&ntmp);
      args->ntmp_arr1 = ntmp * 4;
@@ -1687,6 +1698,7 @@ static void destroy_data(args_t *args)
      }
      free(args->maps);
      free(args->als);
+    free(args->int32_arr);
      free(args->tmp_arr1);
      free(args->tmp_arr2);
      free(args->diploid);
@@ -1926,7 +1938,7 @@ int main_vcfnorm(int argc, char *argv[])
                  break;
              case 'o': args->output_fname = optarg; break;
              case 'D':
-                fprintf(bcftools_stderr,"Warning: `-D` is functional but deprecated, replaced by `-d both`.\n"); 
+                fprintf(bcftools_stderr,"Warning: `-D` is functional but deprecated, replaced by and alias of `-d none`.\n"); 
                  args->rmdup = BCF_SR_PAIR_EXACT;
                  break;
              case 's': args->strict_filter = 1; break;
@@ -1946,9 +1958,7 @@ int main_vcfnorm(int argc, char *argv[])
              default: error("Unknown argument: %s\n", optarg);
          }
      }
-    if ( argc>optind+1 ) usage();
-    if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) usage();
-    if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
      char *fname = NULL;
      if ( optind>=argc )
      {
@@ -1957,6 +1967,9 @@ int main_vcfnorm(int argc, char *argv[])
      }
      else fname = argv[optind];
  
+    if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) error("Expected -f, -m, -D or -d option\n");
+    if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
      if ( args->region )
      {
          if ( bcf_sr_set_regions(args->files, args->region,region_is_file)<0 )
diff --git a/bcftools/vcfquery.c b/bcftools/vcfquery.c

index 29c4a9df2c9558813df44273502f66679580c950..fefab2375dd2024453e7f2eea3e8d8ab7dfd108a 100644 (file)
--- a/bcftools/vcfquery.c
+++ b/bcftools/vcfquery.c
@@ -278,7 +278,7 @@ int main_vcfquery(int argc, char *argv[])
              case 'H': args->print_header = 1; break;
              case 'v': args->vcf_list = optarg; break;
              case 'c': 
-                error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n", optarg);
+                error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n");
                  break;
              case 'a':
                  {
diff --git a/bcftools/vcfquery.c.pysam.c b/bcftools/vcfquery.c.pysam.c

index 8186a59d970ccec91992ca307c830ffd662b55cd..04e0638f30dcfdc08183e3e2b1041567ebb03b24 100644 (file)
--- a/bcftools/vcfquery.c.pysam.c
+++ b/bcftools/vcfquery.c.pysam.c
@@ -280,7 +280,7 @@ int main_vcfquery(int argc, char *argv[])
              case 'H': args->print_header = 1; break;
              case 'v': args->vcf_list = optarg; break;
              case 'c': 
-                error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n", optarg);
+                error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n");
                  break;
              case 'a':
                  {
diff --git a/bcftools/vcfroh.c b/bcftools/vcfroh.c

index 626f97538c2ec651c0a4b8623fa7a7a56ce746cd..313daba9563a51eecdb07989c4c49ef95d02c9be 100644 (file)
--- a/bcftools/vcfroh.c
+++ b/bcftools/vcfroh.c
@@ -1,6 +1,6 @@
  /*  vcfroh.c -- HMM model for detecting runs of autozygosity.
  
-    Copyright (C) 2013-2017 Genome Research Ltd.
+    Copyright (C) 2013-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -26,6 +26,7 @@ THE SOFTWARE.  */
  #include <unistd.h>
  #include <getopt.h>
  #include <math.h>
+#include <inttypes.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kstring.h>
@@ -35,6 +36,7 @@ THE SOFTWARE.  */
  #include "bcftools.h"
  #include "HMM.h"
  #include "smpl_ilist.h"
+#include "filter.h"
  
  #define STATE_HW 0        // normal state, follows Hardy-Weinberg allele frequencies
  #define STATE_AZ 1        // autozygous state
@@ -43,6 +45,11 @@ THE SOFTWARE.  */
  #define OUTPUT_RG (1<<2)
  #define OUTPUT_GZ (1<<3)
  
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+
  /** Genetic map */
  typedef struct
  {
@@ -94,6 +101,9 @@ typedef struct _args_t
      int32_t skip_rid, prev_rid, prev_pos;
  
      int ntot;                   // some stats to detect if things didn't go wrong
+    int nno_af;                 // number of sites rejected because AF could not be determined
+    int nfiltered;              // .. because of filters
+    int nnot_biallelic, ndup;
      smpl_t *smpl;               // HMM data for each sample
      smpl_ilist_t *af_smpl;      // list of samples to estimate AF from (--estimate-AF)
      smpl_ilist_t *roh_smpl;     // list of samples to analyze (--samples, --samples-file)
@@ -103,6 +113,10 @@ typedef struct _args_t
      int argc, fake_PLs, snps_only, vi_training, samples_is_file, output_type, skip_homref, n_threads;
      BGZF *out;
      kstring_t str;
+
+    int filter_logic;
+    filter_t *filter;
+    char *filter_str;
  }
  args_t;
  
@@ -125,6 +139,9 @@ static void init_data(args_t *args)
  
      if ( !bcf_hdr_nsamples(args->hdr) ) error("No samples in the VCF?\n");
  
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
      if ( !args->fake_PLs )
      {
          args->pl_hdr_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL");
@@ -318,6 +335,7 @@ static void init_data(args_t *args)
  
  static void destroy_data(args_t *args)
  {
+    if ( args->filter ) filter_destroy(args->filter);
      if ( bgzf_close(args->out)!=0 ) error("Error: close failed .. %s\n", args->output_fname);
      int i;
      for (i=0; i<args->roh_smpl->n; i++)
@@ -367,7 +385,7 @@ static int load_genmap(args_t *args, const char *chr)
  
      hts_getline(fp, KS_SEP_LINE, &str);
      if ( strcmp(str.s,"position COMBINED_rate(cM/Mb) Genetic_Map(cM)") )
-        error("Unexpected header, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
+        error("Unexpected header in %s, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
  
      args->ngenmap = args->igenmap = 0;
      while ( hts_getline(fp, KS_SEP_LINE, &str) > 0 )
@@ -731,9 +749,9 @@ int estimate_AF_from_PL(args_t *args, bcf_fmt_t *fmt_pl, int ial, double *alt_fr
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
                  double prob[3], norm = 0; \
-                prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
                  for (j=0; j<3; j++) norm += prob[j]; \
                  for (j=0; j<3; j++) prob[j] /= norm; \
                  af += 0.5*prob[1] + prob[2]; \
@@ -761,9 +779,9 @@ int estimate_AF_from_PL(args_t *args, bcf_fmt_t *fmt_pl, int ial, double *alt_fr
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
                  double prob[3], norm = 0; \
-                prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
                  for (j=0; j<3; j++) norm += prob[j]; \
                  for (j=0; j<3; j++) prob[j] /= norm; \
                  af += 0.5*prob[1] + prob[2]; \
@@ -854,8 +872,9 @@ int process_line(args_t *args, bcf1_t *line, int ial)
              alt_freq = (double) AC/AN;
      }
  
-    if ( ret<0 ) return ret;
-    if ( alt_freq==0.0 ) return -1;
+    if ( args->dflt_AF>0 && (ret<0 || alt_freq==0.0) ) alt_freq = args->dflt_AF;
+    else if ( ret<0 ) { args->nno_af++; return ret; }
+    else if ( alt_freq==0.0 ) { args->nno_af++; return -1; }
  
      int irr = bcf_alleles2gt(0,0), ira = bcf_alleles2gt(0,ial), iaa = bcf_alleles2gt(ial,ial);
      if ( args->fake_PLs )
@@ -907,9 +926,9 @@ int process_line(args_t *args, bcf1_t *line, int ial)
                  type_t *p = (type_t*)fmt_pl->p + fmt_pl->n*ismpl; \
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
-                pdg[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                pdg[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                pdg[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                pdg[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                pdg[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                pdg[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
              }
              switch (fmt_pl->type) {
                  case BCF_BT_INT8:  BRANCH(int8_t); break;
@@ -932,7 +951,7 @@ int process_line(args_t *args, bcf1_t *line, int ial)
          {
              hts_expand(uint32_t,smpl->nsites+1,smpl->msites,smpl->sites);
              smpl->eprob = (double*) realloc(smpl->eprob,sizeof(*smpl->eprob)*smpl->msites*2);
-            if ( !smpl->eprob ) error("Error: failed to alloc %d bytes\n", sizeof(*smpl->eprob)*smpl->msites*2);
+            if ( !smpl->eprob ) error("Error: failed to alloc %"PRIu64" bytes\n", (uint64_t)(sizeof(*smpl->eprob)*smpl->msites*2));
          }
          
          // Calculate emission probabilities P(D|AZ) and P(D|HW)
@@ -970,12 +989,11 @@ static void vcfroh(args_t *args, bcf1_t *line)
          for (i=0; i<args->roh_smpl->n; i++) flush_viterbi(args, i);
          return; 
      }
-    args->ntot++;
  
      // Skip unwanted lines, for simplicity we consider only biallelic sites 
      if ( line->rid == args->skip_rid ) return;
-    if ( line->n_allele==1 ) return;    // no ALT allele
-    if ( line->n_allele > 3 ) return;   // cannot be bi-allelic, even with <*>
+    if ( line->n_allele==1 ) { args->nnot_biallelic++; return; }   // no ALT allele
+    if ( line->n_allele > 3 ) { args->nnot_biallelic++; return; }   // cannot be bi-allelic, even with <*>
  
      // This can be raw callable VCF with the symbolic unseen allele <*>
      int ial = 0;
@@ -983,7 +1001,7 @@ static void vcfroh(args_t *args, bcf1_t *line)
          if ( !strcmp("<*>",line->d.allele[i]) ) { ial = i; break; }
      if ( ial==0 )    // normal VCF, the symbolic allele is not present
      {
-        if ( line->n_allele!=2 ) return;    // not biallelic
+        if ( line->n_allele!=2 ) { args->nnot_biallelic++; return; }   // not biallelic
          ial = 1;
      }
      else
@@ -1017,7 +1035,11 @@ static void vcfroh(args_t *args, bcf1_t *line)
          args->prev_pos = line->pos;
          skip_rid = load_genmap(args, bcf_seqname(args->hdr,line));
      }
-    else if ( args->prev_pos == line->pos ) return;     // skip duplicate positions
+    else if ( args->prev_pos == line->pos ) 
+    {
+        args->ndup++;
+        return;     // skip duplicate positions
+    }
  
      if ( skip_rid )
      {
@@ -1044,14 +1066,16 @@ static void usage(args_t *args)
      fprintf(stderr, "General Options:\n");
      fprintf(stderr, "        --AF-dflt <float>              if AF is not known, use this allele frequency [skip]\n");
      fprintf(stderr, "        --AF-tag <TAG>                 use TAG for allele frequency\n");
-    fprintf(stderr, "        --AF-file <file>               read allele frequencies from file (CHR\\tPOS\\tREF,ALT\\tAF)\n");
+    fprintf(stderr, "        --AF-file <file>               read allele frequencies from file (CHR\\tPOS\\tREF\\tALT\\tAF)\n");
      fprintf(stderr, "    -b  --buffer-size <int[,int]>      buffer size and the number of overlapping sites, 0 for unlimited [0]\n");
      fprintf(stderr, "                                           If the first number is negative, it is interpreted as the maximum memory to\n");
      fprintf(stderr, "                                           use, in MB. The default overlap is set to roughly 1%% of the buffer size.\n");
      fprintf(stderr, "    -e, --estimate-AF [TAG],<file>     estimate AF from FORMAT/TAG (GT or PL) of all samples (\"-\") or samples listed\n");
      fprintf(stderr, "                                            in <file>. If TAG is not given, the frequency is estimated from GT by default\n");
+    fprintf(stderr, "        --exclude <expr>               exclude sites for which the expression is true\n");
      fprintf(stderr, "    -G, --GTs-only <float>             use GTs and ignore PLs, instead using <float> for PL of the two least likely genotypes.\n");
      fprintf(stderr, "                                           Safe value to use is 30 to account for GT errors.\n");
+    fprintf(stderr, "        --include <expr>               select sites for which the expression is true\n");
      fprintf(stderr, "    -i, --ignore-homref                skip hom-ref genotypes (0/0)\n");
      fprintf(stderr, "    -I, --skip-indels                  skip indels as their genotypes are enriched for errors\n");
      fprintf(stderr, "    -m, --genetic-map <file>           genetic map in IMPUTE2 format, single file or mask, where string \"{CHROM}\"\n");
@@ -1091,6 +1115,8 @@ int main_vcfroh(int argc, char *argv[])
          {"AF-tag",1,0,0},
          {"AF-file",1,0,1},
          {"AF-dflt",1,0,2},
+        {"include",1,0,3},
+        {"exclude",1,0,4},
          {"buffer-size",1,0,'b'},
          {"ignore-homref",0,0,'i'},
          {"estimate-AF",1,0,'e'},
@@ -1123,6 +1149,8 @@ int main_vcfroh(int argc, char *argv[])
                  args->dflt_AF = strtod(optarg,&tmp);
                  if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
                  break;
+            case 3: args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+            case 4: args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
              case 'o': args->output_fname = optarg; break;
              case 'O': 
                  if ( strchr(optarg,'s') || strchr(optarg,'S') ) args->output_type |= OUTPUT_ST;
@@ -1180,8 +1208,8 @@ int main_vcfroh(int argc, char *argv[])
      else fname = argv[optind];
  
      if ( args->vi_training && args->buffer_size ) error("Error: cannot use -b with -V\n");
-    if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1]\n", args->t2AZ);
-    if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1]\n", args->t2HW);
+    if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1] .. %e\n", args->t2AZ);
+    if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1] .. %e\n", args->t2HW);
      if ( naf_opts>1 ) error("Error: The options --AF-tag, --AF-file and -e are mutually exclusive\n");
      if ( args->af_fname && args->targets_list ) error("Error: The options --AF-file and -t are mutually exclusive\n");
      if ( args->regions_list )
@@ -1206,16 +1234,28 @@ int main_vcfroh(int argc, char *argv[])
      init_data(args);
      while ( bcf_sr_next_line(args->files) )
      {
-        vcfroh(args, args->files->readers[0].buffer[0]);
+        args->ntot++;
+        bcf1_t *line = bcf_sr_get_line(args->files,0);
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, line, NULL);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !pass ) { args->nfiltered++; continue; }
+        }
+        vcfroh(args, line);
      }
      vcfroh(args, NULL);
      int i, nmin = 0;
      for (i=0; i<args->roh_smpl->n; i++)
          if ( !i || args->smpl[i].nused < nmin ) nmin = args->smpl[i].nused;
-    fprintf(stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+    if ( args->af_fname )
+        fprintf(stderr,"Number of lines overlapping with --AF-file/processed: %d/%d\n", args->ntot,nmin);
+    else
+        fprintf(stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+    fprintf(stderr,"Number of lines filtered/no AF/not biallelic/dup: %d/%d/%d/%d\n", args->nfiltered,args->nno_af,args->nnot_biallelic,args->ndup);
      if ( nmin==0 )
      {
-        fprintf(stderr,"No usable sites were found.");
+        fprintf(stderr,"No usable sites were found.\n");
          if ( !naf_opts && !args->dflt_AF ) fprintf(stderr, " Consider using one of the AF options.\n");
      }
      destroy_data(args);
diff --git a/bcftools/vcfroh.c.pysam.c b/bcftools/vcfroh.c.pysam.c

index 77d2f4fdd801f751b118f8ffa68ec121cd7993b4..297aa6ec7e07aa4dab98849d045c75d03ce1f272 100644 (file)
--- a/bcftools/vcfroh.c.pysam.c
+++ b/bcftools/vcfroh.c.pysam.c
@@ -2,7 +2,7 @@
  
  /*  vcfroh.c -- HMM model for detecting runs of autozygosity.
  
-    Copyright (C) 2013-2017 Genome Research Ltd.
+    Copyright (C) 2013-2018 Genome Research Ltd.
  
      Author: Petr Danecek <pd3@sanger.ac.uk>
  
@@ -28,6 +28,7 @@ THE SOFTWARE.  */
  #include <unistd.h>
  #include <getopt.h>
  #include <math.h>
+#include <inttypes.h>
  #include <htslib/vcf.h>
  #include <htslib/synced_bcf_reader.h>
  #include <htslib/kstring.h>
@@ -37,6 +38,7 @@ THE SOFTWARE.  */
  #include "bcftools.h"
  #include "HMM.h"
  #include "smpl_ilist.h"
+#include "filter.h"
  
  #define STATE_HW 0        // normal state, follows Hardy-Weinberg allele frequencies
  #define STATE_AZ 1        // autozygous state
@@ -45,6 +47,11 @@ THE SOFTWARE.  */
  #define OUTPUT_RG (1<<2)
  #define OUTPUT_GZ (1<<3)
  
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+
  /** Genetic map */
  typedef struct
  {
@@ -96,6 +103,9 @@ typedef struct _args_t
      int32_t skip_rid, prev_rid, prev_pos;
  
      int ntot;                   // some stats to detect if things didn't go wrong
+    int nno_af;                 // number of sites rejected because AF could not be determined
+    int nfiltered;              // .. because of filters
+    int nnot_biallelic, ndup;
      smpl_t *smpl;               // HMM data for each sample
      smpl_ilist_t *af_smpl;      // list of samples to estimate AF from (--estimate-AF)
      smpl_ilist_t *roh_smpl;     // list of samples to analyze (--samples, --samples-file)
@@ -105,6 +115,10 @@ typedef struct _args_t
      int argc, fake_PLs, snps_only, vi_training, samples_is_file, output_type, skip_homref, n_threads;
      BGZF *out;
      kstring_t str;
+
+    int filter_logic;
+    filter_t *filter;
+    char *filter_str;
  }
  args_t;
  
@@ -127,6 +141,9 @@ static void init_data(args_t *args)
  
      if ( !bcf_hdr_nsamples(args->hdr) ) error("No samples in the VCF?\n");
  
+    if ( args->filter_str )
+        args->filter = filter_init(args->hdr, args->filter_str);
+
      if ( !args->fake_PLs )
      {
          args->pl_hdr_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL");
@@ -320,6 +337,7 @@ static void init_data(args_t *args)
  
  static void destroy_data(args_t *args)
  {
+    if ( args->filter ) filter_destroy(args->filter);
      if ( bgzf_close(args->out)!=0 ) error("Error: close failed .. %s\n", args->output_fname);
      int i;
      for (i=0; i<args->roh_smpl->n; i++)
@@ -369,7 +387,7 @@ static int load_genmap(args_t *args, const char *chr)
  
      hts_getline(fp, KS_SEP_LINE, &str);
      if ( strcmp(str.s,"position COMBINED_rate(cM/Mb) Genetic_Map(cM)") )
-        error("Unexpected header, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
+        error("Unexpected header in %s, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
  
      args->ngenmap = args->igenmap = 0;
      while ( hts_getline(fp, KS_SEP_LINE, &str) > 0 )
@@ -733,9 +751,9 @@ int estimate_AF_from_PL(args_t *args, bcf_fmt_t *fmt_pl, int ial, double *alt_fr
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
                  double prob[3], norm = 0; \
-                prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
                  for (j=0; j<3; j++) norm += prob[j]; \
                  for (j=0; j<3; j++) prob[j] /= norm; \
                  af += 0.5*prob[1] + prob[2]; \
@@ -763,9 +781,9 @@ int estimate_AF_from_PL(args_t *args, bcf_fmt_t *fmt_pl, int ial, double *alt_fr
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
                  double prob[3], norm = 0; \
-                prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
                  for (j=0; j<3; j++) norm += prob[j]; \
                  for (j=0; j<3; j++) prob[j] /= norm; \
                  af += 0.5*prob[1] + prob[2]; \
@@ -856,8 +874,9 @@ int process_line(args_t *args, bcf1_t *line, int ial)
              alt_freq = (double) AC/AN;
      }
  
-    if ( ret<0 ) return ret;
-    if ( alt_freq==0.0 ) return -1;
+    if ( args->dflt_AF>0 && (ret<0 || alt_freq==0.0) ) alt_freq = args->dflt_AF;
+    else if ( ret<0 ) { args->nno_af++; return ret; }
+    else if ( alt_freq==0.0 ) { args->nno_af++; return -1; }
  
      int irr = bcf_alleles2gt(0,0), ira = bcf_alleles2gt(0,ial), iaa = bcf_alleles2gt(ial,ial);
      if ( args->fake_PLs )
@@ -909,9 +928,9 @@ int process_line(args_t *args, bcf1_t *line, int ial)
                  type_t *p = (type_t*)fmt_pl->p + fmt_pl->n*ismpl; \
                  if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue;    /* missing value */ \
                  if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue;    /* all values are the same */ \
-                pdg[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
-                pdg[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
-                pdg[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+                pdg[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+                pdg[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+                pdg[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
              }
              switch (fmt_pl->type) {
                  case BCF_BT_INT8:  BRANCH(int8_t); break;
@@ -934,7 +953,7 @@ int process_line(args_t *args, bcf1_t *line, int ial)
          {
              hts_expand(uint32_t,smpl->nsites+1,smpl->msites,smpl->sites);
              smpl->eprob = (double*) realloc(smpl->eprob,sizeof(*smpl->eprob)*smpl->msites*2);
-            if ( !smpl->eprob ) error("Error: failed to alloc %d bytes\n", sizeof(*smpl->eprob)*smpl->msites*2);
+            if ( !smpl->eprob ) error("Error: failed to alloc %"PRIu64" bytes\n", (uint64_t)(sizeof(*smpl->eprob)*smpl->msites*2));
          }
          
          // Calculate emission probabilities P(D|AZ) and P(D|HW)
@@ -972,12 +991,11 @@ static void vcfroh(args_t *args, bcf1_t *line)
          for (i=0; i<args->roh_smpl->n; i++) flush_viterbi(args, i);
          return; 
      }
-    args->ntot++;
  
      // Skip unwanted lines, for simplicity we consider only biallelic sites 
      if ( line->rid == args->skip_rid ) return;
-    if ( line->n_allele==1 ) return;    // no ALT allele
-    if ( line->n_allele > 3 ) return;   // cannot be bi-allelic, even with <*>
+    if ( line->n_allele==1 ) { args->nnot_biallelic++; return; }   // no ALT allele
+    if ( line->n_allele > 3 ) { args->nnot_biallelic++; return; }   // cannot be bi-allelic, even with <*>
  
      // This can be raw callable VCF with the symbolic unseen allele <*>
      int ial = 0;
@@ -985,7 +1003,7 @@ static void vcfroh(args_t *args, bcf1_t *line)
          if ( !strcmp("<*>",line->d.allele[i]) ) { ial = i; break; }
      if ( ial==0 )    // normal VCF, the symbolic allele is not present
      {
-        if ( line->n_allele!=2 ) return;    // not biallelic
+        if ( line->n_allele!=2 ) { args->nnot_biallelic++; return; }   // not biallelic
          ial = 1;
      }
      else
@@ -1019,7 +1037,11 @@ static void vcfroh(args_t *args, bcf1_t *line)
          args->prev_pos = line->pos;
          skip_rid = load_genmap(args, bcf_seqname(args->hdr,line));
      }
-    else if ( args->prev_pos == line->pos ) return;     // skip duplicate positions
+    else if ( args->prev_pos == line->pos ) 
+    {
+        args->ndup++;
+        return;     // skip duplicate positions
+    }
  
      if ( skip_rid )
      {
@@ -1046,14 +1068,16 @@ static void usage(args_t *args)
      fprintf(bcftools_stderr, "General Options:\n");
      fprintf(bcftools_stderr, "        --AF-dflt <float>              if AF is not known, use this allele frequency [skip]\n");
      fprintf(bcftools_stderr, "        --AF-tag <TAG>                 use TAG for allele frequency\n");
-    fprintf(bcftools_stderr, "        --AF-file <file>               read allele frequencies from file (CHR\\tPOS\\tREF,ALT\\tAF)\n");
+    fprintf(bcftools_stderr, "        --AF-file <file>               read allele frequencies from file (CHR\\tPOS\\tREF\\tALT\\tAF)\n");
      fprintf(bcftools_stderr, "    -b  --buffer-size <int[,int]>      buffer size and the number of overlapping sites, 0 for unlimited [0]\n");
      fprintf(bcftools_stderr, "                                           If the first number is negative, it is interpreted as the maximum memory to\n");
      fprintf(bcftools_stderr, "                                           use, in MB. The default overlap is set to roughly 1%% of the buffer size.\n");
      fprintf(bcftools_stderr, "    -e, --estimate-AF [TAG],<file>     estimate AF from FORMAT/TAG (GT or PL) of all samples (\"-\") or samples listed\n");
      fprintf(bcftools_stderr, "                                            in <file>. If TAG is not given, the frequency is estimated from GT by default\n");
+    fprintf(bcftools_stderr, "        --exclude <expr>               exclude sites for which the expression is true\n");
      fprintf(bcftools_stderr, "    -G, --GTs-only <float>             use GTs and ignore PLs, instead using <float> for PL of the two least likely genotypes.\n");
      fprintf(bcftools_stderr, "                                           Safe value to use is 30 to account for GT errors.\n");
+    fprintf(bcftools_stderr, "        --include <expr>               select sites for which the expression is true\n");
      fprintf(bcftools_stderr, "    -i, --ignore-homref                skip hom-ref genotypes (0/0)\n");
      fprintf(bcftools_stderr, "    -I, --skip-indels                  skip indels as their genotypes are enriched for errors\n");
      fprintf(bcftools_stderr, "    -m, --genetic-map <file>           genetic map in IMPUTE2 format, single file or mask, where string \"{CHROM}\"\n");
@@ -1093,6 +1117,8 @@ int main_vcfroh(int argc, char *argv[])
          {"AF-tag",1,0,0},
          {"AF-file",1,0,1},
          {"AF-dflt",1,0,2},
+        {"include",1,0,3},
+        {"exclude",1,0,4},
          {"buffer-size",1,0,'b'},
          {"ignore-homref",0,0,'i'},
          {"estimate-AF",1,0,'e'},
@@ -1125,6 +1151,8 @@ int main_vcfroh(int argc, char *argv[])
                  args->dflt_AF = strtod(optarg,&tmp);
                  if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
                  break;
+            case 3: args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+            case 4: args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
              case 'o': args->output_fname = optarg; break;
              case 'O': 
                  if ( strchr(optarg,'s') || strchr(optarg,'S') ) args->output_type |= OUTPUT_ST;
@@ -1182,8 +1210,8 @@ int main_vcfroh(int argc, char *argv[])
      else fname = argv[optind];
  
      if ( args->vi_training && args->buffer_size ) error("Error: cannot use -b with -V\n");
-    if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1]\n", args->t2AZ);
-    if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1]\n", args->t2HW);
+    if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1] .. %e\n", args->t2AZ);
+    if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1] .. %e\n", args->t2HW);
      if ( naf_opts>1 ) error("Error: The options --AF-tag, --AF-file and -e are mutually exclusive\n");
      if ( args->af_fname && args->targets_list ) error("Error: The options --AF-file and -t are mutually exclusive\n");
      if ( args->regions_list )
@@ -1208,16 +1236,28 @@ int main_vcfroh(int argc, char *argv[])
      init_data(args);
      while ( bcf_sr_next_line(args->files) )
      {
-        vcfroh(args, args->files->readers[0].buffer[0]);
+        args->ntot++;
+        bcf1_t *line = bcf_sr_get_line(args->files,0);
+        if ( args->filter )
+        {
+            int pass = filter_test(args->filter, line, NULL);
+            if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+            if ( !pass ) { args->nfiltered++; continue; }
+        }
+        vcfroh(args, line);
      }
      vcfroh(args, NULL);
      int i, nmin = 0;
      for (i=0; i<args->roh_smpl->n; i++)
          if ( !i || args->smpl[i].nused < nmin ) nmin = args->smpl[i].nused;
-    fprintf(bcftools_stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+    if ( args->af_fname )
+        fprintf(bcftools_stderr,"Number of lines overlapping with --AF-file/processed: %d/%d\n", args->ntot,nmin);
+    else
+        fprintf(bcftools_stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+    fprintf(bcftools_stderr,"Number of lines filtered/no AF/not biallelic/dup: %d/%d/%d/%d\n", args->nfiltered,args->nno_af,args->nnot_biallelic,args->ndup);
      if ( nmin==0 )
      {
-        fprintf(bcftools_stderr,"No usable sites were found.");
+        fprintf(bcftools_stderr,"No usable sites were found.\n");
          if ( !naf_opts && !args->dflt_AF ) fprintf(bcftools_stderr, " Consider using one of the AF options.\n");
      }
      destroy_data(args);
diff --git a/bcftools/vcfsom.c b/bcftools/vcfsom.c

index 434dc16115f7bebade90fefef85ad5b453aca885..9d9598fd72f38d9b34b08945ab7c73bbd94191cc 100644 (file)
--- a/bcftools/vcfsom.c
+++ b/bcftools/vcfsom.c
@@ -351,9 +351,9 @@ static som_t *som_init(args_t *args)
      som->bmu_th = args->bmu_th;
      som->size   = pow(som->nbin,som->ndim);
      som->w = (double*) malloc(sizeof(double)*som->size*som->kdim);
-    if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d kdim=%d]\n", sizeof(double)*som->size*som->kdim,som->nbin,som->ndim,som->kdim);
+    if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d kdim=%d]\n", (uint64_t)(sizeof(double)*som->size*som->kdim),som->nbin,som->ndim,som->kdim);
      som->c = (double*) calloc(som->size,sizeof(double));
-    if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d]\n", sizeof(double)*som->size,som->nbin,som->ndim);
+    if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d]\n", (uint64_t)(sizeof(double)*som->size),som->nbin,som->ndim);
      int i;
      for (i=0; i<som->size*som->kdim; i++)
          som->w[i] = (double)random()/RAND_MAX;
diff --git a/bcftools/vcfsom.c.pysam.c b/bcftools/vcfsom.c.pysam.c

index 092d37a580c95414c9499bfdc5faf975ab048dfe..6fc9f6add5498794b4e1171714f4db8a2cb4ce8c 100644 (file)
--- a/bcftools/vcfsom.c.pysam.c
+++ b/bcftools/vcfsom.c.pysam.c
@@ -353,9 +353,9 @@ static som_t *som_init(args_t *args)
      som->bmu_th = args->bmu_th;
      som->size   = pow(som->nbin,som->ndim);
      som->w = (double*) malloc(sizeof(double)*som->size*som->kdim);
-    if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d kdim=%d]\n", sizeof(double)*som->size*som->kdim,som->nbin,som->ndim,som->kdim);
+    if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d kdim=%d]\n", (uint64_t)(sizeof(double)*som->size*som->kdim),som->nbin,som->ndim,som->kdim);
      som->c = (double*) calloc(som->size,sizeof(double));
-    if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d]\n", sizeof(double)*som->size,som->nbin,som->ndim);
+    if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d]\n", (uint64_t)(sizeof(double)*som->size),som->nbin,som->ndim);
      int i;
      for (i=0; i<som->size*som->kdim; i++)
          som->w[i] = (double)random()/RAND_MAX;
diff --git a/bcftools/vcfsort.c b/bcftools/vcfsort.c

index e41b628c9689dcc207f0078677e0823ccf654a10..9bf151d63a66de80ea1be390fac96697bc74167c 100644 (file)
--- a/bcftools/vcfsort.c
+++ b/bcftools/vcfsort.c
@@ -67,6 +67,21 @@ int cmp_bcf_pos(const void *aptr, const void *bptr)
      if ( a->rid > b->rid ) return 1;
      if ( a->pos < b->pos ) return -1;
      if ( a->pos > b->pos ) return 1;
+
+    // Sort the same chr:pos records lexicographically by ref,alt.
+    // This will be called rarely so should not slow the sorting down
+    // noticeably.
+
+    if ( !a->unpacked ) bcf_unpack(a, BCF_UN_STR);
+    if ( !b->unpacked ) bcf_unpack(b, BCF_UN_STR);
+    int i;
+    for (i=0; i<a->n_allele; i++)
+    { 
+        if ( i >= b->n_allele ) return 1;
+        int ret = strcasecmp(a->d.allele[i],b->d.allele[i]);
+        if ( ret ) return ret;
+    }
+    if ( a->n_allele < b->n_allele ) return -1;
      return 0;
  }
  
@@ -138,9 +153,8 @@ static inline int blk_is_smaller(blk_t **aptr, blk_t **bptr)
  {
      blk_t *a = *aptr;
      blk_t *b = *bptr;
-    if ( a->rec->rid < b->rec->rid ) return 1;
-    if ( a->rec->rid > b->rec->rid ) return 0;
-    if ( a->rec->pos < b->rec->pos ) return 1;
+    int ret = cmp_bcf_pos(&a->rec, &b->rec);
+    if ( ret < 0 ) return 1;
      return 0;
  }
  KHEAP_INIT(blk, blk_t*, blk_is_smaller)
diff --git a/bcftools/vcfsort.c.pysam.c b/bcftools/vcfsort.c.pysam.c

index 4a0325fcc2aaf4e052deb8a9f990cd6446de6d88..9d92048486415f0ef06d5bcc02ebb731bc366157 100644 (file)
--- a/bcftools/vcfsort.c.pysam.c
+++ b/bcftools/vcfsort.c.pysam.c
@@ -69,6 +69,21 @@ int cmp_bcf_pos(const void *aptr, const void *bptr)
      if ( a->rid > b->rid ) return 1;
      if ( a->pos < b->pos ) return -1;
      if ( a->pos > b->pos ) return 1;
+
+    // Sort the same chr:pos records lexicographically by ref,alt.
+    // This will be called rarely so should not slow the sorting down
+    // noticeably.
+
+    if ( !a->unpacked ) bcf_unpack(a, BCF_UN_STR);
+    if ( !b->unpacked ) bcf_unpack(b, BCF_UN_STR);
+    int i;
+    for (i=0; i<a->n_allele; i++)
+    { 
+        if ( i >= b->n_allele ) return 1;
+        int ret = strcasecmp(a->d.allele[i],b->d.allele[i]);
+        if ( ret ) return ret;
+    }
+    if ( a->n_allele < b->n_allele ) return -1;
      return 0;
  }
  
@@ -140,9 +155,8 @@ static inline int blk_is_smaller(blk_t **aptr, blk_t **bptr)
  {
      blk_t *a = *aptr;
      blk_t *b = *bptr;
-    if ( a->rec->rid < b->rec->rid ) return 1;
-    if ( a->rec->rid > b->rec->rid ) return 0;
-    if ( a->rec->pos < b->rec->pos ) return 1;
+    int ret = cmp_bcf_pos(&a->rec, &b->rec);
+    if ( ret < 0 ) return 1;
      return 0;
  }
  KHEAP_INIT(blk, blk_t*, blk_is_smaller)
diff --git a/bcftools/vcfstats.c b/bcftools/vcfstats.c

index 4f7765e39fdb4078badd881ffc1dd7de3517ded0..c59e39a5654494d5edc5eac15ac4b31a7a4b3bca 100644 (file)
--- a/bcftools/vcfstats.c
+++ b/bcftools/vcfstats.c
@@ -392,7 +392,7 @@ static void init_user_stats(args_t *args, bcf_hdr_t *hdr, stats_t *stats)
          int id = bcf_hdr_id2int(hdr,BCF_DT_ID,usr->tag);
          if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,id) ) error("The INFO tag \"%s\" is not defined in the header\n", usr->tag);
          usr->type = bcf_hdr_id2type(hdr,BCF_HL_INFO,id);
-        if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->type);
+        if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->tag, usr->type);
      }
  }
  static void init_stats(args_t *args)
diff --git a/bcftools/vcfstats.c.pysam.c b/bcftools/vcfstats.c.pysam.c

index 46a051f63f91a19df6f4e918d371be65dab0f58c..8591d0fadfa9e8316cf3a041370e82bd3c77a28e 100644 (file)
--- a/bcftools/vcfstats.c.pysam.c
+++ b/bcftools/vcfstats.c.pysam.c
@@ -394,7 +394,7 @@ static void init_user_stats(args_t *args, bcf_hdr_t *hdr, stats_t *stats)
          int id = bcf_hdr_id2int(hdr,BCF_DT_ID,usr->tag);
          if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,id) ) error("The INFO tag \"%s\" is not defined in the header\n", usr->tag);
          usr->type = bcf_hdr_id2type(hdr,BCF_HL_INFO,id);
-        if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->type);
+        if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->tag, usr->type);
      }
  }
  static void init_stats(args_t *args)
diff --git a/bcftools/vcfview.c b/bcftools/vcfview.c

index 5d523f164c5571da709cc9053bb330efbfb542f0..412b6830e89c7aac778de5293845d5c2c1a540f5 100644 (file)
--- a/bcftools/vcfview.c
+++ b/bcftools/vcfview.c
@@ -589,7 +589,7 @@ int main_vcfview(int argc, char *argv[])
      char *tmp;
      while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:auUhHc:C:Ii:e:xXpPq:Q:g:",loptions,NULL)) >= 0)
      {
-        char allele_type[8] = "nref";
+        char allele_type[9] = "nref";
          switch (c)
          {
              case 'O':
@@ -641,7 +641,7 @@ int main_vcfview(int argc, char *argv[])
              case 'c':
              {
                  args->min_ac_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%d:%s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
+                if ( sscanf(optarg,"%d:%8s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
                      error("Error: Could not parse --min-ac %s\n", optarg);
                  set_allele_type(&args->min_ac_type, allele_type);
                  args->calc_ac = 1;
@@ -650,7 +650,7 @@ int main_vcfview(int argc, char *argv[])
              case 'C':
              {
                  args->max_ac_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%d:%s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
+                if ( sscanf(optarg,"%d:%8s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
                      error("Error: Could not parse --max-ac %s\n", optarg);
                  set_allele_type(&args->max_ac_type, allele_type);
                  args->calc_ac = 1;
@@ -659,8 +659,8 @@ int main_vcfview(int argc, char *argv[])
              case 'q':
              {
                  args->min_af_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%f:%s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
-                    error("Error: Could not parse --min_af %s\n", optarg);
+                if ( sscanf(optarg,"%f:%8s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
+                    error("Error: Could not parse --min-af %s\n", optarg);
                  set_allele_type(&args->min_af_type, allele_type);
                  args->calc_ac = 1;
                  break;
@@ -668,8 +668,8 @@ int main_vcfview(int argc, char *argv[])
              case 'Q':
              {
                  args->max_af_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%f:%s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
-                    error("Error: Could not parse --min_af %s\n", optarg);
+                if ( sscanf(optarg,"%f:%8s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
+                    error("Error: Could not parse --max-af %s\n", optarg);
                  set_allele_type(&args->max_af_type, allele_type);
                  args->calc_ac = 1;
                  break;
diff --git a/bcftools/vcfview.c.pysam.c b/bcftools/vcfview.c.pysam.c

index 4de3ac4529566ecf396f4376c8850c18514f81b4..1a7866aecdc3bac5c58a3b4306efafe64250d39f 100644 (file)
--- a/bcftools/vcfview.c.pysam.c
+++ b/bcftools/vcfview.c.pysam.c
@@ -591,7 +591,7 @@ int main_vcfview(int argc, char *argv[])
      char *tmp;
      while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:auUhHc:C:Ii:e:xXpPq:Q:g:",loptions,NULL)) >= 0)
      {
-        char allele_type[8] = "nref";
+        char allele_type[9] = "nref";
          switch (c)
          {
              case 'O':
@@ -643,7 +643,7 @@ int main_vcfview(int argc, char *argv[])
              case 'c':
              {
                  args->min_ac_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%d:%s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
+                if ( sscanf(optarg,"%d:%8s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
                      error("Error: Could not parse --min-ac %s\n", optarg);
                  set_allele_type(&args->min_ac_type, allele_type);
                  args->calc_ac = 1;
@@ -652,7 +652,7 @@ int main_vcfview(int argc, char *argv[])
              case 'C':
              {
                  args->max_ac_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%d:%s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
+                if ( sscanf(optarg,"%d:%8s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
                      error("Error: Could not parse --max-ac %s\n", optarg);
                  set_allele_type(&args->max_ac_type, allele_type);
                  args->calc_ac = 1;
@@ -661,8 +661,8 @@ int main_vcfview(int argc, char *argv[])
              case 'q':
              {
                  args->min_af_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%f:%s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
-                    error("Error: Could not parse --min_af %s\n", optarg);
+                if ( sscanf(optarg,"%f:%8s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
+                    error("Error: Could not parse --min-af %s\n", optarg);
                  set_allele_type(&args->min_af_type, allele_type);
                  args->calc_ac = 1;
                  break;
@@ -670,8 +670,8 @@ int main_vcfview(int argc, char *argv[])
              case 'Q':
              {
                  args->max_af_type = ALLELE_NONREF;
-                if ( sscanf(optarg,"%f:%s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
-                    error("Error: Could not parse --min_af %s\n", optarg);
+                if ( sscanf(optarg,"%f:%8s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
+                    error("Error: Could not parse --max-af %s\n", optarg);
                  set_allele_type(&args->max_af_type, allele_type);
                  args->calc_ac = 1;
                  break;
diff --git a/bcftools/vcmp.c b/bcftools/vcmp.c

index 8d04b8942da8e03e7e41f460ce898af0fdede70c..7d3b0f9ea81518217168b97709e3c8b0db513c3d 100644 (file)
--- a/bcftools/vcmp.c
+++ b/bcftools/vcmp.c
@@ -26,6 +26,7 @@ THE SOFTWARE.  */
  #include <string.h>
  #include <stdlib.h>
  #include <htslib/hts.h>
+#include <htslib/vcf.h>
  #include <ctype.h>
  #include "vcmp.h"
  
@@ -34,7 +35,8 @@ struct _vcmp_t
      char *dref;
      int ndref, mdref;   // ndref: positive when ref1 longer, negative when ref2 is longer
      int nmatch;
-    int *map, mmap;
+    int *map, mmap, nmap;
+    int *map_dip, mmap_dip, nmap_dip;
  };
  
  vcmp_t *vcmp_init()
@@ -44,6 +46,7 @@ vcmp_t *vcmp_init()
  
  void vcmp_destroy(vcmp_t *vcmp)
  {
+    free(vcmp->map_dip);
      free(vcmp->map);
      free(vcmp->dref);
      free(vcmp);
@@ -120,7 +123,8 @@ int *vcmp_map_ARvalues(vcmp_t *vcmp, int n, int nals1, char **als1, int nals2, c
  {
      if ( vcmp_set_ref(vcmp,als1[0],als2[0]) < 0 ) return NULL;
  
-    vcmp->map = (int*) realloc(vcmp->map,sizeof(int)*n);
+    vcmp->nmap = n;
+    hts_expand(int, vcmp->nmap, vcmp->mmap, vcmp->map);
  
      int i, ifrom = n==nals2 ? 0 : 1;
      for (i=ifrom; i<nals2; i++)
@@ -130,3 +134,22 @@ int *vcmp_map_ARvalues(vcmp_t *vcmp, int n, int nals1, char **als1, int nals2, c
      return vcmp->map;
  }
  
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap)
+{
+    vcmp->nmap_dip = vcmp->nmap*(vcmp->nmap+1)/2;
+    hts_expand(int, vcmp->nmap_dip, vcmp->mmap_dip, vcmp->map_dip);
+
+    int i, j, k = 0;
+    for (i=0; i<vcmp->nmap; i++)
+    {
+        for (j=0; j<=i; j++)
+        {
+            vcmp->map_dip[k] = vcmp->map[i]>=0 && vcmp->map[j]>=0 ? bcf_alleles2gt(vcmp->map[i],vcmp->map[j]) : -1;
+            k++;
+        }
+    }
+    *nmap = k;
+    return vcmp->map_dip;
+}
+
+
diff --git a/bcftools/vcmp.c.pysam.c b/bcftools/vcmp.c.pysam.c

index 80b242048641d3bff849b0ccfde2735cf9660022..00435bd744cf6b8892dfcd88f97f8f192930f6a0 100644 (file)
--- a/bcftools/vcmp.c.pysam.c
+++ b/bcftools/vcmp.c.pysam.c
@@ -28,6 +28,7 @@ THE SOFTWARE.  */
  #include <string.h>
  #include <stdlib.h>
  #include <htslib/hts.h>
+#include <htslib/vcf.h>
  #include <ctype.h>
  #include "vcmp.h"
  
@@ -36,7 +37,8 @@ struct _vcmp_t
      char *dref;
      int ndref, mdref;   // ndref: positive when ref1 longer, negative when ref2 is longer
      int nmatch;
-    int *map, mmap;
+    int *map, mmap, nmap;
+    int *map_dip, mmap_dip, nmap_dip;
  };
  
  vcmp_t *vcmp_init()
@@ -46,6 +48,7 @@ vcmp_t *vcmp_init()
  
  void vcmp_destroy(vcmp_t *vcmp)
  {
+    free(vcmp->map_dip);
      free(vcmp->map);
      free(vcmp->dref);
      free(vcmp);
@@ -122,7 +125,8 @@ int *vcmp_map_ARvalues(vcmp_t *vcmp, int n, int nals1, char **als1, int nals2, c
  {
      if ( vcmp_set_ref(vcmp,als1[0],als2[0]) < 0 ) return NULL;
  
-    vcmp->map = (int*) realloc(vcmp->map,sizeof(int)*n);
+    vcmp->nmap = n;
+    hts_expand(int, vcmp->nmap, vcmp->mmap, vcmp->map);
  
      int i, ifrom = n==nals2 ? 0 : 1;
      for (i=ifrom; i<nals2; i++)
@@ -132,3 +136,22 @@ int *vcmp_map_ARvalues(vcmp_t *vcmp, int n, int nals1, char **als1, int nals2, c
      return vcmp->map;
  }
  
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap)
+{
+    vcmp->nmap_dip = vcmp->nmap*(vcmp->nmap+1)/2;
+    hts_expand(int, vcmp->nmap_dip, vcmp->mmap_dip, vcmp->map_dip);
+
+    int i, j, k = 0;
+    for (i=0; i<vcmp->nmap; i++)
+    {
+        for (j=0; j<=i; j++)
+        {
+            vcmp->map_dip[k] = vcmp->map[i]>=0 && vcmp->map[j]>=0 ? bcf_alleles2gt(vcmp->map[i],vcmp->map[j]) : -1;
+            k++;
+        }
+    }
+    *nmap = k;
+    return vcmp->map_dip;
+}
+
+
diff --git a/bcftools/vcmp.h b/bcftools/vcmp.h

index 031770404abb0228d7b1a8bad065ea61a491357e..9c6370ce206e98e28af7b42e3a668988b74a6ae9 100644 (file)
--- a/bcftools/vcmp.h
+++ b/bcftools/vcmp.h
@@ -58,5 +58,6 @@ int vcmp_find_allele(vcmp_t *vcmp, char **als1, int nals1, char *al2);
   */
  int *vcmp_map_ARvalues(vcmp_t *vcmp, int number, int nals1, char **als1, int nals2, char **als2);
  
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap);
  
  #endif
diff --git a/bcftools/version.h b/bcftools/version.h

index 29d20170f95e8477bdfec44e6e5683447bc2725d..4da990f34b9f38e96a80f6ca49173ce381e60e8c 100644 (file)
--- a/bcftools/version.h
+++ b/bcftools/version.h
@@ -1 +1 @@
-#define BCFTOOLS_VERSION "1.7"
+#define BCFTOOLS_VERSION "1.9"
diff --git a/benchmark/cython_flagstat.py b/benchmark/cython_flagstat.py

deleted file mode 100644 (file)

index 6a9b7df..0000000
--- a/benchmark/cython_flagstat.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""compute number of reads/alignments from BAM file
-===================================================
-
-This is a benchmarking utility script with limited functionality.
-
-Compute simple flag stats on a BAM-file using
-the pysam cython interface.
-
-"""
-
-import sys
-import pysam
-import pyximport
-pyximport.install()
-import _cython_flagstat
-
-assert len(sys.argv) == 2, "USAGE: {} filename.bam".format(sys.argv[0])
-
-is_paired, is_proper = _cython_flagstat.count(
-    pysam.AlignmentFile(sys.argv[1], "rb"))
-
-print ("there are alignments of %i paired reads" % is_paired)
-print ("there are %i proper paired alignments" % is_proper)
diff --git a/benchmark/python_flagstat.py b/benchmark/python_flagstat.py

deleted file mode 100644 (file)

index 0a14d80..0000000
--- a/benchmark/python_flagstat.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""compute number of reads/alignments from BAM file
-===================================================
-
-This is a benchmarking utility script with limited functionality.
-
-Compute simple flag stats on a BAM-file using
-the pysam python interface.
-"""
-
-import sys
-import pysam
-
-assert len(sys.argv) == 2, "USAGE: {} filename.bam".format(sys.argv[0])
-
-is_paired = 0
-is_proper = 0
-
-for read in pysam.AlignmentFile(sys.argv[1], "rb"):
-    is_paired += read.is_paired
-    is_proper += read.is_proper_pair
-
-print ("there are alignments of %i paired reads" % is_paired)
-print ("there are %i proper paired alignments" % is_proper)
diff --git a/buildwheels.sh b/buildwheels.sh

deleted file mode 100755 (executable)

index ae0d953..0000000
--- a/buildwheels.sh
+++ /dev/null
@@ -1,69 +0,0 @@
-#!/bin/bash
-#
-# Build manylinux1 wheels for pysam. Based on the example at
-# <https://github.com/pypa/python-manylinux-demo>
-#
-# It is best to run this in a fresh clone of the repository!
-#
-# Run this within the repository root:
-#   docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/buildwheels.sh
-#
-# The wheels will be put into the wheelhouse/ subdirectory.
-#
-# For interactive tests:
-#   docker run -it -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /bin/bash
-
-set -xeuo pipefail
-
-# For convenience, if this script is called from outside of a docker container,
-# it starts a container and runs itself inside of it.
-if ! grep -q docker /proc/1/cgroup; then
-  # We are not inside a container
-  exec docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/$0
-fi
-
-yum install -y zlib-devel bzip2-devel xz-devel
-
-# Python 2.6 is not supported
-rm -r /opt/python/cp26*
-
-# Python 3.3 builds fail with:
-#  /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-CentOS-linux/4.8.2/ld: cannot find -lchtslib
-rm -r /opt/python/cp33*
-
-# Without libcurl support, htslib can open files from HTTP and FTP URLs.
-# With libcurl support, it also supports HTTPS and S3 URLs, but libcurl needs a
-# current version of OpenSSL, and we do not want to be responsible for
-# updating the wheels as soon as there are any security issues. So disable
-# libcurl for now.
-# See also <https://github.com/pypa/manylinux/issues/74>.
-#
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
-
-PYBINS="/opt/python/*/bin"
-for PYBIN in ${PYBINS}; do
-    ${PYBIN}/pip install -r /io/requirements.txt
-    ${PYBIN}/pip wheel /io/ -w wheelhouse/
-done
-
-# Bundle external shared libraries into the wheels
-#
-# The '-L .' option is a workaround. By default, auditwheel puts all external
-# libraries (.so files) into a .libs directory and sets the RUNPATH to $ORIGIN/.libs.
-# When HTSLIB_MODE is 'shared' (now the default), then all so libraries part of
-# pysam require that RUNPATH is set to $ORIGIN (without the .libs). It seems
-# auditwheel overwrites $ORIGIN with $ORIGIN/.libs. This workaround makes
-# auditwheel set the RUNPATH to "$ORIGIN/." and it will work as desired.
-#
-for whl in wheelhouse/*.whl; do
-    auditwheel repair -L . $whl -w /io/wheelhouse/
-done
-
-# Created files are owned by root, so fix permissions.
-chown -R --reference=/io/setup.py /io/wheelhouse/
-
-# TODO Install packages and test them
-#for PYBIN in ${PYBINS}; do
-#    ${PYBIN}/pip install pysam --no-index -f /io/wheelhouse
-#    (cd $HOME; ${PYBIN}/nosetests ...)
-#done
diff --git a/ci/conda-recipe/build.sh b/ci/conda-recipe/build.sh

deleted file mode 100644 (file)

index 32b67db..0000000
--- a/ci/conda-recipe/build.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-#!/bin/bash
-
-# Use internal htslib
-chmod a+x ./htslib/configure
-export CFLAGS="-I${PREFIX}/include/curl/ -I${PREFIX}/include -L${PREFIX}/lib"
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
-
-$PYTHON setup.py install
diff --git a/ci/conda-recipe/meta.yaml b/ci/conda-recipe/meta.yaml

deleted file mode 100644 (file)

index 4e57895..0000000
--- a/ci/conda-recipe/meta.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-package:
-  name: pysam
-  version: 0.8.5
-
-source:
-  path: ../../
-
-build:
-  number: 0
-
-requirements:
-  build:
-    - python
-    - setuptools
-    - zlib
-    - cython
-
-  run:
-    - python
-    - zlib
-
-test:
-  imports:
-    - pysam
-
-about:
-  home: https://github.com/pysam-developers/pysam
-  license: MIT
-  summary: Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix.
diff --git a/ci/install-CGAT-tools.sh b/ci/install-CGAT-tools.sh

deleted file mode 100755 (executable)

index 27eb481..0000000
--- a/ci/install-CGAT-tools.sh
+++ /dev/null
@@ -1,281 +0,0 @@
-#!/usr/bin/env bash
-
-# function to detect the Operating System
-detect_os(){
-
-if [ -f /etc/os-release ]; then
-
-   OS=$(cat /etc/os-release | awk '/VERSION_ID/ {sub("="," "); print $2;}' | sed 's/\"//g' | awk '{sub("\\."," "); print $1;}')
-   if [ "$OS" != "12" ] ; then
-
-      echo       
-      echo " Ubuntu version not supported "
-      echo
-      echo " Only Ubuntu 12.x has been tested so far "
-      echo 
-      exit 1;
-
-   fi
-
-   OS="ubuntu"
-
-elif [ -f /etc/system-release ]; then
-
-   OS=$(cat /etc/system-release | awk ' {print $4;}' | awk '{sub("\\."," "); print $1;}')
-   if [ "$OS" != "6" ] ; then
-      echo
-      echo " Scientific Linux version not supported "
-      echo
-      echo " Only 6.x Scientific Linux has been tested so far "
-      echo
-      exit 1;
-   fi
-
-   OS="sl"
-
-else
-
-   echo
-   echo " Operating system not supported "
-   echo
-   echo " Exiting installation "
-   echo
-   exit 1;
-
-fi
-} # detect_os
-
-# message to display when the OS is not correct
-sanity_check_os() {
-   echo
-   echo " Unsupported operating system: $OS "
-   echo " Installation aborted "
-   echo
-   exit 1;
-} # sanity_check_os
-
-# function to install operating system dependencies
-install_os_packages() {
-
-if [ "$OS" == "ubuntu" -o "$OS" == "travis" ] ; then
-
-   echo
-   echo " Installing packages for Ubuntu "
-   echo
-
-   apt-get install -y gcc g++
-
-elif [ "$OS" == "sl" ] ; then
-
-   echo 
-   echo " Installing packages for Scientific Linux "
-   echo
-
-   yum -y install gcc zlib-devel gcc-c++
-
-else
-
-   sanity_check_os
-
-fi # if-OS
-} # install_os_packages
-
-# funcion to install Python dependencies
-install_python_deps() {
-
-if [ "$OS" == "ubuntu" -o "$OS" == "sl" ] ; then
-
-   echo
-   echo " Installing Python dependencies for $1 "
-   echo
-
-   # Create virtual environment
-   cd
-   mkdir CGAT
-   cd CGAT
-   wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.10.1.tar.gz
-   tar xvfz virtualenv-1.10.1.tar.gz
-   rm virtualenv-1.10.1.tar.gz
-   cd virtualenv-1.10.1
-   python virtualenv.py cgat-venv
-   source cgat-venv/bin/activate
-
-   # Install Python prerequisites
-   pip install cython
-
-elif [ "$OS" == "travis" ] ; then
-   # Travis-CI provides a virtualenv with Python 2.7
-   echo 
-   echo " Installing Python dependencies in travis "
-   echo
-
-   # Install Python prerequisites
-   pip install cython
-   pip install nose
-
-else
-
-   sanity_check_os
-
-fi # if-OS
-} # install_python_deps
-
-# common set of tasks to prepare external dependencies
-nosetests_external_deps() {
-echo
-echo " Running nosetests for $1 "
-echo
-
-pushd .
-
-# create a new folder to store external tools
-mkdir -p $HOME/CGAT/external-tools
-
-# install samtools
-cd $HOME/CGAT/external-tools
-curl -L http://downloads.sourceforge.net/project/samtools/samtools/1.3/samtools-1.3.tar.bz2 > samtools-1.3.tar.bz2
-tar xjf samtools-1.3.tar.bz2 
-cd samtools-1.3
-make
-PATH=$PATH:$HOME/CGAT/external-tools/samtools-1.3
-
-echo "installed samtools"
-samtools --version
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-# install bcftools
-cd $HOME/CGAT/external-tools
-curl -L https://github.com/samtools/bcftools/releases/download/1.3/bcftools-1.3.tar.bz2 > bcftools-1.3.tar.bz2
-tar xjf bcftools-1.3.tar.bz2
-cd bcftools-1.3
-make
-PATH=$PATH:$HOME/CGAT/external-tools/bcftools-1.3
-
-echo "installed bcftools"
-bcftools --version
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-popd
-
-} # nosetests_external_deps
-
-
-# function to run nosetests
-run_nosetests() {
-
-echo
-echo " Running nosetests for $1 "
-echo
-
-# prepare external dependencies
-nosetests_external_deps $OS
-
-# install code
-python setup.py install
-
-# change into tests directory. Otherwise,
-# 'import pysam' will import the repository,
-# not the installed version. This causes
-# problems in the compilation test.
-cd tests
-
-# create auxilliary data
-echo
-echo 'building test data'
-echo 
-make -C pysam_data all
-make -C cbcf_data all
-
-# run nosetests
-# -s: do not capture stdout, conflicts with pysam.dispatch
-# -v: verbose output
-nosetests -s -v 
-
-} # run_nosetests
-
-# function to display help message
-help_message() {
-echo
-echo " Use this script as follows: "
-echo
-echo " 1) Become root and install the operating system* packages: "
-echo " ./install-CGAT-tools.sh --install-os-packages"
-echo
-echo " 2) Now, as a normal user (non root), install the Python dependencies**: "
-echo " ./install-CGAT-tools.sh --install-python-deps"
-echo
-echo " At this stage the CGAT Code Collection is ready to go and you do not need further steps. Please type the following for more information:"
-echo " source $HOME/CGAT/virtualenv-1.10.1/cgat-venv/bin/activate"
-echo " cgat --help "
-echo
-echo " The CGAT Code Collection tests the software with nosetests. If you are interested in running those, please continue with the following steps:"
-echo
-echo " 3) Become root to install external tools and set up the environment: "
-echo " ./install-CGAT-tools.sh --install-nosetests-deps"
-echo
-echo " 4) Then, back as a normal user (non root), run nosetests as follows:"
-echo " ./install-CGAT-tools.sh --run-nosetests"
-echo 
-echo " NOTES: "
-echo " * Supported operating systems: Ubuntu 12.x and Scientific Linux 6.x "
-echo " ** An isolated virtual environment will be created to install Python dependencies "
-echo
-exit 1;
-}
-
-
-# the main script starts here
-
-if [ $# -eq 0 -o $# -gt 1 ] ; then
-
-   help_message
-
-else
-
-   if [ "$1" == "--help" ] ; then
-
-      help_message
-
-   elif [ "$1" == "--travis" ] ; then
-
-      OS="travis"
-      install_os_packages
-      install_python_deps
-      run_nosetests
-
-   elif [ "$1" == "--install-os-packages" ] ; then
-
-      detect_os
-      install_os_packages
-
-   elif [ "$1" == "--install-python-deps" ] ; then
-
-      detect_os
-      install_python_deps
-
-   elif [ "$1" == "--install-nosetests-deps" ] ; then
-
-      detect_os
-      install_nosetests_deps
-
-   elif [ "$1" == "--run-nosetests" ] ; then
-
-      detect_os
-      run_nosetests
-
-   else 
-
-      echo 
-      echo " Incorrect input parameter: $1 "
-      help_message
-
-   fi # if argument 1
-
-fi # if number of input parameters
-
diff --git a/devtools/buildwheels.sh b/devtools/buildwheels.sh

new file mode 100755 (executable)

index 0000000..ae0d953
--- /dev/null
+++ b/devtools/buildwheels.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+#
+# Build manylinux1 wheels for pysam. Based on the example at
+# <https://github.com/pypa/python-manylinux-demo>
+#
+# It is best to run this in a fresh clone of the repository!
+#
+# Run this within the repository root:
+#   docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/buildwheels.sh
+#
+# The wheels will be put into the wheelhouse/ subdirectory.
+#
+# For interactive tests:
+#   docker run -it -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /bin/bash
+
+set -xeuo pipefail
+
+# For convenience, if this script is called from outside of a docker container,
+# it starts a container and runs itself inside of it.
+if ! grep -q docker /proc/1/cgroup; then
+  # We are not inside a container
+  exec docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/$0
+fi
+
+yum install -y zlib-devel bzip2-devel xz-devel
+
+# Python 2.6 is not supported
+rm -r /opt/python/cp26*
+
+# Python 3.3 builds fail with:
+#  /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-CentOS-linux/4.8.2/ld: cannot find -lchtslib
+rm -r /opt/python/cp33*
+
+# Without libcurl support, htslib can open files from HTTP and FTP URLs.
+# With libcurl support, it also supports HTTPS and S3 URLs, but libcurl needs a
+# current version of OpenSSL, and we do not want to be responsible for
+# updating the wheels as soon as there are any security issues. So disable
+# libcurl for now.
+# See also <https://github.com/pypa/manylinux/issues/74>.
+#
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
+
+PYBINS="/opt/python/*/bin"
+for PYBIN in ${PYBINS}; do
+    ${PYBIN}/pip install -r /io/requirements.txt
+    ${PYBIN}/pip wheel /io/ -w wheelhouse/
+done
+
+# Bundle external shared libraries into the wheels
+#
+# The '-L .' option is a workaround. By default, auditwheel puts all external
+# libraries (.so files) into a .libs directory and sets the RUNPATH to $ORIGIN/.libs.
+# When HTSLIB_MODE is 'shared' (now the default), then all so libraries part of
+# pysam require that RUNPATH is set to $ORIGIN (without the .libs). It seems
+# auditwheel overwrites $ORIGIN with $ORIGIN/.libs. This workaround makes
+# auditwheel set the RUNPATH to "$ORIGIN/." and it will work as desired.
+#
+for whl in wheelhouse/*.whl; do
+    auditwheel repair -L . $whl -w /io/wheelhouse/
+done
+
+# Created files are owned by root, so fix permissions.
+chown -R --reference=/io/setup.py /io/wheelhouse/
+
+# TODO Install packages and test them
+#for PYBIN in ${PYBINS}; do
+#    ${PYBIN}/pip install pysam --no-index -f /io/wheelhouse
+#    (cd $HOME; ${PYBIN}/nosetests ...)
+#done
diff --git a/devtools/conda-recipe/build.sh b/devtools/conda-recipe/build.sh

new file mode 100644 (file)

index 0000000..32b67db
--- /dev/null
+++ b/devtools/conda-recipe/build.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+# Use internal htslib
+chmod a+x ./htslib/configure
+export CFLAGS="-I${PREFIX}/include/curl/ -I${PREFIX}/include -L${PREFIX}/lib"
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
+
+$PYTHON setup.py install
diff --git a/devtools/conda-recipe/meta.yaml b/devtools/conda-recipe/meta.yaml

new file mode 100644 (file)

index 0000000..4e57895
--- /dev/null
+++ b/devtools/conda-recipe/meta.yaml
@@ -0,0 +1,29 @@
+package:
+  name: pysam
+  version: 0.8.5
+
+source:
+  path: ../../
+
+build:
+  number: 0
+
+requirements:
+  build:
+    - python
+    - setuptools
+    - zlib
+    - cython
+
+  run:
+    - python
+    - zlib
+
+test:
+  imports:
+    - pysam
+
+about:
+  home: https://github.com/pysam-developers/pysam
+  license: MIT
+  summary: Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix.
diff --git a/devtools/import.py b/devtools/import.py

new file mode 100644 (file)

index 0000000..44e606a
--- /dev/null
+++ b/devtools/import.py
@@ -0,0 +1,226 @@
+#################################################################
+# Importing samtools and htslib
+#
+# For htslib, simply copy the whole release tar-ball
+# into the directory "htslib" and recreate the file version.h
+#
+# rm -rf htslib
+# mv download/htslib htslib
+# git checkout -- htslib/version.h
+# Edit the file htslib/version.h to set the right version number.
+#
+# For samtools, type:
+# rm -rf samtools
+# python import.py samtools download/samtools
+# git checkout -- samtools/version.h
+#
+# Manually, then:
+# modify config.h to set compatibility flags
+#
+# For bcftools, type:
+# rm -rf bcftools
+# python import.py bcftools download/bedtools
+# git checkout -- bcftools/version.h
+# rm -rf bedtools/test bedtools/plugins
+
+import fnmatch
+import os
+import re
+import itertools
+import shutil
+import sys
+import hashlib
+
+
+EXCLUDE = {
+    "samtools": (
+        "razip.c",
+        "bgzip.c",
+        "main.c",
+        "calDepth.c",
+        "bam2bed.c",
+        "wgsim.c",
+        "bam_tview.c",
+        "bam_tview.h",
+        "bam_tview_html.c",
+        "bam_tview_curses.c",
+        "md5fa.c",
+        "md5sum-lite.c",
+        "maq2sam.c",
+        "bamcheck.c",
+        "chk_indel.c",
+        "vcf-miniview.c",
+        "hfile_irods.c",  # requires irods library
+    ),
+    "bcftools": (
+        "test", "plugins", "peakfit.c",
+        "peakfit.h",
+        # needs to renamed, name conflict with samtools reheader
+        # "reheader.c",
+        "polysomy.c"),
+    "htslib": (
+        'htslib/tabix.c', 'htslib/bgzip.c',
+        'htslib/htsfile.c', 'htslib/hfile_irods.c'),
+}
+
+
+MAIN = {
+    "samtools": "bamtk",
+    "bcftools": "main"
+}
+
+
+
+def locate(pattern, root=os.curdir):
+    '''Locate all files matching supplied filename pattern in and below
+    supplied root directory.
+    '''
+    for path, dirs, files in os.walk(os.path.abspath(root)):
+        for filename in fnmatch.filter(files, pattern):
+            yield os.path.join(path, filename)
+
+
+def _update_pysam_files(cf, destdir):
+    '''update pysam files applying redirection of ouput'''
+    basename = os.path.basename(destdir)
+    for filename in cf:
+        if not filename:
+            continue
+        dest = filename + ".pysam.c"
+        with open(filename, encoding="utf-8") as infile:
+            lines = "".join(infile.readlines())
+
+            with open(dest, "w", encoding="utf-8") as outfile:
+                outfile.write('#include "{}.pysam.h"\n\n'.format(basename))
+                subname, _ = os.path.splitext(os.path.basename(filename))
+                if subname in MAIN.get(basename, []):
+                    lines = re.sub("int main\(", "int {}_main(".format(
+                        basename), lines)
+                else:
+                    lines = re.sub("int main\(", "int {}_{}_main(".format(
+                        basename, subname), lines)
+                lines = re.sub("stderr", "{}_stderr".format(basename), lines)
+                lines = re.sub("stdout", "{}_stdout".format(basename), lines)
+                lines = re.sub(" printf\(", " fprintf({}_stdout, ".format(basename), lines)
+                lines = re.sub("([^kf])puts\(", r"\1{}_puts(".format(basename), lines)
+                lines = re.sub("putchar\(([^)]+)\)",
+                               r"fputc(\1, {}_stdout)".format(basename), lines)
+
+                fn = os.path.basename(filename)
+                # some specific fixes:
+                SPECIFIC_SUBSTITUTIONS = {
+                    "bam_md.c": (
+                        'sam_open_format("-", mode_w',
+                        'sam_open_format({}_stdout_fn, mode_w'.format(basename)),
+                    "phase.c": (
+                        'putc("ACGT"[f->seq[j] == 1? (c&3, {}_stdout) : (c>>16&3)]);'.format(basename),
+                        'putc("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)], {}_stdout);'.format(basename)),
+                    "cut_target.c": (
+                        'putc(33 + (cns[j]>>8>>2, {}_stdout));'.format(basename),
+                        'putc(33 + (cns[j]>>8>>2), {}_stdout);'.format(basename))
+                    }
+                if fn in SPECIFIC_SUBSTITUTIONS:
+                    lines = lines.replace(
+                        SPECIFIC_SUBSTITUTIONS[fn][0],
+                        SPECIFIC_SUBSTITUTIONS[fn][1])
+                outfile.write(lines)
+
+    with open(os.path.join("import", "pysam.h")) as inf, \
+         open(os.path.join(destdir, "{}.pysam.h".format(basename)), "w") as outf:
+        outf.write(re.sub("@pysam@", basename, inf.read()))
+
+    with open(os.path.join("import", "pysam.c")) as inf, \
+         open(os.path.join(destdir, "{}.pysam.c".format(basename)), "w") as outf:
+        outf.write(re.sub("@pysam@", basename, inf.read()))
+
+
+if len(sys.argv) >= 1:
+    if len(sys.argv) != 3:
+        raise ValueError("import requires dest src")
+
+    dest, srcdir = sys.argv[1:3]
+    if dest not in EXCLUDE:
+        raise ValueError("import expected one of %s" %
+                         ",".join(EXCLUDE.keys()))
+    exclude = EXCLUDE[dest]
+    destdir = os.path.abspath(dest)
+    srcdir = os.path.abspath(srcdir)
+    if not os.path.exists(srcdir):
+        raise IOError(
+            "source directory `%s` does not exist." % srcdir)
+
+    cfiles = locate("*.c", srcdir)
+    hfiles = locate("*.h", srcdir)
+    mfiles = itertools.chain(locate("README", srcdir), locate("LICENSE", srcdir))
+    
+    # remove unwanted files and htslib subdirectory.
+    cfiles = [x for x in cfiles if os.path.basename(x) not in exclude
+              and not re.search("htslib-", x)]
+
+    hfiles = [x for x in hfiles if os.path.basename(x) not in exclude
+              and not re.search("htslib-", x)]
+
+    ncopied = 0
+
+    def _compareAndCopy(src, srcdir, destdir, exclude):
+
+        d, f = os.path.split(src)
+        common_prefix = os.path.commonprefix((d, srcdir))
+        subdir = re.sub(common_prefix, "", d)[1:]
+        targetdir = os.path.join(destdir, subdir)
+        if not os.path.exists(targetdir):
+            os.makedirs(targetdir)
+        old_file = os.path.join(targetdir, f)
+        if os.path.exists(old_file):
+            md5_old = hashlib.md5(
+                "".join(open(old_file, "r", encoding="utf-8").readlines()).encode()).digest()
+            md5_new = hashlib.md5(
+                "".join(open(src, "r", encoding="utf-8").readlines()).encode()).digest()
+            if md5_old != md5_new:
+                raise ValueError(
+                    "incompatible files for %s and %s" %
+                    (old_file, src))
+
+        shutil.copy(src, targetdir)
+        return old_file
+
+    for src_file in hfiles:
+        _compareAndCopy(src_file, srcdir, destdir, exclude)
+        ncopied += 1
+
+    for src_file in mfiles:
+        _compareAndCopy(src_file, srcdir, destdir, exclude)
+        ncopied += 1
+
+    cf = []
+    for src_file in cfiles:
+        cf.append(_compareAndCopy(src_file,
+                                  srcdir,
+                                  destdir,
+                                  exclude))
+        ncopied += 1
+
+    sys.stdout.write(
+        "installed latest source code from %s: "
+        "%i files copied\n" % (srcdir, ncopied))
+    # redirect stderr to pysamerr and replace bam.h with a stub.
+    sys.stdout.write("applying stderr redirection\n")
+
+    _update_pysam_files(cf, destdir)
+
+    sys.exit(0)
+
+
+# if len(sys.argv) >= 2 and sys.argv[1] == "refresh":
+#     sys.stdout.write("refreshing latest source code from .c to .pysam.c")
+#     # redirect stderr to pysamerr and replace bam.h with a stub.
+#     sys.stdout.write("applying stderr redirection")
+#     for destdir in ('samtools', ):
+#         pysamcfiles = locate("*.pysam.c", destdir)
+#         for f in pysamcfiles:
+#             os.remove(f)
+#         cfiles = locate("*.c", destdir)
+#         _update_pysam_files(cfiles, destdir)
+
+#     sys.exit(0)
+
diff --git a/devtools/install-CGAT-tools.sh b/devtools/install-CGAT-tools.sh

new file mode 100755 (executable)

index 0000000..27eb481
--- /dev/null
+++ b/devtools/install-CGAT-tools.sh
@@ -0,0 +1,281 @@
+#!/usr/bin/env bash
+
+# function to detect the Operating System
+detect_os(){
+
+if [ -f /etc/os-release ]; then
+
+   OS=$(cat /etc/os-release | awk '/VERSION_ID/ {sub("="," "); print $2;}' | sed 's/\"//g' | awk '{sub("\\."," "); print $1;}')
+   if [ "$OS" != "12" ] ; then
+
+      echo       
+      echo " Ubuntu version not supported "
+      echo
+      echo " Only Ubuntu 12.x has been tested so far "
+      echo 
+      exit 1;
+
+   fi
+
+   OS="ubuntu"
+
+elif [ -f /etc/system-release ]; then
+
+   OS=$(cat /etc/system-release | awk ' {print $4;}' | awk '{sub("\\."," "); print $1;}')
+   if [ "$OS" != "6" ] ; then
+      echo
+      echo " Scientific Linux version not supported "
+      echo
+      echo " Only 6.x Scientific Linux has been tested so far "
+      echo
+      exit 1;
+   fi
+
+   OS="sl"
+
+else
+
+   echo
+   echo " Operating system not supported "
+   echo
+   echo " Exiting installation "
+   echo
+   exit 1;
+
+fi
+} # detect_os
+
+# message to display when the OS is not correct
+sanity_check_os() {
+   echo
+   echo " Unsupported operating system: $OS "
+   echo " Installation aborted "
+   echo
+   exit 1;
+} # sanity_check_os
+
+# function to install operating system dependencies
+install_os_packages() {
+
+if [ "$OS" == "ubuntu" -o "$OS" == "travis" ] ; then
+
+   echo
+   echo " Installing packages for Ubuntu "
+   echo
+
+   apt-get install -y gcc g++
+
+elif [ "$OS" == "sl" ] ; then
+
+   echo 
+   echo " Installing packages for Scientific Linux "
+   echo
+
+   yum -y install gcc zlib-devel gcc-c++
+
+else
+
+   sanity_check_os
+
+fi # if-OS
+} # install_os_packages
+
+# funcion to install Python dependencies
+install_python_deps() {
+
+if [ "$OS" == "ubuntu" -o "$OS" == "sl" ] ; then
+
+   echo
+   echo " Installing Python dependencies for $1 "
+   echo
+
+   # Create virtual environment
+   cd
+   mkdir CGAT
+   cd CGAT
+   wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.10.1.tar.gz
+   tar xvfz virtualenv-1.10.1.tar.gz
+   rm virtualenv-1.10.1.tar.gz
+   cd virtualenv-1.10.1
+   python virtualenv.py cgat-venv
+   source cgat-venv/bin/activate
+
+   # Install Python prerequisites
+   pip install cython
+
+elif [ "$OS" == "travis" ] ; then
+   # Travis-CI provides a virtualenv with Python 2.7
+   echo 
+   echo " Installing Python dependencies in travis "
+   echo
+
+   # Install Python prerequisites
+   pip install cython
+   pip install nose
+
+else
+
+   sanity_check_os
+
+fi # if-OS
+} # install_python_deps
+
+# common set of tasks to prepare external dependencies
+nosetests_external_deps() {
+echo
+echo " Running nosetests for $1 "
+echo
+
+pushd .
+
+# create a new folder to store external tools
+mkdir -p $HOME/CGAT/external-tools
+
+# install samtools
+cd $HOME/CGAT/external-tools
+curl -L http://downloads.sourceforge.net/project/samtools/samtools/1.3/samtools-1.3.tar.bz2 > samtools-1.3.tar.bz2
+tar xjf samtools-1.3.tar.bz2 
+cd samtools-1.3
+make
+PATH=$PATH:$HOME/CGAT/external-tools/samtools-1.3
+
+echo "installed samtools"
+samtools --version
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+# install bcftools
+cd $HOME/CGAT/external-tools
+curl -L https://github.com/samtools/bcftools/releases/download/1.3/bcftools-1.3.tar.bz2 > bcftools-1.3.tar.bz2
+tar xjf bcftools-1.3.tar.bz2
+cd bcftools-1.3
+make
+PATH=$PATH:$HOME/CGAT/external-tools/bcftools-1.3
+
+echo "installed bcftools"
+bcftools --version
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+popd
+
+} # nosetests_external_deps
+
+
+# function to run nosetests
+run_nosetests() {
+
+echo
+echo " Running nosetests for $1 "
+echo
+
+# prepare external dependencies
+nosetests_external_deps $OS
+
+# install code
+python setup.py install
+
+# change into tests directory. Otherwise,
+# 'import pysam' will import the repository,
+# not the installed version. This causes
+# problems in the compilation test.
+cd tests
+
+# create auxilliary data
+echo
+echo 'building test data'
+echo 
+make -C pysam_data all
+make -C cbcf_data all
+
+# run nosetests
+# -s: do not capture stdout, conflicts with pysam.dispatch
+# -v: verbose output
+nosetests -s -v 
+
+} # run_nosetests
+
+# function to display help message
+help_message() {
+echo
+echo " Use this script as follows: "
+echo
+echo " 1) Become root and install the operating system* packages: "
+echo " ./install-CGAT-tools.sh --install-os-packages"
+echo
+echo " 2) Now, as a normal user (non root), install the Python dependencies**: "
+echo " ./install-CGAT-tools.sh --install-python-deps"
+echo
+echo " At this stage the CGAT Code Collection is ready to go and you do not need further steps. Please type the following for more information:"
+echo " source $HOME/CGAT/virtualenv-1.10.1/cgat-venv/bin/activate"
+echo " cgat --help "
+echo
+echo " The CGAT Code Collection tests the software with nosetests. If you are interested in running those, please continue with the following steps:"
+echo
+echo " 3) Become root to install external tools and set up the environment: "
+echo " ./install-CGAT-tools.sh --install-nosetests-deps"
+echo
+echo " 4) Then, back as a normal user (non root), run nosetests as follows:"
+echo " ./install-CGAT-tools.sh --run-nosetests"
+echo 
+echo " NOTES: "
+echo " * Supported operating systems: Ubuntu 12.x and Scientific Linux 6.x "
+echo " ** An isolated virtual environment will be created to install Python dependencies "
+echo
+exit 1;
+}
+
+
+# the main script starts here
+
+if [ $# -eq 0 -o $# -gt 1 ] ; then
+
+   help_message
+
+else
+
+   if [ "$1" == "--help" ] ; then
+
+      help_message
+
+   elif [ "$1" == "--travis" ] ; then
+
+      OS="travis"
+      install_os_packages
+      install_python_deps
+      run_nosetests
+
+   elif [ "$1" == "--install-os-packages" ] ; then
+
+      detect_os
+      install_os_packages
+
+   elif [ "$1" == "--install-python-deps" ] ; then
+
+      detect_os
+      install_python_deps
+
+   elif [ "$1" == "--install-nosetests-deps" ] ; then
+
+      detect_os
+      install_nosetests_deps
+
+   elif [ "$1" == "--run-nosetests" ] ; then
+
+      detect_os
+      run_nosetests
+
+   else 
+
+      echo 
+      echo " Incorrect input parameter: $1 "
+      help_message
+
+   fi # if argument 1
+
+fi # if number of input parameters
+
diff --git a/devtools/run_tests_travis.sh b/devtools/run_tests_travis.sh

new file mode 100755 (executable)

index 0000000..89988bb
--- /dev/null
+++ b/devtools/run_tests_travis.sh
@@ -0,0 +1,121 @@
+#!/usr/bin/env bash
+
+# test script for pysam.
+# The script performs the following tasks:
+# 1. Setup a conda environment and install dependencies via conda
+# 2. Build pysam via the conda recipe
+# 3. Build pysam via setup.py from repository
+# 4. Run tests on the setup.py version
+# 5. Additional build tests
+# 5.1 pip install with cython
+# 5.2 pip install without cython
+# 5.3 pip install without cython and without configure options
+
+pushd .
+
+WORKDIR=`pwd`
+
+#Install miniconda python
+if [ $TRAVIS_OS_NAME == "osx" ]; then
+       wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O Miniconda3.sh
+else
+       wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3.sh --no-check-certificate  # Default OS versions are old and have SSL / CERT issues
+fi
+
+bash Miniconda3.sh -b
+
+# Create a new conda environment with the target python version
+~/miniconda3/bin/conda install conda-build -y
+~/miniconda3/bin/conda create -q -y --name testenv python=$CONDA_PY cython numpy pytest psutil pip
+
+# activate testenv environment
+source ~/miniconda3/bin/activate testenv
+
+conda config --add channels r
+conda config --add channels defaults
+conda config --add channels conda-forge
+conda config --add channels bioconda
+
+# pin versions, so that tests do not fail when pysam/htslib out of step
+# add htslib dependencies
+conda install -y "samtools=1.7" "bcftools=1.6" "htslib=1.7" xz curl bzip2
+
+# Need to make C compiler and linker use the anaconda includes and libraries:
+export PREFIX=~/miniconda3/
+export CFLAGS="-I${PREFIX}/include -L${PREFIX}/lib"
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl --disable-lzma"
+
+samtools --version
+htslib --version
+bcftools --version
+
+# Try building conda recipe first
+~/miniconda3/bin/conda-build devtools/conda-recipe/ --python=$CONDA_PY
+
+# install code from the repository via setup.py
+echo "installing via setup.py from repository"
+python setup.py install
+
+# create auxilliary data
+echo
+echo 'building test data'
+echo
+make -C tests/pysam_data
+make -C tests/cbcf_data
+
+# echo any limits that are in place
+ulimit -a
+
+# run tests
+pytest
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+# build source tar-ball. Make sure to run 'build' target so that .pyx
+# files are cythonized.
+python setup.py build sdist
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+# check for presence of config.h files
+echo "checking for presence of config.h files in tar-ball"
+tar -tvzf dist/pysam-*.tar.gz | grep "config.h$"
+
+if [ $? != 1 ]; then
+    echo "ERROR: found config.h in tar-ball"
+    tar -tvzf dist/pysam-*.tar.gz | grep "config.h%"
+    exit 1
+fi
+
+# test pip installation from tar-ball with cython
+echo "pip installing with cython"
+pip install --verbose --no-deps --no-binary=:all: dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+# attempt pip installation without cython
+echo "pip installing without cython"
+~/miniconda3/bin/conda remove -y cython
+~/miniconda3/bin/conda list
+echo "python is" `which python`
+pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+    exit 1
+fi
+
+# attempt pip installation without cython and without
+# command line options
+echo "pip installing without cython and no configure options"
+export HTSLIB_CONFIGURE_OPTIONS=""
+pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+    exit 1
+fi
diff --git a/doc/api.rst b/doc/api.rst

index 8e766863ecaf96d22348be610b9d3741b85be6d0..d2a9a80fceb21a16698528c3b2f39d199dd177cc 100644 (file)
--- a/doc/api.rst
+++ b/doc/api.rst
@@ -133,8 +133,8 @@ More detailed usage instructions is at :ref:`usage`.
  API
  ===
  
-SAM/BAM files
--------------
+SAM/BAM/CRAM files
+-------------------
  
  Objects of type :class:`~pysam.AlignmentFile` allow working with
  BAM/SAM formatted files.
@@ -219,3 +219,12 @@ VCF files
  
  .. autoclass:: pysam.VariantHeaderRecord
     :members:
+
+HTSFile
+-------
+
+HTSFile is the base class for :class:`pysam.AlignmentFile` and
+:class:`pysam.VariantFile`.
+
+.. autoclass:: pysam.HTSFile
+   :members:
diff --git a/doc/glossary.rst b/doc/glossary.rst

index 3e739f9022be6850305628ffcd977f7cb6c96e46..de3e032622696cfdefeb4f69ceb7dfe4faba9f87 100644 (file)
--- a/doc/glossary.rst
+++ b/doc/glossary.rst
@@ -1,4 +1,4 @@
- ========
+========
  Glossary
  ========
  
diff --git a/doc/release.rst b/doc/release.rst

index 1a255b1409b04fead2d9caae1cdba5f521946b8f..0e3b96360c8bc3ceebd4dec27f600118383617ae 100644 (file)
--- a/doc/release.rst
+++ b/doc/release.rst
@@ -2,6 +2,17 @@
  Release notes
  =============
  
+Release 0.15.0
+==============
+
+This release wraps htslib/samtools/bcftools version 1.9.0.
+
+* [#673] permit dash in chromosome name of region string
+* [#656] Support `text` when opening a SAM file for writing
+* [#658] return None in get_forward_sequence if sequence not in record
+* [#683] allow lower case bases in MD tags
+* Ensure that = and X CIGAR ops are treated the same as M
+
  Release 0.14.1
  ==============
  
diff --git a/import.py b/import.py

deleted file mode 100644 (file)

index 89aa9f1..0000000
--- a/import.py
+++ /dev/null
@@ -1,228 +0,0 @@
-#################################################################
-# Importing samtools and htslib
-#
-# For htslib, simply copy the whole release tar-ball
-# into the directory "htslib" and recreate the file version.h
-#
-# rm -rf htslib
-# mv download/htslib htslib
-# git checkout -- htslib/version.h
-# Edit the file htslib/version.h to set the right version number.
-#
-# For samtools, type:
-# rm -rf samtools
-# python import.py samtools download/samtools
-# git checkout -- samtools/version.h
-#
-# Manually, then:
-# modify config.h to set compatibility flags
-#
-# For bcftools, type:
-# rm -rf bcftools
-# python import.py bcftools download/bedtools
-# git checkout -- bcftools/version.h
-# rm -rf bedtools/test bedtools/plugins
-
-import fnmatch
-import os
-import re
-import itertools
-import shutil
-import sys
-import hashlib
-
-
-EXCLUDE = {
-    "samtools": (
-        "razip.c",
-        "bgzip.c",
-        "main.c",
-        "calDepth.c",
-        "bam2bed.c",
-        "wgsim.c",
-        "bam_tview.c",
-        "bam_tview.h",
-        "bam_tview_html.c",
-        "bam_tview_curses.c",
-        "md5fa.c",
-        "md5sum-lite.c",
-        "maq2sam.c",
-        "bamcheck.c",
-        "chk_indel.c",
-        "vcf-miniview.c",
-        "hfile_irods.c",  # requires irods library
-    ),
-    "bcftools": (
-        "test", "plugins", "peakfit.c",
-        "peakfit.h",
-        # needs to renamed, name conflict with samtools reheader
-        # "reheader.c",
-        "polysomy.c"),
-    "htslib": (
-        'htslib/tabix.c', 'htslib/bgzip.c',
-        'htslib/htsfile.c', 'htslib/hfile_irods.c'),
-}
-
-
-MAIN = {
-    "samtools": "bamtk",
-    "bcftools": "main"
-}
-
-
-
-def locate(pattern, root=os.curdir):
-    '''Locate all files matching supplied filename pattern in and below
-    supplied root directory.
-    '''
-    for path, dirs, files in os.walk(os.path.abspath(root)):
-        for filename in fnmatch.filter(files, pattern):
-            yield os.path.join(path, filename)
-
-
-def _update_pysam_files(cf, destdir):
-    '''update pysam files applying redirection of ouput'''
-    basename = os.path.basename(destdir)
-    for filename in cf:
-        if not filename:
-            continue
-        dest = filename + ".pysam.c"
-        with open(filename, encoding="utf-8") as infile:
-            lines = "".join(infile.readlines())
-
-            with open(dest, "w", encoding="utf-8") as outfile:
-                outfile.write('#include "{}.pysam.h"\n\n'.format(basename))
-                subname, _ = os.path.splitext(os.path.basename(filename))
-                if subname in MAIN.get(basename, []):
-                    lines = re.sub("int main\(", "int {}_main(".format(
-                        basename), lines)
-                else:
-                    lines = re.sub("int main\(", "int {}_{}_main(".format(
-                        basename, subname), lines)
-                lines = re.sub("stderr", "{}_stderr".format(basename), lines)
-                lines = re.sub("stdout", "{}_stdout".format(basename), lines)
-                lines = re.sub(" printf\(", " fprintf({}_stdout, ".format(basename), lines)
-                lines = re.sub("([^kf])puts\(([^)]+)\)",
-                               r"\1fputs(\2, {}_stdout) & fputc('\\n', {}_stdout)".format(basename, basename),
-                               lines)
-                lines = re.sub("putchar\(([^)]+)\)",
-                               r"fputc(\1, {}_stdout)".format(basename), lines)
-
-                fn = os.path.basename(filename)
-                # some specific fixes:
-                SPECIFIC_SUBSTITUTIONS = {
-                    "bam_md.c": (
-                        'sam_open_format("-", mode_w',
-                        'sam_open_format({}_stdout_fn, mode_w'.format(basename)),
-                    "phase.c": (
-                        'putc("ACGT"[f->seq[j] == 1? (c&3, {}_stdout) : (c>>16&3)]);'.format(basename),
-                        'putc("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)], {}_stdout);'.format(basename)),
-                    "cut_target.c": (
-                        'putc(33 + (cns[j]>>8>>2, {}_stdout));'.format(basename),
-                        'putc(33 + (cns[j]>>8>>2), {}_stdout);'.format(basename))
-                    }
-                if fn in SPECIFIC_SUBSTITUTIONS:
-                    lines = lines.replace(
-                        SPECIFIC_SUBSTITUTIONS[fn][0],
-                        SPECIFIC_SUBSTITUTIONS[fn][1])
-                outfile.write(lines)
-
-    with open(os.path.join("import", "pysam.h")) as inf, \
-         open(os.path.join(destdir, "{}.pysam.h".format(basename)), "w") as outf:
-        outf.write(re.sub("@pysam@", basename, inf.read()))
-
-    with open(os.path.join("import", "pysam.c")) as inf, \
-         open(os.path.join(destdir, "{}.pysam.c".format(basename)), "w") as outf:
-        outf.write(re.sub("@pysam@", basename, inf.read()))
-
-
-if len(sys.argv) >= 1:
-    if len(sys.argv) != 3:
-        raise ValueError("import requires dest src")
-
-    dest, srcdir = sys.argv[1:3]
-    if dest not in EXCLUDE:
-        raise ValueError("import expected one of %s" %
-                         ",".join(EXCLUDE.keys()))
-    exclude = EXCLUDE[dest]
-    destdir = os.path.abspath(dest)
-    srcdir = os.path.abspath(srcdir)
-    if not os.path.exists(srcdir):
-        raise IOError(
-            "source directory `%s` does not exist." % srcdir)
-
-    cfiles = locate("*.c", srcdir)
-    hfiles = locate("*.h", srcdir)
-    mfiles = itertools.chain(locate("README", srcdir), locate("LICENSE", srcdir))
-    
-    # remove unwanted files and htslib subdirectory.
-    cfiles = [x for x in cfiles if os.path.basename(x) not in exclude
-              and not re.search("htslib-", x)]
-
-    hfiles = [x for x in hfiles if os.path.basename(x) not in exclude
-              and not re.search("htslib-", x)]
-
-    ncopied = 0
-
-    def _compareAndCopy(src, srcdir, destdir, exclude):
-
-        d, f = os.path.split(src)
-        common_prefix = os.path.commonprefix((d, srcdir))
-        subdir = re.sub(common_prefix, "", d)[1:]
-        targetdir = os.path.join(destdir, subdir)
-        if not os.path.exists(targetdir):
-            os.makedirs(targetdir)
-        old_file = os.path.join(targetdir, f)
-        if os.path.exists(old_file):
-            md5_old = hashlib.md5(
-                "".join(open(old_file, "r", encoding="utf-8").readlines()).encode()).digest()
-            md5_new = hashlib.md5(
-                "".join(open(src, "r", encoding="utf-8").readlines()).encode()).digest()
-            if md5_old != md5_new:
-                raise ValueError(
-                    "incompatible files for %s and %s" %
-                    (old_file, src))
-
-        shutil.copy(src, targetdir)
-        return old_file
-
-    for src_file in hfiles:
-        _compareAndCopy(src_file, srcdir, destdir, exclude)
-        ncopied += 1
-
-    for src_file in mfiles:
-        _compareAndCopy(src_file, srcdir, destdir, exclude)
-        ncopied += 1
-
-    cf = []
-    for src_file in cfiles:
-        cf.append(_compareAndCopy(src_file,
-                                  srcdir,
-                                  destdir,
-                                  exclude))
-        ncopied += 1
-
-    sys.stdout.write(
-        "installed latest source code from %s: "
-        "%i files copied\n" % (srcdir, ncopied))
-    # redirect stderr to pysamerr and replace bam.h with a stub.
-    sys.stdout.write("applying stderr redirection\n")
-
-    _update_pysam_files(cf, destdir)
-
-    sys.exit(0)
-
-
-# if len(sys.argv) >= 2 and sys.argv[1] == "refresh":
-#     sys.stdout.write("refreshing latest source code from .c to .pysam.c")
-#     # redirect stderr to pysamerr and replace bam.h with a stub.
-#     sys.stdout.write("applying stderr redirection")
-#     for destdir in ('samtools', ):
-#         pysamcfiles = locate("*.pysam.c", destdir)
-#         for f in pysamcfiles:
-#             os.remove(f)
-#         cfiles = locate("*.c", destdir)
-#         _update_pysam_files(cfiles, destdir)
-
-#     sys.exit(0)
-
diff --git a/import/pysam.c b/import/pysam.c

index 16420137537948315a0e44334892103462f03740..d08cf50e5542366ddb63ffed52a5eb16c7b505fc 100644 (file)
--- a/import/pysam.c
+++ b/import/pysam.c
@@ -54,6 +54,12 @@ void @pysam@_unset_stdout(void)
    @pysam@_stdout_fileno = STDOUT_FILENO;
  }
  
+int @pysam@_puts(const char *s)
+{
+  if (fputs(s, @pysam@_stdout) == EOF) return EOF;
+  return putc('\n', @pysam@_stdout);
+}
+
  void @pysam@_set_optind(int val)
  {
    // setting this in cython via 
diff --git a/import/pysam.h b/import/pysam.h

index 4a6ec2961efeb4286c2c248b515696c72dc6d2e7..c3cc6231462eea087048f53a1e8d80ae35cbbcea 100644 (file)
--- a/import/pysam.h
+++ b/import/pysam.h
@@ -38,6 +38,8 @@ void @pysam@_unset_stderr(void);
   */
  void @pysam@_unset_stdout(void);
  
+int @pysam@_puts(const char *s);
+
  int @pysam@_dispatch(int argc, char *argv[]);
  
  void @pysam@_set_optind(int);
diff --git a/pysam/libcalignedsegment.pyx b/pysam/libcalignedsegment.pyx

index edb5eaa224310eb75bcb15af53cab53354b5f291..d0a20d52a26ec24c7ffd033ea80ed71c17979e86 100644 (file)
--- a/pysam/libcalignedsegment.pyx
+++ b/pysam/libcalignedsegment.pyx
@@ -777,6 +777,7 @@ cdef inline bytes build_alignment_sequence(bam1_t * src):
  
      cdef char * md_tag = <char*>bam_aux2Z(md_tag_ptr)
      cdef int md_idx = 0
+    cdef char c
      s_idx = 0
  
      # Check if MD tag is valid by matching CIGAR length to MD tag defined length
@@ -822,8 +823,12 @@ cdef inline bytes build_alignment_sequence(bam1_t * src):
                      s_idx += 1
                      md_idx += 1
              else:
-                # save mismatch and change to lower case
-                s[s_idx] = md_tag[md_idx] + 32
+                # save mismatch
+                # enforce lower case
+                c = md_tag[md_idx]
+                if c <= 90:
+                    c += 32
+                s[s_idx] = c
                  s_idx += 1
                  r_idx += 1
                  md_idx += 1
@@ -1774,7 +1779,7 @@ cdef class AlignedSegment:
                  if _full:
                      for i from 0 <= i < l:
                          result.append(None)
-            elif op == BAM_CMATCH:
+            elif op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
                  for i from pos <= i < pos + l:
                      result.append(i)
                  pos += l
@@ -1830,7 +1835,11 @@ cdef class AlignedSegment:
          
          Reads mapping to the reverse strand will be reverse
          complemented.
+
+        Returns None if the record has no query sequence.
          """
+        if self.query_sequence is None:
+            return None
          s = force_str(self.query_sequence)
          if self.is_reverse:
              s = s.translate(maketrans("ACGTacgtNnXx", "TGCAtgcaNnXx"))[::-1]
@@ -1989,7 +1998,7 @@ cdef class AlignedSegment:
          for k from 0 <= k < pysam_get_n_cigar(src):
              op = cigar_p[k] & BAM_CIGAR_MASK
              l = cigar_p[k] >> BAM_CIGAR_SHIFT
-            if op == BAM_CMATCH:
+            if op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
                  result.append((pos, pos + l))
                  pos += l
              elif op == BAM_CDEL or op == BAM_CREF_SKIP:
@@ -2021,11 +2030,11 @@ cdef class AlignedSegment:
              op = cigar_p[k] & BAM_CIGAR_MASK
              l = cigar_p[k] >> BAM_CIGAR_SHIFT
  
-            if op == BAM_CMATCH:
+            if op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
                  o = min( pos + l, end) - max( pos, start )
                  if o > 0: overlap += o
  
-            if op == BAM_CMATCH or op == BAM_CDEL or op == BAM_CREF_SKIP:
+            if op == BAM_CMATCH or op == BAM_CDEL or op == BAM_CREF_SKIP or op == BAM_CEQUAL or op == BAM_CDIFF:
                  pos += l
  
          return overlap
diff --git a/pysam/libcalignmentfile.pyx b/pysam/libcalignmentfile.pyx

index 739013dd85c5e053adac02265561558bc863e5e3..9e4f5a71c797b6a6e95aa241617070f25d937722 100644 (file)
--- a/pysam/libcalignmentfile.pyx
+++ b/pysam/libcalignmentfile.pyx
@@ -568,7 +568,7 @@ cdef class AlignmentFile(HTSFile):
      header=None, add_sq_text=False, check_header=True, check_sq=True,
      reference_filename=None, filename=None, index_filename=None,
      filepath_index=None, require_index=False, duplicate_filehandle=True,
-    ignore_truncation=False)
+    ignore_truncation=False, threads=1)
  
      A :term:`SAM`/:term:`BAM`/:term:`CRAM` formatted file.
  
@@ -712,12 +712,17 @@ cdef class AlignmentFile(HTSFile):
      format_options: list
          A list of key=value strings, as accepted by --input-fmt-option and
          --output-fmt-option in samtools.
+    threads: integer
+        Number of threads to use for compressing/decompressing BAM/CRAM files.
+        Setting threads to > 1 cannot be combined with `ignore_truncation`.
+        (Default=1)
      """
  
      def __cinit__(self, *args, **kwargs):
          self.htsfile = NULL
          self.filename = None
          self.mode = None
+        self.threads = 1
          self.is_stream = False
          self.is_remote = False
          self.index = NULL
@@ -783,7 +788,8 @@ cdef class AlignmentFile(HTSFile):
                referencelengths=None,
                duplicate_filehandle=True,
                ignore_truncation=False,
-              format_options=None):
+              format_options=None,
+              threads=1):
          '''open a sam, bam or cram formatted file.
  
          If _open is called on an existing file, the current file
@@ -795,7 +801,16 @@ cdef class AlignmentFile(HTSFile):
          cdef char *cindexname = NULL
          cdef char *cmode = NULL
          cdef bam_hdr_t * hdr = NULL
-        
+
+        if threads > 1 and ignore_truncation:
+           # This won't raise errors if reaching a truncated alignment,
+           # because bgzf_mt_reader in htslib does not deal with
+           # bgzf_mt_read_block returning non-zero values, contrary
+           # to bgzf_read (https://github.com/samtools/htslib/blob/1.7/bgzf.c#L888)
+           # Better to avoid this (for now) than to produce seemingly correct results.
+           raise ValueError('Cannot add extra threads when "ignore_truncation" is True')
+        self.threads = threads
+
          # for backwards compatibility:
          if referencenames is not None:
              reference_names = referencenames
@@ -861,9 +876,9 @@ cdef class AlignmentFile(HTSFile):
          if mode[0] == 'w':
              # open file for writing
  
-            if not (template or header or reference_names):
+            if not (template or header or text or (reference_names and reference_lengths)):
                  raise ValueError(
-                    "either supply options `template`, `header` or  both `reference_names` "
+                    "either supply options `template`, `header`, `text` or  both `reference_names` "
                      "and `reference_lengths` for writing")
              
              if template:
@@ -885,7 +900,6 @@ cdef class AlignmentFile(HTSFile):
              else:
                  raise ValueError("not enough information to construct header. Please provide template, "
                                   "header, text or reference_names/reference_lengths")
-            
              self.htsfile = self._open_htsfile()
  
              if self.htsfile == NULL:
@@ -901,8 +915,8 @@ cdef class AlignmentFile(HTSFile):
              # is given, the CRAM reference arrays will be built from
              # the @SQ header in the header
              if "c" in mode and reference_filename:
-                # note that fn_aux takes ownership, so create a copy
-                self.htsfile.fn_aux = strdup(self.reference_filename)
+                if (hts_set_fai_filename(self.htsfile, self.reference_filename) != 0):
+                    raise ValueError("failure when setting reference filename")
  
              # write header to htsfile
              if "b" in mode or "c" in mode or "h" in mode:
@@ -1005,10 +1019,10 @@ cdef class AlignmentFile(HTSFile):
                end=None):
          """fetch reads aligned in a :term:`region`.
  
-        See :meth:`AlignmentFile.parse_region` for more information
-        on genomic regions.  :term:`reference` and `end` are also accepted for
-        backward compatiblity as synonyms for :term:`contig` and `stop`,
-        respectively.
+        See :meth:`~pysam.HTSFile.parse_region` for more information
+        on how genomic regions can be specified. :term:`reference` and
+        `end` are also accepted for backward compatiblity as synonyms
+        for :term:`contig` and `stop`, respectively.
  
          Without a `contig` or `region` all mapped reads in the file
          will be fetched. The reads will be returned ordered by reference
@@ -1016,18 +1030,12 @@ cdef class AlignmentFile(HTSFile):
          file. This mode of iteration still requires an index. If there is
          no index, use `until_eof=True`.
  
-        If only `reference` is set, all reads aligned to `reference`
+        If only `contig` is set, all reads aligned to `contig`
          will be fetched.
  
          A :term:`SAM` file does not allow random access. If `region`
          or `contig` are given, an exception is raised.
  
-        :class:`~pysam.FastaFile`
-        :class:`~pysam.IteratorRow`
-        :class:`~pysam.IteratorRow`
-        :class:`~IteratorRow`
-        :class:`IteratorRow`
-
          Parameters
          ----------
  
diff --git a/pysam/libcbcf.pyx b/pysam/libcbcf.pyx

index 5087dff9707d9d5861b97c8b462ce625f11afce9..b74ada136349da578919f58c976bc54cb9559623 100644 (file)
--- a/pysam/libcbcf.pyx
+++ b/pysam/libcbcf.pyx
@@ -3934,7 +3934,7 @@ cdef class TabixIterator(BaseIterator):
  
  cdef class VariantFile(HTSFile):
      """*(filename, mode=None, index_filename=None, header=None, drop_samples=False,
-    duplicate_filehandle=True, ignore_truncation=False)*
+    duplicate_filehandle=True, ignore_truncation=False, threads=1)*
  
      A :term:`VCF`/:term:`BCF` formatted file. The file is automatically
      opened.
@@ -3989,6 +3989,11 @@ cdef class VariantFile(HTSFile):
          appears to be truncated due to a missing EOF marker.  Only applies
          to bgzipped formats. (Default=False)
  
+    threads: integer
+        Number of threads to use for compressing/decompressing VCF/BCF files.
+        Setting threads to > 1 cannot be combined with `ignore_truncation`.
+        (Default=1)
+
      """
      def __cinit__(self, *args, **kwargs):
          self.htsfile = NULL
@@ -3998,6 +4003,7 @@ cdef class VariantFile(HTSFile):
          self.index          = None
          self.filename       = None
          self.mode           = None
+        self.threads        = 1
          self.index_filename = None
          self.is_stream      = False
          self.is_remote      = False
@@ -4106,6 +4112,7 @@ cdef class VariantFile(HTSFile):
  
          vars.filename       = self.filename
          vars.mode           = self.mode
+        vars.threads        = self.threads
          vars.index_filename = self.index_filename
          vars.drop_samples   = self.drop_samples
          vars.is_stream      = self.is_stream
@@ -4128,7 +4135,8 @@ cdef class VariantFile(HTSFile):
               VariantHeader header=None,
               drop_samples=False,
               duplicate_filehandle=True,
-             ignore_truncation=False):
+             ignore_truncation=False,
+             threads=1):
          """open a vcf/bcf file.
  
          If open is called on an existing VariantFile, the current file will be
@@ -4142,6 +4150,15 @@ cdef class VariantFile(HTSFile):
          cdef char *cindex_filename = NULL
          cdef char *cmode
  
+        if threads > 1 and ignore_truncation:
+            # This won't raise errors if reaching a truncated alignment,
+            # because bgzf_mt_reader in htslib does not deal with
+            # bgzf_mt_read_block returning non-zero values, contrary
+            # to bgzf_read (https://github.com/samtools/htslib/blob/1.7/bgzf.c#L888)
+            # Better to avoid this (for now) than to produce seemingly correct results.
+            raise ValueError('Cannot add extra threads when "ignore_truncation" is True')
+        self.threads = threads
+
          # close a previously opened file
          if self.is_open:
              self.close()
diff --git a/pysam/libchtslib.pxd b/pysam/libchtslib.pxd

index 119dab2f3bf06da3a1f2a5b19a86182935213699..8bcf399f36266ef4c2d07fb053a39d5392da5690 100644 (file)
--- a/pysam/libchtslib.pxd
+++ b/pysam/libchtslib.pxd
@@ -2593,6 +2593,7 @@ cdef class HTSFile(object):
  
      cdef readonly object  filename       # filename as supplied by user
      cdef readonly object  mode           # file opening mode
+    cdef readonly object  threads        # number of threads to use
      cdef readonly object  index_filename # filename of index, if supplied by user
  
      cdef readonly bint    is_stream      # Is htsfile a non-seekable stream
diff --git a/pysam/libchtslib.pyx b/pysam/libchtslib.pyx

index 7096a99052d31834dffb4e954693b0ac7958d4e6..040ad1fa8fe33e8d45e9a44fc060dbab9640dbf9 100644 (file)
--- a/pysam/libchtslib.pyx
+++ b/pysam/libchtslib.pyx
@@ -327,8 +327,10 @@ cdef class HTSFile(object):
      """
      Base class for HTS file types
      """
+
      def __cinit__(self, *args, **kwargs):
          self.htsfile = NULL
+        self.threads = 1
          self.duplicate_filehandle = True
  
      def close(self):
@@ -522,12 +524,16 @@ cdef class HTSFile(object):
      cdef htsFile *_open_htsfile(self) except? NULL:
          cdef char *cfilename
          cdef char *cmode = self.mode
-        cdef int fd, dup_fd
+        cdef int fd, dup_fd, threads
  
+        threads = self.threads - 1
          if isinstance(self.filename, bytes):
              cfilename = self.filename
              with nogil:
-                return hts_open(cfilename, cmode)
+                htsfile = hts_open(cfilename, cmode)
+                if htsfile != NULL:
+                    hts_set_threads(htsfile, threads)
+                return htsfile
          else:
              if isinstance(self.filename, int):
                  fd = self.filename
@@ -560,7 +566,10 @@ cdef class HTSFile(object):
              filename = encode_filename(filename)
              cfilename = filename
              with nogil:
-                return hts_hopen(hfile, cfilename, cmode)
+                htsfile = hts_hopen(hfile, cfilename, cmode)
+                if htsfile != NULL:
+                    hts_set_threads(htsfile, threads)
+                return htsfile
  
      def add_hts_options(self, format_options=None):
          """Given a list of key=value format option strings, add them to an open htsFile
@@ -582,12 +591,13 @@ cdef class HTSFile(object):
                      raise RuntimeError('An error occured while applying the requested format options')
                  hts_opt_free(opts)
  
-    def parse_region(self, contig=None, start=None, stop=None, region=None,tid=None,
+    def parse_region(self, contig=None, start=None, stop=None,
+                     region=None, tid=None,
                       reference=None, end=None):
          """parse alternative ways to specify a genomic region. A region can
          either be specified by :term:`contig`, `start` and
          `stop`. `start` and `stop` denote 0-based, half-open
-        intervals.  :term:`reference` and `end` are also accepted for
+        intervals. :term:`reference` and `end` are also accepted for
          backward compatiblity as synonyms for :term:`contig` and
          `stop`, respectively.
  
@@ -653,12 +663,14 @@ cdef class HTSFile(object):
  
          if region:
              region = force_str(region)
-            parts = re.split('[:-]', region)
-            contig = parts[0]
-            if len(parts) >= 2:
-                rstart = int(parts[1]) - 1
-            if len(parts) >= 3:
-                rstop = int(parts[2])
+            if ":" in region:
+                contig, coord = region.split(":")
+                parts = coord.split("-")
+                rstart = int(parts[0]) - 1
+                if len(parts) >= 1:
+                    rstop = int(parts[1])
+            else:
+                contig = region
  
          if tid is not None:
              if not self.is_valid_tid(tid):
diff --git a/pysam/libctabix.pyx b/pysam/libctabix.pyx

index 10177ceaa0148697abe261192b16f8f5ab5680b2..e581b617eb7eef694fb31e54d15cb9b876a9e041 100644 (file)
--- a/pysam/libctabix.pyx
+++ b/pysam/libctabix.pyx
@@ -319,6 +319,11 @@ cdef class TabixFile:
  
          The encoding passed to the parser
  
+    threads: integer
+        Number of threads to use for decompressing Tabix files.
+        (Default=1)
+
+
      Raises
      ------
      
@@ -334,6 +339,7 @@ cdef class TabixFile:
                    parser=None,
                    index=None,
                    encoding="ascii",
+                  threads=1,
                    *args,
                    **kwargs ):
  
@@ -341,13 +347,15 @@ cdef class TabixFile:
          self.is_remote = False
          self.is_stream = False
          self.parser = parser
+        self.threads = threads
          self._open(filename, mode, index, *args, **kwargs)
          self.encoding = encoding
  
-    def _open( self, 
+    def _open( self,
                 filename,
                 mode='r',
                 index=None,
+               threads=1,
                ):
          '''open a :term:`tabix file` for reading.'''
  
@@ -357,6 +365,7 @@ cdef class TabixFile:
          if self.htsfile != NULL:
              self.close()
          self.htsfile = NULL
+        self.threads=threads
  
          filename_index = index or (filename + ".tbi")
          # encode all the strings to pass to tabix
@@ -400,7 +409,8 @@ cdef class TabixFile:
          The file is being re-opened.
          '''
          return TabixFile(self.filename,
-                         mode="r", 
+                         mode="r",
+                         threads=self.threads,
                           parser=self.parser,
                           index=self.filename_index,
                           encoding=self.encoding)
diff --git a/pysam/libcutils.pxd b/pysam/libcutils.pxd

index f2d0aeb895d0b0e7ec09eb7076ff4be34218c151..9e1cce1a8e64ad5cb7f847e640f85020d132baab 100644 (file)
--- a/pysam/libcutils.pxd
+++ b/pysam/libcutils.pxd
@@ -4,7 +4,7 @@
  cimport cython
  from cpython cimport array as c_array
  
-cpdef parse_region(reference=*, start=*, end=*, region=*)
+cpdef parse_region(contig=*, start=*, stop=*, region=*, reference=*, end=*)
  
  #########################################################################
  # Utility functions for quality string conversions
diff --git a/pysam/libcutils.pyx b/pysam/libcutils.pyx

index 66f9bf939b53f2f855ed5fe3757b171528b8b8c9..75989b1f39adafd9f504a5f282dc3c5566e8a190 100644 (file)
--- a/pysam/libcutils.pyx
+++ b/pysam/libcutils.pyx
@@ -164,13 +164,19 @@ cdef force_str(object s, encoding="ascii"):
          return s
  
  
-cpdef parse_region(reference=None,
+cpdef parse_region(contig=None,
                     start=None,
-                   end=None,
-                   region=None):
+                   stop=None,
+                   region=None,
+                   reference=None,
+                   end=None):
      """parse alternative ways to specify a genomic region. A region can
      either be specified by :term:`reference`, `start` and
      `end`. `start` and `end` denote 0-based, half-open intervals.
+    
+    :term:`reference` and `end` are also accepted for backward
+    compatiblity as synonyms for :term:`contig` and `stop`,
+    respectively.
  
      Alternatively, a samtools :term:`region` string can be supplied.
  
@@ -195,6 +201,11 @@ cpdef parse_region(reference=None,
      cdef long long rstart
      cdef long long rend
  
+    if contig is not None:
+        reference = contig
+    if stop is not None:
+        end = stop
+
      rstart = 0
      rend = MAX_POS
      if start != None:
@@ -211,12 +222,14 @@ cpdef parse_region(reference=None,
  
      if region:
          region = force_str(region)
-        parts = re.split("[:-]", region)
-        reference = parts[0]
-        if len(parts) >= 2:
-            rstart = int(parts[1]) - 1
-        if len(parts) >= 3:
-            rend = int(parts[2])
+        if ":" in region:
+            contig, coord = region.split(":")
+            parts = coord.split("-")
+            rstart = int(parts[0]) - 1
+            if len(parts) >= 1:
+                rend = int(parts[1])
+        else:
+            contig = region
  
      if not reference:
          return None, 0, 0
diff --git a/pysam/version.py b/pysam/version.py

index 8a68bcc6704d0e20f01346b9fccdd70ea2635f5f..b3f022fa7231c24d17ee3a711c1be0aea361e660 100644 (file)
--- a/pysam/version.py
+++ b/pysam/version.py
@@ -1,10 +1,10 @@
  # pysam versioning information
-__version__ = "0.14.1"
+__version__ = "0.15.0"
  
  # TODO: upgrade number
-__samtools_version__ = "1.7"
+__samtools_version__ = "1.9"
  
  # TODO: upgrade code and number
-__bcftools_version__ = "1.6"
+__bcftools_version__ = "1.9"
  
-__htslib_version__ = "1.7"
+__htslib_version__ = "1.9"
diff --git a/run_tests_travis.sh b/run_tests_travis.sh

deleted file mode 100755 (executable)

index b2659bc..0000000
--- a/run_tests_travis.sh
+++ /dev/null
@@ -1,121 +0,0 @@
-#!/usr/bin/env bash
-
-# test script for pysam.
-# The script performs the following tasks:
-# 1. Setup a conda environment and install dependencies via conda
-# 2. Build pysam via the conda recipe
-# 3. Build pysam via setup.py from repository
-# 4. Run tests on the setup.py version
-# 5. Additional build tests
-# 5.1 pip install with cython
-# 5.2 pip install without cython
-# 5.3 pip install without cython and without configure options
-
-pushd .
-
-WORKDIR=`pwd`
-
-#Install miniconda python
-if [ $TRAVIS_OS_NAME == "osx" ]; then
-       wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O Miniconda3.sh
-else
-       wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3.sh --no-check-certificate  # Default OS versions are old and have SSL / CERT issues
-fi
-
-bash Miniconda3.sh -b
-
-# Create a new conda environment with the target python version
-~/miniconda3/bin/conda install conda-build -y
-~/miniconda3/bin/conda create -q -y --name testenv python=$CONDA_PY cython numpy pytest psutil pip
-
-# activate testenv environment
-source ~/miniconda3/bin/activate testenv
-
-conda config --add channels r
-conda config --add channels defaults
-conda config --add channels conda-forge
-conda config --add channels bioconda
-
-# pin versions, so that tests do not fail when pysam/htslib out of step
-# add htslib dependencies
-conda install -y "samtools=1.7" "bcftools=1.6" "htslib=1.7" xz curl bzip2
-
-# Need to make C compiler and linker use the anaconda includes and libraries:
-export PREFIX=~/miniconda3/
-export CFLAGS="-I${PREFIX}/include -L${PREFIX}/lib"
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl --disable-lzma"
-
-samtools --version
-htslib --version
-bcftools --version
-
-# Try building conda recipe first
-~/miniconda3/bin/conda-build ci/conda-recipe/ --python=$CONDA_PY
-
-# install code from the repository via setup.py
-echo "installing via setup.py from repository"
-python setup.py install
-
-# create auxilliary data
-echo
-echo 'building test data'
-echo
-make -C tests/pysam_data
-make -C tests/cbcf_data
-
-# echo any limits that are in place
-ulimit -a
-
-# run tests
-pytest
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-# build source tar-ball. Make sure to run 'build' target so that .pyx
-# files are cythonized.
-python setup.py build sdist
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-# check for presence of config.h files
-echo "checking for presence of config.h files in tar-ball"
-tar -tvzf dist/pysam-*.tar.gz | grep "config.h$"
-
-if [ $? != 1 ]; then
-    echo "ERROR: found config.h in tar-ball"
-    tar -tvzf dist/pysam-*.tar.gz | grep "config.h%"
-    exit 1
-fi
-
-# test pip installation from tar-ball with cython
-echo "pip installing with cython"
-pip install --verbose --no-deps --no-binary=:all: dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-# attempt pip installation without cython
-echo "pip installing without cython"
-~/miniconda3/bin/conda remove -y cython
-~/miniconda3/bin/conda list
-echo "python is" `which python`
-pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
-    exit 1
-fi
-
-# attempt pip installation without cython and without
-# command line options
-echo "pip installing without cython and no configure options"
-export HTSLIB_CONFIGURE_OPTIONS=""
-pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
-    exit 1
-fi
diff --git a/samtools/LICENSE b/samtools/LICENSE

index aeaae3cfc517406c81362bfaadc66173dfc996aa..68b8723aa342f3e9e519501a13648e8c3b077d4b 100644 (file)
--- a/samtools/LICENSE
+++ b/samtools/LICENSE
@@ -1,6 +1,6 @@
  The MIT/Expat License
  
-Copyright (C) 2008-2014 Genome Research Ltd.
+Copyright (C) 2008-2018 Genome Research Ltd.
  
  Permission is hereby granted, free of charge, to any person obtaining a copy
  of this software and associated documentation files (the "Software"), to deal
diff --git a/samtools/README b/samtools/README

index 7088f79253f45af6437c8731d5e6a12c70335f2a..48ddf8833bde581cb71e78cf682c9c7f02c47c10 100644 (file)
--- a/samtools/README
+++ b/samtools/README
@@ -9,7 +9,7 @@ Building samtools
  The typical simple case of building Samtools using the HTSlib bundled within
  this Samtools release tarball is done as follows:
  
-    cd .../samtools-1.7 # Within the unpacked release directory
+    cd .../samtools-1.9 # Within the unpacked release directory
      ./configure
      make
  
@@ -21,7 +21,7 @@ install samtools etc properly into a directory of your choosing.  Building for
  installation using the HTSlib bundled within this Samtools release tarball,
  and building the various HTSlib utilities such as bgzip is done as follows:
  
-    cd .../samtools-1.7 # Within the unpacked release directory
+    cd .../samtools-1.9 # Within the unpacked release directory
      ./configure --prefix=/path/to/location
      make all all-htslib
      make install install-htslib
@@ -30,6 +30,53 @@ You will likely wish to add /path/to/location/bin to your $PATH.
  
  See INSTALL for full building and installation instructions and details.
  
+Building with HTSlib plug-in support
+====================================
+
+Enabling plug-ins causes some parts of HTSlib to be built as separate modules.
+There are two advantages to this:
+
+ * The static library libhts.a has fewer dependencies, which makes linking
+   third-party code against it easier.
+
+ * It is possible to build extra plug-ins in addition to the ones that are
+   bundled with HTSlib.  For example, the hts-plugins repository
+   <https://github.com/samtools/htslib-plugins> includes a module that
+   allows direct access to files stored in an iRODS data management
+   repository (see <https://irods.org/>).
+
+To build with plug-ins, you need to use the --enable-plugins configure option
+as follows:
+
+    cd .../samtools-1.9 # Within the unpacked release directory
+    ./configure --enable-plugins --prefix=/path/to/location
+    make all all-htslib
+    make install install-htslib
+
+There are two other configure options that affect plug-ins.  These are:
+   --with-plugin-dir=DIR     plug-in installation location
+   --with-plugin-path=PATH   default plug-in search path
+
+The default for --with-plugin-dir is <prefix>/libexec/htslib.
+--with-plugin-path sets the built-in search path used to find the plug-ins.  By
+default this is the directory set by the --with-plugin-dir option.  Multiple
+directories should be separated by colons.
+
+Setting --with-plugin-path is useful if you want to run directly from
+the source distribution instead of installing the package.  In that case
+you can use:
+
+    cd .../samtools-1.9 # Within the unpacked release directory
+    ./configure --enable-plugins --with-plugin-path=$PWD/htslib-1.9
+    make all all-htslib
+
+It is possible to override the built-in search path using the HTS_PATH
+environment variable.  Directories should be separated by colons.  To
+include the built-in path, add an empty entry to HTS_PATH:
+
+   export HTS_PATH=:/my/path            # Search built-in path first
+   export HTS_PATH=/my/path:            # Search built-in path last
+   export HTS_PATH=/my/path1::/my/path2 # Search built-in path between others
  
  Using an optimised zlib library
  ===============================
diff --git a/samtools/bam.c.pysam.c b/samtools/bam.c.pysam.c

index 982bf41064044f5c83d0f6cdd3e63f10b05ae688..a92d1153007d42352ee73dea42a8e8662ce70603 100644 (file)
--- a/samtools/bam.c.pysam.c
+++ b/samtools/bam.c.pysam.c
@@ -51,7 +51,7 @@ int bam_view1(const bam_header_t *header, const bam1_t *b)
      char *s = bam_format1(header, b);
      int ret = -1;
      if (!s) return -1;
-    if (fputs(s, samtools_stdout) & fputc('\n', samtools_stdout) != EOF) ret = 0;
+    if (samtools_puts(s) != EOF) ret = 0;
      free(s);
      return ret;
  }
diff --git a/samtools/bam.h b/samtools/bam.h

index d4df9377a1ccf7e4f22bee82c949e899041ee05a..b7058392944e8a72536498c59d0e1330ca874ddc 100644 (file)
--- a/samtools/bam.h
+++ b/samtools/bam.h
@@ -38,7 +38,7 @@ DEALINGS IN THE SOFTWARE.  */
    @copyright Genome Research Ltd.
   */
  
-#define BAM_VERSION "1.7"
+#define BAM_VERSION "1.9"
  
  #include <stdint.h>
  #include <stdlib.h>
diff --git a/samtools/bam2depth.c b/samtools/bam2depth.c

index b732e8e60ba6af9040ddf1e2f9811b5bab83b170..f8f132bfaed32372e6a40ab4c299611409245a69 100644 (file)
--- a/samtools/bam2depth.c
+++ b/samtools/bam2depth.c
@@ -81,7 +81,8 @@ static int usage() {
      fprintf(stderr, "   -b <bed>            list of positions or regions\n");
      fprintf(stderr, "   -f <list>           list of input BAM filenames, one per line [null]\n");
      fprintf(stderr, "   -l <int>            read length threshold (ignore reads shorter than <int>) [0]\n");
-    fprintf(stderr, "   -d/-m <int>         maximum coverage depth [8000]\n");  // the htslib's default
+    fprintf(stderr, "   -d/-m <int>         maximum coverage depth [8000]. If 0, depth is set to the maximum\n"
+                    "                       integer value, effectively removing any depth limit.\n");  // the htslib's default
      fprintf(stderr, "   -q <int>            base quality threshold [0]\n");
      fprintf(stderr, "   -Q <int>            mapping quality threshold [0]\n");
      fprintf(stderr, "   -r <chr:from-to>    region\n");
@@ -206,6 +207,8 @@ int main_depth(int argc, char *argv[])
      mplp = bam_mplp_init(n, read_bam, (void**)data); // initialization
      if (0 < max_depth)
          bam_mplp_set_maxcnt(mplp,max_depth);  // set maximum coverage depth
+    else if (!max_depth)
+        bam_mplp_set_maxcnt(mplp,INT_MAX);
      n_plp = calloc(n, sizeof(int)); // n_plp[i] is the number of covering reads from the i-th BAM
      plp = calloc(n, sizeof(bam_pileup1_t*)); // plp[i] points to the array of covering reads (internal in mplp)
      while ((ret=bam_mplp_auto(mplp, &tid, &pos, n_plp, plp)) > 0) { // come to the next covered position
@@ -252,7 +255,8 @@ int main_depth(int argc, char *argv[])
              for (j = 0; j < n_plp[i]; ++j) {
                  const bam_pileup1_t *p = plp[i] + j; // DON'T modfity plp[][] unless you really know
                  if (p->is_del || p->is_refskip) ++m; // having dels or refskips at tid:pos
-                else if (bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
+                else if (p->qpos < p->b->core.l_qseq &&
+                         bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
              }
              printf("\t%d", n_plp[i] - m); // this the depth to output
          }
diff --git a/samtools/bam2depth.c.pysam.c b/samtools/bam2depth.c.pysam.c

index ebe60d5d5696f454d92b96e0e9488198f3eb19c5..ba6e7e0babbbd29d53a43d6cac2c2727ff686b9e 100644 (file)
--- a/samtools/bam2depth.c.pysam.c
+++ b/samtools/bam2depth.c.pysam.c
@@ -83,7 +83,8 @@ static int usage() {
      fprintf(samtools_stderr, "   -b <bed>            list of positions or regions\n");
      fprintf(samtools_stderr, "   -f <list>           list of input BAM filenames, one per line [null]\n");
      fprintf(samtools_stderr, "   -l <int>            read length threshold (ignore reads shorter than <int>) [0]\n");
-    fprintf(samtools_stderr, "   -d/-m <int>         maximum coverage depth [8000]\n");  // the htslib's default
+    fprintf(samtools_stderr, "   -d/-m <int>         maximum coverage depth [8000]. If 0, depth is set to the maximum\n"
+                    "                       integer value, effectively removing any depth limit.\n");  // the htslib's default
      fprintf(samtools_stderr, "   -q <int>            base quality threshold [0]\n");
      fprintf(samtools_stderr, "   -Q <int>            mapping quality threshold [0]\n");
      fprintf(samtools_stderr, "   -r <chr:from-to>    region\n");
@@ -208,6 +209,8 @@ int main_depth(int argc, char *argv[])
      mplp = bam_mplp_init(n, read_bam, (void**)data); // initialization
      if (0 < max_depth)
          bam_mplp_set_maxcnt(mplp,max_depth);  // set maximum coverage depth
+    else if (!max_depth)
+        bam_mplp_set_maxcnt(mplp,INT_MAX);
      n_plp = calloc(n, sizeof(int)); // n_plp[i] is the number of covering reads from the i-th BAM
      plp = calloc(n, sizeof(bam_pileup1_t*)); // plp[i] points to the array of covering reads (internal in mplp)
      while ((ret=bam_mplp_auto(mplp, &tid, &pos, n_plp, plp)) > 0) { // come to the next covered position
@@ -254,7 +257,8 @@ int main_depth(int argc, char *argv[])
              for (j = 0; j < n_plp[i]; ++j) {
                  const bam_pileup1_t *p = plp[i] + j; // DON'T modfity plp[][] unless you really know
                  if (p->is_del || p->is_refskip) ++m; // having dels or refskips at tid:pos
-                else if (bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
+                else if (p->qpos < p->b->core.l_qseq &&
+                         bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
              }
              fprintf(samtools_stdout, "\t%d", n_plp[i] - m); // this the depth to output
          }
diff --git a/samtools/bam_addrprg.c b/samtools/bam_addrprg.c

index 99a198d222ee09032a3041ae553a3b9f5ae07ce7..90c5d12176cc18ce155c9604ff456f8ab6d1c2e3 100644 (file)
--- a/samtools/bam_addrprg.c
+++ b/samtools/bam_addrprg.c
@@ -358,7 +358,9 @@ static void orphan_only_func(const state_t* state, bam1_t* file_read)
  }
  
  static bool init(const parsed_opts_t* opts, state_t** state_out) {
+    char output_mode[8] = "w";
      state_t* retval = (state_t*) calloc(1, sizeof(state_t));
+
      if (retval == NULL) {
          fprintf(stderr, "[init] Out of memory allocating state struct.\n");
          return false;
@@ -374,7 +376,9 @@ static bool init(const parsed_opts_t* opts, state_t** state_out) {
      retval->input_header = sam_hdr_read(retval->input_file);
  
      retval->output_header = bam_hdr_dup(retval->input_header);
-    retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, "w", &opts->ga.out);
+    if (opts->output_name) // File format auto-detection
+        sam_open_mode(output_mode + 1, opts->output_name, NULL);
+    retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, output_mode, &opts->ga.out);
  
      if (retval->output_file == NULL) {
          print_error_errno("addreplacerg", "could not create \"%s\"", opts->output_name);
diff --git a/samtools/bam_addrprg.c.pysam.c b/samtools/bam_addrprg.c.pysam.c

index 6d65ccb5ca030b0b7fc80996a02da00e8d997f4f..25d0238fec0a7d09f9fd0865c537c4cff684bcce 100644 (file)
--- a/samtools/bam_addrprg.c.pysam.c
+++ b/samtools/bam_addrprg.c.pysam.c
@@ -360,7 +360,9 @@ static void orphan_only_func(const state_t* state, bam1_t* file_read)
  }
  
  static bool init(const parsed_opts_t* opts, state_t** state_out) {
+    char output_mode[8] = "w";
      state_t* retval = (state_t*) calloc(1, sizeof(state_t));
+
      if (retval == NULL) {
          fprintf(samtools_stderr, "[init] Out of memory allocating state struct.\n");
          return false;
@@ -376,7 +378,9 @@ static bool init(const parsed_opts_t* opts, state_t** state_out) {
      retval->input_header = sam_hdr_read(retval->input_file);
  
      retval->output_header = bam_hdr_dup(retval->input_header);
-    retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, "w", &opts->ga.out);
+    if (opts->output_name) // File format auto-detection
+        sam_open_mode(output_mode + 1, opts->output_name, NULL);
+    retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, output_mode, &opts->ga.out);
  
      if (retval->output_file == NULL) {
          print_error_errno("addreplacerg", "could not create \"%s\"", opts->output_name);
diff --git a/samtools/bam_cat.c b/samtools/bam_cat.c

index 95498ec369c25809eb996c3b7156facb799e3e99..14698762989d0fca06ae80e5b7e102eac19feae5 100644 (file)
--- a/samtools/bam_cat.c
+++ b/samtools/bam_cat.c
@@ -542,14 +542,14 @@ int main_cat(int argc, char *argv[])
              case 'h': {
                  samFile *fph = sam_open(optarg, "r");
                  if (fph == 0) {
-                    fprintf(stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, argv[1]);
+                    fprintf(stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, optarg);
                      return 1;
                  }
                  h = sam_hdr_read(fph);
                  if (h == NULL) {
                      fprintf(stderr,
-                            "[%s] ERROR: failed to read the header for '%s'.\n",
-                            __func__, argv[1]);
+                            "[%s] ERROR: failed to read the header from '%s'.\n",
+                            __func__, optarg);
                      return 1;
                  }
                  sam_close(fph);
diff --git a/samtools/bam_cat.c.pysam.c b/samtools/bam_cat.c.pysam.c

index 4cf5540ad92d2f280cfb9f68e38278832a26e560..97b14a24dd6280b7a913680534661bcf3e49b6fe 100644 (file)
--- a/samtools/bam_cat.c.pysam.c
+++ b/samtools/bam_cat.c.pysam.c
@@ -544,14 +544,14 @@ int main_cat(int argc, char *argv[])
              case 'h': {
                  samFile *fph = sam_open(optarg, "r");
                  if (fph == 0) {
-                    fprintf(samtools_stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, argv[1]);
+                    fprintf(samtools_stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, optarg);
                      return 1;
                  }
                  h = sam_hdr_read(fph);
                  if (h == NULL) {
                      fprintf(samtools_stderr,
-                            "[%s] ERROR: failed to read the header for '%s'.\n",
-                            __func__, argv[1]);
+                            "[%s] ERROR: failed to read the header from '%s'.\n",
+                            __func__, optarg);
                      return 1;
                  }
                  sam_close(fph);
diff --git a/samtools/bam_index.c b/samtools/bam_index.c

index 40b7e0fd20fb2934e82ada93627a5e0d04fdf34c..3b6d6b1d751009c829db77d641c2b72a28d8db0d 100644 (file)
--- a/samtools/bam_index.c
+++ b/samtools/bam_index.c
@@ -34,8 +34,10 @@ DEALINGS IN THE SOFTWARE.  */
  #define __STDC_FORMAT_MACROS
  #include <inttypes.h>
  #include <unistd.h>
+#include <getopt.h>
  
  #include "samtools.h"
+#include "sam_opts.h"
  
  #define BAM_LIDX_SHIFT    14
  
@@ -101,45 +103,144 @@ int bam_index(int argc, char *argv[])
      return EXIT_FAILURE;
  }
  
+/*
+ * Cram indices do not contain mapped/unmapped record counts, so we have to
+ * decode each record and count.  However we can speed this up as much as
+ * possible by using the required fields parameter.
+ *
+ * This prints the stats to stdout in the same manner than the BAM function
+ * does.
+ *
+ * Returns 0 on success,
+ *        -1 on failure.
+ */
+int slow_idxstats(samFile *fp, bam_hdr_t *header) {
+    int ret, last_tid = -2;
+    bam1_t *b = bam_init1();
+
+    if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_RNAME | SAM_FLAG))
+        return -1;
+
+    uint64_t (*count0)[2] = calloc(header->n_targets+1, sizeof(*count0));
+    uint64_t (*counts)[2] = count0+1;
+    if (!count0)
+        return -1;
+
+    while ((ret = sam_read1(fp, header, b)) >= 0) {
+        if (b->core.tid >= header->n_targets || b->core.tid < -1) {
+            free(count0);
+            return -1;
+        }
+
+        if (b->core.tid != last_tid) {
+            if (last_tid >= -1) {
+                if (counts[b->core.tid][0] + counts[b->core.tid][1]) {
+                    print_error("idxstats", "file is not position sorted");
+                    free(count0);
+                    return -1;
+                }
+            }
+            last_tid = b->core.tid;
+        }
+
+        counts[b->core.tid][(b->core.flag & BAM_FUNMAP) ? 1 : 0]++;
+    }
+
+    if (ret == -1) {
+        int i;
+        for (i = 0; i < header->n_targets; i++) {
+            printf("%s\t%d\t%"PRIu64"\t%"PRIu64"\n",
+                   header->target_name[i],
+                   header->target_len[i],
+                   counts[i][0], counts[i][1]);
+        }
+        printf("*\t0\t%"PRIu64"\t%"PRIu64"\n", counts[-1][0], counts[-1][1]);
+    }
+
+    free(count0);
+
+    bam_destroy1(b);
+
+    return (ret == -1) ? 0 : -1;
+}
+
+static void usage_exit(FILE *fp, int exit_status)
+{
+    fprintf(fp, "Usage: samtools idxstats [options] <in.bam>\n");
+    sam_global_opt_help(fp, "-.---@");
+    exit(exit_status);
+}
+
  int bam_idxstats(int argc, char *argv[])
  {
      hts_idx_t* idx;
      bam_hdr_t* header;
      samFile* fp;
+    int c;
  
-    if (argc < 2) {
-        fprintf(stderr, "Usage: samtools idxstats <in.bam>\n");
-        return 1;
+    sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
+    static const struct option lopts[] = {
+        SAM_OPT_GLOBAL_OPTIONS('-', 0, '-', '-', '-', '@'),
+        {NULL, 0, NULL, 0}
+    };
+
+    while ((c = getopt_long(argc, argv, "@:", lopts, NULL)) >= 0) {
+        switch (c) {
+        default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
+            /* else fall-through */
+        case '?':
+            usage_exit(stderr, EXIT_FAILURE);
+        }
      }
-    fp = sam_open(argv[1], "r");
+
+    if (argc != optind+1) {
+        if (argc == optind) usage_exit(stdout, EXIT_SUCCESS);
+        else usage_exit(stderr, EXIT_FAILURE);
+    }
+
+    fp = sam_open_format(argv[optind], "r", &ga.in);
      if (fp == NULL) {
-        print_error_errno("idxstats", "failed to open \"%s\"", argv[1]);
+        print_error_errno("idxstats", "failed to open \"%s\"", argv[optind]);
          return 1;
      }
      header = sam_hdr_read(fp);
      if (header == NULL) {
-        print_error("idxstats", "failed to read header for \"%s\"", argv[1]);
-        return 1;
-    }
-    idx = sam_index_load(fp, argv[1]);
-    if (idx == NULL) {
-        print_error("idxstats", "fail to load index for \"%s\"", argv[1]);
+        print_error("idxstats", "failed to read header for \"%s\"", argv[optind]);
          return 1;
      }
  
-    int i;
-    for (i = 0; i < header->n_targets; ++i) {
-        // Print out contig name and length
-        printf("%s\t%d", header->target_name[i], header->target_len[i]);
-        // Now fetch info about it from the meta bin
-        uint64_t u, v;
-        hts_idx_get_stat(idx, i, &u, &v);
-        printf("\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+    if (hts_get_format(fp)->format != bam) {
+    slow_method:
+        if (ga.nthreads)
+            hts_set_threads(fp, ga.nthreads);
+
+        if (slow_idxstats(fp, header) < 0) {
+            print_error("idxstats", "failed to process \"%s\"", argv[optind]);
+            return 1;
+        }
+    } else {
+        idx = sam_index_load(fp, argv[optind]);
+        if (idx == NULL) {
+            print_error("idxstats", "fail to load index for \"%s\", "
+                        "reverting to slow method", argv[optind]);
+            goto slow_method;
+        }
+
+        int i;
+        for (i = 0; i < header->n_targets; ++i) {
+            // Print out contig name and length
+            printf("%s\t%d", header->target_name[i], header->target_len[i]);
+            // Now fetch info about it from the meta bin
+            uint64_t u, v;
+            hts_idx_get_stat(idx, i, &u, &v);
+            printf("\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+        }
+        // Dump information about unmapped reads
+        printf("*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+        hts_idx_destroy(idx);
      }
-    // Dump information about unmapped reads
-    printf("*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+
      bam_hdr_destroy(header);
-    hts_idx_destroy(idx);
      sam_close(fp);
      return 0;
  }
diff --git a/samtools/bam_index.c.pysam.c b/samtools/bam_index.c.pysam.c

index e13d453a332a82feac03e50ef689e9e021b16452..9a96d143aca5b95e719c9f88841c3be5ee7589a0 100644 (file)
--- a/samtools/bam_index.c.pysam.c
+++ b/samtools/bam_index.c.pysam.c
@@ -36,8 +36,10 @@ DEALINGS IN THE SOFTWARE.  */
  #define __STDC_FORMAT_MACROS
  #include <inttypes.h>
  #include <unistd.h>
+#include <getopt.h>
  
  #include "samtools.h"
+#include "sam_opts.h"
  
  #define BAM_LIDX_SHIFT    14
  
@@ -103,45 +105,144 @@ int bam_index(int argc, char *argv[])
      return EXIT_FAILURE;
  }
  
+/*
+ * Cram indices do not contain mapped/unmapped record counts, so we have to
+ * decode each record and count.  However we can speed this up as much as
+ * possible by using the required fields parameter.
+ *
+ * This prints the stats to samtools_stdout in the same manner than the BAM function
+ * does.
+ *
+ * Returns 0 on success,
+ *        -1 on failure.
+ */
+int slow_idxstats(samFile *fp, bam_hdr_t *header) {
+    int ret, last_tid = -2;
+    bam1_t *b = bam_init1();
+
+    if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_RNAME | SAM_FLAG))
+        return -1;
+
+    uint64_t (*count0)[2] = calloc(header->n_targets+1, sizeof(*count0));
+    uint64_t (*counts)[2] = count0+1;
+    if (!count0)
+        return -1;
+
+    while ((ret = sam_read1(fp, header, b)) >= 0) {
+        if (b->core.tid >= header->n_targets || b->core.tid < -1) {
+            free(count0);
+            return -1;
+        }
+
+        if (b->core.tid != last_tid) {
+            if (last_tid >= -1) {
+                if (counts[b->core.tid][0] + counts[b->core.tid][1]) {
+                    print_error("idxstats", "file is not position sorted");
+                    free(count0);
+                    return -1;
+                }
+            }
+            last_tid = b->core.tid;
+        }
+
+        counts[b->core.tid][(b->core.flag & BAM_FUNMAP) ? 1 : 0]++;
+    }
+
+    if (ret == -1) {
+        int i;
+        for (i = 0; i < header->n_targets; i++) {
+            fprintf(samtools_stdout, "%s\t%d\t%"PRIu64"\t%"PRIu64"\n",
+                   header->target_name[i],
+                   header->target_len[i],
+                   counts[i][0], counts[i][1]);
+        }
+        fprintf(samtools_stdout, "*\t0\t%"PRIu64"\t%"PRIu64"\n", counts[-1][0], counts[-1][1]);
+    }
+
+    free(count0);
+
+    bam_destroy1(b);
+
+    return (ret == -1) ? 0 : -1;
+}
+
+static void usage_exit(FILE *fp, int exit_status)
+{
+    fprintf(fp, "Usage: samtools idxstats [options] <in.bam>\n");
+    sam_global_opt_help(fp, "-.---@");
+    exit(exit_status);
+}
+
  int bam_idxstats(int argc, char *argv[])
  {
      hts_idx_t* idx;
      bam_hdr_t* header;
      samFile* fp;
+    int c;
  
-    if (argc < 2) {
-        fprintf(samtools_stderr, "Usage: samtools idxstats <in.bam>\n");
-        return 1;
+    sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
+    static const struct option lopts[] = {
+        SAM_OPT_GLOBAL_OPTIONS('-', 0, '-', '-', '-', '@'),
+        {NULL, 0, NULL, 0}
+    };
+
+    while ((c = getopt_long(argc, argv, "@:", lopts, NULL)) >= 0) {
+        switch (c) {
+        default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
+            /* else fall-through */
+        case '?':
+            usage_exit(samtools_stderr, EXIT_FAILURE);
+        }
      }
-    fp = sam_open(argv[1], "r");
+
+    if (argc != optind+1) {
+        if (argc == optind) usage_exit(samtools_stdout, EXIT_SUCCESS);
+        else usage_exit(samtools_stderr, EXIT_FAILURE);
+    }
+
+    fp = sam_open_format(argv[optind], "r", &ga.in);
      if (fp == NULL) {
-        print_error_errno("idxstats", "failed to open \"%s\"", argv[1]);
+        print_error_errno("idxstats", "failed to open \"%s\"", argv[optind]);
          return 1;
      }
      header = sam_hdr_read(fp);
      if (header == NULL) {
-        print_error("idxstats", "failed to read header for \"%s\"", argv[1]);
-        return 1;
-    }
-    idx = sam_index_load(fp, argv[1]);
-    if (idx == NULL) {
-        print_error("idxstats", "fail to load index for \"%s\"", argv[1]);
+        print_error("idxstats", "failed to read header for \"%s\"", argv[optind]);
          return 1;
      }
  
-    int i;
-    for (i = 0; i < header->n_targets; ++i) {
-        // Print out contig name and length
-        fprintf(samtools_stdout, "%s\t%d", header->target_name[i], header->target_len[i]);
-        // Now fetch info about it from the meta bin
-        uint64_t u, v;
-        hts_idx_get_stat(idx, i, &u, &v);
-        fprintf(samtools_stdout, "\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+    if (hts_get_format(fp)->format != bam) {
+    slow_method:
+        if (ga.nthreads)
+            hts_set_threads(fp, ga.nthreads);
+
+        if (slow_idxstats(fp, header) < 0) {
+            print_error("idxstats", "failed to process \"%s\"", argv[optind]);
+            return 1;
+        }
+    } else {
+        idx = sam_index_load(fp, argv[optind]);
+        if (idx == NULL) {
+            print_error("idxstats", "fail to load index for \"%s\", "
+                        "reverting to slow method", argv[optind]);
+            goto slow_method;
+        }
+
+        int i;
+        for (i = 0; i < header->n_targets; ++i) {
+            // Print out contig name and length
+            fprintf(samtools_stdout, "%s\t%d", header->target_name[i], header->target_len[i]);
+            // Now fetch info about it from the meta bin
+            uint64_t u, v;
+            hts_idx_get_stat(idx, i, &u, &v);
+            fprintf(samtools_stdout, "\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+        }
+        // Dump information about unmapped reads
+        fprintf(samtools_stdout, "*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+        hts_idx_destroy(idx);
      }
-    // Dump information about unmapped reads
-    fprintf(samtools_stdout, "*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+
      bam_hdr_destroy(header);
-    hts_idx_destroy(idx);
      sam_close(fp);
      return 0;
  }
diff --git a/samtools/bam_lpileup.c b/samtools/bam_lpileup.c

index cc7a75bdfd7ea0ce63d611da5a8f5dc2035da20c..e20cc92a5eb3ee97d5a1792bfddef495e6886512 100644 (file)
--- a/samtools/bam_lpileup.c
+++ b/samtools/bam_lpileup.c
@@ -29,7 +29,6 @@ DEALINGS IN THE SOFTWARE.  */
  #include <assert.h>
  #include "bam_plbuf.h"
  #include "bam_lpileup.h"
-#include "samtools.h"
  #include <htslib/ksort.h>
  
  #define TV_GAP 2
diff --git a/samtools/bam_lpileup.c.pysam.c b/samtools/bam_lpileup.c.pysam.c

index 8a1555cc5033ca3f28d491c8c65f4955246dc900..1719ffb8549117a90d57694b1b6d890b72d99099 100644 (file)
--- a/samtools/bam_lpileup.c.pysam.c
+++ b/samtools/bam_lpileup.c.pysam.c
@@ -31,7 +31,6 @@ DEALINGS IN THE SOFTWARE.  */
  #include <assert.h>
  #include "bam_plbuf.h"
  #include "bam_lpileup.h"
-#include "samtools.h"
  #include <htslib/ksort.h>
  
  #define TV_GAP 2
diff --git a/samtools/bam_markdup.c b/samtools/bam_markdup.c

index 21bf90aa995583ab8bcfe8f28ae3bbd8d95b0773..41d847519021e6db911f811543fddf81a6905664 100644 (file)
--- a/samtools/bam_markdup.c
+++ b/samtools/bam_markdup.c
@@ -1,7 +1,7 @@
  /*  bam_markdup.c -- Mark duplicates from a coord sorted file that has gone
                       through fixmates with the mate scoring option on.
  
-    Copyright (C) 2017 Genome Research Ltd.
+    Copyright (C) 2017-18 Genome Research Ltd.
  
      Author: Andrew Whitwham <aw7@sanger.ac.uk>
  
@@ -276,7 +276,7 @@ static int64_t get_mate_score(bam1_t *b) {
      if ((data = bam_aux_get(b, "ms"))) {
          score = bam_aux2i(data);
      } else {
-        fprintf(stderr, "[markdup] error: no ms score tag.\n");
+        fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
          return -1;
      }
  
@@ -323,7 +323,7 @@ static int make_pair_key(key_data_t *key, bam1_t *bam) {
          other_end   = unclipped_other_end(bam->core.mpos, cig);
          other_coord = unclipped_other_start(bam->core.mpos, cig);
      } else {
-        fprintf(stderr, "[markdup] error: no MC tag.\n");
+        fprintf(stderr, "[markdup] error: no MC tag. Please run samtools fixmate on file first.\n");
          return 1;
      }
  
@@ -634,14 +634,14 @@ static int bam_mark_duplicates(samFile *in, samFile *out, char *prefix, int remo
                      bp = &kh_val(pair_hash, k);
  
                      if ((mate_tmp = get_mate_score(bp->p)) == -1) {
-                        fprintf(stderr, "[markdup] error: no ms score tag.\n");
+                        fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
                          return 1;
                      } else {
                          old_score = calc_score(bp->p) + mate_tmp;
                      }
  
                      if ((mate_tmp = get_mate_score(in_read->b)) == -1) {
-                        fprintf(stderr, "[markdup] error: no ms score tag.\n");
+                        fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
                          return 1;
                      } else {
                          new_score = calc_score(in_read->b) + mate_tmp;
diff --git a/samtools/bam_markdup.c.pysam.c b/samtools/bam_markdup.c.pysam.c

index ce621d3bffc35ebdafca7583d06b7f20cca5ac32..4b00967278f1096eba5422046de60affb4a07593 100644 (file)
--- a/samtools/bam_markdup.c.pysam.c
+++ b/samtools/bam_markdup.c.pysam.c
@@ -3,7 +3,7 @@
  /*  bam_markdup.c -- Mark duplicates from a coord sorted file that has gone
                       through fixmates with the mate scoring option on.
  
-    Copyright (C) 2017 Genome Research Ltd.
+    Copyright (C) 2017-18 Genome Research Ltd.
  
      Author: Andrew Whitwham <aw7@sanger.ac.uk>
  
@@ -278,7 +278,7 @@ static int64_t get_mate_score(bam1_t *b) {
      if ((data = bam_aux_get(b, "ms"))) {
          score = bam_aux2i(data);
      } else {
-        fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+        fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
          return -1;
      }
  
@@ -325,7 +325,7 @@ static int make_pair_key(key_data_t *key, bam1_t *bam) {
          other_end   = unclipped_other_end(bam->core.mpos, cig);
          other_coord = unclipped_other_start(bam->core.mpos, cig);
      } else {
-        fprintf(samtools_stderr, "[markdup] error: no MC tag.\n");
+        fprintf(samtools_stderr, "[markdup] error: no MC tag. Please run samtools fixmate on file first.\n");
          return 1;
      }
  
@@ -636,14 +636,14 @@ static int bam_mark_duplicates(samFile *in, samFile *out, char *prefix, int remo
                      bp = &kh_val(pair_hash, k);
  
                      if ((mate_tmp = get_mate_score(bp->p)) == -1) {
-                        fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+                        fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
                          return 1;
                      } else {
                          old_score = calc_score(bp->p) + mate_tmp;
                      }
  
                      if ((mate_tmp = get_mate_score(in_read->b)) == -1) {
-                        fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+                        fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
                          return 1;
                      } else {
                          new_score = calc_score(in_read->b) + mate_tmp;
diff --git a/samtools/bam_mate.c b/samtools/bam_mate.c

index 1d6c55f3cb80f60bc5d185ece844dd539b4d58c2..2690b5ce302af451e0ed2bbb02ab44bb8634d8ee 100644 (file)
--- a/samtools/bam_mate.c
+++ b/samtools/bam_mate.c
@@ -254,7 +254,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
  {
      bam_hdr_t *header;
      bam1_t *b[2] = { NULL, NULL };
-    int curr, has_prev, pre_end = 0, cur_end = 0;
+    int curr, has_prev, pre_end = 0, cur_end = 0, result;
      kstring_t str;
  
      str.l = str.m = 0; str.s = 0;
@@ -280,7 +280,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
      b[0] = bam_init1();
      b[1] = bam_init1();
      curr = 0; has_prev = 0;
-    while (sam_read1(in, header, b[curr]) >= 0) {
+    while ((result = sam_read1(in, header, b[curr])) >= 0) {
          bam1_t *cur = b[curr], *pre = b[1-curr];
          if (cur->core.flag & BAM_FSECONDARY)
          {
@@ -365,6 +365,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
          curr = 1 - curr;
          pre_end = cur_end;
      }
+    if (result < -1) goto fail;
      if (has_prev && !remove_reads) { // If we still have a BAM in the buffer it must be unpaired
          bam1_t *pre = b[1-curr];
          if (pre->core.tid < 0 || pre->core.pos < 0 || pre->core.flag&BAM_FUNMAP) { // If unmapped
diff --git a/samtools/bam_mate.c.pysam.c b/samtools/bam_mate.c.pysam.c

index 57159cc7e7d68aac6c18222aecf9566a7b3a5fb1..cf938779fcc70232f104753217ca0363f09aed35 100644 (file)
--- a/samtools/bam_mate.c.pysam.c
+++ b/samtools/bam_mate.c.pysam.c
@@ -256,7 +256,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
  {
      bam_hdr_t *header;
      bam1_t *b[2] = { NULL, NULL };
-    int curr, has_prev, pre_end = 0, cur_end = 0;
+    int curr, has_prev, pre_end = 0, cur_end = 0, result;
      kstring_t str;
  
      str.l = str.m = 0; str.s = 0;
@@ -282,7 +282,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
      b[0] = bam_init1();
      b[1] = bam_init1();
      curr = 0; has_prev = 0;
-    while (sam_read1(in, header, b[curr]) >= 0) {
+    while ((result = sam_read1(in, header, b[curr])) >= 0) {
          bam1_t *cur = b[curr], *pre = b[1-curr];
          if (cur->core.flag & BAM_FSECONDARY)
          {
@@ -367,6 +367,7 @@ static int bam_mating_core(samFile *in, samFile *out, int remove_reads, int prop
          curr = 1 - curr;
          pre_end = cur_end;
      }
+    if (result < -1) goto fail;
      if (has_prev && !remove_reads) { // If we still have a BAM in the buffer it must be unpaired
          bam1_t *pre = b[1-curr];
          if (pre->core.tid < 0 || pre->core.pos < 0 || pre->core.flag&BAM_FUNMAP) { // If unmapped
diff --git a/samtools/bam_md.c b/samtools/bam_md.c

index f09503030a32e1e2df90053b6711c0527206f859..e9205c72693ab78501295771cd121d2f6c470bd2 100644 (file)
--- a/samtools/bam_md.c
+++ b/samtools/bam_md.c
@@ -46,7 +46,7 @@ DEALINGS IN THE SOFTWARE.  */
  
  int bam_aux_drop_other(bam1_t *b, uint8_t *s);
  
-void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
+void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm, int quiet_mode)
  {
      uint8_t *seq = bam_get_seq(b);
      uint32_t *cigar = bam_get_cigar(b);
@@ -116,7 +116,9 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
          if (old_nm) old_nm_i = bam_aux2i(old_nm);
          if (!old_nm) bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
          else if (nm != old_nm_i) {
-            fprintf(stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+            if (!quiet_mode) {
+                fprintf(stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+            }
              bam_aux_del(b, old_nm);
              bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
          }
@@ -134,7 +136,9 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
                  if (i < str->l) is_diff = 1;
              } else is_diff = 1;
              if (is_diff) {
-                fprintf(stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+                if (!quiet_mode) {
+                    fprintf(stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+                }
                  bam_aux_del(b, old_md);
                  bam_aux_append(b, "MD", 'Z', str->l + 1, (uint8_t*)str->s);
              }
@@ -156,20 +160,21 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
      free(str->s); free(str);
  }
  
-void bam_fillmd1(bam1_t *b, char *ref, int flag)
+void bam_fillmd1(bam1_t *b, char *ref, int flag, int quiet_mode)
  {
-    bam_fillmd1_core(b, ref, INT_MAX, flag, 0);
+    bam_fillmd1_core(b, ref, INT_MAX, flag, 0, quiet_mode);
  }
  
  int calmd_usage() {
      fprintf(stderr,
-"Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>\n"
+"Usage: samtools calmd [-eubrAESQ] <aln.bam> <ref.fasta>\n"
  "Options:\n"
  "  -e       change identical bases to '='\n"
  "  -u       uncompressed BAM output (for piping)\n"
  "  -b       compressed BAM output\n"
  "  -S       ignored (input format is auto-detected)\n"
  "  -A       modify the quality string\n"
+"  -Q       use quiet mode to output less debug info to stdout\n"
  "  -r       compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n"
  "  -E       extended BAQ for better sensitivity but lower specificity\n");
  
@@ -179,7 +184,7 @@ int calmd_usage() {
  
  int bam_fillmd(int argc, char *argv[])
  {
-    int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag;
+    int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag, quiet_mode;
      htsThreadPool p = {NULL, 0};
      samFile *fp = NULL, *fpout = NULL;
      bam_hdr_t *header = NULL;
@@ -194,9 +199,9 @@ int bam_fillmd(int argc, char *argv[])
      };
  
      flt_flag = UPDATE_NM | UPDATE_MD;
-    is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0;
+    is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = quiet_mode = 0;
      strcpy(mode_w, "w");
-    while ((c = getopt_long(argc, argv, "EqreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "EqQreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
          switch (c) {
          case 'r': is_realn = 1; break;
          case 'e': flt_flag |= USE_EQUAL; break;
@@ -211,6 +216,7 @@ int bam_fillmd(int argc, char *argv[])
          case 'C': capQ = atoi(optarg); break;
          case 'A': baq_flag |= 1; break;
          case 'E': baq_flag |= 2; break;
+        case 'Q': quiet_mode = 1; break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
              fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n\n", c);
              /* else fall-through */
@@ -283,7 +289,7 @@ int bam_fillmd(int argc, char *argv[])
                  int q = sam_cap_mapq(b, ref, len, capQ);
                  if (b->core.qual > q) b->core.qual = q;
              }
-            if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm);
+            if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm, quiet_mode);
          }
          if (sam_write1(fpout, header, b) < 0) {
              print_error_errno("calmd", "failed to write to output file");
diff --git a/samtools/bam_md.c.pysam.c b/samtools/bam_md.c.pysam.c

index f266fe7b65df8e92b02a740970ea1bf1f4a42699..42c3a87da2879c044365e950366c5c4620674ee9 100644 (file)
--- a/samtools/bam_md.c.pysam.c
+++ b/samtools/bam_md.c.pysam.c
@@ -48,7 +48,7 @@ DEALINGS IN THE SOFTWARE.  */
  
  int bam_aux_drop_other(bam1_t *b, uint8_t *s);
  
-void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
+void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm, int quiet_mode)
  {
      uint8_t *seq = bam_get_seq(b);
      uint32_t *cigar = bam_get_cigar(b);
@@ -118,7 +118,9 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
          if (old_nm) old_nm_i = bam_aux2i(old_nm);
          if (!old_nm) bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
          else if (nm != old_nm_i) {
-            fprintf(samtools_stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+            if (!quiet_mode) {
+                fprintf(samtools_stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+            }
              bam_aux_del(b, old_nm);
              bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
          }
@@ -136,7 +138,9 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
                  if (i < str->l) is_diff = 1;
              } else is_diff = 1;
              if (is_diff) {
-                fprintf(samtools_stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+                if (!quiet_mode) {
+                    fprintf(samtools_stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+                }
                  bam_aux_del(b, old_md);
                  bam_aux_append(b, "MD", 'Z', str->l + 1, (uint8_t*)str->s);
              }
@@ -158,20 +162,21 @@ void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
      free(str->s); free(str);
  }
  
-void bam_fillmd1(bam1_t *b, char *ref, int flag)
+void bam_fillmd1(bam1_t *b, char *ref, int flag, int quiet_mode)
  {
-    bam_fillmd1_core(b, ref, INT_MAX, flag, 0);
+    bam_fillmd1_core(b, ref, INT_MAX, flag, 0, quiet_mode);
  }
  
  int calmd_usage() {
      fprintf(samtools_stderr,
-"Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>\n"
+"Usage: samtools calmd [-eubrAESQ] <aln.bam> <ref.fasta>\n"
  "Options:\n"
  "  -e       change identical bases to '='\n"
  "  -u       uncompressed BAM output (for piping)\n"
  "  -b       compressed BAM output\n"
  "  -S       ignored (input format is auto-detected)\n"
  "  -A       modify the quality string\n"
+"  -Q       use quiet mode to output less debug info to samtools_stdout\n"
  "  -r       compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n"
  "  -E       extended BAQ for better sensitivity but lower specificity\n");
  
@@ -181,7 +186,7 @@ int calmd_usage() {
  
  int bam_fillmd(int argc, char *argv[])
  {
-    int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag;
+    int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag, quiet_mode;
      htsThreadPool p = {NULL, 0};
      samFile *fp = NULL, *fpout = NULL;
      bam_hdr_t *header = NULL;
@@ -196,9 +201,9 @@ int bam_fillmd(int argc, char *argv[])
      };
  
      flt_flag = UPDATE_NM | UPDATE_MD;
-    is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0;
+    is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = quiet_mode = 0;
      strcpy(mode_w, "w");
-    while ((c = getopt_long(argc, argv, "EqreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "EqQreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
          switch (c) {
          case 'r': is_realn = 1; break;
          case 'e': flt_flag |= USE_EQUAL; break;
@@ -213,6 +218,7 @@ int bam_fillmd(int argc, char *argv[])
          case 'C': capQ = atoi(optarg); break;
          case 'A': baq_flag |= 1; break;
          case 'E': baq_flag |= 2; break;
+        case 'Q': quiet_mode = 1; break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
              fprintf(samtools_stderr, "[bam_fillmd] unrecognized option '-%c'\n\n", c);
              /* else fall-through */
@@ -285,7 +291,7 @@ int bam_fillmd(int argc, char *argv[])
                  int q = sam_cap_mapq(b, ref, len, capQ);
                  if (b->core.qual > q) b->core.qual = q;
              }
-            if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm);
+            if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm, quiet_mode);
          }
          if (sam_write1(fpout, header, b) < 0) {
              print_error_errno("calmd", "failed to write to output file");
diff --git a/samtools/bam_plcmd.c b/samtools/bam_plcmd.c

index d451ffdfad19d0672fe6f0f9b06ee5b8c8ebc32a..bd79d99effd91e516aed97c5c8739071184d6beb 100644 (file)
--- a/samtools/bam_plcmd.c
+++ b/samtools/bam_plcmd.c
@@ -115,6 +115,9 @@ static inline void pileup_seq(FILE *fp, const bam_pileup1_t *p, int pos, int ref
  #define MPLP_SMART_OVERLAPS (1<<12)
  #define MPLP_PRINT_QNAME (1<<13)
  
+#define MPLP_MAX_DEPTH 8000
+#define MPLP_MAX_INDEL_DEPTH 250
+
  void *bed_read(const char *fn);
  void bed_destroy(void *_h);
  int bed_overlap(const void *_h, const char *chr, int beg, int end);
@@ -381,8 +384,10 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
              exit(EXIT_FAILURE);
          }
          bam_smpl_add(sm, fn[i], (conf->flag&MPLP_IGNORE_RG)? 0 : h_tmp->text);
-        // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
-        rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+        if (conf->flag & MPLP_BCF) {
+            // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
+            rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+        }
          if (conf->reg) {
              hts_idx_t *idx = sam_index_load(data[i]->fp, fn[i]);
              if (idx == NULL) {
@@ -409,17 +414,18 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
              data[i]->h = h;
          }
      }
-    // allocate data storage proportionate to number of samples being studied sm->n
-    gplp.n = sm->n;
-    gplp.n_plp = calloc(sm->n, sizeof(int));
-    gplp.m_plp = calloc(sm->n, sizeof(int));
-    gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
-
      fprintf(stderr, "[%s] %d samples in %d input files\n", __func__, sm->n, n);
-    // write the VCF header
      if (conf->flag & MPLP_BCF)
      {
          const char *mode;
+        // allocate data storage proportionate to number of samples being studied sm->n
+        gplp.n = sm->n;
+        gplp.n_plp = calloc(sm->n, sizeof(int));
+        gplp.m_plp = calloc(sm->n, sizeof(int));
+        gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
+
+        // write the VCF header
+
          if ( conf->flag & MPLP_VCF )
              mode = (conf->flag&MPLP_NO_COMP)? "wu" : "wz";   // uncompressed VCF or compressed VCF
          else
@@ -554,13 +560,16 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
      // init pileup
      iter = bam_mplp_init(n, mplp_func, (void**)data);
      if ( conf->flag & MPLP_SMART_OVERLAPS ) bam_mplp_init_overlaps(iter);
-    max_depth = conf->max_depth;
-    if (max_depth * sm->n > 1<<20)
-        fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
-    if (max_depth * sm->n < 8000) {
-        max_depth = 8000 / sm->n;
-        fprintf(stderr, "<%s> Set max per-file depth to %d\n", __func__, max_depth);
+    if ( !conf->max_depth ) {
+        max_depth = INT_MAX;
+        fprintf(stderr, "[%s] Max depth set to maximum value (%d)\n", __func__, INT_MAX);
+    } else {
+        max_depth = conf->max_depth;
+        if ( max_depth * n > 1<<20 )
+            fprintf(stderr, "[%s] Combined max depth is above 1M. Potential memory hog!\n", __func__);
      }
+
+    // Only used when writing BCF
      max_indel_depth = conf->max_indel_depth * sm->n;
      bam_mplp_set_maxcnt(iter, max_depth);
      bcf1_t *bcf_rec = bcf_init1();
@@ -677,7 +686,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = plp[i] + j;
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
                              c = plp[i][j].b->core.qual + 33;
                              if (c > 126) c = 126;
@@ -692,7 +703,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = plp[i] + j;
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
  
                              if (n > 0) putc(',', pileup_fp);
@@ -707,7 +720,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = &plp[i][j];
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
  
                              if (n > 0) putc(',', pileup_fp);
@@ -910,46 +925,28 @@ static void print_usage(FILE *fp, const mplp_conf_t *mplp)
  "\n"
  "Output options:\n"
  "  -o, --output FILE       write output to FILE [standard output]\n"
-"  -g, --BCF               generate genotype likelihoods in BCF format\n"
-"  -v, --VCF               generate genotype likelihoods in VCF format\n"
-"\n"
-"Output options for mpileup format (without -g/-v):\n"
  "  -O, --output-BP         output base positions on reads\n"
  "  -s, --output-MQ         output mapping quality\n"
  "      --output-QNAME      output read names\n"
  "  -a                      output all positions (including zero depth)\n"
  "  -a -a (or -aa)          output absolutely all positions, including unused ref. sequences\n"
  "\n"
-"Output options for genotype likelihoods (when -g/-v is used):\n"
-"  -t, --output-tags LIST  optional tags to output:\n"
-"               DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []\n"
-"  -u, --uncompressed      generate uncompressed VCF/BCF output\n"
-"\n"
-"SNP/INDEL genotype likelihoods options (effective with -g/-v):\n"
-"  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [%d]\n", mplp->extQ);
-    fprintf(fp,
-"  -F, --gap-frac FLOAT    minimum fraction of gapped reads [%g]\n", mplp->min_frac);
-    fprintf(fp,
-"  -h, --tandem-qual INT   coefficient for homopolymer errors [%d]\n", mplp->tandemQ);
-    fprintf(fp,
-"  -I, --skip-indels       do not perform indel calling\n"
-"  -L, --max-idepth INT    maximum per-file depth for INDEL calling [%d]\n", mplp->max_indel_depth);
-    fprintf(fp,
-"  -m, --min-ireads INT    minimum number gapped reads for indel candidates [%d]\n", mplp->min_support);
-    fprintf(fp,
-"  -o, --open-prob INT     Phred-scaled gap open seq error probability [%d]\n", mplp->openQ);
-    fprintf(fp,
-"  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity\n"
-"  -P, --platforms STR     comma separated list of platforms for indels [all]\n");
+"Generic options:\n");
      sam_global_opt_help(fp, "-.--.-");
-    fprintf(fp,
-"\n"
-"Notes: Assuming diploid individuals.\n");
+
+    fprintf(fp, "\n"
+"Note that using \"samtools mpileup\" to generate BCF or VCF files is now\n"
+"deprecated.  To output these formats, please use \"bcftools mpileup\" instead.\n");
  
      free(tmp_require);
      free(tmp_filter);
  }
  
+static void deprecated(char opt) {
+    fprintf(stderr, "[warning] samtools mpileup option `%c` is functional, "
+            "but deprecated. Please switch to using bcftools mpileup in future.\n", opt);
+}
+
  int bam_mpileup(int argc, char *argv[])
  {
      int c;
@@ -960,7 +957,8 @@ int bam_mpileup(int argc, char *argv[])
      memset(&mplp, 0, sizeof(mplp_conf_t));
      mplp.min_baseQ = 13;
      mplp.capQ_thres = 0;
-    mplp.max_depth = 250; mplp.max_indel_depth = 250;
+    mplp.max_depth = MPLP_MAX_DEPTH;
+    mplp.max_indel_depth = MPLP_MAX_INDEL_DEPTH;
      mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
      mplp.min_frac = 0.002; mplp.min_support = 1;
      mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN | MPLP_SMART_OVERLAPS;
@@ -1052,16 +1050,16 @@ int bam_mpileup(int argc, char *argv[])
                    mplp.bed = bed_read(optarg);
                    if (!mplp.bed) { print_error_errno("mpileup", "Could not read file \"%s\"", optarg); return 1; }
                    break;
-        case 'P': mplp.pl_list = strdup(optarg); break;
-        case 'p': mplp.flag |= MPLP_PER_SAMPLE; break;
-        case 'g': mplp.flag |= MPLP_BCF; break;
-        case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; break;
-        case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; break;
+        case 'P': mplp.pl_list = strdup(optarg); deprecated(c); break;
+        case 'p': mplp.flag |= MPLP_PER_SAMPLE; deprecated(c); break;
+        case 'g': mplp.flag |= MPLP_BCF; deprecated(c); break;
+        case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; deprecated(c); break;
+        case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; deprecated(c); break;
          case 'B': mplp.flag &= ~MPLP_REALN; break;
-        case 'D': mplp.fmt_flag |= B2B_FMT_DP; fprintf(stderr, "[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.\n"); break;
-        case 'S': mplp.fmt_flag |= B2B_FMT_SP; fprintf(stderr, "[warning] samtools mpileup option `-S` is functional, but deprecated. Please switch to `-t SP` in future.\n"); break;
-        case 'V': mplp.fmt_flag |= B2B_FMT_DV; fprintf(stderr, "[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.\n"); break;
-        case 'I': mplp.flag |= MPLP_NO_INDEL; break;
+        case 'D': mplp.fmt_flag |= B2B_FMT_DP; deprecated(c); break;
+        case 'S': mplp.fmt_flag |= B2B_FMT_SP; deprecated(c); break;
+        case 'V': mplp.fmt_flag |= B2B_FMT_DV; deprecated(c); break;
+        case 'I': mplp.flag |= MPLP_NO_INDEL; deprecated(c); break;
          case 'E': mplp.flag |= MPLP_REDO_BAQ; break;
          case '6': mplp.flag |= MPLP_ILLUMINA13; break;
          case 'R': mplp.flag |= MPLP_IGNORE_RG; break;
@@ -1075,28 +1073,34 @@ int bam_mpileup(int argc, char *argv[])
                  char *end;
                  long value = strtol(optarg, &end, 10);
                  // Distinguish between -o INT and -o FILE (a bit of a hack!)
-                if (*end == '\0') mplp.openQ = value;
-                else mplp.output_fname = optarg;
+                if (*end == '\0') {
+                    mplp.openQ = value;
+                    fprintf(stderr, "[warning] samtools mpileup option "
+                            "'--open-prob INT' is functional, but deprecated. "
+                            "Please switch to using bcftools mpileup in future.\n");
+                } else {
+                    mplp.output_fname = optarg;
+                }
              }
              break;
-        case 'e': mplp.extQ = atoi(optarg); break;
-        case 'h': mplp.tandemQ = atoi(optarg); break;
+        case 'e': mplp.extQ = atoi(optarg); deprecated(c); break;
+        case 'h': mplp.tandemQ = atoi(optarg); deprecated(c); break;
          case 'A': use_orphan = 1; break;
-        case 'F': mplp.min_frac = atof(optarg); break;
-        case 'm': mplp.min_support = atoi(optarg); break;
-        case 'L': mplp.max_indel_depth = atoi(optarg); break;
+        case 'F': mplp.min_frac = atof(optarg); deprecated(c); break;
+        case 'm': mplp.min_support = atoi(optarg); deprecated(c); break;
+        case 'L': mplp.max_indel_depth = atoi(optarg); deprecated(c); break;
          case 'G': {
                  FILE *fp_rg;
                  char buf[1024];
                  mplp.rghash = khash_str2int_init();
                  if ((fp_rg = fopen(optarg, "r")) == NULL)
-                    fprintf(stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg);
+                    fprintf(stderr, "[%s] Fail to open file %s. Continue anyway.\n", __func__, optarg);
                  while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me...
                      khash_str2int_inc(mplp.rghash, strdup(buf));
                  fclose(fp_rg);
              }
              break;
-        case 't': mplp.fmt_flag |= parse_format_flag(optarg); break;
+        case 't': mplp.fmt_flag |= parse_format_flag(optarg); deprecated(c); break;
          case 'a': mplp.all++; break;
          default:
              if (parse_sam_global_opt(c, optarg, lopts, &mplp.ga) == 0) break;
diff --git a/samtools/bam_plcmd.c.pysam.c b/samtools/bam_plcmd.c.pysam.c

index 77999e65ff7dc1a1bd45000680a7355ee9502991..0d46cca351e4f36c7f2d47fd1cbb40ca864ab521 100644 (file)
--- a/samtools/bam_plcmd.c.pysam.c
+++ b/samtools/bam_plcmd.c.pysam.c
@@ -117,6 +117,9 @@ static inline void pileup_seq(FILE *fp, const bam_pileup1_t *p, int pos, int ref
  #define MPLP_SMART_OVERLAPS (1<<12)
  #define MPLP_PRINT_QNAME (1<<13)
  
+#define MPLP_MAX_DEPTH 8000
+#define MPLP_MAX_INDEL_DEPTH 250
+
  void *bed_read(const char *fn);
  void bed_destroy(void *_h);
  int bed_overlap(const void *_h, const char *chr, int beg, int end);
@@ -383,8 +386,10 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
              exit(EXIT_FAILURE);
          }
          bam_smpl_add(sm, fn[i], (conf->flag&MPLP_IGNORE_RG)? 0 : h_tmp->text);
-        // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
-        rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+        if (conf->flag & MPLP_BCF) {
+            // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
+            rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+        }
          if (conf->reg) {
              hts_idx_t *idx = sam_index_load(data[i]->fp, fn[i]);
              if (idx == NULL) {
@@ -411,17 +416,18 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
              data[i]->h = h;
          }
      }
-    // allocate data storage proportionate to number of samples being studied sm->n
-    gplp.n = sm->n;
-    gplp.n_plp = calloc(sm->n, sizeof(int));
-    gplp.m_plp = calloc(sm->n, sizeof(int));
-    gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
-
      fprintf(samtools_stderr, "[%s] %d samples in %d input files\n", __func__, sm->n, n);
-    // write the VCF header
      if (conf->flag & MPLP_BCF)
      {
          const char *mode;
+        // allocate data storage proportionate to number of samples being studied sm->n
+        gplp.n = sm->n;
+        gplp.n_plp = calloc(sm->n, sizeof(int));
+        gplp.m_plp = calloc(sm->n, sizeof(int));
+        gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
+
+        // write the VCF header
+
          if ( conf->flag & MPLP_VCF )
              mode = (conf->flag&MPLP_NO_COMP)? "wu" : "wz";   // uncompressed VCF or compressed VCF
          else
@@ -556,13 +562,16 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
      // init pileup
      iter = bam_mplp_init(n, mplp_func, (void**)data);
      if ( conf->flag & MPLP_SMART_OVERLAPS ) bam_mplp_init_overlaps(iter);
-    max_depth = conf->max_depth;
-    if (max_depth * sm->n > 1<<20)
-        fprintf(samtools_stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
-    if (max_depth * sm->n < 8000) {
-        max_depth = 8000 / sm->n;
-        fprintf(samtools_stderr, "<%s> Set max per-file depth to %d\n", __func__, max_depth);
+    if ( !conf->max_depth ) {
+        max_depth = INT_MAX;
+        fprintf(samtools_stderr, "[%s] Max depth set to maximum value (%d)\n", __func__, INT_MAX);
+    } else {
+        max_depth = conf->max_depth;
+        if ( max_depth * n > 1<<20 )
+            fprintf(samtools_stderr, "[%s] Combined max depth is above 1M. Potential memory hog!\n", __func__);
      }
+
+    // Only used when writing BCF
      max_indel_depth = conf->max_indel_depth * sm->n;
      bam_mplp_set_maxcnt(iter, max_depth);
      bcf1_t *bcf_rec = bcf_init1();
@@ -679,7 +688,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = plp[i] + j;
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
                              c = plp[i][j].b->core.qual + 33;
                              if (c > 126) c = 126;
@@ -694,7 +705,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = plp[i] + j;
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
  
                              if (n > 0) putc(',', pileup_fp);
@@ -709,7 +722,9 @@ static int mpileup(mplp_conf_t *conf, int n, char **fn)
                          putc('\t', pileup_fp);
                          for (j = 0; j < n_plp[i]; ++j) {
                              const bam_pileup1_t *p = &plp[i][j];
-                            int c = bam_get_qual(p->b)[p->qpos];
+                            int c = p->qpos < p->b->core.l_qseq
+                                ? bam_get_qual(p->b)[p->qpos]
+                                : 0;
                              if ( c < conf->min_baseQ ) continue;
  
                              if (n > 0) putc(',', pileup_fp);
@@ -912,46 +927,28 @@ static void print_usage(FILE *fp, const mplp_conf_t *mplp)
  "\n"
  "Output options:\n"
  "  -o, --output FILE       write output to FILE [standard output]\n"
-"  -g, --BCF               generate genotype likelihoods in BCF format\n"
-"  -v, --VCF               generate genotype likelihoods in VCF format\n"
-"\n"
-"Output options for mpileup format (without -g/-v):\n"
  "  -O, --output-BP         output base positions on reads\n"
  "  -s, --output-MQ         output mapping quality\n"
  "      --output-QNAME      output read names\n"
  "  -a                      output all positions (including zero depth)\n"
  "  -a -a (or -aa)          output absolutely all positions, including unused ref. sequences\n"
  "\n"
-"Output options for genotype likelihoods (when -g/-v is used):\n"
-"  -t, --output-tags LIST  optional tags to output:\n"
-"               DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []\n"
-"  -u, --uncompressed      generate uncompressed VCF/BCF output\n"
-"\n"
-"SNP/INDEL genotype likelihoods options (effective with -g/-v):\n"
-"  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [%d]\n", mplp->extQ);
-    fprintf(fp,
-"  -F, --gap-frac FLOAT    minimum fraction of gapped reads [%g]\n", mplp->min_frac);
-    fprintf(fp,
-"  -h, --tandem-qual INT   coefficient for homopolymer errors [%d]\n", mplp->tandemQ);
-    fprintf(fp,
-"  -I, --skip-indels       do not perform indel calling\n"
-"  -L, --max-idepth INT    maximum per-file depth for INDEL calling [%d]\n", mplp->max_indel_depth);
-    fprintf(fp,
-"  -m, --min-ireads INT    minimum number gapped reads for indel candidates [%d]\n", mplp->min_support);
-    fprintf(fp,
-"  -o, --open-prob INT     Phred-scaled gap open seq error probability [%d]\n", mplp->openQ);
-    fprintf(fp,
-"  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity\n"
-"  -P, --platforms STR     comma separated list of platforms for indels [all]\n");
+"Generic options:\n");
      sam_global_opt_help(fp, "-.--.-");
-    fprintf(fp,
-"\n"
-"Notes: Assuming diploid individuals.\n");
+
+    fprintf(fp, "\n"
+"Note that using \"samtools mpileup\" to generate BCF or VCF files is now\n"
+"deprecated.  To output these formats, please use \"bcftools mpileup\" instead.\n");
  
      free(tmp_require);
      free(tmp_filter);
  }
  
+static void deprecated(char opt) {
+    fprintf(samtools_stderr, "[warning] samtools mpileup option `%c` is functional, "
+            "but deprecated. Please switch to using bcftools mpileup in future.\n", opt);
+}
+
  int bam_mpileup(int argc, char *argv[])
  {
      int c;
@@ -962,7 +959,8 @@ int bam_mpileup(int argc, char *argv[])
      memset(&mplp, 0, sizeof(mplp_conf_t));
      mplp.min_baseQ = 13;
      mplp.capQ_thres = 0;
-    mplp.max_depth = 250; mplp.max_indel_depth = 250;
+    mplp.max_depth = MPLP_MAX_DEPTH;
+    mplp.max_indel_depth = MPLP_MAX_INDEL_DEPTH;
      mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
      mplp.min_frac = 0.002; mplp.min_support = 1;
      mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN | MPLP_SMART_OVERLAPS;
@@ -1054,16 +1052,16 @@ int bam_mpileup(int argc, char *argv[])
                    mplp.bed = bed_read(optarg);
                    if (!mplp.bed) { print_error_errno("mpileup", "Could not read file \"%s\"", optarg); return 1; }
                    break;
-        case 'P': mplp.pl_list = strdup(optarg); break;
-        case 'p': mplp.flag |= MPLP_PER_SAMPLE; break;
-        case 'g': mplp.flag |= MPLP_BCF; break;
-        case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; break;
-        case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; break;
+        case 'P': mplp.pl_list = strdup(optarg); deprecated(c); break;
+        case 'p': mplp.flag |= MPLP_PER_SAMPLE; deprecated(c); break;
+        case 'g': mplp.flag |= MPLP_BCF; deprecated(c); break;
+        case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; deprecated(c); break;
+        case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; deprecated(c); break;
          case 'B': mplp.flag &= ~MPLP_REALN; break;
-        case 'D': mplp.fmt_flag |= B2B_FMT_DP; fprintf(samtools_stderr, "[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.\n"); break;
-        case 'S': mplp.fmt_flag |= B2B_FMT_SP; fprintf(samtools_stderr, "[warning] samtools mpileup option `-S` is functional, but deprecated. Please switch to `-t SP` in future.\n"); break;
-        case 'V': mplp.fmt_flag |= B2B_FMT_DV; fprintf(samtools_stderr, "[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.\n"); break;
-        case 'I': mplp.flag |= MPLP_NO_INDEL; break;
+        case 'D': mplp.fmt_flag |= B2B_FMT_DP; deprecated(c); break;
+        case 'S': mplp.fmt_flag |= B2B_FMT_SP; deprecated(c); break;
+        case 'V': mplp.fmt_flag |= B2B_FMT_DV; deprecated(c); break;
+        case 'I': mplp.flag |= MPLP_NO_INDEL; deprecated(c); break;
          case 'E': mplp.flag |= MPLP_REDO_BAQ; break;
          case '6': mplp.flag |= MPLP_ILLUMINA13; break;
          case 'R': mplp.flag |= MPLP_IGNORE_RG; break;
@@ -1077,28 +1075,34 @@ int bam_mpileup(int argc, char *argv[])
                  char *end;
                  long value = strtol(optarg, &end, 10);
                  // Distinguish between -o INT and -o FILE (a bit of a hack!)
-                if (*end == '\0') mplp.openQ = value;
-                else mplp.output_fname = optarg;
+                if (*end == '\0') {
+                    mplp.openQ = value;
+                    fprintf(samtools_stderr, "[warning] samtools mpileup option "
+                            "'--open-prob INT' is functional, but deprecated. "
+                            "Please switch to using bcftools mpileup in future.\n");
+                } else {
+                    mplp.output_fname = optarg;
+                }
              }
              break;
-        case 'e': mplp.extQ = atoi(optarg); break;
-        case 'h': mplp.tandemQ = atoi(optarg); break;
+        case 'e': mplp.extQ = atoi(optarg); deprecated(c); break;
+        case 'h': mplp.tandemQ = atoi(optarg); deprecated(c); break;
          case 'A': use_orphan = 1; break;
-        case 'F': mplp.min_frac = atof(optarg); break;
-        case 'm': mplp.min_support = atoi(optarg); break;
-        case 'L': mplp.max_indel_depth = atoi(optarg); break;
+        case 'F': mplp.min_frac = atof(optarg); deprecated(c); break;
+        case 'm': mplp.min_support = atoi(optarg); deprecated(c); break;
+        case 'L': mplp.max_indel_depth = atoi(optarg); deprecated(c); break;
          case 'G': {
                  FILE *fp_rg;
                  char buf[1024];
                  mplp.rghash = khash_str2int_init();
                  if ((fp_rg = fopen(optarg, "r")) == NULL)
-                    fprintf(samtools_stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg);
+                    fprintf(samtools_stderr, "[%s] Fail to open file %s. Continue anyway.\n", __func__, optarg);
                  while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me...
                      khash_str2int_inc(mplp.rghash, strdup(buf));
                  fclose(fp_rg);
              }
              break;
-        case 't': mplp.fmt_flag |= parse_format_flag(optarg); break;
+        case 't': mplp.fmt_flag |= parse_format_flag(optarg); deprecated(c); break;
          case 'a': mplp.all++; break;
          default:
              if (parse_sam_global_opt(c, optarg, lopts, &mplp.ga) == 0) break;
diff --git a/samtools/bam_sort.c b/samtools/bam_sort.c

index 509c1d96b0d69c0b123b0db8851479fa5b054b40..ee4a518007ecc4b36ff15fb5a0dd6eb74e844e1d 100644 (file)
--- a/samtools/bam_sort.c
+++ b/samtools/bam_sort.c
@@ -38,7 +38,6 @@ DEALINGS IN THE SOFTWARE.  */
  #include <getopt.h>
  #include <assert.h>
  #include <pthread.h>
-#include "htslib/bgzf.h"
  #include "htslib/ksort.h"
  #include "htslib/hts_os.h"
  #include "htslib/khash.h"
@@ -49,11 +48,15 @@ DEALINGS IN THE SOFTWARE.  */
  #include "samtools.h"
  
  
-// Struct which contains the a record, and the pointer to the sort tag (if any)
-// Used to speed up sort-by-tag.
+// Struct which contains the a record, and the pointer to the sort tag (if any) or
+// a combined ref / position / strand.
+// Used to speed up tag and position sorts.
  typedef struct bam1_tag {
      bam1_t *bam_record;
-    const uint8_t *tag;
+    union {
+        const uint8_t *tag;
+        uint64_t pos;
+    } u;
  } bam1_tag;
  
  /* Minimum memory required in megabytes before sort will attempt to run. This
@@ -123,6 +126,7 @@ static int strnum_cmp(const char *_a, const char *_b)
  
  typedef struct {
      int i;
+    uint32_t rev;
      uint64_t pos, idx;
      bam1_tag entry;
  } heap1_t;
@@ -150,6 +154,7 @@ static inline int heap_lt(const heap1_t a, const heap1_t b)
          if (fa != fb) return fa > fb;
      } else {
          if (a.pos != b.pos) return a.pos > b.pos;
+        if (a.rev != b.rev) return a.rev > b.rev;
      }
      // This compares by position in the input file(s)
      if (a.i != b.i) return a.i > b.i;
@@ -1361,24 +1366,25 @@ int bam_merge_core2(int by_qname, char* sort_tag, const char *out, const char *m
          int res;
          h->i = i;
          h->entry.bam_record = bam_init1();
-        h->entry.tag = NULL;
+        h->entry.u.tag = NULL;
          if (!h->entry.bam_record) goto mem_fail;
          res = iter[i] ? sam_itr_next(fp[i], iter[i], h->entry.bam_record) : sam_read1(fp[i], hdr[i], h->entry.bam_record);
          if (res >= 0) {
              bam_translate(h->entry.bam_record, translation_tbl + i);
-            h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1)<<1 | bam_is_rev(h->entry.bam_record);
+            h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1);
+            h->rev = bam_is_rev(h->entry.bam_record);
              h->idx = idx++;
              if (g_is_by_tag) {
-                h->entry.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
+                h->entry.u.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
              } else {
-                h->entry.tag = NULL;
+                h->entry.u.tag = NULL;
              }
          }
          else if (res == -1 && (!iter[i] || iter[i]->finished)) {
              h->pos = HEAP_EMPTY;
              bam_destroy1(h->entry.bam_record);
              h->entry.bam_record = NULL;
-            h->entry.tag = NULL;
+            h->entry.u.tag = NULL;
          } else {
              print_error(cmd, "failed to read first record from \"%s\"", fn[i]);
              goto fail;
@@ -1413,18 +1419,19 @@ int bam_merge_core2(int by_qname, char* sort_tag, const char *out, const char *m
          }
          if ((j = (iter[heap->i]? sam_itr_next(fp[heap->i], iter[heap->i], b) : sam_read1(fp[heap->i], hdr[heap->i], b))) >= 0) {
              bam_translate(b, translation_tbl + heap->i);
-            heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1)<<1 | bam_is_rev(b);
+            heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1);
+            heap->rev = bam_is_rev(b);
              heap->idx = idx++;
              if (g_is_by_tag) {
-                heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+                heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
              } else {
-                heap->entry.tag = NULL;
+                heap->entry.u.tag = NULL;
              }
          } else if (j == -1 && (!iter[heap->i] || iter[heap->i]->finished)) {
              heap->pos = HEAP_EMPTY;
              bam_destroy1(heap->entry.bam_record);
              heap->entry.bam_record = NULL;
-            heap->entry.tag = NULL;
+            heap->entry.u.tag = NULL;
          } else {
              print_error(cmd, "\"%s\" is truncated", fn[heap->i]);
              goto fail;
@@ -1649,19 +1656,19 @@ static inline int heap_add_read(heap1_t *heap, int nfiles, samFile **fp,
      }
      if (res >= 0) {
          heap->pos = (((uint64_t)heap->entry.bam_record->core.tid<<32)
-                     | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1)<<1
-                     | bam_is_rev(heap->entry.bam_record));
+                     | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1));
+        heap->rev = bam_is_rev(heap->entry.bam_record);
          heap->idx = (*idx)++;
          if (g_is_by_tag) {
-            heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+            heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
          } else {
-            heap->entry.tag = NULL;
+            heap->entry.u.tag = NULL;
          }
      } else if (res == -1) {
          heap->pos = HEAP_EMPTY;
          if (i < nfiles) bam_destroy1(heap->entry.bam_record);
          heap->entry.bam_record = NULL;
-        heap->entry.tag = NULL;
+        heap->entry.u.tag = NULL;
      } else {
          return -1;
      }
@@ -1716,7 +1723,7 @@ static int bam_merge_simple(int by_qname, char *sort_tag, const char *out,
  
          // Get a read into the heap
          h->i = i;
-        h->entry.tag = NULL;
+        h->entry.u.tag = NULL;
          if (i < n) {
              h->entry.bam_record = bam_init1();
              if (!h->entry.bam_record) goto mem_fail;
@@ -1734,7 +1741,7 @@ static int bam_merge_simple(int by_qname, char *sort_tag, const char *out,
          return -1;
      }
  
-    hts_set_threads(fpout, n_threads);
+    if (n_threads > 1) hts_set_threads(fpout, n_threads);
  
      if (sam_hdr_write(fpout, hout) != 0) {
          print_error_errno(cmd, "failed to write header to \"%s\"", out);
@@ -1804,8 +1811,14 @@ static inline int bam1_cmp_core(const bam1_tag a, const bam1_tag b)
          if (t != 0) return t;
          return (int) (a.bam_record->core.flag&0xc0) - (int) (b.bam_record->core.flag&0xc0);
      } else {
-        pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1)<<1|bam_is_rev(a.bam_record);
-        pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1)<<1|bam_is_rev(b.bam_record);
+        pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1);
+        pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1);
+
+        if (pa == pb) {
+            pa = bam_is_rev(a.bam_record);
+            pb = bam_is_rev(b.bam_record);
+        }
+
          return pa < pb ? -1 : (pa > pb ? 1 : 0);
      }
  }
@@ -1828,8 +1841,8 @@ uint8_t normalize_type(const uint8_t* aux) {
  // equal to or greater than b, respectively.
  static inline int bam1_cmp_by_tag(const bam1_tag a, const bam1_tag b)
  {
-    const uint8_t* aux_a = a.tag;
-    const uint8_t* aux_b = b.tag;
+    const uint8_t* aux_a = a.u.tag;
+    const uint8_t* aux_b = b.u.tag;
  
      if (aux_a == NULL && aux_b != NULL) {
          return -1;
@@ -1926,12 +1939,76 @@ static int write_buffer(const char *fn, const char *mode, size_t l, bam1_tag *bu
      return -1;
  }
  
+#define NUMBASE 256
+#define STEP 8
+
+static int ks_radixsort(size_t n, bam1_tag *buf, const bam_hdr_t *h)
+{
+    int curr = 0, ret = -1;
+    ssize_t i;
+    bam1_tag *buf_ar2[2], *bam_a, *bam_b;
+    uint64_t max_pos = 0, max_digit = 0, shift = 0;
+
+    for (i = 0; i < n; i++) {
+        bam1_t *b = buf[i].bam_record;
+        int32_t tid = b->core.tid == -1 ? h->n_targets : b->core.tid;
+        buf[i].u.pos = (uint64_t)tid<<32 | (b->core.pos+1)<<1 | bam_is_rev(b);
+        if (max_pos < buf[i].u.pos)
+            max_pos = buf[i].u.pos;
+    }
+
+    while (max_pos) {
+        ++max_digit;
+        max_pos = max_pos >> 1;
+    }
+
+    buf_ar2[0] = buf;
+    buf_ar2[1] = (bam1_tag *)malloc(sizeof(bam1_tag) * n);
+    if (buf_ar2[1] == NULL) {
+        print_error("sort", "couldn't allocate memory for temporary buf");
+        goto err;
+    }
+
+    while (shift < max_digit){
+        size_t remainders[NUMBASE] = { 0 };
+        bam_a = buf_ar2[curr]; bam_b = buf_ar2[1-curr];
+        for (i = 0; i < n; ++i)
+            remainders[(bam_a[i].u.pos >> shift) % NUMBASE]++;
+        for (i = 1; i < NUMBASE; ++i)
+            remainders[i] += remainders[i - 1];
+        for (i = n - 1; i >= 0; i--) {
+            size_t j = --remainders[(bam_a[i].u.pos >> shift) % NUMBASE];
+            bam_b[j] = bam_a[i];
+        }
+        shift += STEP;
+        curr = 1 - curr;
+    }
+    if (curr == 1) {
+        bam1_tag *end = buf + n;
+        bam_a = buf_ar2[0]; bam_b = buf_ar2[1];
+        while (bam_a < end) *bam_a++ = *bam_b++;
+    }
+
+    ret = 0;
+err:
+    free(buf_ar2[1]);
+    return ret;
+}
+
  static void *worker(void *data)
  {
      worker_t *w = (worker_t*)data;
      char *name;
      w->error = 0;
-    ks_mergesort(sort, w->buf_len, w->buf, 0);
+
+    if (!g_is_by_qname && !g_is_by_tag) {
+        if (ks_radixsort(w->buf_len, w->buf, w->h) < 0) {
+            w->error = errno;
+            return NULL;
+        }
+    } else {
+        ks_mergesort(sort, w->buf_len, w->buf, 0);
+    }
  
      if (w->no_save)
          return 0;
@@ -2138,11 +2215,12 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
              mem_full = 1;
          }
  
-        // Pull out the pointer to the sort tag if applicable
+        // Pull out the value of the position
+        // or the pointer to the sort tag if applicable
          if (g_is_by_tag) {
-            buf[k].tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
+            buf[k].u.tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
          } else {
-            buf[k].tag = NULL;
+            buf[k].u.tag = NULL;
          }
          ++k;
  
@@ -2166,7 +2244,7 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
          in_mem = calloc(n_threads > 0 ? n_threads : 1, sizeof(in_mem[0]));
          if (!in_mem) goto err;
          num_in_mem = sort_blocks(n_files, k, buf, prefix, header, n_threads,
-                                   in_mem);
+                                 in_mem);
          if (num_in_mem < 0) goto err;
      } else {
          num_in_mem = 0;
@@ -2174,7 +2252,6 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
  
      // write the final output
      if (n_files == 0 && num_in_mem < 2) { // a single block
-        ks_mergesort(sort, k, buf, 0);
          if (write_buffer(fnout, modeout, k, buf, header, n_threads, out_fmt) != 0) {
              print_error_errno("sort", "failed to create \"%s\"", fnout);
              goto err;
@@ -2215,6 +2292,7 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
      bam_destroy1(b);
      free(buf);
      free(bam_mem);
+    free(in_mem);
      bam_hdr_destroy(header);
      if (fp) sam_close(fp);
      return ret;
diff --git a/samtools/bam_sort.c.pysam.c b/samtools/bam_sort.c.pysam.c

index d38a311cfd978e9fc292097e84be1caf8b421657..50285721afb19de5343ad9932671675d40fe8ece 100644 (file)
--- a/samtools/bam_sort.c.pysam.c
+++ b/samtools/bam_sort.c.pysam.c
@@ -40,7 +40,6 @@ DEALINGS IN THE SOFTWARE.  */
  #include <getopt.h>
  #include <assert.h>
  #include <pthread.h>
-#include "htslib/bgzf.h"
  #include "htslib/ksort.h"
  #include "htslib/hts_os.h"
  #include "htslib/khash.h"
@@ -51,11 +50,15 @@ DEALINGS IN THE SOFTWARE.  */
  #include "samtools.h"
  
  
-// Struct which contains the a record, and the pointer to the sort tag (if any)
-// Used to speed up sort-by-tag.
+// Struct which contains the a record, and the pointer to the sort tag (if any) or
+// a combined ref / position / strand.
+// Used to speed up tag and position sorts.
  typedef struct bam1_tag {
      bam1_t *bam_record;
-    const uint8_t *tag;
+    union {
+        const uint8_t *tag;
+        uint64_t pos;
+    } u;
  } bam1_tag;
  
  /* Minimum memory required in megabytes before sort will attempt to run. This
@@ -125,6 +128,7 @@ static int strnum_cmp(const char *_a, const char *_b)
  
  typedef struct {
      int i;
+    uint32_t rev;
      uint64_t pos, idx;
      bam1_tag entry;
  } heap1_t;
@@ -152,6 +156,7 @@ static inline int heap_lt(const heap1_t a, const heap1_t b)
          if (fa != fb) return fa > fb;
      } else {
          if (a.pos != b.pos) return a.pos > b.pos;
+        if (a.rev != b.rev) return a.rev > b.rev;
      }
      // This compares by position in the input file(s)
      if (a.i != b.i) return a.i > b.i;
@@ -1363,24 +1368,25 @@ int bam_merge_core2(int by_qname, char* sort_tag, const char *out, const char *m
          int res;
          h->i = i;
          h->entry.bam_record = bam_init1();
-        h->entry.tag = NULL;
+        h->entry.u.tag = NULL;
          if (!h->entry.bam_record) goto mem_fail;
          res = iter[i] ? sam_itr_next(fp[i], iter[i], h->entry.bam_record) : sam_read1(fp[i], hdr[i], h->entry.bam_record);
          if (res >= 0) {
              bam_translate(h->entry.bam_record, translation_tbl + i);
-            h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1)<<1 | bam_is_rev(h->entry.bam_record);
+            h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1);
+            h->rev = bam_is_rev(h->entry.bam_record);
              h->idx = idx++;
              if (g_is_by_tag) {
-                h->entry.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
+                h->entry.u.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
              } else {
-                h->entry.tag = NULL;
+                h->entry.u.tag = NULL;
              }
          }
          else if (res == -1 && (!iter[i] || iter[i]->finished)) {
              h->pos = HEAP_EMPTY;
              bam_destroy1(h->entry.bam_record);
              h->entry.bam_record = NULL;
-            h->entry.tag = NULL;
+            h->entry.u.tag = NULL;
          } else {
              print_error(cmd, "failed to read first record from \"%s\"", fn[i]);
              goto fail;
@@ -1415,18 +1421,19 @@ int bam_merge_core2(int by_qname, char* sort_tag, const char *out, const char *m
          }
          if ((j = (iter[heap->i]? sam_itr_next(fp[heap->i], iter[heap->i], b) : sam_read1(fp[heap->i], hdr[heap->i], b))) >= 0) {
              bam_translate(b, translation_tbl + heap->i);
-            heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1)<<1 | bam_is_rev(b);
+            heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1);
+            heap->rev = bam_is_rev(b);
              heap->idx = idx++;
              if (g_is_by_tag) {
-                heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+                heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
              } else {
-                heap->entry.tag = NULL;
+                heap->entry.u.tag = NULL;
              }
          } else if (j == -1 && (!iter[heap->i] || iter[heap->i]->finished)) {
              heap->pos = HEAP_EMPTY;
              bam_destroy1(heap->entry.bam_record);
              heap->entry.bam_record = NULL;
-            heap->entry.tag = NULL;
+            heap->entry.u.tag = NULL;
          } else {
              print_error(cmd, "\"%s\" is truncated", fn[heap->i]);
              goto fail;
@@ -1651,19 +1658,19 @@ static inline int heap_add_read(heap1_t *heap, int nfiles, samFile **fp,
      }
      if (res >= 0) {
          heap->pos = (((uint64_t)heap->entry.bam_record->core.tid<<32)
-                     | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1)<<1
-                     | bam_is_rev(heap->entry.bam_record));
+                     | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1));
+        heap->rev = bam_is_rev(heap->entry.bam_record);
          heap->idx = (*idx)++;
          if (g_is_by_tag) {
-            heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+            heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
          } else {
-            heap->entry.tag = NULL;
+            heap->entry.u.tag = NULL;
          }
      } else if (res == -1) {
          heap->pos = HEAP_EMPTY;
          if (i < nfiles) bam_destroy1(heap->entry.bam_record);
          heap->entry.bam_record = NULL;
-        heap->entry.tag = NULL;
+        heap->entry.u.tag = NULL;
      } else {
          return -1;
      }
@@ -1718,7 +1725,7 @@ static int bam_merge_simple(int by_qname, char *sort_tag, const char *out,
  
          // Get a read into the heap
          h->i = i;
-        h->entry.tag = NULL;
+        h->entry.u.tag = NULL;
          if (i < n) {
              h->entry.bam_record = bam_init1();
              if (!h->entry.bam_record) goto mem_fail;
@@ -1736,7 +1743,7 @@ static int bam_merge_simple(int by_qname, char *sort_tag, const char *out,
          return -1;
      }
  
-    hts_set_threads(fpout, n_threads);
+    if (n_threads > 1) hts_set_threads(fpout, n_threads);
  
      if (sam_hdr_write(fpout, hout) != 0) {
          print_error_errno(cmd, "failed to write header to \"%s\"", out);
@@ -1806,8 +1813,14 @@ static inline int bam1_cmp_core(const bam1_tag a, const bam1_tag b)
          if (t != 0) return t;
          return (int) (a.bam_record->core.flag&0xc0) - (int) (b.bam_record->core.flag&0xc0);
      } else {
-        pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1)<<1|bam_is_rev(a.bam_record);
-        pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1)<<1|bam_is_rev(b.bam_record);
+        pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1);
+        pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1);
+
+        if (pa == pb) {
+            pa = bam_is_rev(a.bam_record);
+            pb = bam_is_rev(b.bam_record);
+        }
+
          return pa < pb ? -1 : (pa > pb ? 1 : 0);
      }
  }
@@ -1830,8 +1843,8 @@ uint8_t normalize_type(const uint8_t* aux) {
  // equal to or greater than b, respectively.
  static inline int bam1_cmp_by_tag(const bam1_tag a, const bam1_tag b)
  {
-    const uint8_t* aux_a = a.tag;
-    const uint8_t* aux_b = b.tag;
+    const uint8_t* aux_a = a.u.tag;
+    const uint8_t* aux_b = b.u.tag;
  
      if (aux_a == NULL && aux_b != NULL) {
          return -1;
@@ -1928,12 +1941,76 @@ static int write_buffer(const char *fn, const char *mode, size_t l, bam1_tag *bu
      return -1;
  }
  
+#define NUMBASE 256
+#define STEP 8
+
+static int ks_radixsort(size_t n, bam1_tag *buf, const bam_hdr_t *h)
+{
+    int curr = 0, ret = -1;
+    ssize_t i;
+    bam1_tag *buf_ar2[2], *bam_a, *bam_b;
+    uint64_t max_pos = 0, max_digit = 0, shift = 0;
+
+    for (i = 0; i < n; i++) {
+        bam1_t *b = buf[i].bam_record;
+        int32_t tid = b->core.tid == -1 ? h->n_targets : b->core.tid;
+        buf[i].u.pos = (uint64_t)tid<<32 | (b->core.pos+1)<<1 | bam_is_rev(b);
+        if (max_pos < buf[i].u.pos)
+            max_pos = buf[i].u.pos;
+    }
+
+    while (max_pos) {
+        ++max_digit;
+        max_pos = max_pos >> 1;
+    }
+
+    buf_ar2[0] = buf;
+    buf_ar2[1] = (bam1_tag *)malloc(sizeof(bam1_tag) * n);
+    if (buf_ar2[1] == NULL) {
+        print_error("sort", "couldn't allocate memory for temporary buf");
+        goto err;
+    }
+
+    while (shift < max_digit){
+        size_t remainders[NUMBASE] = { 0 };
+        bam_a = buf_ar2[curr]; bam_b = buf_ar2[1-curr];
+        for (i = 0; i < n; ++i)
+            remainders[(bam_a[i].u.pos >> shift) % NUMBASE]++;
+        for (i = 1; i < NUMBASE; ++i)
+            remainders[i] += remainders[i - 1];
+        for (i = n - 1; i >= 0; i--) {
+            size_t j = --remainders[(bam_a[i].u.pos >> shift) % NUMBASE];
+            bam_b[j] = bam_a[i];
+        }
+        shift += STEP;
+        curr = 1 - curr;
+    }
+    if (curr == 1) {
+        bam1_tag *end = buf + n;
+        bam_a = buf_ar2[0]; bam_b = buf_ar2[1];
+        while (bam_a < end) *bam_a++ = *bam_b++;
+    }
+
+    ret = 0;
+err:
+    free(buf_ar2[1]);
+    return ret;
+}
+
  static void *worker(void *data)
  {
      worker_t *w = (worker_t*)data;
      char *name;
      w->error = 0;
-    ks_mergesort(sort, w->buf_len, w->buf, 0);
+
+    if (!g_is_by_qname && !g_is_by_tag) {
+        if (ks_radixsort(w->buf_len, w->buf, w->h) < 0) {
+            w->error = errno;
+            return NULL;
+        }
+    } else {
+        ks_mergesort(sort, w->buf_len, w->buf, 0);
+    }
  
      if (w->no_save)
          return 0;
@@ -2140,11 +2217,12 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
              mem_full = 1;
          }
  
-        // Pull out the pointer to the sort tag if applicable
+        // Pull out the value of the position
+        // or the pointer to the sort tag if applicable
          if (g_is_by_tag) {
-            buf[k].tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
+            buf[k].u.tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
          } else {
-            buf[k].tag = NULL;
+            buf[k].u.tag = NULL;
          }
          ++k;
  
@@ -2168,7 +2246,7 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
          in_mem = calloc(n_threads > 0 ? n_threads : 1, sizeof(in_mem[0]));
          if (!in_mem) goto err;
          num_in_mem = sort_blocks(n_files, k, buf, prefix, header, n_threads,
-                                   in_mem);
+                                 in_mem);
          if (num_in_mem < 0) goto err;
      } else {
          num_in_mem = 0;
@@ -2176,7 +2254,6 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
  
      // write the final output
      if (n_files == 0 && num_in_mem < 2) { // a single block
-        ks_mergesort(sort, k, buf, 0);
          if (write_buffer(fnout, modeout, k, buf, header, n_threads, out_fmt) != 0) {
              print_error_errno("sort", "failed to create \"%s\"", fnout);
              goto err;
@@ -2217,6 +2294,7 @@ int bam_sort_core_ext(int is_by_qname, char* sort_by_tag, const char *fn, const
      bam_destroy1(b);
      free(buf);
      free(bam_mem);
+    free(in_mem);
      bam_hdr_destroy(header);
      if (fp) sam_close(fp);
      return ret;
diff --git a/samtools/bamshuf.c b/samtools/bamshuf.c

index c1c89fb93d75203555b2a48984eab52bb1c6a75b..8c53f35bf9e0b845d65f2db1ba5e8ebabe3b6c97 100644 (file)
--- a/samtools/bamshuf.c
+++ b/samtools/bamshuf.c
@@ -1,7 +1,7 @@
  /*  bamshuf.c -- collate subcommand.
  
      Copyright (C) 2012 Broad Institute.
-    Copyright (C) 2013, 2015 Genome Research Ltd.
+    Copyright (C) 2013, 2015, 2018 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>
  
@@ -30,12 +30,18 @@ DEALINGS IN THE SOFTWARE.  */
  #include <stdlib.h>
  #include <string.h>
  #include <assert.h>
+#include <errno.h>
+#ifdef _WIN32
+#  define WIN32_LEAN_AND_MEAN
+#  include <windows.h>
+#endif
  #include "htslib/sam.h"
  #include "htslib/hts.h"
  #include "htslib/ksort.h"
  #include "samtools.h"
  #include "htslib/thread_pool.h"
  #include "sam_opts.h"
+#include "htslib/khash.h"
  
  #define DEF_CLEVEL 1
  
@@ -77,13 +83,110 @@ static inline int elem_lt(elem_t x, elem_t y)
  
  KSORT_INIT(bamshuf, elem_t, elem_lt)
  
+
+typedef struct {
+    int written;
+    bam1_t *b;
+} bam_item_t;
+
+typedef struct {
+    bam1_t *bam_pool;
+    bam_item_t *items;
+    size_t size;
+    size_t index;
+} bam_list_t;
+
+typedef struct {
+    bam_item_t *bi;
+} store_item_t;
+
+KHASH_MAP_INIT_STR(bam_store, store_item_t)
+
+
+static bam_item_t *store_bam(bam_list_t *list) {
+    size_t old_index = list->index;
+
+    list->items[list->index++].written = 0;
+
+    if (list->index >= list->size)
+        list->index = 0;
+
+    return &list->items[old_index];
+}
+
+
+static int write_bam_needed(bam_list_t *list) {
+    return !list->items[list->index].written;
+}
+
+
+static void mark_bam_as_written(bam_list_t *list) {
+    list->items[list->index].written = 1;
+}
+
+
+static int create_bam_list(bam_list_t *list, size_t max_size) {
+    size_t i;
+
+    list->size = list->index = 0;
+    list->items    = NULL;
+    list->bam_pool = NULL;
+
+    if ((list->items = malloc(max_size * sizeof(bam_item_t))) == NULL) {
+        return 1;
+    }
+
+    if ((list->bam_pool = calloc(max_size, sizeof(bam1_t))) == NULL) {
+        return 1;
+    }
+
+    for (i = 0; i < max_size; i++) {
+        list->items[i].b = &list->bam_pool[i];
+        list->items[i].written = 1;
+    }
+
+    list->size  = max_size;
+    list->index = 0;
+
+    return 0;
+}
+
+
+static void destroy_bam_list(bam_list_t *list) {
+    size_t i;
+
+    for (i = 0; i < list->size; i++) {
+        free(list->bam_pool[i].data);
+    }
+
+    free(list->bam_pool);
+    free(list->items);
+}
+
+
+static inline int write_to_bin_file(bam1_t *bam, int64_t *count, samFile **bin_files, char **names, bam_hdr_t *header, int files) {
+    uint32_t x;
+
+    x = hash_X31_Wang(bam_get_qname(bam)) % files;
+
+    if (sam_write1(bin_files[x], header, bam) < 0) {
+        print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", names[x]);
+        return 1;
+    }
+
+    ++count[x];
+
+    return 0;
+}
+
+
  static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
-                   int is_stdout, sam_global_args *ga)
+                   int is_stdout, const char *output_file, int fast, int store_max, sam_global_args *ga)
  {
      samFile *fp, *fpw = NULL, **fpt = NULL;
      char **fnt = NULL, modew[8];
      bam1_t *b = NULL;
-    int i, l, r;
+    int i, counter, l, r;
      bam_hdr_t *h = NULL;
      int64_t j, max_cnt = 0, *cnt = NULL;
      elem_t *a = NULL;
@@ -122,6 +225,40 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
          goto fail;
      }
  
+    // open final output file
+    l = strlen(pre);
+
+    sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+
+    if (!is_stdout && !output_file) { // output to a file (name based on prefix)
+        char *fnw = (char*)calloc(l + 5, 1);
+        if (!fnw) goto mem_fail;
+        if (ga->out.format == unknown_format)
+            sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
+        else
+            sprintf(fnw, "%s.%s", pre,  hts_format_file_extension(&ga->out));
+        fpw = sam_open_format(fnw, modew, &ga->out);
+        free(fnw);
+    } else if (output_file) { // output to a given file
+        modew[0] = 'w'; modew[1] = '\0';
+        sam_open_mode(modew + 1, output_file, NULL);
+        j = strlen(modew);
+        snprintf(modew + j, sizeof(modew) - j, "%d",
+                 (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+        fpw = sam_open_format(output_file, modew, &ga->out);
+    } else fpw = sam_open_format("-", modew, &ga->out); // output to stdout
+    if (fpw == NULL) {
+        if (is_stdout) print_error_errno("collate", "Cannot open standard output");
+        else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
+        goto fail;
+    }
+    if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
+
+    if (sam_hdr_write(fpw, h) < 0) {
+        print_error_errno("collate", "Couldn't write header");
+        goto fail;
+    }
+
      fnt = (char**)calloc(n_files, sizeof(char*));
      if (!fnt) goto mem_fail;
      fpt = (samFile**)calloc(n_files, sizeof(samFile*));
@@ -129,35 +266,162 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      cnt = (int64_t*)calloc(n_files, 8);
      if (!cnt) goto mem_fail;
  
-    l = strlen(pre);
-
-    for (i = 0; i < n_files; ++i) {
-        fnt[i] = (char*)calloc(l + 10, 1);
+    for (i = counter = 0; i < n_files; ++i) {
+        fnt[i] = (char*)calloc(l + 20, 1);
          if (!fnt[i]) goto mem_fail;
-        sprintf(fnt[i], "%s.%.4d.bam", pre, i);
-        fpt[i] = sam_open(fnt[i], "wb1");
+        do {
+            sprintf(fnt[i], "%s.%04d.bam", pre, counter++);
+            fpt[i] = sam_open(fnt[i], "wxb1");
+        } while (!fpt[i] && errno == EEXIST);
          if (fpt[i] == NULL) {
              print_error_errno("collate", "Cannot open intermediate file \"%s\"", fnt[i]);
              goto fail;
          }
+        if (p.pool) hts_set_opt(fpt[i], HTS_OPT_THREAD_POOL, &p);
          if (sam_hdr_write(fpt[i], h) < 0) {
              print_error_errno("collate", "Couldn't write header to intermediate file \"%s\"", fnt[i]);
              goto fail;
          }
      }
-    b = bam_init1();
-    if (!b) goto mem_fail;
-    while ((r = sam_read1(fp, h, b)) >= 0) {
-        uint32_t x;
-        x = hash_X31_Wang(bam_get_qname(b)) % n_files;
-        if (sam_write1(fpt[x], h, b) < 0) {
-            print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", fnt[x]);
+
+    if (fast) {
+        khash_t(bam_store) *stored = kh_init(bam_store);
+        khiter_t itr;
+        bam_list_t list;
+        int err = 0;
+        if (!stored) goto mem_fail;
+
+        if (store_max < 2) store_max = 2;
+
+        if (create_bam_list(&list, store_max)) {
+            fprintf(stderr, "[collate[ ERROR: unable to create bam list.\n");
+            err = 1;
+            goto fast_fail;
+        }
+
+        while ((r = sam_read1(fp, h, list.items[list.index].b)) >= 0) {
+            int ret;
+            bam1_t *b = list.items[list.index].b;
+            int readflag = b->core.flag & (BAM_FREAD1 | BAM_FREAD2);
+
+            // strictly paired reads only
+            if (!(b->core.flag & (BAM_FSECONDARY | BAM_FSUPPLEMENTARY)) && (readflag == BAM_FREAD1 || readflag == BAM_FREAD2)) {
+
+                itr = kh_get(bam_store, stored, bam_get_qname(b));
+
+                if (itr == kh_end(stored)) {
+                    // new read
+                    itr = kh_put(bam_store, stored, bam_get_qname(b), &ret);
+
+                    if (ret > 0) { // okay to go ahead store it
+                        kh_value(stored, itr).bi = store_bam(&list);
+
+                        // see if the next one on the list needs to be written out
+                        if (write_bam_needed(&list)) {
+                            if (write_to_bin_file(list.items[list.index].b, cnt, fpt, fnt, h, n_files) < 0) {
+                                fprintf(stderr, "[collate] ERROR: could not write line.\n");
+                                err = 1;
+                                goto fast_fail;
+                            } else {
+                                mark_bam_as_written(&list);
+
+                                itr = kh_get(bam_store, stored, bam_get_qname(list.items[list.index].b));
+
+                                if (itr != kh_end(stored)) {
+                                    kh_del(bam_store, stored, itr);
+                                } else {
+                                    fprintf(stderr, "[collate] ERROR: stored value not in hash.\n");
+                                    err = 1;
+                                    goto fast_fail;
+                                }
+                            }
+                        }
+                    } else if (ret == 0) {
+                        fprintf(stderr, "[collate] ERROR: value already in hash.\n");
+                        err = 1;
+                        goto fast_fail;
+                    } else {
+                        fprintf(stderr, "[collate] ERROR: unable to store in hash.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+                } else { // we have a match
+                    // write out the reads in R1 R2 order
+                    bam1_t *r1, *r2;
+
+                    if (b->core.flag & BAM_FREAD1) {
+                        r1 = b;
+                        r2 = kh_value(stored, itr).bi->b;
+                    } else {
+                        r1 = kh_value(stored, itr).bi->b;
+                        r2 = b;
+                    }
+
+                    if (sam_write1(fpw, h, r1) < 0) {
+                        fprintf(stderr, "[collate] ERROR: could not write r1 alignment.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+
+                    if (sam_write1(fpw, h, r2) < 0) {
+                        fprintf(stderr, "[collate] ERROR: could not write r2 alignment.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+
+                    mark_bam_as_written(&list);
+
+                    // remove stored read
+                    kh_value(stored, itr).bi->written = 1;
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+        }
+
+        for (list.index = 0; list.index < list.size; list.index++) {
+            if (write_bam_needed(&list)) {
+                bam1_t *b = list.items[list.index].b;
+
+                if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+                    err = 1;
+                    goto fast_fail;
+                } else {
+                    itr = kh_get(bam_store, stored, bam_get_qname(b));
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+        }
+
+ fast_fail:
+        if (err) {
+            for (itr = kh_begin(stored); itr != kh_end(stored); ++itr) {
+                if (kh_exist(stored, itr)) {
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+
+            kh_destroy(bam_store, stored);
+            destroy_bam_list(&list);
              goto fail;
+        } else {
+            kh_destroy(bam_store, stored);
+            destroy_bam_list(&list);
          }
-        ++cnt[x];
+
+    } else {
+        b = bam_init1();
+        if (!b) goto mem_fail;
+
+        while ((r = sam_read1(fp, h, b)) >= 0) {
+            if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+                bam_destroy1(b);
+                goto fail;
+            }
+        }
+
+        bam_destroy1(b);
      }
-    bam_destroy1(b);
-    b = NULL;
+
      if (r < -1) {
          fprintf(stderr, "Error reading input file\n");
          goto fail;
@@ -178,30 +442,8 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      fpt = NULL;
      sam_close(fp);
      fp = NULL;
-    // merge
-    sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
-    if (!is_stdout) { // output to a file
-        char *fnw = (char*)calloc(l + 5, 1);
-        if (!fnw) goto mem_fail;
-        if (ga->out.format == unknown_format)
-            sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
-        else
-            sprintf(fnw, "%s.%s", pre,  hts_format_file_extension(&ga->out));
-        fpw = sam_open_format(fnw, modew, &ga->out);
-        free(fnw);
-    } else fpw = sam_open_format("-", modew, &ga->out); // output to stdout
-    if (fpw == NULL) {
-        if (is_stdout) print_error_errno("collate", "Cannot open standard output");
-        else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
-        goto fail;
-    }
-    if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
-
-    if (sam_hdr_write(fpw, h) < 0) {
-        print_error_errno("collate", "Couldn't write header");
-        goto fail;
-    }
  
+    // merge
      a = malloc(max_cnt * sizeof(elem_t));
      if (!a) goto mem_fail;
      for (j = 0; j < max_cnt; ++j) {
@@ -262,7 +504,6 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      if (fp) sam_close(fp);
      if (fpw) sam_close(fpw);
      if (h) bam_hdr_destroy(h);
-    if (b) bam_destroy1(b);
      for (i = 0; i < n_files; ++i) {
          if (fnt) free(fnt[i]);
          if (fpt && fpt[i]) sam_close(fpt[i]);
@@ -279,44 +520,102 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      return 1;
  }
  
-static int usage(FILE *fp, int n_files) {
+static int usage(FILE *fp, int n_files, int reads_store) {
      fprintf(fp,
-            "Usage:   samtools collate [-Ou] [-n nFiles] [-l cLevel] <in.bam> <out.prefix>\n\n"
+            "Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]\n\n"
              "Options:\n"
              "      -O       output to stdout\n"
+            "      -o       output file name (use prefix if not set)\n"
              "      -u       uncompressed BAM output\n"
+            "      -f       fast (only primary alignments)\n"
+            "      -r       working reads stored (with -f) [%d]\n" // reads_store
              "      -l INT   compression level [%d]\n" // DEF_CLEVEL
              "      -n INT   number of temporary files [%d]\n", // n_files
-            DEF_CLEVEL, n_files);
+            reads_store, DEF_CLEVEL, n_files);
  
      sam_global_opt_help(fp, "-....@");
+    fprintf(fp,
+            "  <prefix> is required unless the -o or -O options are used.\n");
  
      return 1;
  }
  
+char * generate_prefix() {
+    char *prefix;
+    unsigned int pid = getpid();
+#ifdef _WIN32
+#  define PREFIX_LEN (MAX_PATH + 16)
+    DWORD ret;
+    prefix = calloc(PREFIX_LEN, sizeof(*prefix));
+    if (!prefix) {
+        perror("collate");
+        return NULL;
+    }
+    ret = GetTempPathA(MAX_PATH, prefix);
+    if (ret > MAX_PATH || ret == 0) {
+        fprintf(stderr,
+                "[E::collate] Couldn't get path for temporary files.\n");
+        free(prefix);
+        return NULL;
+    }
+    snprintf(prefix + ret, PREFIX_LEN - ret, "\\%x", pid);
+    return prefix;
+#else
+#  define PREFIX_LEN 64
+    prefix = malloc(PREFIX_LEN);
+    if (!prefix) {
+        perror("collate");
+        return NULL;
+    }
+    snprintf(prefix, PREFIX_LEN, "/tmp/collate%x", pid);
+    return prefix;
+#endif
+}
+
  int main_bamshuf(int argc, char *argv[])
  {
-    int c, n_files = 64, clevel = DEF_CLEVEL, is_stdout = 0, is_un = 0;
+    int c, n_files = 64, clevel = DEF_CLEVEL, is_stdout = 0, is_un = 0, fast_coll = 0, reads_store = 10000, ret, pre_mem = 0;
+    const char *output_file = NULL;
+    char *prefix = NULL;
      sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
      static const struct option lopts[] = {
          SAM_OPT_GLOBAL_OPTIONS('-', 0, 0, 0, 0, '@'),
          { NULL, 0, NULL, 0 }
      };
  
-    while ((c = getopt_long(argc, argv, "n:l:uO@:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "n:l:uOo:@:fr:", lopts, NULL)) >= 0) {
          switch (c) {
          case 'n': n_files = atoi(optarg); break;
          case 'l': clevel = atoi(optarg); break;
          case 'u': is_un = 1; break;
          case 'O': is_stdout = 1; break;
+        case 'o': output_file = optarg; break;
+        case 'f': fast_coll = 1; break;
+        case 'r': reads_store = atoi(optarg); break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
                    /* else fall-through */
-        case '?': return usage(stderr, n_files);
+        case '?': return usage(stderr, n_files, reads_store);
          }
      }
      if (is_un) clevel = 0;
-    if (optind + 2 > argc)
-        return usage(stderr, n_files);
+    if (argc >= optind + 2) prefix = argv[optind+1];
+    if (!(prefix || is_stdout || output_file))
+        return usage(stderr, n_files, reads_store);
+    if (is_stdout && output_file) {
+        fprintf(stderr, "collate: -o and -O options cannot be used together.\n");
+        return usage(stderr, n_files, reads_store);
+    }
+    if (!prefix) {
+        prefix = generate_prefix();
+        pre_mem = 1;
+    }
+
+    if (!prefix) return EXIT_FAILURE;
+
+    ret = bamshuf(argv[optind], n_files, prefix, clevel, is_stdout,
+                   output_file, fast_coll, reads_store, &ga);
+
+    if (pre_mem) free(prefix);
  
-    return bamshuf(argv[optind], n_files, argv[optind+1], clevel, is_stdout, &ga);
+    return ret;
  }
diff --git a/samtools/bamshuf.c.pysam.c b/samtools/bamshuf.c.pysam.c

index 008aa0c33f4e43bda6f2d57954b5aff653e34c7b..c89a50785a45533f95c81f62df138faa989aafb1 100644 (file)
--- a/samtools/bamshuf.c.pysam.c
+++ b/samtools/bamshuf.c.pysam.c
@@ -3,7 +3,7 @@
  /*  bamshuf.c -- collate subcommand.
  
      Copyright (C) 2012 Broad Institute.
-    Copyright (C) 2013, 2015 Genome Research Ltd.
+    Copyright (C) 2013, 2015, 2018 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>
  
@@ -32,12 +32,18 @@ DEALINGS IN THE SOFTWARE.  */
  #include <stdlib.h>
  #include <string.h>
  #include <assert.h>
+#include <errno.h>
+#ifdef _WIN32
+#  define WIN32_LEAN_AND_MEAN
+#  include <windows.h>
+#endif
  #include "htslib/sam.h"
  #include "htslib/hts.h"
  #include "htslib/ksort.h"
  #include "samtools.h"
  #include "htslib/thread_pool.h"
  #include "sam_opts.h"
+#include "htslib/khash.h"
  
  #define DEF_CLEVEL 1
  
@@ -79,13 +85,110 @@ static inline int elem_lt(elem_t x, elem_t y)
  
  KSORT_INIT(bamshuf, elem_t, elem_lt)
  
+
+typedef struct {
+    int written;
+    bam1_t *b;
+} bam_item_t;
+
+typedef struct {
+    bam1_t *bam_pool;
+    bam_item_t *items;
+    size_t size;
+    size_t index;
+} bam_list_t;
+
+typedef struct {
+    bam_item_t *bi;
+} store_item_t;
+
+KHASH_MAP_INIT_STR(bam_store, store_item_t)
+
+
+static bam_item_t *store_bam(bam_list_t *list) {
+    size_t old_index = list->index;
+
+    list->items[list->index++].written = 0;
+
+    if (list->index >= list->size)
+        list->index = 0;
+
+    return &list->items[old_index];
+}
+
+
+static int write_bam_needed(bam_list_t *list) {
+    return !list->items[list->index].written;
+}
+
+
+static void mark_bam_as_written(bam_list_t *list) {
+    list->items[list->index].written = 1;
+}
+
+
+static int create_bam_list(bam_list_t *list, size_t max_size) {
+    size_t i;
+
+    list->size = list->index = 0;
+    list->items    = NULL;
+    list->bam_pool = NULL;
+
+    if ((list->items = malloc(max_size * sizeof(bam_item_t))) == NULL) {
+        return 1;
+    }
+
+    if ((list->bam_pool = calloc(max_size, sizeof(bam1_t))) == NULL) {
+        return 1;
+    }
+
+    for (i = 0; i < max_size; i++) {
+        list->items[i].b = &list->bam_pool[i];
+        list->items[i].written = 1;
+    }
+
+    list->size  = max_size;
+    list->index = 0;
+
+    return 0;
+}
+
+
+static void destroy_bam_list(bam_list_t *list) {
+    size_t i;
+
+    for (i = 0; i < list->size; i++) {
+        free(list->bam_pool[i].data);
+    }
+
+    free(list->bam_pool);
+    free(list->items);
+}
+
+
+static inline int write_to_bin_file(bam1_t *bam, int64_t *count, samFile **bin_files, char **names, bam_hdr_t *header, int files) {
+    uint32_t x;
+
+    x = hash_X31_Wang(bam_get_qname(bam)) % files;
+
+    if (sam_write1(bin_files[x], header, bam) < 0) {
+        print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", names[x]);
+        return 1;
+    }
+
+    ++count[x];
+
+    return 0;
+}
+
+
  static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
-                   int is_samtools_stdout, sam_global_args *ga)
+                   int is_samtools_stdout, const char *output_file, int fast, int store_max, sam_global_args *ga)
  {
      samFile *fp, *fpw = NULL, **fpt = NULL;
      char **fnt = NULL, modew[8];
      bam1_t *b = NULL;
-    int i, l, r;
+    int i, counter, l, r;
      bam_hdr_t *h = NULL;
      int64_t j, max_cnt = 0, *cnt = NULL;
      elem_t *a = NULL;
@@ -124,6 +227,40 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
          goto fail;
      }
  
+    // open final output file
+    l = strlen(pre);
+
+    sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+
+    if (!is_samtools_stdout && !output_file) { // output to a file (name based on prefix)
+        char *fnw = (char*)calloc(l + 5, 1);
+        if (!fnw) goto mem_fail;
+        if (ga->out.format == unknown_format)
+            sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
+        else
+            sprintf(fnw, "%s.%s", pre,  hts_format_file_extension(&ga->out));
+        fpw = sam_open_format(fnw, modew, &ga->out);
+        free(fnw);
+    } else if (output_file) { // output to a given file
+        modew[0] = 'w'; modew[1] = '\0';
+        sam_open_mode(modew + 1, output_file, NULL);
+        j = strlen(modew);
+        snprintf(modew + j, sizeof(modew) - j, "%d",
+                 (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+        fpw = sam_open_format(output_file, modew, &ga->out);
+    } else fpw = sam_open_format("-", modew, &ga->out); // output to samtools_stdout
+    if (fpw == NULL) {
+        if (is_samtools_stdout) print_error_errno("collate", "Cannot open standard output");
+        else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
+        goto fail;
+    }
+    if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
+
+    if (sam_hdr_write(fpw, h) < 0) {
+        print_error_errno("collate", "Couldn't write header");
+        goto fail;
+    }
+
      fnt = (char**)calloc(n_files, sizeof(char*));
      if (!fnt) goto mem_fail;
      fpt = (samFile**)calloc(n_files, sizeof(samFile*));
@@ -131,35 +268,162 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      cnt = (int64_t*)calloc(n_files, 8);
      if (!cnt) goto mem_fail;
  
-    l = strlen(pre);
-
-    for (i = 0; i < n_files; ++i) {
-        fnt[i] = (char*)calloc(l + 10, 1);
+    for (i = counter = 0; i < n_files; ++i) {
+        fnt[i] = (char*)calloc(l + 20, 1);
          if (!fnt[i]) goto mem_fail;
-        sprintf(fnt[i], "%s.%.4d.bam", pre, i);
-        fpt[i] = sam_open(fnt[i], "wb1");
+        do {
+            sprintf(fnt[i], "%s.%04d.bam", pre, counter++);
+            fpt[i] = sam_open(fnt[i], "wxb1");
+        } while (!fpt[i] && errno == EEXIST);
          if (fpt[i] == NULL) {
              print_error_errno("collate", "Cannot open intermediate file \"%s\"", fnt[i]);
              goto fail;
          }
+        if (p.pool) hts_set_opt(fpt[i], HTS_OPT_THREAD_POOL, &p);
          if (sam_hdr_write(fpt[i], h) < 0) {
              print_error_errno("collate", "Couldn't write header to intermediate file \"%s\"", fnt[i]);
              goto fail;
          }
      }
-    b = bam_init1();
-    if (!b) goto mem_fail;
-    while ((r = sam_read1(fp, h, b)) >= 0) {
-        uint32_t x;
-        x = hash_X31_Wang(bam_get_qname(b)) % n_files;
-        if (sam_write1(fpt[x], h, b) < 0) {
-            print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", fnt[x]);
+
+    if (fast) {
+        khash_t(bam_store) *stored = kh_init(bam_store);
+        khiter_t itr;
+        bam_list_t list;
+        int err = 0;
+        if (!stored) goto mem_fail;
+
+        if (store_max < 2) store_max = 2;
+
+        if (create_bam_list(&list, store_max)) {
+            fprintf(samtools_stderr, "[collate[ ERROR: unable to create bam list.\n");
+            err = 1;
+            goto fast_fail;
+        }
+
+        while ((r = sam_read1(fp, h, list.items[list.index].b)) >= 0) {
+            int ret;
+            bam1_t *b = list.items[list.index].b;
+            int readflag = b->core.flag & (BAM_FREAD1 | BAM_FREAD2);
+
+            // strictly paired reads only
+            if (!(b->core.flag & (BAM_FSECONDARY | BAM_FSUPPLEMENTARY)) && (readflag == BAM_FREAD1 || readflag == BAM_FREAD2)) {
+
+                itr = kh_get(bam_store, stored, bam_get_qname(b));
+
+                if (itr == kh_end(stored)) {
+                    // new read
+                    itr = kh_put(bam_store, stored, bam_get_qname(b), &ret);
+
+                    if (ret > 0) { // okay to go ahead store it
+                        kh_value(stored, itr).bi = store_bam(&list);
+
+                        // see if the next one on the list needs to be written out
+                        if (write_bam_needed(&list)) {
+                            if (write_to_bin_file(list.items[list.index].b, cnt, fpt, fnt, h, n_files) < 0) {
+                                fprintf(samtools_stderr, "[collate] ERROR: could not write line.\n");
+                                err = 1;
+                                goto fast_fail;
+                            } else {
+                                mark_bam_as_written(&list);
+
+                                itr = kh_get(bam_store, stored, bam_get_qname(list.items[list.index].b));
+
+                                if (itr != kh_end(stored)) {
+                                    kh_del(bam_store, stored, itr);
+                                } else {
+                                    fprintf(samtools_stderr, "[collate] ERROR: stored value not in hash.\n");
+                                    err = 1;
+                                    goto fast_fail;
+                                }
+                            }
+                        }
+                    } else if (ret == 0) {
+                        fprintf(samtools_stderr, "[collate] ERROR: value already in hash.\n");
+                        err = 1;
+                        goto fast_fail;
+                    } else {
+                        fprintf(samtools_stderr, "[collate] ERROR: unable to store in hash.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+                } else { // we have a match
+                    // write out the reads in R1 R2 order
+                    bam1_t *r1, *r2;
+
+                    if (b->core.flag & BAM_FREAD1) {
+                        r1 = b;
+                        r2 = kh_value(stored, itr).bi->b;
+                    } else {
+                        r1 = kh_value(stored, itr).bi->b;
+                        r2 = b;
+                    }
+
+                    if (sam_write1(fpw, h, r1) < 0) {
+                        fprintf(samtools_stderr, "[collate] ERROR: could not write r1 alignment.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+
+                    if (sam_write1(fpw, h, r2) < 0) {
+                        fprintf(samtools_stderr, "[collate] ERROR: could not write r2 alignment.\n");
+                        err = 1;
+                        goto fast_fail;
+                    }
+
+                    mark_bam_as_written(&list);
+
+                    // remove stored read
+                    kh_value(stored, itr).bi->written = 1;
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+        }
+
+        for (list.index = 0; list.index < list.size; list.index++) {
+            if (write_bam_needed(&list)) {
+                bam1_t *b = list.items[list.index].b;
+
+                if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+                    err = 1;
+                    goto fast_fail;
+                } else {
+                    itr = kh_get(bam_store, stored, bam_get_qname(b));
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+        }
+
+ fast_fail:
+        if (err) {
+            for (itr = kh_begin(stored); itr != kh_end(stored); ++itr) {
+                if (kh_exist(stored, itr)) {
+                    kh_del(bam_store, stored, itr);
+                }
+            }
+
+            kh_destroy(bam_store, stored);
+            destroy_bam_list(&list);
              goto fail;
+        } else {
+            kh_destroy(bam_store, stored);
+            destroy_bam_list(&list);
          }
-        ++cnt[x];
+
+    } else {
+        b = bam_init1();
+        if (!b) goto mem_fail;
+
+        while ((r = sam_read1(fp, h, b)) >= 0) {
+            if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+                bam_destroy1(b);
+                goto fail;
+            }
+        }
+
+        bam_destroy1(b);
      }
-    bam_destroy1(b);
-    b = NULL;
+
      if (r < -1) {
          fprintf(samtools_stderr, "Error reading input file\n");
          goto fail;
@@ -180,30 +444,8 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      fpt = NULL;
      sam_close(fp);
      fp = NULL;
-    // merge
-    sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
-    if (!is_samtools_stdout) { // output to a file
-        char *fnw = (char*)calloc(l + 5, 1);
-        if (!fnw) goto mem_fail;
-        if (ga->out.format == unknown_format)
-            sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
-        else
-            sprintf(fnw, "%s.%s", pre,  hts_format_file_extension(&ga->out));
-        fpw = sam_open_format(fnw, modew, &ga->out);
-        free(fnw);
-    } else fpw = sam_open_format("-", modew, &ga->out); // output to samtools_stdout
-    if (fpw == NULL) {
-        if (is_samtools_stdout) print_error_errno("collate", "Cannot open standard output");
-        else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
-        goto fail;
-    }
-    if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
-
-    if (sam_hdr_write(fpw, h) < 0) {
-        print_error_errno("collate", "Couldn't write header");
-        goto fail;
-    }
  
+    // merge
      a = malloc(max_cnt * sizeof(elem_t));
      if (!a) goto mem_fail;
      for (j = 0; j < max_cnt; ++j) {
@@ -264,7 +506,6 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      if (fp) sam_close(fp);
      if (fpw) sam_close(fpw);
      if (h) bam_hdr_destroy(h);
-    if (b) bam_destroy1(b);
      for (i = 0; i < n_files; ++i) {
          if (fnt) free(fnt[i]);
          if (fpt && fpt[i]) sam_close(fpt[i]);
@@ -281,44 +522,102 @@ static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
      return 1;
  }
  
-static int usage(FILE *fp, int n_files) {
+static int usage(FILE *fp, int n_files, int reads_store) {
      fprintf(fp,
-            "Usage:   samtools collate [-Ou] [-n nFiles] [-l cLevel] <in.bam> <out.prefix>\n\n"
+            "Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]\n\n"
              "Options:\n"
              "      -O       output to samtools_stdout\n"
+            "      -o       output file name (use prefix if not set)\n"
              "      -u       uncompressed BAM output\n"
+            "      -f       fast (only primary alignments)\n"
+            "      -r       working reads stored (with -f) [%d]\n" // reads_store
              "      -l INT   compression level [%d]\n" // DEF_CLEVEL
              "      -n INT   number of temporary files [%d]\n", // n_files
-            DEF_CLEVEL, n_files);
+            reads_store, DEF_CLEVEL, n_files);
  
      sam_global_opt_help(fp, "-....@");
+    fprintf(fp,
+            "  <prefix> is required unless the -o or -O options are used.\n");
  
      return 1;
  }
  
+char * generate_prefix() {
+    char *prefix;
+    unsigned int pid = getpid();
+#ifdef _WIN32
+#  define PREFIX_LEN (MAX_PATH + 16)
+    DWORD ret;
+    prefix = calloc(PREFIX_LEN, sizeof(*prefix));
+    if (!prefix) {
+        perror("collate");
+        return NULL;
+    }
+    ret = GetTempPathA(MAX_PATH, prefix);
+    if (ret > MAX_PATH || ret == 0) {
+        fprintf(samtools_stderr,
+                "[E::collate] Couldn't get path for temporary files.\n");
+        free(prefix);
+        return NULL;
+    }
+    snprintf(prefix + ret, PREFIX_LEN - ret, "\\%x", pid);
+    return prefix;
+#else
+#  define PREFIX_LEN 64
+    prefix = malloc(PREFIX_LEN);
+    if (!prefix) {
+        perror("collate");
+        return NULL;
+    }
+    snprintf(prefix, PREFIX_LEN, "/tmp/collate%x", pid);
+    return prefix;
+#endif
+}
+
  int main_bamshuf(int argc, char *argv[])
  {
-    int c, n_files = 64, clevel = DEF_CLEVEL, is_samtools_stdout = 0, is_un = 0;
+    int c, n_files = 64, clevel = DEF_CLEVEL, is_samtools_stdout = 0, is_un = 0, fast_coll = 0, reads_store = 10000, ret, pre_mem = 0;
+    const char *output_file = NULL;
+    char *prefix = NULL;
      sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
      static const struct option lopts[] = {
          SAM_OPT_GLOBAL_OPTIONS('-', 0, 0, 0, 0, '@'),
          { NULL, 0, NULL, 0 }
      };
  
-    while ((c = getopt_long(argc, argv, "n:l:uO@:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "n:l:uOo:@:fr:", lopts, NULL)) >= 0) {
          switch (c) {
          case 'n': n_files = atoi(optarg); break;
          case 'l': clevel = atoi(optarg); break;
          case 'u': is_un = 1; break;
          case 'O': is_samtools_stdout = 1; break;
+        case 'o': output_file = optarg; break;
+        case 'f': fast_coll = 1; break;
+        case 'r': reads_store = atoi(optarg); break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
                    /* else fall-through */
-        case '?': return usage(samtools_stderr, n_files);
+        case '?': return usage(samtools_stderr, n_files, reads_store);
          }
      }
      if (is_un) clevel = 0;
-    if (optind + 2 > argc)
-        return usage(samtools_stderr, n_files);
+    if (argc >= optind + 2) prefix = argv[optind+1];
+    if (!(prefix || is_samtools_stdout || output_file))
+        return usage(samtools_stderr, n_files, reads_store);
+    if (is_samtools_stdout && output_file) {
+        fprintf(samtools_stderr, "collate: -o and -O options cannot be used together.\n");
+        return usage(samtools_stderr, n_files, reads_store);
+    }
+    if (!prefix) {
+        prefix = generate_prefix();
+        pre_mem = 1;
+    }
+
+    if (!prefix) return EXIT_FAILURE;
+
+    ret = bamshuf(argv[optind], n_files, prefix, clevel, is_samtools_stdout,
+                   output_file, fast_coll, reads_store, &ga);
+
+    if (pre_mem) free(prefix);
  
-    return bamshuf(argv[optind], n_files, argv[optind+1], clevel, is_samtools_stdout, &ga);
+    return ret;
  }
diff --git a/samtools/bamtk.c b/samtools/bamtk.c

index d1e89c685516e19d0d29274f6ace06931b31ca48..8e7b0b0e6d52bb6356b88847476aa9b6c4a2ec6a 100644 (file)
--- a/samtools/bamtk.c
+++ b/samtools/bamtk.c
@@ -31,6 +31,7 @@ DEALINGS IN THE SOFTWARE.  */
  
  #include "htslib/hts.h"
  #include "samtools.h"
+#include "version.h"
  
  int bam_taf2baf(int argc, char *argv[]);
  int bam_mpileup(int argc, char *argv[]);
@@ -62,7 +63,12 @@ int main_quickcheck(int argc, char *argv[]);
  int main_addreplacerg(int argc, char *argv[]);
  int faidx_main(int argc, char *argv[]);
  int dict_main(int argc, char *argv[]);
+int fqidx_main(int argc, char *argv[]);
  
+const char *samtools_version()
+{
+    return SAMTOOLS_VERSION;
+}
  
  static void usage(FILE *fp)
  {
@@ -79,6 +85,7 @@ static void usage(FILE *fp)
  "  -- Indexing\n"
  "     dict           create a sequence dictionary file\n"
  "     faidx          index/extract FASTA\n"
+"     fqidx          index/extract FASTQ\n"
  "     index          index alignment\n"
  "\n"
  "  -- Editing\n"
@@ -161,6 +168,7 @@ int main(int argc, char *argv[])
      else if (strcmp(argv[1], "index") == 0)     ret = bam_index(argc-1, argv+1);
      else if (strcmp(argv[1], "idxstats") == 0)  ret = bam_idxstats(argc-1, argv+1);
      else if (strcmp(argv[1], "faidx") == 0)     ret = faidx_main(argc-1, argv+1);
+    else if (strcmp(argv[1], "fqidx") == 0)     ret = fqidx_main(argc-1, argv+1);
      else if (strcmp(argv[1], "dict") == 0)      ret = dict_main(argc-1, argv+1);
      else if (strcmp(argv[1], "fixmate") == 0)   ret = bam_mating(argc-1, argv+1);
      else if (strcmp(argv[1], "rmdup") == 0)     ret = bam_rmdup(argc-1, argv+1);
diff --git a/samtools/bamtk.c.pysam.c b/samtools/bamtk.c.pysam.c

index e14f01c7ac7b93440775c75a258a0877664fbdd5..b2e8fa654cf6faaa6902fc14a67a25341ac3a06e 100644 (file)
--- a/samtools/bamtk.c.pysam.c
+++ b/samtools/bamtk.c.pysam.c
@@ -33,13 +33,14 @@ DEALINGS IN THE SOFTWARE.  */
  
  #include "htslib/hts.h"
  #include "samtools.h"
+#include "version.h"
  
  int bam_taf2baf(int argc, char *argv[]);
  int bam_mpileup(int argc, char *argv[]);
  int bam_merge(int argc, char *argv[]);
  int bam_index(int argc, char *argv[]);
  int bam_sort(int argc, char *argv[]);
-// int bam_tview_main(int argc, char *argv[]);
+//int bam_tview_main(int argc, char *argv[]);
  int bam_mating(int argc, char *argv[]);
  int bam_rmdup(int argc, char *argv[]);
  int bam_flagstat(int argc, char *argv[]);
@@ -64,7 +65,12 @@ int main_quickcheck(int argc, char *argv[]);
  int main_addreplacerg(int argc, char *argv[]);
  int faidx_main(int argc, char *argv[]);
  int dict_main(int argc, char *argv[]);
+int fqidx_main(int argc, char *argv[]);
  
+const char *samtools_version()
+{
+    return SAMTOOLS_VERSION;
+}
  
  static void usage(FILE *fp)
  {
@@ -81,6 +87,7 @@ static void usage(FILE *fp)
  "  -- Indexing\n"
  "     dict           create a sequence dictionary file\n"
  "     faidx          index/extract FASTA\n"
+"     fqidx          index/extract FASTQ\n"
  "     index          index alignment\n"
  "\n"
  "  -- Editing\n"
@@ -163,6 +170,7 @@ int samtools_main(int argc, char *argv[])
      else if (strcmp(argv[1], "index") == 0)     ret = bam_index(argc-1, argv+1);
      else if (strcmp(argv[1], "idxstats") == 0)  ret = bam_idxstats(argc-1, argv+1);
      else if (strcmp(argv[1], "faidx") == 0)     ret = faidx_main(argc-1, argv+1);
+    else if (strcmp(argv[1], "fqidx") == 0)     ret = fqidx_main(argc-1, argv+1);
      else if (strcmp(argv[1], "dict") == 0)      ret = dict_main(argc-1, argv+1);
      else if (strcmp(argv[1], "fixmate") == 0)   ret = bam_mating(argc-1, argv+1);
      else if (strcmp(argv[1], "rmdup") == 0)     ret = bam_rmdup(argc-1, argv+1);
@@ -192,9 +200,9 @@ int samtools_main(int argc, char *argv[])
          fprintf(samtools_stderr, "[main] The `pileup' command has been removed. Please use `mpileup' instead.\n");
          return 1;
      }
-    //    else if (strcmp(argv[1], "tview") == 0)   ret = bam_tview_main(argc-1, argv+1);
+    //else if (strcmp(argv[1], "tview") == 0)   ret = bam_tview_main(argc-1, argv+1);
      else if (strcmp(argv[1], "--version") == 0) {
-        fprintf(samtools_stdout, 
+        fprintf(samtools_stdout,
  "samtools %s\n"
  "Using htslib %s\n"
  "Copyright (C) 2018 Genome Research Ltd.\n",
diff --git a/samtools/bedcov.c b/samtools/bedcov.c

index 10983097e52fae233ee27e693d662b5e398cb01d..40306e42573f7e6e296013707a3a10c808a5eb08 100644 (file)
--- a/samtools/bedcov.c
+++ b/samtools/bedcov.c
@@ -68,7 +68,7 @@ int main_bedcov(int argc, char *argv[])
      kstream_t *ks;
      hts_idx_t **idx;
      aux_t **aux;
-    int *n_plp, dret, i, n, c, min_mapQ = 0;
+    int *n_plp, dret, i, j, m, n, c, min_mapQ = 0, skip_DN = 0;
      int64_t *cnt;
      const bam_pileup1_t **plp;
      int usage = 0;
@@ -79,9 +79,10 @@ int main_bedcov(int argc, char *argv[])
          { NULL, 0, NULL, 0 }
      };
  
-    while ((c = getopt_long(argc, argv, "Q:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "Q:j", lopts, NULL)) >= 0) {
          switch (c) {
          case 'Q': min_mapQ = atoi(optarg); break;
+        case 'j': skip_DN = 1; break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
                    /* else fall-through */
          case '?': usage = 1; break;
@@ -91,7 +92,8 @@ int main_bedcov(int argc, char *argv[])
      if (usage || optind + 2 > argc) {
          fprintf(stderr, "Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]\n\n");
          fprintf(stderr, "Options:\n");
-        fprintf(stderr, "   -Q <int>            mapping quality threshold [0]\n");
+        fprintf(stderr, "      -Q <int>            mapping quality threshold [0]\n");
+        fprintf(stderr, "      -j                  do not include deletions (D) and ref skips (N) in bedcov computation\n");
          sam_global_opt_help(stderr, "-.--.-");
          return 1;
      }
@@ -155,8 +157,16 @@ int main_bedcov(int argc, char *argv[])
          bam_mplp_set_maxcnt(mplp, 64000);
          memset(cnt, 0, 8 * n);
          while (bam_mplp_auto(mplp, &tid, &pos, n_plp, plp) > 0)
-            if (pos >= beg && pos < end)
-                for (i = 0; i < n; ++i) cnt[i] += n_plp[i];
+            if (pos >= beg && pos < end) {
+                for (i = 0, m = 0; i < n; ++i) {
+                    if (skip_DN)
+                        for (j = 0; j < n_plp[i]; ++j) {
+                            const bam_pileup1_t *pi = plp[i] + j;
+                            if (pi->is_del || pi->is_refskip) ++m;
+                        }
+                    cnt[i] += n_plp[i] - m;
+                }
+            }
          for (i = 0; i < n; ++i) {
              kputc('\t', &str);
              kputl(cnt[i], &str);
diff --git a/samtools/bedcov.c.pysam.c b/samtools/bedcov.c.pysam.c

index fa7c9a26350adeadd5126c2939309584ce843f74..165fed72685ee8de03d3a2c1745f45922a649550 100644 (file)
--- a/samtools/bedcov.c.pysam.c
+++ b/samtools/bedcov.c.pysam.c
@@ -70,7 +70,7 @@ int main_bedcov(int argc, char *argv[])
      kstream_t *ks;
      hts_idx_t **idx;
      aux_t **aux;
-    int *n_plp, dret, i, n, c, min_mapQ = 0;
+    int *n_plp, dret, i, j, m, n, c, min_mapQ = 0, skip_DN = 0;
      int64_t *cnt;
      const bam_pileup1_t **plp;
      int usage = 0;
@@ -81,9 +81,10 @@ int main_bedcov(int argc, char *argv[])
          { NULL, 0, NULL, 0 }
      };
  
-    while ((c = getopt_long(argc, argv, "Q:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "Q:j", lopts, NULL)) >= 0) {
          switch (c) {
          case 'Q': min_mapQ = atoi(optarg); break;
+        case 'j': skip_DN = 1; break;
          default:  if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
                    /* else fall-through */
          case '?': usage = 1; break;
@@ -93,7 +94,8 @@ int main_bedcov(int argc, char *argv[])
      if (usage || optind + 2 > argc) {
          fprintf(samtools_stderr, "Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]\n\n");
          fprintf(samtools_stderr, "Options:\n");
-        fprintf(samtools_stderr, "   -Q <int>            mapping quality threshold [0]\n");
+        fprintf(samtools_stderr, "      -Q <int>            mapping quality threshold [0]\n");
+        fprintf(samtools_stderr, "      -j                  do not include deletions (D) and ref skips (N) in bedcov computation\n");
          sam_global_opt_help(samtools_stderr, "-.--.-");
          return 1;
      }
@@ -157,13 +159,21 @@ int main_bedcov(int argc, char *argv[])
          bam_mplp_set_maxcnt(mplp, 64000);
          memset(cnt, 0, 8 * n);
          while (bam_mplp_auto(mplp, &tid, &pos, n_plp, plp) > 0)
-            if (pos >= beg && pos < end)
-                for (i = 0; i < n; ++i) cnt[i] += n_plp[i];
+            if (pos >= beg && pos < end) {
+                for (i = 0, m = 0; i < n; ++i) {
+                    if (skip_DN)
+                        for (j = 0; j < n_plp[i]; ++j) {
+                            const bam_pileup1_t *pi = plp[i] + j;
+                            if (pi->is_del || pi->is_refskip) ++m;
+                        }
+                    cnt[i] += n_plp[i] - m;
+                }
+            }
          for (i = 0; i < n; ++i) {
              kputc('\t', &str);
              kputl(cnt[i], &str);
          }
-        fputs(str.s, samtools_stdout) & fputc('\n', samtools_stdout);
+        samtools_puts(str.s);
          bam_mplp_destroy(mplp);
          continue;
  
diff --git a/samtools/bedidx.c b/samtools/bedidx.c

index 3489c27688e2195f2a74bdde1d81778ca219b6c6..ec66d0ff80157ef35fe347a041f674d510dafefc 100644 (file)
--- a/samtools/bedidx.c
+++ b/samtools/bedidx.c
@@ -1,7 +1,7 @@
  /*  bedidx.c -- BED file indexing.
  
      Copyright (C) 2011 Broad Institute.
-    Copyright (C) 2014 Genome Research Ltd.
+    Copyright (C) 2014,2017 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>
  
@@ -186,7 +186,7 @@ int bed_overlap(const void *_h, const char *chr, int beg, int end)
   *  @param reg_hash    the region hash table with interval lists as values
   */
  
-static void bed_unify(void *reg_hash) {
+void bed_unify(void *reg_hash) {
  
      int i, j, new_n;
      reghash_t *h;
@@ -251,7 +251,7 @@ void *bed_read(const char *fn)
      gzFile fp;
      kstream_t *ks = NULL;
      int dret;
-    unsigned int line = 0;
+    unsigned int line = 0, save_errno;
      kstring_t str = { 0, 0, NULL };
  
      if (NULL == h) return NULL;
@@ -286,9 +286,18 @@ void *bed_read(const char *fn)
              // has called their reference "browser" or "track".
              if (0 == strcmp(ref, "browser")) continue;
              if (0 == strcmp(ref, "track")) continue;
-            fprintf(stderr, "[bed_read] Parse error reading %s at line %u\n",
-                    fn, line);
-            goto fail_no_msg;
+            if (num < 1) {
+                fprintf(stderr,
+                        "[bed_read] Parse error reading \"%s\" at line %u\n",
+                        fn, line);
+            } else {
+                fprintf(stderr,
+                        "[bed_read] Parse error reading \"%s\" at line %u : "
+                        "end (%u) must not be less than start (%u)\n",
+                        fn, line, end, beg);
+            }
+            errno = 0; // Prevent caller from printing misleading error messages
+            goto fail;
          }
  
          // Put reg in the hash table if not already there
@@ -324,12 +333,12 @@ void *bed_read(const char *fn)
      //bed_unify(h);
      return h;
   fail:
-    fprintf(stderr, "[bed_read] Error reading %s : %s\n", fn, strerror(errno));
- fail_no_msg:
+    save_errno = errno;
      if (ks) ks_destroy(ks);
      if (fp) gzclose(fp);
      free(str.s);
      bed_destroy(h);
+    errno = save_errno;
      return NULL;
  }
  
diff --git a/samtools/bedidx.c.pysam.c b/samtools/bedidx.c.pysam.c

index fa92fa0cc7ed845c18cdbf8c3272a0e4ad9f82f9..f6cee07f0f4697f14376e2b9924abd6178e2d0ba 100644 (file)
--- a/samtools/bedidx.c.pysam.c
+++ b/samtools/bedidx.c.pysam.c
@@ -3,7 +3,7 @@
  /*  bedidx.c -- BED file indexing.
  
      Copyright (C) 2011 Broad Institute.
-    Copyright (C) 2014 Genome Research Ltd.
+    Copyright (C) 2014,2017 Genome Research Ltd.
  
      Author: Heng Li <lh3@sanger.ac.uk>
  
@@ -188,7 +188,7 @@ int bed_overlap(const void *_h, const char *chr, int beg, int end)
   *  @param reg_hash    the region hash table with interval lists as values
   */
  
-static void bed_unify(void *reg_hash) {
+void bed_unify(void *reg_hash) {
  
      int i, j, new_n;
      reghash_t *h;
@@ -253,7 +253,7 @@ void *bed_read(const char *fn)
      gzFile fp;
      kstream_t *ks = NULL;
      int dret;
-    unsigned int line = 0;
+    unsigned int line = 0, save_errno;
      kstring_t str = { 0, 0, NULL };
  
      if (NULL == h) return NULL;
@@ -288,9 +288,18 @@ void *bed_read(const char *fn)
              // has called their reference "browser" or "track".
              if (0 == strcmp(ref, "browser")) continue;
              if (0 == strcmp(ref, "track")) continue;
-            fprintf(samtools_stderr, "[bed_read] Parse error reading %s at line %u\n",
-                    fn, line);
-            goto fail_no_msg;
+            if (num < 1) {
+                fprintf(samtools_stderr,
+                        "[bed_read] Parse error reading \"%s\" at line %u\n",
+                        fn, line);
+            } else {
+                fprintf(samtools_stderr,
+                        "[bed_read] Parse error reading \"%s\" at line %u : "
+                        "end (%u) must not be less than start (%u)\n",
+                        fn, line, end, beg);
+            }
+            errno = 0; // Prevent caller from printing misleading error messages
+            goto fail;
          }
  
          // Put reg in the hash table if not already there
@@ -326,12 +335,12 @@ void *bed_read(const char *fn)
      //bed_unify(h);
      return h;
   fail:
-    fprintf(samtools_stderr, "[bed_read] Error reading %s : %s\n", fn, strerror(errno));
- fail_no_msg:
+    save_errno = errno;
      if (ks) ks_destroy(ks);
      if (fp) gzclose(fp);
      free(str.s);
      bed_destroy(h);
+    errno = save_errno;
      return NULL;
  }
  
diff --git a/samtools/bedidx.h b/samtools/bedidx.h

index a33a65f2ea3fe3deec7847d0ec2a22e948728889..1baeec1b0a9bee1c5fb4ebce557aedcd604d7238 100644 (file)
--- a/samtools/bedidx.h
+++ b/samtools/bedidx.h
@@ -1,3 +1,27 @@
+/*  bedidx.h -- BED file indexing header file.
+
+    Copyright (C) 2017 Genome Research Ltd.
+
+    Author: Valeriu Ohan <vo2@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.  */
+
  #ifndef BEDIDX_H
  #define BEDIDX_H
  
@@ -16,5 +40,6 @@ int bed_overlap(const void *_h, const char *chr, int beg, int end);
  void *bed_hash_regions(void *reg_hash, char **regs, int first, int last, int *op);
  const char* bed_get(void *reg_hash, int index, int filter);
  hts_reglist_t *bed_reglist(void *reg_hash, int filter, int *count_regs);
+void bed_unify(void *_h);
  
  #endif
diff --git a/samtools/faidx.c b/samtools/faidx.c

index c5c9ed635cfb95f908d49c3786d666720fbb8d3c..6654cf2de1ff622d18d690e4ab0bf7de39e1be8b 100644 (file)
--- a/samtools/faidx.c
+++ b/samtools/faidx.c
@@ -1,6 +1,6 @@
  /*  faidx.c -- faidx subcommand.
  
-    Copyright (C) 2008, 2009, 2013, 2016 Genome Research Ltd.
+    Copyright (C) 2008, 2009, 2013, 2016, 2018 Genome Research Ltd.
      Portions copyright (C) 2011 Broad Institute.
  
      Author: Heng Li <lh3@sanger.ac.uk>
@@ -21,85 +21,361 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
  THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
  FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
-DEALINGS IN THE SOFTWARE.  */
+DEALINGS IN THE SOFTWARE.
+
+History:
+
+  * 2016-01-12: Pierre Lindenbaum @yokofakun : added options -o -n
+
+*/
  
  #include <config.h>
  
+#include <ctype.h>
+#include <string.h>
  #include <stdlib.h>
  #include <stdio.h>
  #include <unistd.h>
-
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+#include <limits.h>
  #include <htslib/faidx.h>
+#include <htslib/hts.h>
+#include <htslib/hfile.h>
+#include <htslib/kstring.h>
  #include "samtools.h"
  
-static int usage(FILE *fp, int exit_status)
+#define DEFAULT_FASTA_LINE_LEN 60
+
+static unsigned char comp_base[256] = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
+ 16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,
+ 32, '!', '"', '#', '$', '%', '&', '\'','(', ')', '*', '+', ',', '-', '.', '/',
+'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
+'@', 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
+'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z', '[', '\\',']', '^', '_',
+'`', 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
+'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', '{', '|', '}', '~', 127,
+128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
+144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
+160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
+176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
+192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
+208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
+224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
+};
+
+static void reverse_complement(char *str, int len) {
+    char c;
+    int i = 0, j = len - 1;
+
+    while (i <= j) {
+        c = str[i];
+        str[i] = comp_base[(unsigned char)str[j]];
+        str[j] = comp_base[(unsigned char)c];
+        i++;
+        j--;
+    }
+}
+
+
+static void reverse(char *str, int len) {
+    char c;
+    int i = 0, j = len - 1;
+
+    while (i < j) {
+        c = str[i];
+        str[i] = str[j];
+        str[j] = c;
+        i++;
+        j--;
+    }
+}
+
+
+static int write_line(FILE *file, const char *line, const char *name, const int ignore,
+                      const int length, const int seq_len) {
+    int beg, end;
+
+    if (seq_len < 0) {
+        fprintf(stderr, "[faidx] Failed to fetch sequence in %s\n", name);
+
+        if (ignore && seq_len == -2) {
+            return EXIT_SUCCESS;
+        } else {
+            return EXIT_FAILURE;
+        }
+    } else if (seq_len == 0) {
+        fprintf(stderr, "[faidx] Zero length sequence: %s\n", name);
+    } else if (hts_parse_reg(name, &beg, &end) && (end < INT_MAX) && (seq_len != end - beg)) {
+        fprintf(stderr, "[faidx] Truncated sequence: %s\n", name);
+    }
+
+    size_t i, seq_sz = seq_len;
+
+    for (i = 0; i < seq_sz; i += length)
+    {
+        size_t len = i + length < seq_sz ? length : seq_sz - i;
+        if (fwrite(line + i, 1, len, file) < len ||
+            fputc('\n', file) == EOF) {
+            print_error_errno("faidx", "failed to write output");
+            return EXIT_FAILURE;
+        }
+    }
+
+    return EXIT_SUCCESS;
+}
+
+
+static int write_output(faidx_t *faid, FILE *file, const char *name, const int ignore,
+                        const int length, const int rev,
+                        const char *pos_strand_name, const char *neg_strand_name,
+                        enum fai_format_options format) {
+    int seq_len;
+    char *seq = fai_fetch(faid, name, &seq_len);
+
+    if (format == FAI_FASTA) {
+        fprintf(file, ">%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+    } else {
+        fprintf(file, "@%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+    }
+
+    if (rev && seq_len > 0) {
+        reverse_complement(seq, seq_len);
+    }
+
+    if (write_line(file, seq, name, ignore, length, seq_len) == EXIT_FAILURE) {
+        free(seq);
+        return EXIT_FAILURE;
+    }
+
+    free(seq);
+
+    if (format == FAI_FASTQ) {
+        fprintf(file, "+\n");
+
+        char *qual = fai_fetchqual(faid, name, &seq_len);
+
+        if (rev && seq_len > 0) {
+            reverse(qual, seq_len);
+        }
+
+        if (write_line(file, qual, name, ignore, length, seq_len) == EXIT_FAILURE) {
+            free(seq);
+            return EXIT_FAILURE;
+        }
+
+        free(qual);
+    }
+
+    return EXIT_SUCCESS;
+}
+
+
+static int read_regions_from_file(faidx_t *faid, hFILE *in_file, FILE *file, const int ignore,
+                                  const int length, const int rev,
+                                  const char *pos_strand_name,
+                                  const char *neg_strand_name,
+                                  enum fai_format_options format) {
+    kstring_t line = {0, 0, NULL};
+    int ret = EXIT_FAILURE;
+
+    while (line.l = 0, kgetline(&line, (kgets_func *)hgets, in_file) >= 0) {
+        if ((ret = write_output(faid, file, line.s, ignore, length, rev, pos_strand_name, neg_strand_name, format)) == EXIT_FAILURE) {
+            break;
+        }
+    }
+
+    free(line.s);
+
+    return ret;
+}
+
+static int usage(FILE *fp, enum fai_format_options format, int exit_status)
  {
-    fprintf(fp, "Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]\n");
+    char *tool, *file_type;
+
+    if (format == FAI_FASTA) {
+        tool = "faidx <file.fa|file.fa.gz>";
+        file_type = "FASTA";
+    } else {
+        tool = "fqidx <file.fq|file.fq.gz>";
+        file_type = "FASTQ";
+    }
+
+    fprintf(fp, "Usage: samtools %s [<reg> [...]]\n", tool);
+    fprintf(fp, "Option: \n"
+                " -o, --output FILE        Write %s to file.\n"
+                " -n, --length INT         Length of %s sequence line. [60]\n"
+                " -c, --continue           Continue after trying to retrieve missing region.\n"
+                " -r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.\n"
+                " -i, --reverse-complement Reverse complement sequences.\n"
+                "     --mark-strand TYPE   Add strand indicator to sequence name\n"
+                "                          TYPE = rc   for /rc on negative strand (default)\n"
+                "                                 no   for no strand indicator\n"
+                "                                 sign for (+) / (-)\n"
+                "                                 custom,<pos>,<neg> for custom indicator\n",
+                file_type, file_type);
+
+
+    if (format == FAI_FASTA) {
+       fprintf(fp, " -f, --fastq              File and index in FASTQ format.\n");
+    }
+
+    fprintf(fp, " -h, --help               This message.\n");
+
      return exit_status;
  }
  
-int faidx_main(int argc, char *argv[])
+int faidx_core(int argc, char *argv[], enum fai_format_options format)
  {
-    int c;
-    while((c  = getopt(argc, argv, "h")) >= 0)
-    {
-        switch(c)
-        {
-            case 'h':
-                return usage(stdout, EXIT_SUCCESS);
+    int c, ignore_error = 0, rev = 0;
+    int line_len = DEFAULT_FASTA_LINE_LEN ;/* fasta line len */
+    char* output_file = NULL; /* output file (default is stdout ) */
+    char *region_file = NULL; // list of regions from file, one per line
+    char *pos_strand_name = ""; // Extension to add to name for +ve strand
+    char *neg_strand_name = "/rc"; // Extension to add to name for -ve strand
+    char *strand_names = NULL; // Used for custom strand annotation
+    FILE* file_out = stdout;/* output stream */
+
+    static const struct option lopts[] = {
+        { "output", required_argument,       NULL, 'o' },
+        { "help",   no_argument,             NULL, 'h' },
+        { "length", required_argument,       NULL, 'n' },
+        { "continue", no_argument,           NULL, 'c' },
+        { "region-file", required_argument,  NULL, 'r' },
+        { "fastq", no_argument,              NULL, 'f' },
+        { "reverse-complement", no_argument, NULL, 'i' },
+        { "mark-strand", required_argument, NULL, 1000 },
+        { NULL, 0, NULL, 0 }
+    };
  
-            default:
-                return usage(stderr, EXIT_FAILURE);
+    while ((c = getopt_long(argc, argv, "ho:n:cr:fi", lopts, NULL)) >= 0) {
+        switch (c) {
+            case 'o': output_file = optarg; break;
+            case 'n': line_len = atoi(optarg);
+                      if(line_len<1) {
+                        fprintf(stderr,"[faidx] bad line length '%s', using default:%d\n",optarg,DEFAULT_FASTA_LINE_LEN);
+                        line_len= DEFAULT_FASTA_LINE_LEN ;
+                        }
+                      break;
+            case 'c': ignore_error = 1; break;
+            case 'r': region_file = optarg; break;
+            case 'f': format = FAI_FASTQ; break;
+            case 'i': rev = 1; break;
+            case '?': return usage(stderr, format, EXIT_FAILURE);
+            case 'h': return usage(stdout, format, EXIT_SUCCESS);
+            case 1000:
+                if (strcmp(optarg, "no") == 0) {
+                    pos_strand_name = neg_strand_name = "";
+                } else if (strcmp(optarg, "sign") == 0) {
+                    pos_strand_name = "(+)";
+                    neg_strand_name = "(-)";
+                } else if (strcmp(optarg, "rc") == 0) {
+                    pos_strand_name = "";
+                    neg_strand_name = "/rc";
+                } else if (strncmp(optarg, "custom,", 7) == 0) {
+                    size_t len = strlen(optarg + 7);
+                    size_t comma = strcspn(optarg + 7, ",");
+                    free(strand_names);
+                    strand_names = pos_strand_name = malloc(len + 2);
+                    if (!strand_names) {
+                        fprintf(stderr, "[faidx] Out of memory\n");
+                        return EXIT_FAILURE;
+                    }
+                    neg_strand_name = pos_strand_name + comma + 1;
+                    memcpy(pos_strand_name, optarg + 7, comma);
+                    pos_strand_name[comma] = '\0';
+                    if (comma < len)
+                        memcpy(neg_strand_name, optarg + 7 + comma + 1,
+                               len - comma);
+                    neg_strand_name[len - comma] = '\0';
+                } else {
+                    fprintf(stderr, "[faidx] Unknown --mark-strand option \"%s\"\n", optarg);
+                    return usage(stderr, format, EXIT_FAILURE);
+                }
+                break;
+            default:  break;
          }
      }
+
      if ( argc==optind )
-        return usage(stdout, EXIT_SUCCESS);
-    if ( argc==2 )
+        return usage(stdout, format, EXIT_SUCCESS);
+
+    if ( optind+1 == argc && !region_file)
      {
          if (fai_build(argv[optind]) != 0) {
-            fprintf(stderr, "Could not build fai index %s.fai\n", argv[optind]);
+            fprintf(stderr, "[faidx] Could not build fai index %s.fai\n", argv[optind]);
              return EXIT_FAILURE;
          }
          return 0;
      }
  
-    faidx_t *fai = fai_load(argv[optind]);
+    faidx_t *fai = fai_load_format(argv[optind], format);
+
      if ( !fai ) {
-        fprintf(stderr, "Could not load fai index of %s\n", argv[optind]);
+        fprintf(stderr, "[faidx] Could not load fai index of %s\n", argv[optind]);
          return EXIT_FAILURE;
      }
  
-    int exit_status = EXIT_SUCCESS;
+    /** output file provided by user */
+    if( output_file != NULL ) {
+        if( strcmp( output_file, argv[optind] ) == 0 ) {
+            fprintf(stderr,"[faidx] Same input/output : %s\n", output_file);
+            return EXIT_FAILURE;
+        }
  
-    while ( ++optind<argc && exit_status == EXIT_SUCCESS)
-    {
-        printf(">%s\n", argv[optind]);
-        int seq_len;
-        char *seq = fai_fetch(fai, argv[optind], &seq_len);
-        if ( seq_len < 0 ) {
-            fprintf(stderr, "Failed to fetch sequence in %s\n", argv[optind]);
-            exit_status = EXIT_FAILURE;
-            break;
+        file_out = fopen( output_file, "w" );
+
+        if( file_out == NULL) {
+            fprintf(stderr,"[faidx] Cannot open \"%s\" for writing :%s.\n", output_file, strerror(errno) );
+            return EXIT_FAILURE;
          }
-        size_t i, seq_sz = seq_len;
-        for (i=0; i<seq_sz; i+=60)
-        {
-            size_t len = i + 60 < seq_sz ? 60 : seq_sz - i;
-            if (fwrite(seq + i, 1, len, stdout) < len ||
-                putchar('\n') == EOF) {
-                print_error_errno("faidx", "failed to write output");
-                exit_status = EXIT_FAILURE;
-                break;
+    }
+
+    int exit_status = EXIT_SUCCESS;
+
+    if (region_file) {
+        hFILE *rf;
+
+        if ((rf = hopen(region_file, "r"))) {
+            exit_status = read_regions_from_file(fai, rf, file_out, ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+
+            if (hclose(rf) != 0) {
+                fprintf(stderr, "[faidx] Warning: failed to close %s", region_file);
              }
+        } else {
+            fprintf(stderr, "[faidx] Failed to open \"%s\" for reading.\n", region_file);
+            exit_status = EXIT_FAILURE;
          }
-        free(seq);
      }
+
+    while ( ++optind<argc && exit_status == EXIT_SUCCESS) {
+        exit_status = write_output(fai, file_out, argv[optind], ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+    }
+
      fai_destroy(fai);
  
-    if (fflush(stdout) == EOF) {
+    if (fflush(file_out) == EOF) {
          print_error_errno("faidx", "failed to flush output");
          exit_status = EXIT_FAILURE;
      }
  
+    if( output_file != NULL) fclose(file_out);
+    free(strand_names);
+
      return exit_status;
  }
+
+
+int faidx_main(int argc, char *argv[]) {
+    return faidx_core(argc, argv, FAI_FASTA);
+}
+
+
+int fqidx_main(int argc, char *argv[]) {
+    return faidx_core(argc, argv, FAI_FASTQ);
+}
+
diff --git a/samtools/faidx.c.pysam.c b/samtools/faidx.c.pysam.c

index 37eb247a179deee94d7ae2657b87c6832f9927d1..559f724900150436a5705597ebf7d2851cdaa5b2 100644 (file)
--- a/samtools/faidx.c.pysam.c
+++ b/samtools/faidx.c.pysam.c
@@ -2,7 +2,7 @@
  
  /*  faidx.c -- faidx subcommand.
  
-    Copyright (C) 2008, 2009, 2013, 2016 Genome Research Ltd.
+    Copyright (C) 2008, 2009, 2013, 2016, 2018 Genome Research Ltd.
      Portions copyright (C) 2011 Broad Institute.
  
      Author: Heng Li <lh3@sanger.ac.uk>
@@ -23,85 +23,361 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
  THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
  LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
  FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
-DEALINGS IN THE SOFTWARE.  */
+DEALINGS IN THE SOFTWARE.
+
+History:
+
+  * 2016-01-12: Pierre Lindenbaum @yokofakun : added options -o -n
+
+*/
  
  #include <config.h>
  
+#include <ctype.h>
+#include <string.h>
  #include <stdlib.h>
  #include <stdio.h>
  #include <unistd.h>
-
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+#include <limits.h>
  #include <htslib/faidx.h>
+#include <htslib/hts.h>
+#include <htslib/hfile.h>
+#include <htslib/kstring.h>
  #include "samtools.h"
  
-static int usage(FILE *fp, int exit_status)
+#define DEFAULT_FASTA_LINE_LEN 60
+
+static unsigned char comp_base[256] = {
+  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
+ 16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,
+ 32, '!', '"', '#', '$', '%', '&', '\'','(', ')', '*', '+', ',', '-', '.', '/',
+'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
+'@', 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
+'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z', '[', '\\',']', '^', '_',
+'`', 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
+'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', '{', '|', '}', '~', 127,
+128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
+144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
+160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
+176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
+192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
+208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
+224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
+};
+
+static void reverse_complement(char *str, int len) {
+    char c;
+    int i = 0, j = len - 1;
+
+    while (i <= j) {
+        c = str[i];
+        str[i] = comp_base[(unsigned char)str[j]];
+        str[j] = comp_base[(unsigned char)c];
+        i++;
+        j--;
+    }
+}
+
+
+static void reverse(char *str, int len) {
+    char c;
+    int i = 0, j = len - 1;
+
+    while (i < j) {
+        c = str[i];
+        str[i] = str[j];
+        str[j] = c;
+        i++;
+        j--;
+    }
+}
+
+
+static int write_line(FILE *file, const char *line, const char *name, const int ignore,
+                      const int length, const int seq_len) {
+    int beg, end;
+
+    if (seq_len < 0) {
+        fprintf(samtools_stderr, "[faidx] Failed to fetch sequence in %s\n", name);
+
+        if (ignore && seq_len == -2) {
+            return EXIT_SUCCESS;
+        } else {
+            return EXIT_FAILURE;
+        }
+    } else if (seq_len == 0) {
+        fprintf(samtools_stderr, "[faidx] Zero length sequence: %s\n", name);
+    } else if (hts_parse_reg(name, &beg, &end) && (end < INT_MAX) && (seq_len != end - beg)) {
+        fprintf(samtools_stderr, "[faidx] Truncated sequence: %s\n", name);
+    }
+
+    size_t i, seq_sz = seq_len;
+
+    for (i = 0; i < seq_sz; i += length)
+    {
+        size_t len = i + length < seq_sz ? length : seq_sz - i;
+        if (fwrite(line + i, 1, len, file) < len ||
+            fputc('\n', file) == EOF) {
+            print_error_errno("faidx", "failed to write output");
+            return EXIT_FAILURE;
+        }
+    }
+
+    return EXIT_SUCCESS;
+}
+
+
+static int write_output(faidx_t *faid, FILE *file, const char *name, const int ignore,
+                        const int length, const int rev,
+                        const char *pos_strand_name, const char *neg_strand_name,
+                        enum fai_format_options format) {
+    int seq_len;
+    char *seq = fai_fetch(faid, name, &seq_len);
+
+    if (format == FAI_FASTA) {
+        fprintf(file, ">%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+    } else {
+        fprintf(file, "@%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+    }
+
+    if (rev && seq_len > 0) {
+        reverse_complement(seq, seq_len);
+    }
+
+    if (write_line(file, seq, name, ignore, length, seq_len) == EXIT_FAILURE) {
+        free(seq);
+        return EXIT_FAILURE;
+    }
+
+    free(seq);
+
+    if (format == FAI_FASTQ) {
+        fprintf(file, "+\n");
+
+        char *qual = fai_fetchqual(faid, name, &seq_len);
+
+        if (rev && seq_len > 0) {
+            reverse(qual, seq_len);
+        }
+
+        if (write_line(file, qual, name, ignore, length, seq_len) == EXIT_FAILURE) {
+            free(seq);
+            return EXIT_FAILURE;
+        }
+
+        free(qual);
+    }
+
+    return EXIT_SUCCESS;
+}
+
+
+static int read_regions_from_file(faidx_t *faid, hFILE *in_file, FILE *file, const int ignore,
+                                  const int length, const int rev,
+                                  const char *pos_strand_name,
+                                  const char *neg_strand_name,
+                                  enum fai_format_options format) {
+    kstring_t line = {0, 0, NULL};
+    int ret = EXIT_FAILURE;
+
+    while (line.l = 0, kgetline(&line, (kgets_func *)hgets, in_file) >= 0) {
+        if ((ret = write_output(faid, file, line.s, ignore, length, rev, pos_strand_name, neg_strand_name, format)) == EXIT_FAILURE) {
+            break;
+        }
+    }
+
+    free(line.s);
+
+    return ret;
+}
+
+static int usage(FILE *fp, enum fai_format_options format, int exit_status)
  {
-    fprintf(fp, "Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]\n");
+    char *tool, *file_type;
+
+    if (format == FAI_FASTA) {
+        tool = "faidx <file.fa|file.fa.gz>";
+        file_type = "FASTA";
+    } else {
+        tool = "fqidx <file.fq|file.fq.gz>";
+        file_type = "FASTQ";
+    }
+
+    fprintf(fp, "Usage: samtools %s [<reg> [...]]\n", tool);
+    fprintf(fp, "Option: \n"
+                " -o, --output FILE        Write %s to file.\n"
+                " -n, --length INT         Length of %s sequence line. [60]\n"
+                " -c, --continue           Continue after trying to retrieve missing region.\n"
+                " -r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.\n"
+                " -i, --reverse-complement Reverse complement sequences.\n"
+                "     --mark-strand TYPE   Add strand indicator to sequence name\n"
+                "                          TYPE = rc   for /rc on negative strand (default)\n"
+                "                                 no   for no strand indicator\n"
+                "                                 sign for (+) / (-)\n"
+                "                                 custom,<pos>,<neg> for custom indicator\n",
+                file_type, file_type);
+
+
+    if (format == FAI_FASTA) {
+       fprintf(fp, " -f, --fastq              File and index in FASTQ format.\n");
+    }
+
+    fprintf(fp, " -h, --help               This message.\n");
+
      return exit_status;
  }
  
-int faidx_main(int argc, char *argv[])
+int faidx_core(int argc, char *argv[], enum fai_format_options format)
  {
-    int c;
-    while((c  = getopt(argc, argv, "h")) >= 0)
-    {
-        switch(c)
-        {
-            case 'h':
-                return usage(samtools_stdout, EXIT_SUCCESS);
+    int c, ignore_error = 0, rev = 0;
+    int line_len = DEFAULT_FASTA_LINE_LEN ;/* fasta line len */
+    char* output_file = NULL; /* output file (default is samtools_stdout ) */
+    char *region_file = NULL; // list of regions from file, one per line
+    char *pos_strand_name = ""; // Extension to add to name for +ve strand
+    char *neg_strand_name = "/rc"; // Extension to add to name for -ve strand
+    char *strand_names = NULL; // Used for custom strand annotation
+    FILE* file_out = samtools_stdout;/* output stream */
+
+    static const struct option lopts[] = {
+        { "output", required_argument,       NULL, 'o' },
+        { "help",   no_argument,             NULL, 'h' },
+        { "length", required_argument,       NULL, 'n' },
+        { "continue", no_argument,           NULL, 'c' },
+        { "region-file", required_argument,  NULL, 'r' },
+        { "fastq", no_argument,              NULL, 'f' },
+        { "reverse-complement", no_argument, NULL, 'i' },
+        { "mark-strand", required_argument, NULL, 1000 },
+        { NULL, 0, NULL, 0 }
+    };
  
-            default:
-                return usage(samtools_stderr, EXIT_FAILURE);
+    while ((c = getopt_long(argc, argv, "ho:n:cr:fi", lopts, NULL)) >= 0) {
+        switch (c) {
+            case 'o': output_file = optarg; break;
+            case 'n': line_len = atoi(optarg);
+                      if(line_len<1) {
+                        fprintf(samtools_stderr,"[faidx] bad line length '%s', using default:%d\n",optarg,DEFAULT_FASTA_LINE_LEN);
+                        line_len= DEFAULT_FASTA_LINE_LEN ;
+                        }
+                      break;
+            case 'c': ignore_error = 1; break;
+            case 'r': region_file = optarg; break;
+            case 'f': format = FAI_FASTQ; break;
+            case 'i': rev = 1; break;
+            case '?': return usage(samtools_stderr, format, EXIT_FAILURE);
+            case 'h': return usage(samtools_stdout, format, EXIT_SUCCESS);
+            case 1000:
+                if (strcmp(optarg, "no") == 0) {
+                    pos_strand_name = neg_strand_name = "";
+                } else if (strcmp(optarg, "sign") == 0) {
+                    pos_strand_name = "(+)";
+                    neg_strand_name = "(-)";
+                } else if (strcmp(optarg, "rc") == 0) {
+                    pos_strand_name = "";
+                    neg_strand_name = "/rc";
+                } else if (strncmp(optarg, "custom,", 7) == 0) {
+                    size_t len = strlen(optarg + 7);
+                    size_t comma = strcspn(optarg + 7, ",");
+                    free(strand_names);
+                    strand_names = pos_strand_name = malloc(len + 2);
+                    if (!strand_names) {
+                        fprintf(samtools_stderr, "[faidx] Out of memory\n");
+                        return EXIT_FAILURE;
+                    }
+                    neg_strand_name = pos_strand_name + comma + 1;
+                    memcpy(pos_strand_name, optarg + 7, comma);
+                    pos_strand_name[comma] = '\0';
+                    if (comma < len)
+                        memcpy(neg_strand_name, optarg + 7 + comma + 1,
+                               len - comma);
+                    neg_strand_name[len - comma] = '\0';
+                } else {
+                    fprintf(samtools_stderr, "[faidx] Unknown --mark-strand option \"%s\"\n", optarg);
+                    return usage(samtools_stderr, format, EXIT_FAILURE);
+                }
+                break;
+            default:  break;
          }
      }
+
      if ( argc==optind )
-        return usage(samtools_stdout, EXIT_SUCCESS);
-    if ( argc==2 )
+        return usage(samtools_stdout, format, EXIT_SUCCESS);
+
+    if ( optind+1 == argc && !region_file)
      {
          if (fai_build(argv[optind]) != 0) {
-            fprintf(samtools_stderr, "Could not build fai index %s.fai\n", argv[optind]);
+            fprintf(samtools_stderr, "[faidx] Could not build fai index %s.fai\n", argv[optind]);
              return EXIT_FAILURE;
          }
          return 0;
      }
  
-    faidx_t *fai = fai_load(argv[optind]);
+    faidx_t *fai = fai_load_format(argv[optind], format);
+
      if ( !fai ) {
-        fprintf(samtools_stderr, "Could not load fai index of %s\n", argv[optind]);
+        fprintf(samtools_stderr, "[faidx] Could not load fai index of %s\n", argv[optind]);
          return EXIT_FAILURE;
      }
  
-    int exit_status = EXIT_SUCCESS;
+    /** output file provided by user */
+    if( output_file != NULL ) {
+        if( strcmp( output_file, argv[optind] ) == 0 ) {
+            fprintf(samtools_stderr,"[faidx] Same input/output : %s\n", output_file);
+            return EXIT_FAILURE;
+        }
  
-    while ( ++optind<argc && exit_status == EXIT_SUCCESS)
-    {
-        fprintf(samtools_stdout, ">%s\n", argv[optind]);
-        int seq_len;
-        char *seq = fai_fetch(fai, argv[optind], &seq_len);
-        if ( seq_len < 0 ) {
-            fprintf(samtools_stderr, "Failed to fetch sequence in %s\n", argv[optind]);
-            exit_status = EXIT_FAILURE;
-            break;
+        file_out = fopen( output_file, "w" );
+
+        if( file_out == NULL) {
+            fprintf(samtools_stderr,"[faidx] Cannot open \"%s\" for writing :%s.\n", output_file, strerror(errno) );
+            return EXIT_FAILURE;
          }
-        size_t i, seq_sz = seq_len;
-        for (i=0; i<seq_sz; i+=60)
-        {
-            size_t len = i + 60 < seq_sz ? 60 : seq_sz - i;
-            if (fwrite(seq + i, 1, len, samtools_stdout) < len ||
-                fputc('\n', samtools_stdout) == EOF) {
-                print_error_errno("faidx", "failed to write output");
-                exit_status = EXIT_FAILURE;
-                break;
+    }
+
+    int exit_status = EXIT_SUCCESS;
+
+    if (region_file) {
+        hFILE *rf;
+
+        if ((rf = hopen(region_file, "r"))) {
+            exit_status = read_regions_from_file(fai, rf, file_out, ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+
+            if (hclose(rf) != 0) {
+                fprintf(samtools_stderr, "[faidx] Warning: failed to close %s", region_file);
              }
+        } else {
+            fprintf(samtools_stderr, "[faidx] Failed to open \"%s\" for reading.\n", region_file);
+            exit_status = EXIT_FAILURE;
          }
-        free(seq);
      }
+
+    while ( ++optind<argc && exit_status == EXIT_SUCCESS) {
+        exit_status = write_output(fai, file_out, argv[optind], ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+    }
+
      fai_destroy(fai);
  
-    if (fflush(samtools_stdout) == EOF) {
+    if (fflush(file_out) == EOF) {
          print_error_errno("faidx", "failed to flush output");
          exit_status = EXIT_FAILURE;
      }
  
+    if( output_file != NULL) fclose(file_out);
+    free(strand_names);
+
      return exit_status;
  }
+
+
+int faidx_main(int argc, char *argv[]) {
+    return faidx_core(argc, argv, FAI_FASTA);
+}
+
+
+int fqidx_main(int argc, char *argv[]) {
+    return faidx_core(argc, argv, FAI_FASTQ);
+}
+
diff --git a/samtools/htslib-1.9/LICENSE b/samtools/htslib-1.9/LICENSE

new file mode 100644 (file)

index 0000000..86f782b
--- /dev/null
+++ b/samtools/htslib-1.9/LICENSE
@@ -0,0 +1,69 @@
+[Files in this distribution outwith the cram/ subdirectory are distributed
+according to the terms of the following MIT/Expat license.]
+
+The MIT/Expat License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[Files within the cram/ subdirectory in this distribution are distributed
+according to the terms of the following Modified 3-Clause BSD license.]
+
+The Modified-BSD License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+   this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+3. Neither the names Genome Research Ltd and Wellcome Trust Sanger Institute
+   nor the names of its contributors may be used to endorse or promote products
+   derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY GENOME RESEARCH LTD AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL GENOME RESEARCH LTD OR ITS CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
diff --git a/samtools/htslib-1.9/README b/samtools/htslib-1.9/README

new file mode 100644 (file)

index 0000000..4225bec
--- /dev/null
+++ b/samtools/htslib-1.9/README
@@ -0,0 +1,5 @@
+HTSlib is an implementation of a unified C library for accessing common file
+formats, such as SAM, CRAM, VCF, and BCF, used for high-throughput sequencing
+data.  It is the core library used by samtools and bcftools.
+
+See INSTALL for building and installation instructions.
diff --git a/samtools/misc/ace2sam.c.pysam.c b/samtools/misc/ace2sam.c.pysam.c

index 11359674025983f8eed7ccf66faa6ca6e645cbfe..ee8ef5e4a5cb348852b2a07d3826353c7caef7a9 100644 (file)
--- a/samtools/misc/ace2sam.c.pysam.c
+++ b/samtools/misc/ace2sam.c.pysam.c
@@ -164,14 +164,14 @@ int samtools_ace2sam_main(int argc, char *argv[])
              if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret);
              ks_getuntil(ks, '\n', &s, &dret); // skip the empty line
              if (write_cns) {
-                if (t[4].l) fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout);
+                if (t[4].l) samtools_puts(t[4].s);
                  t[4].l = 0;
              }
          } else if (strcmp(s.s, "AF") == 0) { // padded read position
              int reversed, neg, pos;
              if (t[0].l == 0) fatal("come to 'AF' before reading 'CO'");
              if (write_cns) {
-                if (t[4].l) fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout);
+                if (t[4].l) samtools_puts(t[4].s);
                  t[4].l = 0;
              }
              ks_getuntil(ks, 0, &s, &dret); // read name
@@ -244,7 +244,7 @@ int samtools_ace2sam_main(int argc, char *argv[])
              kputs("\t*\t0\t0\t", &t[4]); // empty MRNM, MPOS and TLEN
              kputsn(t[3].s, t[3].l, &t[4]); // unpadded SEQ
              kputs("\t*", &t[4]); // QUAL
-            fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout); // print to samtools_stdout
+            samtools_puts(t[4].s); // print to samtools_stdout
              ++af_i;
          } else if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret);
      }
diff --git a/samtools/phase.c b/samtools/phase.c

index 0e00d9b0fe7403b1404fae16fe3394b84f81a675..14f60085ad6ad655e1eb663399ea9c18ce77fa31 100644 (file)
--- a/samtools/phase.c
+++ b/samtools/phase.c
@@ -590,7 +590,7 @@ int main_phase(int argc, char *argv[])
      memset(&g, 0, sizeof(phaseg_t));
      g.flag = FLAG_FIX_CHIMERA;
      g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256;
-    while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A", lopts, NULL)) >= 0) {
          switch (c) {
              case 'D': g.max_depth = atoi(optarg); break;
              case 'q': g.min_varLOD = atoi(optarg); break;
diff --git a/samtools/phase.c.pysam.c b/samtools/phase.c.pysam.c

index f74ba489c458baab975c61709b9e99570ed0545f..783029a928dabd84f8b3bccf5d7fc73453aa55c5 100644 (file)
--- a/samtools/phase.c.pysam.c
+++ b/samtools/phase.c.pysam.c
@@ -592,7 +592,7 @@ int main_phase(int argc, char *argv[])
      memset(&g, 0, sizeof(phaseg_t));
      g.flag = FLAG_FIX_CHIMERA;
      g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256;
-    while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A:", lopts, NULL)) >= 0) {
+    while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A", lopts, NULL)) >= 0) {
          switch (c) {
              case 'D': g.max_depth = atoi(optarg); break;
              case 'q': g.min_varLOD = atoi(optarg); break;
diff --git a/samtools/sam_utils.c b/samtools/sam_utils.c

index efa6e2fb181c35b28800e02dbe7a50135d00d285..4f8964a6d8ba5e85bdd965bf1cca4c1e2e103211 100644 (file)
--- a/samtools/sam_utils.c
+++ b/samtools/sam_utils.c
@@ -28,10 +28,8 @@ DEALINGS IN THE SOFTWARE.  */
  #include <stdarg.h>
  #include <string.h>
  #include <errno.h>
-#include <stdlib.h>
  
  #include "samtools.h"
-#include "version.h"
  
  static void vprint_error_core(const char *subcommand, const char *format, va_list args, const char *extra)
  {
@@ -60,29 +58,3 @@ void print_error_errno(const char *subcommand, const char *format, ...)
      vprint_error_core(subcommand, format, args, err? strerror(err) : NULL);
      va_end(args);
  }
-
-const char *samtools_version()
-{
-    return SAMTOOLS_VERSION;
-}
-
-const char *samtools_version_short()
-{
-    char *sv, *hyph, *v;
-    int len;
-
-    v = SAMTOOLS_VERSION;
-    hyph = strchr(v, '-');
-    if (!hyph)
-        return strdup(v);
-
-    len = hyph - v;
-    sv = (char *)malloc(len+1);
-    if (!sv)
-        return NULL;
-
-    strncpy(sv, v, len);
-    sv[len] = '\0';
-
-    return (const char*)sv;
-}
diff --git a/samtools/sam_utils.c.pysam.c b/samtools/sam_utils.c.pysam.c

index 53f1763cb1d90e2ec745c11974c925a2fe2f8f79..cfa4e83bc04f792dccfc290273da176eb3a7e2e9 100644 (file)
--- a/samtools/sam_utils.c.pysam.c
+++ b/samtools/sam_utils.c.pysam.c
@@ -30,10 +30,8 @@ DEALINGS IN THE SOFTWARE.  */
  #include <stdarg.h>
  #include <string.h>
  #include <errno.h>
-#include <stdlib.h>
  
  #include "samtools.h"
-#include "version.h"
  
  static void vprint_error_core(const char *subcommand, const char *format, va_list args, const char *extra)
  {
@@ -62,29 +60,3 @@ void print_error_errno(const char *subcommand, const char *format, ...)
      vprint_error_core(subcommand, format, args, err? strerror(err) : NULL);
      va_end(args);
  }
-
-const char *samtools_version()
-{
-    return SAMTOOLS_VERSION;
-}
-
-const char *samtools_version_short()
-{
-    char *sv, *hyph, *v;
-    int len;
-
-    v = SAMTOOLS_VERSION;
-    hyph = strchr(v, '-');
-    if (!hyph)
-        return strdup(v);
-
-    len = hyph - v;
-    sv = (char *)malloc(len+1);
-    if (!sv)
-        return NULL;
-
-    strncpy(sv, v, len);
-    sv[len] = '\0';
-
-    return (const char*)sv;
-}
diff --git a/samtools/sam_view.c b/samtools/sam_view.c

index bce2c064344ac2e7914d28f00bff4a91bd4cf72f..7d0352c36f4ed3a763ba87758fa2883d4c3fdf5e 100644 (file)
--- a/samtools/sam_view.c
+++ b/samtools/sam_view.c
@@ -244,7 +244,6 @@ static void check_sam_close(const char *subcmd, samFile *fp, const char *fname,
  int main_samview(int argc, char *argv[])
  {
      int c, is_header = 0, is_header_only = 0, ret = 0, compress_level = -1, is_count = 0;
-    int is_long_help = 0;
      int64_t count = 0;
      samFile *in = 0, *out = 0, *un_out=0;
      FILE *fp_out = NULL;
@@ -279,8 +278,17 @@ int main_samview(int argc, char *argv[])
      /* parse command-line options */
      strcpy(out_mode, "w");
      strcpy(out_un_mode, "w");
+    if (argc == 1 && isatty(STDIN_FILENO))
+        return usage(stdout, EXIT_SUCCESS, 0);
+
+    // Suppress complaints about '?' being an unrecognised option.  Without
+    // this we have to put '?' in the options list, which makes it hard to
+    // tell a bad long option from the use of '-?' (both return '?' and
+    // set optopt to '\0').
+    opterr = 0;
+
      while ((c = getopt_long(argc, argv,
-                            "SbBcCt:h1Ho:O:q:f:F:G:ul:r:?T:R:L:s:@:m:x:U:M",
+                            "SbBcCt:h1Ho:O:q:f:F:G:ul:r:T:R:L:s:@:m:x:U:M",
                              lopts, NULL)) >= 0) {
          switch (c) {
          case 's':
@@ -290,7 +298,18 @@ int main_samview(int argc, char *argv[])
                  srand(settings.subsam_seed);
                  settings.subsam_seed = rand();
              }
-            settings.subsam_frac = strtod(q, &q);
+ 
+            if (q && *q == '.') {
+                settings.subsam_frac = strtod(q, &q);
+                if (*q) ret = 1;
+            } else {
+                ret = 1;
+            }
+
+            if (ret == 1) {
+                print_error("view", "Incorrect sampling argument \"%s\"", optarg);
+                goto view_end;
+            }
              break;
          case 'm': settings.min_qlen = atoi(optarg); break;
          case 'c': is_count = 1; break;
@@ -332,13 +351,29 @@ int main_samview(int argc, char *argv[])
          //case 'x': out_format = "x"; break;
          //case 'X': out_format = "X"; break;
                   */
-        case '?': is_long_help = 1; break;
+        case '?':
+            if (optopt == '?') {  // '-?' appeared on command line
+                return usage(stdout, EXIT_SUCCESS, 1);
+            } else {
+                if (optopt) { // Bad short option
+                    print_error("view", "invalid option -- '%c'", optopt);
+                } else { // Bad long option
+                    // Do our best.  There is no good solution to finding
+                    // out what the bad option was.
+                    // See, e.g. https://stackoverflow.com/questions/2723888/where-does-getopt-long-store-an-unrecognized-option
+                    if (optind > 0 && strncmp(argv[optind - 1], "--", 2) == 0) {
+                        print_error("view", "unrecognised option '%s'",
+                                    argv[optind - 1]);
+                    }
+                }
+                return usage(stderr, EXIT_FAILURE, 0);
+            }
          case 'B': settings.remove_B = 1; break;
          case 'x':
              {
                  if (strlen(optarg) != 2) {
                      fprintf(stderr, "main_samview: Error parsing -x auxiliary tags should be exactly two characters long.\n");
-                    return usage(stderr, EXIT_FAILURE, is_long_help);
+                    return usage(stderr, EXIT_FAILURE, 0);
                  }
                  settings.remove_aux = (char**)realloc(settings.remove_aux, sizeof(char*) * (++settings.remove_aux_len));
                  settings.remove_aux[settings.remove_aux_len-1] = optarg;
@@ -347,7 +382,7 @@ int main_samview(int argc, char *argv[])
          case 'M': settings.multi_region = 1; break;
          default:
              if (parse_sam_global_opt(c, optarg, lopts, &ga) != 0)
-                return usage(stderr, EXIT_FAILURE, is_long_help);
+                return usage(stderr, EXIT_FAILURE, 0);
              break;
          }
      }
@@ -367,7 +402,10 @@ int main_samview(int argc, char *argv[])
          strcat(out_mode, tmp);
          strcat(out_un_mode, tmp);
      }
-    if (argc == optind && isatty(STDIN_FILENO)) return usage(stdout, EXIT_SUCCESS, is_long_help); // potential memory leak...
+    if (argc == optind && isatty(STDIN_FILENO)) {
+        print_error("view", "No input provided or missing option argument.");
+        return usage(stderr, EXIT_FAILURE, 0); // potential memory leak...
+    }
  
      fn_in = (optind < argc)? argv[optind] : "-";
      // generate the fn_list if necessary
@@ -472,22 +510,13 @@ int main_samview(int argc, char *argv[])
              settings.bed = bed_hash_regions(settings.bed, argv, optind+1, argc, &filter_op); //insert(1) or filter out(0) the regions from the command line in the same hash table as the bed file
              if (!filter_op)
                  filter_state = FILTERED;
+        } else {
+            bed_unify(settings.bed);
          }
  
          bam1_t *b = bam_init1();
          if (settings.bed == NULL) { // index is unavailable or no regions have been specified
-            while ((result = sam_read1(in, header, b)) >= 0) { // read one alignment from `in'
-                if (!process_aln(header, b, &settings)) {
-                    if (!is_count) { if (check_sam_write1(out, header, b, fn_out, &ret) < 0) break; }
-                    count++;
-                } else {
-                    if (un_out) { if (check_sam_write1(un_out, header, b, fn_un_out, &ret) < 0) break; }
-                }
-            }
-            if (result < -1) {
-                fprintf(stderr, "[main_samview] truncated file.\n");
-                ret = 1;
-            }
+            fprintf(stderr, "[main_samview] no regions or BED file have been provided. Aborting.\n");
          } else {
              hts_idx_t *idx = sam_index_load(in, fn_in); // load index
              if (idx != NULL) {
@@ -729,9 +758,11 @@ static void bam2fq_usage(FILE *to, const char *command)
  "Usage: samtools %s [options...] <in.bam>\n", command);
      fprintf(to,
  "Options:\n"
-"  -0 FILE              write paired reads flagged both or neither READ1 and READ2 to FILE\n"
-"  -1 FILE              write paired reads flagged READ1 to FILE\n"
-"  -2 FILE              write paired reads flagged READ2 to FILE\n"
+"  -0 FILE              write reads designated READ_OTHER to FILE\n"
+"  -1 FILE              write reads designated READ1 to FILE\n"
+"  -2 FILE              write reads designated READ2 to FILE\n"
+"                       note: if a singleton file is specified with -s, only\n"
+"                       paired reads will be written to the -1 and -2 files.\n"
  "  -f INT               only include reads with all  of the FLAGs in INT present [0]\n"       //   F&x == x
  "  -F INT               only include reads with none of the FLAGS in INT present [0]\n"       //   F&x == 0
  "  -G INT               only EXCLUDE reads with all  of the FLAGs in INT present [0]\n"       // !(F&x == x)
@@ -740,7 +771,7 @@ static void bam2fq_usage(FILE *to, const char *command)
      if (fq) fprintf(to,
  "  -O                   output quality in the OQ tag if present\n");
      fprintf(to,
-"  -s FILE              write singleton reads to FILE [assume single-end]\n"
+"  -s FILE              write singleton reads designated READ1 or READ2 to FILE\n"
  "  -t                   copy RG, BC and QT tags to the %s header line\n",
      fq ? "FASTQ" : "FASTA");
      fprintf(to,
@@ -757,14 +788,30 @@ static void bam2fq_usage(FILE *to, const char *command)
  "  --index-format STR   How to parse barcode and quality tags\n\n");
      sam_global_opt_help(to, "-.--.@");
      fprintf(to,
-"   \n"
-"   The index-format string describes how to parse the barcode and quality tags, for example:\n"
+"\n"
+"Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.\n"
+"Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.\n"
+"Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set\n"
+"or both unset.\n"
+"Run 'samtools flags' for more information on flag codes and meanings.\n");
+    fprintf(to,
+"\n"
+"The index-format string describes how to parse the barcode and quality tags, for example:\n"
  "   i14i8       the first 14 characters are index 1, the next 8 characters are index 2\n"
  "   n8i14       ignore the first 8 characters, and use the next 14 characters for index 1\n"
-"   If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
-"   'read until the separator or end of tag', for example:\n"
+"If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
+"'read until the separator or end of tag', for example:\n"
  "   n*i*        ignore the left part of the tag until the separator, then use the second part\n"
  "               of the tag as index 1\n");
+    fprintf(to,
+"\n"
+"Examples:\n"
+" To get just the paired reads in separate files, use:\n"
+"   samtools %s -1 paired1.%s -2 paired2.%s -0 /dev/null -s /dev/null -n -F 0x900 in.bam\n"
+"\n To get all non-supplementary/secondary reads in a single file, redirect the output:\n"
+"   samtools %s -F 0x900 in.bam > all_reads.%s\n",
+            command, fq ? "fq" : "fa", fq ? "fq" : "fa",
+            command, fq ? "fq" : "fa");
  }
  
  typedef enum { READ_UNKNOWN = 0, READ_1 = 1, READ_2 = 2 } readpart;
@@ -1344,7 +1391,7 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
      state->filetype = opts->filetype;
      state->def_qual = opts->def_qual;
      state->index_sequence = NULL;
-    state->hstdout = bgzf_dopen(fileno(stdout), "wu");
+    state->hstdout = NULL;
      state->compression_level = opts->compression_level;
  
      state->taglist = kl_init(ktaglist);
@@ -1392,6 +1439,14 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
          }
      }
  
+    if (opts->ga.reference) {
+        if (hts_set_fai_filename(state->fp, opts->ga.reference) != 0) {
+            print_error_errno("bam2fq", "cannot load reference \"%s\"", opts->ga.reference);
+            free(state);
+            return false;
+        }
+    }
+
      int i;
      for (i = 0; i < 3; ++i) {
          if (opts->fnr[i]) {
@@ -1402,6 +1457,14 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
                  return false;
              }
          } else {
+            if (!state->hstdout) {
+                state->hstdout = bgzf_dopen(fileno(stdout), "wu");
+                if (!state->hstdout) {
+                    print_error_errno("bam2fq", "Cannot open STDOUT");
+                    free(state);
+                    return false;
+                }
+            }
              state->fpr[i] = state->hstdout;
          }
      }
@@ -1436,12 +1499,16 @@ static bool destroy_state(const bam2fq_opts_t *opts, bam2fq_state_t *state, int*
      if (state->fpse && bgzf_close(state->fpse)) { print_error_errno("bam2fq", "Error closing singleton file \"%s\"", opts->fnse); valid = false; }
      int i;
      for (i = 0; i < 3; ++i) {
-        if (state->fpr[i] == state->hstdout) {
-            if (i==0 && bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing STDOUT"); valid = false; }
-        } else {
+        if (state->fpr[i] != state->hstdout) {
              if (bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing r%d file \"%s\"", i, opts->fnr[i]); valid = false; }
          }
      }
+    if (state->hstdout) {
+        if (bgzf_close(state->hstdout)) {
+            print_error_errno("bam2fq", "Error closing STDOUT");
+            valid = false;
+        }
+    }
      for (i = 0; i < 2; i++) {
          if (state->fpi[i] && bgzf_close(state->fpi[i])) {
              print_error_errno("bam2fq", "Error closing i%d file \"%s\"", i+1, opts->index_file[i]);
diff --git a/samtools/sam_view.c.pysam.c b/samtools/sam_view.c.pysam.c

index 4e3f8ab48f3cbc4a789e4e11775173064b708073..bf7bb832ab27e328682cadbb9957771390f35138 100644 (file)
--- a/samtools/sam_view.c.pysam.c
+++ b/samtools/sam_view.c.pysam.c
@@ -246,7 +246,6 @@ static void check_sam_close(const char *subcmd, samFile *fp, const char *fname,
  int main_samview(int argc, char *argv[])
  {
      int c, is_header = 0, is_header_only = 0, ret = 0, compress_level = -1, is_count = 0;
-    int is_long_help = 0;
      int64_t count = 0;
      samFile *in = 0, *out = 0, *un_out=0;
      FILE *fp_out = NULL;
@@ -281,8 +280,17 @@ int main_samview(int argc, char *argv[])
      /* parse command-line options */
      strcpy(out_mode, "w");
      strcpy(out_un_mode, "w");
+    if (argc == 1 && isatty(STDIN_FILENO))
+        return usage(samtools_stdout, EXIT_SUCCESS, 0);
+
+    // Suppress complaints about '?' being an unrecognised option.  Without
+    // this we have to put '?' in the options list, which makes it hard to
+    // tell a bad long option from the use of '-?' (both return '?' and
+    // set optopt to '\0').
+    opterr = 0;
+
      while ((c = getopt_long(argc, argv,
-                            "SbBcCt:h1Ho:O:q:f:F:G:ul:r:?T:R:L:s:@:m:x:U:M",
+                            "SbBcCt:h1Ho:O:q:f:F:G:ul:r:T:R:L:s:@:m:x:U:M",
                              lopts, NULL)) >= 0) {
          switch (c) {
          case 's':
@@ -292,7 +300,18 @@ int main_samview(int argc, char *argv[])
                  srand(settings.subsam_seed);
                  settings.subsam_seed = rand();
              }
-            settings.subsam_frac = strtod(q, &q);
+ 
+            if (q && *q == '.') {
+                settings.subsam_frac = strtod(q, &q);
+                if (*q) ret = 1;
+            } else {
+                ret = 1;
+            }
+
+            if (ret == 1) {
+                print_error("view", "Incorrect sampling argument \"%s\"", optarg);
+                goto view_end;
+            }
              break;
          case 'm': settings.min_qlen = atoi(optarg); break;
          case 'c': is_count = 1; break;
@@ -334,13 +353,29 @@ int main_samview(int argc, char *argv[])
          //case 'x': out_format = "x"; break;
          //case 'X': out_format = "X"; break;
                   */
-        case '?': is_long_help = 1; break;
+        case '?':
+            if (optopt == '?') {  // '-?' appeared on command line
+                return usage(samtools_stdout, EXIT_SUCCESS, 1);
+            } else {
+                if (optopt) { // Bad short option
+                    print_error("view", "invalid option -- '%c'", optopt);
+                } else { // Bad long option
+                    // Do our best.  There is no good solution to finding
+                    // out what the bad option was.
+                    // See, e.g. https://stackoverflow.com/questions/2723888/where-does-getopt-long-store-an-unrecognized-option
+                    if (optind > 0 && strncmp(argv[optind - 1], "--", 2) == 0) {
+                        print_error("view", "unrecognised option '%s'",
+                                    argv[optind - 1]);
+                    }
+                }
+                return usage(samtools_stderr, EXIT_FAILURE, 0);
+            }
          case 'B': settings.remove_B = 1; break;
          case 'x':
              {
                  if (strlen(optarg) != 2) {
                      fprintf(samtools_stderr, "main_samview: Error parsing -x auxiliary tags should be exactly two characters long.\n");
-                    return usage(samtools_stderr, EXIT_FAILURE, is_long_help);
+                    return usage(samtools_stderr, EXIT_FAILURE, 0);
                  }
                  settings.remove_aux = (char**)realloc(settings.remove_aux, sizeof(char*) * (++settings.remove_aux_len));
                  settings.remove_aux[settings.remove_aux_len-1] = optarg;
@@ -349,7 +384,7 @@ int main_samview(int argc, char *argv[])
          case 'M': settings.multi_region = 1; break;
          default:
              if (parse_sam_global_opt(c, optarg, lopts, &ga) != 0)
-                return usage(samtools_stderr, EXIT_FAILURE, is_long_help);
+                return usage(samtools_stderr, EXIT_FAILURE, 0);
              break;
          }
      }
@@ -369,7 +404,10 @@ int main_samview(int argc, char *argv[])
          strcat(out_mode, tmp);
          strcat(out_un_mode, tmp);
      }
-    if (argc == optind && isatty(STDIN_FILENO)) return usage(samtools_stdout, EXIT_SUCCESS, is_long_help); // potential memory leak...
+    if (argc == optind && isatty(STDIN_FILENO)) {
+        print_error("view", "No input provided or missing option argument.");
+        return usage(samtools_stderr, EXIT_FAILURE, 0); // potential memory leak...
+    }
  
      fn_in = (optind < argc)? argv[optind] : "-";
      // generate the fn_list if necessary
@@ -474,22 +512,13 @@ int main_samview(int argc, char *argv[])
              settings.bed = bed_hash_regions(settings.bed, argv, optind+1, argc, &filter_op); //insert(1) or filter out(0) the regions from the command line in the same hash table as the bed file
              if (!filter_op)
                  filter_state = FILTERED;
+        } else {
+            bed_unify(settings.bed);
          }
  
          bam1_t *b = bam_init1();
          if (settings.bed == NULL) { // index is unavailable or no regions have been specified
-            while ((result = sam_read1(in, header, b)) >= 0) { // read one alignment from `in'
-                if (!process_aln(header, b, &settings)) {
-                    if (!is_count) { if (check_sam_write1(out, header, b, fn_out, &ret) < 0) break; }
-                    count++;
-                } else {
-                    if (un_out) { if (check_sam_write1(un_out, header, b, fn_un_out, &ret) < 0) break; }
-                }
-            }
-            if (result < -1) {
-                fprintf(samtools_stderr, "[main_samview] truncated file.\n");
-                ret = 1;
-            }
+            fprintf(samtools_stderr, "[main_samview] no regions or BED file have been provided. Aborting.\n");
          } else {
              hts_idx_t *idx = sam_index_load(in, fn_in); // load index
              if (idx != NULL) {
@@ -731,9 +760,11 @@ static void bam2fq_usage(FILE *to, const char *command)
  "Usage: samtools %s [options...] <in.bam>\n", command);
      fprintf(to,
  "Options:\n"
-"  -0 FILE              write paired reads flagged both or neither READ1 and READ2 to FILE\n"
-"  -1 FILE              write paired reads flagged READ1 to FILE\n"
-"  -2 FILE              write paired reads flagged READ2 to FILE\n"
+"  -0 FILE              write reads designated READ_OTHER to FILE\n"
+"  -1 FILE              write reads designated READ1 to FILE\n"
+"  -2 FILE              write reads designated READ2 to FILE\n"
+"                       note: if a singleton file is specified with -s, only\n"
+"                       paired reads will be written to the -1 and -2 files.\n"
  "  -f INT               only include reads with all  of the FLAGs in INT present [0]\n"       //   F&x == x
  "  -F INT               only include reads with none of the FLAGS in INT present [0]\n"       //   F&x == 0
  "  -G INT               only EXCLUDE reads with all  of the FLAGs in INT present [0]\n"       // !(F&x == x)
@@ -742,7 +773,7 @@ static void bam2fq_usage(FILE *to, const char *command)
      if (fq) fprintf(to,
  "  -O                   output quality in the OQ tag if present\n");
      fprintf(to,
-"  -s FILE              write singleton reads to FILE [assume single-end]\n"
+"  -s FILE              write singleton reads designated READ1 or READ2 to FILE\n"
  "  -t                   copy RG, BC and QT tags to the %s header line\n",
      fq ? "FASTQ" : "FASTA");
      fprintf(to,
@@ -759,14 +790,30 @@ static void bam2fq_usage(FILE *to, const char *command)
  "  --index-format STR   How to parse barcode and quality tags\n\n");
      sam_global_opt_help(to, "-.--.@");
      fprintf(to,
-"   \n"
-"   The index-format string describes how to parse the barcode and quality tags, for example:\n"
+"\n"
+"Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.\n"
+"Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.\n"
+"Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set\n"
+"or both unset.\n"
+"Run 'samtools flags' for more information on flag codes and meanings.\n");
+    fprintf(to,
+"\n"
+"The index-format string describes how to parse the barcode and quality tags, for example:\n"
  "   i14i8       the first 14 characters are index 1, the next 8 characters are index 2\n"
  "   n8i14       ignore the first 8 characters, and use the next 14 characters for index 1\n"
-"   If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
-"   'read until the separator or end of tag', for example:\n"
+"If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
+"'read until the separator or end of tag', for example:\n"
  "   n*i*        ignore the left part of the tag until the separator, then use the second part\n"
  "               of the tag as index 1\n");
+    fprintf(to,
+"\n"
+"Examples:\n"
+" To get just the paired reads in separate files, use:\n"
+"   samtools %s -1 paired1.%s -2 paired2.%s -0 /dev/null -s /dev/null -n -F 0x900 in.bam\n"
+"\n To get all non-supplementary/secondary reads in a single file, redirect the output:\n"
+"   samtools %s -F 0x900 in.bam > all_reads.%s\n",
+            command, fq ? "fq" : "fa", fq ? "fq" : "fa",
+            command, fq ? "fq" : "fa");
  }
  
  typedef enum { READ_UNKNOWN = 0, READ_1 = 1, READ_2 = 2 } readpart;
@@ -1346,7 +1393,7 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
      state->filetype = opts->filetype;
      state->def_qual = opts->def_qual;
      state->index_sequence = NULL;
-    state->hsamtools_stdout = bgzf_dopen(fileno(samtools_stdout), "wu");
+    state->hsamtools_stdout = NULL;
      state->compression_level = opts->compression_level;
  
      state->taglist = kl_init(ktaglist);
@@ -1394,6 +1441,14 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
          }
      }
  
+    if (opts->ga.reference) {
+        if (hts_set_fai_filename(state->fp, opts->ga.reference) != 0) {
+            print_error_errno("bam2fq", "cannot load reference \"%s\"", opts->ga.reference);
+            free(state);
+            return false;
+        }
+    }
+
      int i;
      for (i = 0; i < 3; ++i) {
          if (opts->fnr[i]) {
@@ -1404,6 +1459,14 @@ static bool init_state(const bam2fq_opts_t* opts, bam2fq_state_t** state_out)
                  return false;
              }
          } else {
+            if (!state->hsamtools_stdout) {
+                state->hsamtools_stdout = bgzf_dopen(fileno(samtools_stdout), "wu");
+                if (!state->hsamtools_stdout) {
+                    print_error_errno("bam2fq", "Cannot open STDOUT");
+                    free(state);
+                    return false;
+                }
+            }
              state->fpr[i] = state->hsamtools_stdout;
          }
      }
@@ -1438,12 +1501,16 @@ static bool destroy_state(const bam2fq_opts_t *opts, bam2fq_state_t *state, int*
      if (state->fpse && bgzf_close(state->fpse)) { print_error_errno("bam2fq", "Error closing singleton file \"%s\"", opts->fnse); valid = false; }
      int i;
      for (i = 0; i < 3; ++i) {
-        if (state->fpr[i] == state->hsamtools_stdout) {
-            if (i==0 && bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing STDOUT"); valid = false; }
-        } else {
+        if (state->fpr[i] != state->hsamtools_stdout) {
              if (bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing r%d file \"%s\"", i, opts->fnr[i]); valid = false; }
          }
      }
+    if (state->hsamtools_stdout) {
+        if (bgzf_close(state->hsamtools_stdout)) {
+            print_error_errno("bam2fq", "Error closing STDOUT");
+            valid = false;
+        }
+    }
      for (i = 0; i < 2; i++) {
          if (state->fpi[i] && bgzf_close(state->fpi[i])) {
              print_error_errno("bam2fq", "Error closing i%d file \"%s\"", i+1, opts->index_file[i]);
diff --git a/samtools/samtools.h b/samtools/samtools.h

index 7a406a22a78f6be6fa070b3f1d4bb18f7609181e..1e726543dd00e9ec8d6bff15c3d11db0f146c64a 100644 (file)
--- a/samtools/samtools.h
+++ b/samtools/samtools.h
@@ -26,7 +26,6 @@ DEALINGS IN THE SOFTWARE.  */
  #define SAMTOOLS_H
  
  const char *samtools_version(void);
-const char *samtools_version_short(void);
  
  #if defined __GNUC__ && __GNUC__ >= 2
  #define CHECK_PRINTF(fmt,args) __attribute__ ((format (printf, fmt, args)))
diff --git a/samtools/samtools.pysam.c b/samtools/samtools.pysam.c

index c276a8a091ec7fbcc7ce0589c04cbdfdf7147bd4..edbd9a2d4821ff2e0fae274f548306f24942a3a8 100644 (file)
--- a/samtools/samtools.pysam.c
+++ b/samtools/samtools.pysam.c
@@ -54,6 +54,12 @@ void samtools_unset_stdout(void)
    samtools_stdout_fileno = STDOUT_FILENO;
  }
  
+int samtools_puts(const char *s)
+{
+  if (fputs(s, samtools_stdout) == EOF) return EOF;
+  return putc('\n', samtools_stdout);
+}
+
  void samtools_set_optind(int val)
  {
    // setting this in cython via 
diff --git a/samtools/samtools.pysam.h b/samtools/samtools.pysam.h

index e2bfd85e71b3c04ec6df36a5716c78600c431f79..d97ee250acc5da1e9837170f71a6d248308cc32b 100644 (file)
--- a/samtools/samtools.pysam.h
+++ b/samtools/samtools.pysam.h
@@ -38,6 +38,8 @@ void samtools_unset_stderr(void);
   */
  void samtools_unset_stdout(void);
  
+int samtools_puts(const char *s);
+
  int samtools_dispatch(int argc, char *argv[]);
  
  void samtools_set_optind(int);
diff --git a/samtools/stats.c b/samtools/stats.c

index 35574ed831d85490c84cc80157669ceddff15048..8d7acbf43aa565aeaae734917091f5d778757bb9 100644 (file)
--- a/samtools/stats.c
+++ b/samtools/stats.c
@@ -60,8 +60,11 @@ DEALINGS IN THE SOFTWARE.  */
  #include <htslib/kstring.h>
  #include "stats_isize.h"
  #include "sam_opts.h"
+#include "bedidx.h"
  
  #define BWA_MIN_RDLEN 35
+#define DEFAULT_CHUNK_NO 8
+#define DEFAULT_PAIR_MAX 10000
  // From the spec
  // If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
  #define IS_PAIRED_AND_MAPPED(bam) (((bam)->core.flag&BAM_FPAIRED) && !((bam)->core.flag&BAM_FUNMAP) && !((bam)->core.flag&BAM_FMUNMAP))
@@ -134,6 +137,8 @@ typedef struct
      // Misc
      char *split_tag;      // Tag on which to perform stats splitting
      char *split_prefix;   // Path or string prefix for filenames created when splitting
+    int remove_overlaps;
+    int cov_threshold;
  }
  stats_info_t;
  
@@ -149,19 +154,24 @@ typedef struct
      // Arrays for the histogram data
      uint64_t *quals_1st, *quals_2nd;
      uint64_t *gc_1st, *gc_2nd;
-    acgtno_count_t *acgtno_cycles;
-    uint64_t *read_lengths;
+    acgtno_count_t *acgtno_cycles_1st;
+    acgtno_count_t *acgtno_cycles_2nd;
+    uint64_t *read_lengths, *read_lengths_1st, *read_lengths_2nd;
      uint64_t *insertions, *deletions;
      uint64_t *ins_cycles_1st, *ins_cycles_2nd, *del_cycles_1st, *del_cycles_2nd;
      isize_t *isize;
  
      // The extremes encountered
      int max_len;            // Maximum read length
+    int max_len_1st;        // Maximum read length for forward reads
+    int max_len_2nd;        // Maximum read length for reverse reads
      int max_qual;           // Maximum quality
      int is_sorted;
  
      // Summary numbers
      uint64_t total_len;
+    uint64_t total_len_1st;
+    uint64_t total_len_2nd;
      uint64_t total_len_dup;
      uint64_t nreads_1st;
      uint64_t nreads_2nd;
@@ -202,7 +212,7 @@ typedef struct
      uint64_t *mpc_buf;              // Mismatches per cycle
  
      // Target regions
-    int nregions, reg_from,reg_to;
+    int nregions, reg_from, reg_to;
      regions_t *regions;
  
      // Auxiliary data
@@ -213,15 +223,35 @@ typedef struct
      char* split_name;
  
      stats_info_t* info;             // Pointer to options and settings struct
+    pos_t *chunks;
+    uint32_t nchunks;
  
+    uint32_t pair_count;          // Number of active pairs in the pairing hash table
+    uint32_t target_count;        // Number of bases covered by the target file
+    uint32_t last_pair_tid;
+    uint32_t last_read_flush;
  }
  stats_t;
  KHASH_MAP_INIT_STR(c2stats, stats_t*)
  
+typedef struct {
+    uint32_t first;     // 1 - first read, 2 - second read
+    uint32_t n, m;      // number of chunks, allocated chunks
+    pos_t *chunks;      // chunk array of size m
+} pair_t;
+KHASH_MAP_INIT_STR(qn2pair, pair_t*)
+
+
  static void error(const char *format, ...);
  int is_in_regions(bam1_t *bam_line, stats_t *stats);
  void realloc_buffers(stats_t *stats, int seq_len);
  
+static int regions_lt(const void *r1, const void *r2) {
+    int64_t from_diff = (int64_t)((pos_t *)r1)->from - (int64_t)((pos_t *)r2)->from;
+    int64_t to_diff = (int64_t)((pos_t *)r1)->to - (int64_t)((pos_t *)r2)->to;
+
+    return from_diff > 0 ? 1 : from_diff < 0 ? -1 : to_diff > 0 ? 1 : to_diff < 0 ? -1 : 0;
+}
  
  // Coverage distribution methods
  static inline int coverage_idx(int min, int max, int n, int step, int depth)
@@ -570,16 +600,31 @@ void realloc_buffers(stats_t *stats, int seq_len)
          memset(stats->mpc_buf + stats->nbases*stats->nquals, 0, (n-stats->nbases)*stats->nquals*sizeof(uint64_t));
      }
  
-    stats->acgtno_cycles = realloc(stats->acgtno_cycles, n*sizeof(acgtno_count_t));
-    if ( !stats->acgtno_cycles )
+    stats->acgtno_cycles_1st = realloc(stats->acgtno_cycles_1st, n*sizeof(acgtno_count_t));
+    if ( !stats->acgtno_cycles_1st )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
-    memset(stats->acgtno_cycles + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+    memset(stats->acgtno_cycles_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+
+    stats->acgtno_cycles_2nd = realloc(stats->acgtno_cycles_2nd, n*sizeof(acgtno_count_t));
+    if ( !stats->acgtno_cycles_2nd )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
+    memset(stats->acgtno_cycles_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
  
      stats->read_lengths = realloc(stats->read_lengths, n*sizeof(uint64_t));
      if ( !stats->read_lengths )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
      memset(stats->read_lengths + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
  
+    stats->read_lengths_1st = realloc(stats->read_lengths_1st, n*sizeof(uint64_t));
+    if ( !stats->read_lengths_1st )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+    memset(stats->read_lengths_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
+    stats->read_lengths_2nd = realloc(stats->read_lengths_2nd, n*sizeof(uint64_t));
+    if ( !stats->read_lengths_2nd )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+    memset(stats->read_lengths_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
      stats->insertions = realloc(stats->insertions, n*sizeof(uint64_t));
      if ( !stats->insertions )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
@@ -655,7 +700,7 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
  
      // Count GC and ACGT per cycle. Note that cycle is approximate, clipping is ignored
      uint8_t *seq  = bam_get_seq(bam_line);
-    int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line);
+    int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line), is_first = IS_READ1(bam_line);
      for (i=0; i<seq_len; i++)
      {
          // Read cycle for current index
@@ -666,28 +711,28 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
          //      =ACMGRSVTWYHKDBN
          switch (bam_seqi(seq, i)) {
          case 1:
-            stats->acgtno_cycles[ read_cycle ].a++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].a++ : stats->acgtno_cycles_2nd[ read_cycle ].a++;
              break;
          case 2:
-            stats->acgtno_cycles[ read_cycle ].c++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].c++ : stats->acgtno_cycles_2nd[ read_cycle ].c++;
              gc_count++;
              break;
          case 4:
-            stats->acgtno_cycles[ read_cycle ].g++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].g++ : stats->acgtno_cycles_2nd[ read_cycle ].g++;
              gc_count++;
              break;
          case 8:
-            stats->acgtno_cycles[ read_cycle ].t++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].t++ : stats->acgtno_cycles_2nd[ read_cycle ].t++;
              break;
          case 15:
-            stats->acgtno_cycles[ read_cycle ].n++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].n++ : stats->acgtno_cycles_2nd[ read_cycle ].n++;
              break;
          default:
              /*
               * count "=" sequences in "other" along
               * with MRSVWYHKDB ambiguity codes
               */
-            stats->acgtno_cycles[ read_cycle ].other++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].other++ : stats->acgtno_cycles_2nd[ read_cycle ].other++;
              break;
          }
      }
@@ -700,10 +745,11 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      //  fill GC histogram
      uint64_t *quals;
      uint8_t *bam_quals = bam_get_qual(bam_line);
-    if ( bam_line->core.flag&BAM_FREAD2 )
+    if ( IS_READ2(bam_line) )
      {
          quals  = stats->quals_2nd;
          stats->nreads_2nd++;
+        stats->total_len_2nd += seq_len;
          for (i=gc_idx_min; i<gc_idx_max; i++)
              stats->gc_2nd[i]++;
      }
@@ -711,6 +757,7 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      {
          quals = stats->quals_1st;
          stats->nreads_1st++;
+        stats->total_len_1st += seq_len;
          for (i=gc_idx_min; i<gc_idx_max; i++)
              stats->gc_1st[i]++;
      }
@@ -756,7 +803,160 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      *gc_count_out = gc_count;
  }
  
-void collect_stats(bam1_t *bam_line, stats_t *stats)
+static int cleanup_overlaps(khash_t(qn2pair) *read_pairs, int max) {
+    if ( !read_pairs )
+        return 0;
+
+    int count = 0;
+    khint_t k;
+    for (k = kh_begin(read_pairs); k < kh_end(read_pairs); k++) {
+        if ( kh_exist(read_pairs, k) ) {
+            char *key = (char *)kh_key(read_pairs, k);
+            pair_t *val = kh_val(read_pairs, k);
+            if ( val && val->chunks ) {
+                if ( val->chunks[val->n-1].to < max ) {
+                    free(val->chunks);
+                    free(val);
+                    free(key);
+                    kh_del(qn2pair, read_pairs, k);
+                    count++;
+                }
+            } else {
+                free(key);
+                kh_del(qn2pair, read_pairs, k);
+                count++;
+            }
+        }
+    }
+    if ( max == INT_MAX )
+        kh_destroy(qn2pair, read_pairs);
+
+    return count;
+}
+
+static void remove_overlaps(bam1_t *bam_line, khash_t(qn2pair) *read_pairs, stats_t *stats, int pmin, int pmax) {
+    if ( !bam_line || !read_pairs || !stats )
+        return;
+
+    uint32_t first = (IS_READ1(bam_line) > 0 ? 1 : 0) + (IS_READ2(bam_line) > 0 ? 2 : 0) ;
+    if ( !(bam_line->core.flag & BAM_FPAIRED) ||
+         (bam_line->core.flag & BAM_FMUNMAP) ||
+         (abs(bam_line->core.isize) >= 2*bam_line->core.l_qseq) ||
+         (first != 1 && first != 2) ) {
+        if ( pmin >= 0 )
+            round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+        return;
+    }
+
+    char *qname = bam_get_qname(bam_line);
+    if ( !qname ) {
+        fprintf(stderr, "Error retrieving qname for line starting at pos %d\n", bam_line->core.pos);
+        return;
+    }
+
+    khint_t k = kh_get(qn2pair, read_pairs, qname);
+    if ( k == kh_end(read_pairs) ) { //first chunk from this template
+        if ( pmin == -1 )
+            return;
+
+        int ret;
+        char *s = strdup(qname);
+        if ( !s ) {
+            fprintf(stderr, "Error allocating memory\n");
+            return;
+        }
+
+        k = kh_put(qn2pair, read_pairs, s, &ret);
+        if ( -1 == ret ) {
+            fprintf(stderr, "Error inserting read '%s' in pair hash table\n", qname);
+            return;
+        }
+
+        pair_t *pc = calloc(1, sizeof(pair_t));
+        if ( !pc ) {
+            fprintf(stderr, "Error allocating memory\n");
+            return;
+        }
+
+        pc->m = DEFAULT_CHUNK_NO;
+        pc->chunks = calloc(pc->m, sizeof(pos_t));
+        if ( !pc->chunks ) {
+            fprintf(stderr, "Error allocating memory\n");
+            return;
+        }
+
+        pc->chunks[0].from = pmin;
+        pc->chunks[0].to = pmax;
+        pc->n = 1;
+        pc->first = first;
+
+        kh_val(read_pairs, k) = pc;
+        stats->pair_count++;
+    } else { //template already present
+        pair_t *pc = kh_val(read_pairs, k);
+        if ( !pc ) {
+            fprintf(stderr, "Invalid hash table entry\n");
+            return;
+        }
+
+        if ( first == pc->first ) { //chunk from an existing line
+            if ( pmin == -1 )
+                return;
+
+            if ( pc->n == pc->m ) {
+                pos_t *tmp = realloc(pc->chunks, (pc->m<<1)*sizeof(pos_t));
+                if ( !tmp ) {
+                    fprintf(stderr, "Error allocating memory\n");
+                    return;
+                }
+                pc->chunks = tmp;
+                pc->m<<=1;
+            }
+
+            pc->chunks[pc->n].from = pmin;
+            pc->chunks[pc->n].to = pmax;
+            pc->n++;
+        } else { //the other line, check for overlapping
+            if ( pmin == -1 && kh_exist(read_pairs, k) ) { //job done, delete entry
+                char *key = (char *)kh_key(read_pairs, k);
+                pair_t *val = kh_val(read_pairs, k);
+                if ( val) {
+                    free(val->chunks);
+                    free(val);
+                }
+                free(key);
+                kh_del(qn2pair, read_pairs, k);
+                stats->pair_count--;
+                return;
+            }
+
+            int i;
+            for (i=0; i<pc->n; i++) {
+                if ( pmin >= pc->chunks[i].to )
+                    continue;
+
+                if ( pmax <= pc->chunks[i].from ) //no overlap
+                    break;
+
+                if ( pmin < pc->chunks[i].from ) { //overlap at the beginning
+                    round_buffer_insert_read(&(stats->cov_rbuf), pmin, pc->chunks[i].from-1);
+                    pmin = pc->chunks[i].from;
+                }
+
+                if ( pmax <= pc->chunks[i].to ) { //completely contained
+                    stats->nbases_mapped_cigar -= (pmax - pmin);
+                    return;
+                } else {                           //overlap at the end
+                    stats->nbases_mapped_cigar -= (pc->chunks[i].to - pmin);
+                    pmin = pc->chunks[i].to;
+                }
+            }
+        }
+    }
+    round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+}
+
+void collect_stats(bam1_t *bam_line, stats_t *stats, khash_t(qn2pair) *read_pairs)
  {
      if ( stats->rg_hash )
      {
@@ -804,6 +1004,11 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
      // Update max_len observed
      if ( stats->max_len<read_len )
          stats->max_len = read_len;
+    if ( IS_READ1(bam_line) && stats->max_len_1st < read_len )
+        stats->max_len_1st = read_len;
+    if ( IS_READ2(bam_line) && stats->max_len_2nd < read_len )
+        stats->max_len_2nd = read_len;
+
      int i;
      int gc_count = 0;
  
@@ -812,6 +1017,8 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
      if ( IS_ORIGINAL(bam_line) )
      {
          stats->read_lengths[read_len]++;
+        if ( IS_READ1(bam_line) ) stats->read_lengths_1st[read_len]++;
+        if ( IS_READ2(bam_line) ) stats->read_lengths_2nd[read_len]++;
          collect_orig_read_stats(bam_line, stats, &gc_count);
      }
  
@@ -839,7 +1046,7 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
              if ( is_fwd*is_mfwd>0 )
                  stats->isize->inc_other(stats->isize->data, isize);
-            else if ( is_fst*pos_fst>0 )
+            else if ( is_fst*pos_fst>=0 )
              {
                  if ( is_fst*is_fwd>0 )
                      stats->isize->inc_inward(stats->isize->data, isize);
@@ -875,7 +1082,7 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
              int ncig = bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
              if ( !ncig ) continue;  // curiously, this can happen: 0D
              if ( cig==BAM_CDEL ) readlen += ncig;
-            else if ( cig==BAM_CMATCH )
+            else if ( cig==BAM_CMATCH || cig==BAM_CEQUAL || cig==BAM_CDIFF )
              {
                  if ( iref < stats->reg_from ) ncig -= stats->reg_from-iref;
                  else if ( iref+ncig-1 > stats->reg_to ) ncig -= iref+ncig-1 - stats->reg_to;
@@ -896,9 +1103,10 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
          // Count the whole read
          for (i=0; i<bam_line->core.n_cigar; i++)
          {
-            if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CMATCH || bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CINS )
+            int cig  = bam_cigar_op(bam_get_cigar(bam_line)[i]);
+            if ( cig==BAM_CMATCH || cig==BAM_CINS || cig==BAM_CEQUAL || cig==BAM_CDIFF )
                  stats->nbases_mapped_cigar += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
-            if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CDEL )
+            if ( cig==BAM_CDEL )
                  readlen += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
          }
      }
@@ -909,8 +1117,22 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
      if ( stats->is_sorted )
      {
-        if ( stats->tid==-1 || stats->tid!=bam_line->core.tid )
+        if ( stats->tid==-1 || stats->tid!=bam_line->core.tid ) {
              round_buffer_flush(stats, -1);
+        }
+
+        //cleanup the pair hash table to free memory
+        stats->last_read_flush++;
+        if ( stats->pair_count > DEFAULT_PAIR_MAX && stats->last_read_flush > DEFAULT_PAIR_MAX) {
+            stats->pair_count -= cleanup_overlaps(read_pairs, bam_line->core.pos);
+            stats->last_read_flush = 0;
+        }
+
+        if ( stats->last_pair_tid != bam_line->core.tid) {
+            stats->pair_count -= cleanup_overlaps(read_pairs, INT_MAX-1);
+            stats->last_pair_tid = bam_line->core.tid;
+            stats->last_read_flush = 0;
+        }
  
          // Mismatches per cycle and GC-depth graph. For simplicity, reads overlapping GCD bins
          //  are not splitted which results in up to seq_len-1 overlaps. The default bin size is
@@ -958,7 +1180,61 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
          // Coverage distribution graph
          round_buffer_flush(stats,bam_line->core.pos);
-        round_buffer_insert_read(&(stats->cov_rbuf),bam_line->core.pos,bam_line->core.pos+seq_len-1);
+        if ( stats->regions ) {
+            uint32_t p = bam_line->core.pos, pnew, pmin, pmax, j;
+            pmin = pmax = i = j = 0;
+            while ( j < bam_line->core.n_cigar && i < stats->nchunks ) {
+                int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+                int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+                switch(op) {
+                case BAM_CMATCH:
+                case BAM_CEQUAL:
+                case BAM_CDIFF:
+                    pmin = MAX(p, stats->chunks[i].from-1);
+                    pmax = MIN(p+oplen, stats->chunks[i].to);
+                    if ( pmax >= pmin ) {
+                        if ( stats->info->remove_overlaps )
+                            remove_overlaps(bam_line, read_pairs, stats, pmin, pmax);
+                        else
+                            round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+                    }
+                    break;
+                case BAM_CDEL:
+                    break;
+                }
+                pnew = p + (bam_cigar_type(op)&2 ? oplen : 0); // consumes reference
+
+                if ( pnew >= stats->chunks[i].to ) {
+                    // go to the next chunk
+                    i++;
+                } else {
+                    // go to the next CIGAR op
+                    j++;
+                    p = pnew;
+                }
+            }
+        } else {
+            uint32_t p = bam_line->core.pos, j;
+            for (j = 0; j < bam_line->core.n_cigar; j++) {
+                int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+                int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+                switch(op) {
+                case BAM_CMATCH:
+                case BAM_CEQUAL:
+                case BAM_CDIFF:
+                    if ( stats->info->remove_overlaps )
+                        remove_overlaps(bam_line, read_pairs, stats, p, p+oplen);
+                    else
+                        round_buffer_insert_read(&(stats->cov_rbuf), p, p+oplen-1);
+                    break;
+                case BAM_CDEL:
+                    break;
+                }
+                p += bam_cigar_type(op)&2 ? oplen : 0; // consumes reference
+            }
+        }
+        if ( stats->info->remove_overlaps )
+           remove_overlaps(bam_line, read_pairs, stats, -1, -1); //remove the line from the hash table
      }
  }
  
@@ -993,8 +1269,9 @@ float gcd_percentile(gc_depth_t *gcd, int N, int p)
  void output_stats(FILE *to, stats_t *stats, int sparse)
  {
      // Calculate average insert size and standard deviation (from the main bulk data only)
-    int isize, ibulk=0;
-    uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0;
+    int isize, ibulk=0, icov;
+    uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0, cov_sum=0;
+    double bulk=0, avg_isize=0, sd_isize=0;
      for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
      {
          // Each pair was counted twice
@@ -1008,10 +1285,11 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
          nisize += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
      }
  
-    double bulk=0, avg_isize=0, sd_isize=0;
      for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
      {
-        bulk += stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+        uint64_t num = stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+        if (num > 0) ibulk = isize + 1;
+        bulk += num;
          avg_isize += isize * (stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize));
  
          if ( bulk/nisize > stats->info->isize_main_bulk )
@@ -1023,10 +1301,9 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      }
      avg_isize /= nisize ? nisize : 1;
      for (isize=1; isize<ibulk; isize++)
-        sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / nisize;
+        sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / (nisize ? nisize : 1);
      sd_isize = sqrt(sd_isize);
  
-
      fprintf(to, "# This file was produced by samtools stats (%s+htslib-%s) and can be plotted using plot-bamstats\n", samtools_version(), hts_version());
      if( stats->split_name != NULL ){
          fprintf(to, "# This file contains statistics only for reads with tag: %s=%s\n", stats->info->split_tag, stats->split_name);
@@ -1059,6 +1336,8 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\treads QC failed:\t%ld\n", (long)stats->nreads_QCfailed);
      fprintf(to, "SN\tnon-primary alignments:\t%ld\n", (long)stats->nreads_secondary);
      fprintf(to, "SN\ttotal length:\t%ld\t# ignores clipping\n", (long)stats->total_len);
+    fprintf(to, "SN\ttotal first fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_1st);
+    fprintf(to, "SN\ttotal last fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_2nd);
      fprintf(to, "SN\tbases mapped:\t%ld\t# ignores clipping\n", (long)stats->nbases_mapped);                 // the length of the whole read goes here, including soft-clips etc.
      fprintf(to, "SN\tbases mapped (cigar):\t%ld\t# more accurate\n", (long)stats->nbases_mapped_cigar);   // only matched and inserted bases are counted here
      fprintf(to, "SN\tbases trimmed:\t%ld\n", (long)stats->nbases_trimmed);
@@ -1067,7 +1346,11 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\terror rate:\t%e\t# mismatches / bases mapped (cigar)\n", stats->nbases_mapped_cigar ? (float)stats->nmismatches/stats->nbases_mapped_cigar : 0);
      float avg_read_length = (stats->nreads_1st+stats->nreads_2nd)?stats->total_len/(stats->nreads_1st+stats->nreads_2nd):0;
      fprintf(to, "SN\taverage length:\t%.0f\n", avg_read_length);
+    fprintf(to, "SN\taverage first fragment length:\t%.0f\n", stats->nreads_1st? (float)stats->total_len_1st/stats->nreads_1st:0);
+    fprintf(to, "SN\taverage last fragment length:\t%.0f\n", stats->nreads_2nd? (float)stats->total_len_2nd/stats->nreads_2nd:0);
      fprintf(to, "SN\tmaximum length:\t%d\n", stats->max_len);
+    fprintf(to, "SN\tmaximum first fragment length:\t%d\n", stats->max_len_1st);
+    fprintf(to, "SN\tmaximum last fragment length:\t%d\n", stats->max_len_2nd);
      fprintf(to, "SN\taverage quality:\t%.1f\n", stats->total_len?stats->sum_qual/stats->total_len:0);
      fprintf(to, "SN\tinsert size average:\t%.1f\n", avg_isize);
      fprintf(to, "SN\tinsert size standard deviation:\t%.1f\n", sd_isize);
@@ -1075,13 +1358,20 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\toutward oriented pairs:\t%ld\n", (long)nisize_outward);
      fprintf(to, "SN\tpairs with other orientation:\t%ld\n", (long)nisize_other);
      fprintf(to, "SN\tpairs on different chromosomes:\t%ld\n", (long)stats->nreads_anomalous/2);
+    fprintf(to, "SN\tpercentage of properly paired reads (%%):\t%.1f\n", (stats->nreads_1st+stats->nreads_2nd)? (float)(100*stats->nreads_properly_paired)/(stats->nreads_1st+stats->nreads_2nd):0);
+    if ( stats->target_count ) {
+        fprintf(to, "SN\tbases inside the target:\t%u\n", stats->target_count);
+        for (icov=stats->info->cov_threshold+1; icov<stats->ncov; icov++)
+            cov_sum += stats->cov[icov];
+        fprintf(to, "SN\tpercentage of target genome with coverage > %d (%%):\t%.2f\n", stats->info->cov_threshold, (float)(100*cov_sum)/stats->target_count);
+    }
  
      int ibase,iqual;
      if ( stats->max_len<stats->nbases ) stats->max_len++;
      if ( stats->max_qual+1<stats->nquals ) stats->max_qual++;
-    fprintf(to, "# First Fragment Qualitites. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
+    fprintf(to, "# First Fragment Qualities. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
      fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
-    for (ibase=0; ibase<stats->max_len; ibase++)
+    for (ibase=0; ibase<stats->max_len_1st; ibase++)
      {
          fprintf(to, "FFQ\t%d",ibase+1);
          for (iqual=0; iqual<=stats->max_qual; iqual++)
@@ -1090,9 +1380,9 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
          }
          fprintf(to, "\n");
      }
-    fprintf(to, "# Last Fragment Qualitites. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
+    fprintf(to, "# Last Fragment Qualities. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
      fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
-    for (ibase=0; ibase<stats->max_len; ibase++)
+    for (ibase=0; ibase<stats->max_len_2nd; ibase++)
      {
          fprintf(to, "LFQ\t%d",ibase+1);
          for (iqual=0; iqual<=stats->max_qual; iqual++)
@@ -1135,10 +1425,51 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "# ACGT content per cycle. Use `grep ^GCC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
      for (ibase=0; ibase<stats->max_len; ibase++)
      {
-        acgtno_count_t *acgtno_count = &(stats->acgtno_cycles[ibase]);
-        uint64_t acgt_sum = acgtno_count->a + acgtno_count->c + acgtno_count->g + acgtno_count->t;
+        acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+        acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+        uint64_t acgt_sum = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t +
+                acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
          if ( ! acgt_sum ) continue;
-        fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1, 100.*acgtno_count->a/acgt_sum, 100.*acgtno_count->c/acgt_sum, 100.*acgtno_count->g/acgt_sum, 100.*acgtno_count->t/acgt_sum, 100.*acgtno_count->n/acgt_sum, 100.*acgtno_count->other/acgt_sum);
+        fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                100.*(acgtno_count_1st->a + acgtno_count_2nd->a)/acgt_sum,
+                100.*(acgtno_count_1st->c + acgtno_count_2nd->c)/acgt_sum,
+                100.*(acgtno_count_1st->g + acgtno_count_2nd->g)/acgt_sum,
+                100.*(acgtno_count_1st->t + acgtno_count_2nd->t)/acgt_sum,
+                100.*(acgtno_count_1st->n + acgtno_count_2nd->n)/acgt_sum,
+                100.*(acgtno_count_1st->other + acgtno_count_2nd->other)/acgt_sum);
+
+    }
+    fprintf(to, "# ACGT content per cycle for first fragments. Use `grep ^FBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+    for (ibase=0; ibase<stats->max_len; ibase++)
+    {
+        acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+        uint64_t acgt_sum_1st = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t;
+
+        if ( acgt_sum_1st )
+            fprintf(to, "FBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                    100.*acgtno_count_1st->a/acgt_sum_1st,
+                    100.*acgtno_count_1st->c/acgt_sum_1st,
+                    100.*acgtno_count_1st->g/acgt_sum_1st,
+                    100.*acgtno_count_1st->t/acgt_sum_1st,
+                    100.*acgtno_count_1st->n/acgt_sum_1st,
+                    100.*acgtno_count_1st->other/acgt_sum_1st);
+
+    }
+    fprintf(to, "# ACGT content per cycle for last fragments. Use `grep ^LBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+    for (ibase=0; ibase<stats->max_len; ibase++)
+    {
+        acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+        uint64_t acgt_sum_2nd = acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
+
+        if ( acgt_sum_2nd )
+            fprintf(to, "LBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                    100.*acgtno_count_2nd->a/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->c/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->g/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->t/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->n/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->other/acgt_sum_2nd);
+
      }
      fprintf(to, "# Insert sizes. Use `grep ^IS | cut -f 2-` to extract this part. The columns are: insert size, pairs total, inward oriented pairs, outward oriented pairs, other pairs\n");
      for (isize=0; isize<ibulk; isize++) {
@@ -1155,11 +1486,26 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      int ilen;
      for (ilen=0; ilen<stats->max_len; ilen++)
      {
-        if ( stats->read_lengths[ilen]>0 )
-            fprintf(to, "RL\t%d\t%ld\n", ilen, (long)stats->read_lengths[ilen]);
+        if ( stats->read_lengths[ilen+1]>0 )
+            fprintf(to, "RL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths[ilen+1]);
+    }
+
+    fprintf(to, "# Read lengths - first fragments. Use `grep ^FRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+    for (ilen=0; ilen<stats->max_len_1st; ilen++)
+    {
+        if ( stats->read_lengths_1st[ilen+1]>0 )
+            fprintf(to, "FRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_1st[ilen+1]);
+    }
+
+    fprintf(to, "# Read lengths - last fragments. Use `grep ^LRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+    for (ilen=0; ilen<stats->max_len_2nd; ilen++)
+    {
+        if ( stats->read_lengths_2nd[ilen+1]>0 )
+            fprintf(to, "LRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_2nd[ilen+1]);
      }
  
      fprintf(to, "# Indel distribution. Use `grep ^ID | cut -f 2-` to extract this part. The columns are: length, number of insertions, number of deletions\n");
+
      for (ilen=0; ilen<stats->nindels; ilen++)
      {
          if ( stats->insertions[ilen]>0 || stats->deletions[ilen]>0 )
@@ -1178,7 +1524,6 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "# Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.\n");
      if  ( stats->cov[0] )
          fprintf(to, "COV\t[<%d]\t%d\t%ld\n",stats->info->cov_min,stats->info->cov_min-1, (long)stats->cov[0]);
-    int icov;
      for (icov=1; icov<stats->ncov-1; icov++)
          if ( stats->cov[icov] )
              fprintf(to, "COV\t[%d-%d]\t%d\t%ld\n",stats->info->cov_min + (icov-1)*stats->info->cov_step, stats->info->cov_min + icov*stats->info->cov_step-1,stats->info->cov_min + icov*stats->info->cov_step-1, (long)stats->cov[icov]);
@@ -1225,7 +1570,7 @@ void init_regions(stats_t *stats, const char *file)
      if ( !fp ) error("%s: %s\n",file,strerror(errno));
  
      kstring_t line = { 0, 0, NULL };
-    int warned = 0;
+    int warned = 0, r, p, new_p;
      int prev_tid=-1, prev_pos=-1;
      while (line.l = 0, kgetline(&line, (kgets_func *)fgets, fp) >= 0)
      {
@@ -1272,10 +1617,31 @@ void init_regions(stats_t *stats, const char *file)
          if ( prev_pos>stats->regions[tid].pos[npos].from )
              error("The positions are not in chromosomal order (%s:%d comes after %d)\n", line.s,stats->regions[tid].pos[npos].from,prev_pos);
          stats->regions[tid].npos++;
+        if ( stats->regions[tid].npos > stats->nchunks )
+            stats->nchunks = stats->regions[tid].npos;
      }
      free(line.s);
      if ( !stats->regions ) error("Unable to map the -t sequences to the BAM sequences.\n");
      fclose(fp);
+
+    // sort region intervals and remove duplicates
+    for (r = 0; r < stats->nregions; r++) {
+        regions_t *reg = &stats->regions[r];
+        if ( reg->npos > 1 ) {
+            qsort(reg->pos, reg->npos, sizeof(pos_t), regions_lt);
+            for (new_p = 0, p = 1; p < reg->npos; p++) {
+                if ( reg->pos[new_p].to < reg->pos[p].from )
+                    reg->pos[++new_p] = reg->pos[p];
+                else if ( reg->pos[new_p].to < reg->pos[p].to )
+                    reg->pos[new_p].to = reg->pos[p].to;
+            }
+            reg->npos = ++new_p;
+        }
+        for (p = 0; p < reg->npos; p++)
+            stats->target_count += (reg->pos[p].to - reg->pos[p].from + 1);
+    }
+
+    stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
  }
  
  void destroy_regions(stats_t *stats)
@@ -1287,6 +1653,7 @@ void destroy_regions(stats_t *stats)
          free(stats->regions[i].pos);
      }
      if ( stats->regions ) free(stats->regions);
+    if ( stats->chunks ) free(stats->chunks);
  }
  
  void reset_regions(stats_t *stats)
@@ -1311,14 +1678,70 @@ int is_in_regions(bam1_t *bam_line, stats_t *stats)
      int i = reg->cpos;
      while ( i<reg->npos && reg->pos[i].to<=bam_line->core.pos ) i++;
      if ( i>=reg->npos ) { reg->cpos = reg->npos; return 0; }
-    if ( bam_line->core.pos + bam_line->core.l_qseq + 1 < reg->pos[i].from ) return 0;
+    int64_t endpos = bam_endpos(bam_line);
+    if ( endpos < reg->pos[i].from ) return 0;
+
+    //found a read overlapping a region
      reg->cpos = i;
      stats->reg_from = reg->pos[i].from;
      stats->reg_to   = reg->pos[i].to;
  
+    //now find all the overlapping chunks
+    stats->nchunks = 0;
+    while (i < reg->npos) {
+        if (bam_line->core.pos < reg->pos[i].to && endpos >= reg->pos[i].from) {
+            stats->chunks[stats->nchunks].from = MAX(bam_line->core.pos+1, reg->pos[i].from);
+            stats->chunks[stats->nchunks].to = MIN(endpos, reg->pos[i].to);
+            stats->nchunks++;
+        }
+        i++;
+    }
+
      return 1;
  }
  
+int replicate_regions(stats_t *stats, hts_itr_multi_t *iter) {
+    if ( !stats || !iter)
+        return 1;
+
+    int i, j, tid;
+    stats->nregions = iter->n_reg;
+    stats->regions = calloc(stats->nregions, sizeof(regions_t));
+    stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
+    if ( !stats->regions || !stats->chunks )
+        return 1;
+
+    for (i = 0; i < iter->n_reg; i++) {
+        tid = iter->reg_list[i].tid;
+        if ( tid < 0 )
+            continue;
+
+        if ( tid >= stats->nregions ) {
+            regions_t *tmp = realloc(stats->regions, (tid+10) * sizeof(regions_t));
+            if ( !tmp )
+                return 1;
+            stats->regions = tmp;
+            memset(stats->regions + stats->nregions, 0,
+                   (tid+10-stats->nregions) * sizeof(regions_t));
+            stats->nregions = tid+10;
+        }
+
+        stats->regions[tid].mpos = stats->regions[tid].npos = iter->reg_list[i].count;
+        stats->regions[tid].pos = calloc(stats->regions[tid].mpos, sizeof(pos_t));
+        if ( !stats->regions[tid].pos )
+            return 1;
+
+        for (j = 0; j < stats->regions[tid].npos; j++) {
+            stats->regions[tid].pos[j].from = iter->reg_list[i].intervals[j].beg+1;
+            stats->regions[tid].pos[j].to = iter->reg_list[i].intervals[j].end;
+
+            stats->target_count += (stats->regions[tid].pos[j].to - stats->regions[tid].pos[j].from + 1);
+        }
+    }
+
+    return 0;
+}
+
  void init_group_id(stats_t *stats, const char *id)
  {
  #if 0
@@ -1375,6 +1798,8 @@ static void error(const char *format, ...)
          printf("    -S, --split <tag>                   Also write statistics to separate files split by tagged field.\n");
          printf("    -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.\n");
          printf("    -x, --sparse                        Suppress outputting IS rows where there are no insertions.\n");
+        printf("    -p, --remove-overlaps               Remove overlaps of paired-end reads from coverage and base count computations.\n");
+        printf("    -g, --cov-threshold                 Only bases with coverage above this value will be included in the target percentage computation.\n");
          sam_global_opt_help(stdout, "-.--.@");
          printf("\n");
      }
@@ -1404,8 +1829,11 @@ void cleanup_stats(stats_t* stats)
      free(stats->gcd);
      free(stats->rseq_buf);
      free(stats->mpc_buf);
-    free(stats->acgtno_cycles);
+    free(stats->acgtno_cycles_1st);
+    free(stats->acgtno_cycles_2nd);
      free(stats->read_lengths);
+    free(stats->read_lengths_1st);
+    free(stats->read_lengths_2nd);
      free(stats->insertions);
      free(stats->deletions);
      free(stats->ins_cycles_1st);
@@ -1454,8 +1882,8 @@ void destroy_split_stats(khash_t(c2stats) *split_hash)
      stats_t *curr_stats = NULL;
      for(i = kh_begin(split_hash); i != kh_end(split_hash); ++i){
          if(!kh_exist(split_hash, i)) continue;
-            curr_stats = kh_value(split_hash, i);
-            cleanup_stats(curr_stats);
+        curr_stats = kh_value(split_hash, i);
+        cleanup_stats(curr_stats);
      }
      kh_destroy(c2stats, split_hash);
  }
@@ -1472,6 +1900,8 @@ stats_info_t* stats_info_init(int argc, char *argv[])
      info->filter_readlen = -1;
      info->argc = argc;
      info->argv = argv;
+    info->remove_overlaps = 0;
+    info->cov_threshold = 0;
  
      return info;
  }
@@ -1499,14 +1929,17 @@ stats_t* stats_init()
      stats->ngc    = 200;
      stats->nquals = 256;
      stats->nbases = 300;
-    stats->max_len   = 30;
-    stats->max_qual  = 40;
      stats->rseq_pos     = -1;
      stats->tid = stats->gcd_pos = -1;
      stats->igcd = 0;
      stats->is_sorted = 1;
      stats->nindels = stats->nbases;
      stats->split_name = NULL;
+    stats->nchunks = 0;
+    stats->pair_count = 0;
+    stats->last_pair_tid = -2;
+    stats->last_read_flush = 0;
+    stats->target_count = 0;
  
      return stats;
  }
@@ -1540,8 +1973,11 @@ static void init_stat_structs(stats_t* stats, stats_info_t* info, const char* gr
      stats->isize          = init_isize_t(info->nisize ?info->nisize+1 :0);
      stats->gcd            = calloc(stats->ngcd,sizeof(gc_depth_t));
      stats->mpc_buf        = info->fai ? calloc(stats->nquals*stats->nbases,sizeof(uint64_t)) : NULL;
-    stats->acgtno_cycles  = calloc(stats->nbases,sizeof(acgtno_count_t));
+    stats->acgtno_cycles_1st  = calloc(stats->nbases,sizeof(acgtno_count_t));
+    stats->acgtno_cycles_2nd  = calloc(stats->nbases,sizeof(acgtno_count_t));
      stats->read_lengths   = calloc(stats->nbases,sizeof(uint64_t));
+    stats->read_lengths_1st   = calloc(stats->nbases,sizeof(uint64_t));
+    stats->read_lengths_2nd   = calloc(stats->nbases,sizeof(uint64_t));
      stats->insertions     = calloc(stats->nbases,sizeof(uint64_t));
      stats->deletions      = calloc(stats->nbases,sizeof(uint64_t));
      stats->ins_cycles_1st = calloc(stats->nbases+1,sizeof(uint64_t));
@@ -1614,16 +2050,18 @@ int main_stats(int argc, char *argv[])
          {"sparse", no_argument, NULL, 'x'},
          {"split", required_argument, NULL, 'S'},
          {"split-prefix", required_argument, NULL, 'P'},
+        {"remove-overlaps", no_argument, NULL, 'p'},
+        {"cov-threshold", required_argument, NULL, 'g'},
          {NULL, 0, NULL, 0}
      };
      int opt;
  
-    while ( (opt=getopt_long(argc,argv,"?hdsxr:c:l:i:t:m:q:f:F:I:1:S:P:@:",loptions,NULL))>0 )
+    while ( (opt=getopt_long(argc,argv,"?hdsxpr:c:l:i:t:m:q:f:F:g:I:1:S:P:@:",loptions,NULL))>0 )
      {
          switch (opt)
          {
              case 'f': info->flag_require = bam_str2flag(optarg); break;
-            case 'F': info->flag_filter = bam_str2flag(optarg); break;
+            case 'F': info->flag_filter |= bam_str2flag(optarg); break;
              case 'd': info->flag_filter |= BAM_FDUP; break;
              case 's': break;
              case 'r': info->fai = fai_load(optarg);
@@ -1643,6 +2081,11 @@ int main_stats(int argc, char *argv[])
              case 'x': sparse = 1; break;
              case 'S': info->split_tag = optarg; break;
              case 'P': info->split_prefix = optarg; break;
+            case 'p': info->remove_overlaps = 1; break;
+            case 'g': info->cov_threshold = atoi(optarg);
+                      if ( info->cov_threshold < 0 || info->cov_threshold == INT_MAX )
+                          error("Unsupported value for coverage threshold %d\n", info->cov_threshold);
+                      break;
              case '?':
              case 'h': error(NULL);
              default:
@@ -1661,7 +2104,10 @@ int main_stats(int argc, char *argv[])
          bam_fname = "-";
      }
  
-    if (init_stat_info_fname(info, bam_fname, &ga.in)) return 1;
+    if (init_stat_info_fname(info, bam_fname, &ga.in)) {
+        free(info);
+        return 1;
+    }
      if (ga.nthreads > 0)
          hts_set_threads(info->sam, ga.nthreads);
  
@@ -1672,41 +2118,78 @@ int main_stats(int argc, char *argv[])
      // .. hash
      khash_t(c2stats)* split_hash = kh_init(c2stats);
  
+    khash_t(qn2pair)* read_pairs = kh_init(qn2pair);
+
      // Collect statistics
      bam1_t *bam_line = bam_init1();
      if ( optind<argc )
      {
-        // Collect stats in selected regions only
-        hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
-        if (bam_idx == 0)
-            error("Random alignment retrieval only works for indexed BAM files.\n");
-
-        int i;
-        for (i=optind; i<argc; i++)
-        {
-            hts_itr_t* iter = bam_itr_querys(bam_idx, info->sam_header, argv[i]);
-            while (sam_itr_next(info->sam, iter, bam_line) >= 0) {
-                if (info->split_tag) {
-                    curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
-                    collect_stats(bam_line, curr_stats);
+        int filter = 1;
+        // Prepare the region hash table for the multi-region iterator
+        void *region_hash = bed_hash_regions(NULL, argv, optind, argc, &filter);
+        if (region_hash) {
+
+            // Collect stats in selected regions only
+            hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
+            if (bam_idx) {
+
+                int regcount = 0;
+                hts_reglist_t *reglist = bed_reglist(region_hash, ALL, &regcount);
+                if (reglist) {
+
+                    hts_itr_multi_t *iter = sam_itr_regions(bam_idx, info->sam_header, reglist, regcount);
+                    if (iter) {
+
+                        if (!targets) {
+                            all_stats->nchunks = argc-optind;
+                            if ( replicate_regions(all_stats, iter) )
+                                fprintf(stderr, "Replications of the regions failed.");
+                        }
+
+                        if ( all_stats->nregions && all_stats->regions ) {
+                            while (sam_itr_multi_next(info->sam, iter, bam_line) >= 0) {
+                               if (info->split_tag) {
+                                   curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
+                                   collect_stats(bam_line, curr_stats, read_pairs);
+                               }
+                               collect_stats(bam_line, all_stats, read_pairs);
+                            }
+                        }
+
+                        hts_itr_multi_destroy(iter);
+                    } else {
+                       fprintf(stderr, "Creation of the region iterator failed.");
+                       hts_reglist_free(reglist, regcount);
+                    }
+                } else {
+                    fprintf(stderr, "Creation of the region list failed.");
                  }
-                collect_stats(bam_line, all_stats);
+
+                hts_idx_destroy(bam_idx);
+            } else {
+                fprintf(stderr, "Random alignment retrieval only works for indexed BAM files.\n");
              }
-            reset_regions(all_stats);
-            bam_itr_destroy(iter);
+
+            bed_destroy(region_hash);
+        } else {
+            fprintf(stderr, "Creation of the region hash table failed.\n");
          }
-        hts_idx_destroy(bam_idx);
      }
      else
      {
+        if ( info->cov_threshold > 0 && !targets ) {
+            fprintf(stderr, "Coverage percentage calcuation requires a list of target regions\n");
+            goto cleanup;
+        }
+
          // Stream through the entire BAM ignoring off-target regions if -t is given
          int ret;
          while ((ret = sam_read1(info->sam, info->sam_header, bam_line)) >= 0) {
              if (info->split_tag) {
                  curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
-                collect_stats(bam_line, curr_stats);
+                collect_stats(bam_line, curr_stats, read_pairs);
              }
-            collect_stats(bam_line, all_stats);
+            collect_stats(bam_line, all_stats, read_pairs);
          }
  
          if (ret < -1) {
@@ -1720,6 +2203,7 @@ int main_stats(int argc, char *argv[])
      if (info->split_tag)
          output_split_stats(split_hash, bam_fname, sparse);
  
+cleanup:
      bam_destroy1(bam_line);
      bam_hdr_destroy(info->sam_header);
      sam_global_args_free(&ga);
@@ -1727,6 +2211,7 @@ int main_stats(int argc, char *argv[])
      cleanup_stats(all_stats);
      cleanup_stats_info(info);
      destroy_split_stats(split_hash);
+    cleanup_overlaps(read_pairs, INT_MAX);
  
      return 0;
  }
diff --git a/samtools/stats.c.pysam.c b/samtools/stats.c.pysam.c

index 1c94a10a72c6b2e985d5099ec3189de8c6ac5e5c..45c7a51dfcd25b5a0f61869152b8f072e39dbaa5 100644 (file)
--- a/samtools/stats.c.pysam.c
+++ b/samtools/stats.c.pysam.c
@@ -62,8 +62,11 @@ DEALINGS IN THE SOFTWARE.  */
  #include <htslib/kstring.h>
  #include "stats_isize.h"
  #include "sam_opts.h"
+#include "bedidx.h"
  
  #define BWA_MIN_RDLEN 35
+#define DEFAULT_CHUNK_NO 8
+#define DEFAULT_PAIR_MAX 10000
  // From the spec
  // If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
  #define IS_PAIRED_AND_MAPPED(bam) (((bam)->core.flag&BAM_FPAIRED) && !((bam)->core.flag&BAM_FUNMAP) && !((bam)->core.flag&BAM_FMUNMAP))
@@ -136,6 +139,8 @@ typedef struct
      // Misc
      char *split_tag;      // Tag on which to perform stats splitting
      char *split_prefix;   // Path or string prefix for filenames created when splitting
+    int remove_overlaps;
+    int cov_threshold;
  }
  stats_info_t;
  
@@ -151,19 +156,24 @@ typedef struct
      // Arrays for the histogram data
      uint64_t *quals_1st, *quals_2nd;
      uint64_t *gc_1st, *gc_2nd;
-    acgtno_count_t *acgtno_cycles;
-    uint64_t *read_lengths;
+    acgtno_count_t *acgtno_cycles_1st;
+    acgtno_count_t *acgtno_cycles_2nd;
+    uint64_t *read_lengths, *read_lengths_1st, *read_lengths_2nd;
      uint64_t *insertions, *deletions;
      uint64_t *ins_cycles_1st, *ins_cycles_2nd, *del_cycles_1st, *del_cycles_2nd;
      isize_t *isize;
  
      // The extremes encountered
      int max_len;            // Maximum read length
+    int max_len_1st;        // Maximum read length for forward reads
+    int max_len_2nd;        // Maximum read length for reverse reads
      int max_qual;           // Maximum quality
      int is_sorted;
  
      // Summary numbers
      uint64_t total_len;
+    uint64_t total_len_1st;
+    uint64_t total_len_2nd;
      uint64_t total_len_dup;
      uint64_t nreads_1st;
      uint64_t nreads_2nd;
@@ -204,7 +214,7 @@ typedef struct
      uint64_t *mpc_buf;              // Mismatches per cycle
  
      // Target regions
-    int nregions, reg_from,reg_to;
+    int nregions, reg_from, reg_to;
      regions_t *regions;
  
      // Auxiliary data
@@ -215,15 +225,35 @@ typedef struct
      char* split_name;
  
      stats_info_t* info;             // Pointer to options and settings struct
+    pos_t *chunks;
+    uint32_t nchunks;
  
+    uint32_t pair_count;          // Number of active pairs in the pairing hash table
+    uint32_t target_count;        // Number of bases covered by the target file
+    uint32_t last_pair_tid;
+    uint32_t last_read_flush;
  }
  stats_t;
  KHASH_MAP_INIT_STR(c2stats, stats_t*)
  
+typedef struct {
+    uint32_t first;     // 1 - first read, 2 - second read
+    uint32_t n, m;      // number of chunks, allocated chunks
+    pos_t *chunks;      // chunk array of size m
+} pair_t;
+KHASH_MAP_INIT_STR(qn2pair, pair_t*)
+
+
  static void error(const char *format, ...);
  int is_in_regions(bam1_t *bam_line, stats_t *stats);
  void realloc_buffers(stats_t *stats, int seq_len);
  
+static int regions_lt(const void *r1, const void *r2) {
+    int64_t from_diff = (int64_t)((pos_t *)r1)->from - (int64_t)((pos_t *)r2)->from;
+    int64_t to_diff = (int64_t)((pos_t *)r1)->to - (int64_t)((pos_t *)r2)->to;
+
+    return from_diff > 0 ? 1 : from_diff < 0 ? -1 : to_diff > 0 ? 1 : to_diff < 0 ? -1 : 0;
+}
  
  // Coverage distribution methods
  static inline int coverage_idx(int min, int max, int n, int step, int depth)
@@ -572,16 +602,31 @@ void realloc_buffers(stats_t *stats, int seq_len)
          memset(stats->mpc_buf + stats->nbases*stats->nquals, 0, (n-stats->nbases)*stats->nquals*sizeof(uint64_t));
      }
  
-    stats->acgtno_cycles = realloc(stats->acgtno_cycles, n*sizeof(acgtno_count_t));
-    if ( !stats->acgtno_cycles )
+    stats->acgtno_cycles_1st = realloc(stats->acgtno_cycles_1st, n*sizeof(acgtno_count_t));
+    if ( !stats->acgtno_cycles_1st )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
-    memset(stats->acgtno_cycles + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+    memset(stats->acgtno_cycles_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+
+    stats->acgtno_cycles_2nd = realloc(stats->acgtno_cycles_2nd, n*sizeof(acgtno_count_t));
+    if ( !stats->acgtno_cycles_2nd )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
+    memset(stats->acgtno_cycles_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
  
      stats->read_lengths = realloc(stats->read_lengths, n*sizeof(uint64_t));
      if ( !stats->read_lengths )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
      memset(stats->read_lengths + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
  
+    stats->read_lengths_1st = realloc(stats->read_lengths_1st, n*sizeof(uint64_t));
+    if ( !stats->read_lengths_1st )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+    memset(stats->read_lengths_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
+    stats->read_lengths_2nd = realloc(stats->read_lengths_2nd, n*sizeof(uint64_t));
+    if ( !stats->read_lengths_2nd )
+        error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+    memset(stats->read_lengths_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
      stats->insertions = realloc(stats->insertions, n*sizeof(uint64_t));
      if ( !stats->insertions )
          error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
@@ -657,7 +702,7 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
  
      // Count GC and ACGT per cycle. Note that cycle is approximate, clipping is ignored
      uint8_t *seq  = bam_get_seq(bam_line);
-    int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line);
+    int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line), is_first = IS_READ1(bam_line);
      for (i=0; i<seq_len; i++)
      {
          // Read cycle for current index
@@ -668,28 +713,28 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
          //      =ACMGRSVTWYHKDBN
          switch (bam_seqi(seq, i)) {
          case 1:
-            stats->acgtno_cycles[ read_cycle ].a++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].a++ : stats->acgtno_cycles_2nd[ read_cycle ].a++;
              break;
          case 2:
-            stats->acgtno_cycles[ read_cycle ].c++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].c++ : stats->acgtno_cycles_2nd[ read_cycle ].c++;
              gc_count++;
              break;
          case 4:
-            stats->acgtno_cycles[ read_cycle ].g++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].g++ : stats->acgtno_cycles_2nd[ read_cycle ].g++;
              gc_count++;
              break;
          case 8:
-            stats->acgtno_cycles[ read_cycle ].t++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].t++ : stats->acgtno_cycles_2nd[ read_cycle ].t++;
              break;
          case 15:
-            stats->acgtno_cycles[ read_cycle ].n++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].n++ : stats->acgtno_cycles_2nd[ read_cycle ].n++;
              break;
          default:
              /*
               * count "=" sequences in "other" along
               * with MRSVWYHKDB ambiguity codes
               */
-            stats->acgtno_cycles[ read_cycle ].other++;
+            is_first ? stats->acgtno_cycles_1st[ read_cycle ].other++ : stats->acgtno_cycles_2nd[ read_cycle ].other++;
              break;
          }
      }
@@ -702,10 +747,11 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      //  fill GC histogram
      uint64_t *quals;
      uint8_t *bam_quals = bam_get_qual(bam_line);
-    if ( bam_line->core.flag&BAM_FREAD2 )
+    if ( IS_READ2(bam_line) )
      {
          quals  = stats->quals_2nd;
          stats->nreads_2nd++;
+        stats->total_len_2nd += seq_len;
          for (i=gc_idx_min; i<gc_idx_max; i++)
              stats->gc_2nd[i]++;
      }
@@ -713,6 +759,7 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      {
          quals = stats->quals_1st;
          stats->nreads_1st++;
+        stats->total_len_1st += seq_len;
          for (i=gc_idx_min; i<gc_idx_max; i++)
              stats->gc_1st[i]++;
      }
@@ -758,7 +805,160 @@ void collect_orig_read_stats(bam1_t *bam_line, stats_t *stats, int* gc_count_out
      *gc_count_out = gc_count;
  }
  
-void collect_stats(bam1_t *bam_line, stats_t *stats)
+static int cleanup_overlaps(khash_t(qn2pair) *read_pairs, int max) {
+    if ( !read_pairs )
+        return 0;
+
+    int count = 0;
+    khint_t k;
+    for (k = kh_begin(read_pairs); k < kh_end(read_pairs); k++) {
+        if ( kh_exist(read_pairs, k) ) {
+            char *key = (char *)kh_key(read_pairs, k);
+            pair_t *val = kh_val(read_pairs, k);
+            if ( val && val->chunks ) {
+                if ( val->chunks[val->n-1].to < max ) {
+                    free(val->chunks);
+                    free(val);
+                    free(key);
+                    kh_del(qn2pair, read_pairs, k);
+                    count++;
+                }
+            } else {
+                free(key);
+                kh_del(qn2pair, read_pairs, k);
+                count++;
+            }
+        }
+    }
+    if ( max == INT_MAX )
+        kh_destroy(qn2pair, read_pairs);
+
+    return count;
+}
+
+static void remove_overlaps(bam1_t *bam_line, khash_t(qn2pair) *read_pairs, stats_t *stats, int pmin, int pmax) {
+    if ( !bam_line || !read_pairs || !stats )
+        return;
+
+    uint32_t first = (IS_READ1(bam_line) > 0 ? 1 : 0) + (IS_READ2(bam_line) > 0 ? 2 : 0) ;
+    if ( !(bam_line->core.flag & BAM_FPAIRED) ||
+         (bam_line->core.flag & BAM_FMUNMAP) ||
+         (abs(bam_line->core.isize) >= 2*bam_line->core.l_qseq) ||
+         (first != 1 && first != 2) ) {
+        if ( pmin >= 0 )
+            round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+        return;
+    }
+
+    char *qname = bam_get_qname(bam_line);
+    if ( !qname ) {
+        fprintf(samtools_stderr, "Error retrieving qname for line starting at pos %d\n", bam_line->core.pos);
+        return;
+    }
+
+    khint_t k = kh_get(qn2pair, read_pairs, qname);
+    if ( k == kh_end(read_pairs) ) { //first chunk from this template
+        if ( pmin == -1 )
+            return;
+
+        int ret;
+        char *s = strdup(qname);
+        if ( !s ) {
+            fprintf(samtools_stderr, "Error allocating memory\n");
+            return;
+        }
+
+        k = kh_put(qn2pair, read_pairs, s, &ret);
+        if ( -1 == ret ) {
+            fprintf(samtools_stderr, "Error inserting read '%s' in pair hash table\n", qname);
+            return;
+        }
+
+        pair_t *pc = calloc(1, sizeof(pair_t));
+        if ( !pc ) {
+            fprintf(samtools_stderr, "Error allocating memory\n");
+            return;
+        }
+
+        pc->m = DEFAULT_CHUNK_NO;
+        pc->chunks = calloc(pc->m, sizeof(pos_t));
+        if ( !pc->chunks ) {
+            fprintf(samtools_stderr, "Error allocating memory\n");
+            return;
+        }
+
+        pc->chunks[0].from = pmin;
+        pc->chunks[0].to = pmax;
+        pc->n = 1;
+        pc->first = first;
+
+        kh_val(read_pairs, k) = pc;
+        stats->pair_count++;
+    } else { //template already present
+        pair_t *pc = kh_val(read_pairs, k);
+        if ( !pc ) {
+            fprintf(samtools_stderr, "Invalid hash table entry\n");
+            return;
+        }
+
+        if ( first == pc->first ) { //chunk from an existing line
+            if ( pmin == -1 )
+                return;
+
+            if ( pc->n == pc->m ) {
+                pos_t *tmp = realloc(pc->chunks, (pc->m<<1)*sizeof(pos_t));
+                if ( !tmp ) {
+                    fprintf(samtools_stderr, "Error allocating memory\n");
+                    return;
+                }
+                pc->chunks = tmp;
+                pc->m<<=1;
+            }
+
+            pc->chunks[pc->n].from = pmin;
+            pc->chunks[pc->n].to = pmax;
+            pc->n++;
+        } else { //the other line, check for overlapping
+            if ( pmin == -1 && kh_exist(read_pairs, k) ) { //job done, delete entry
+                char *key = (char *)kh_key(read_pairs, k);
+                pair_t *val = kh_val(read_pairs, k);
+                if ( val) {
+                    free(val->chunks);
+                    free(val);
+                }
+                free(key);
+                kh_del(qn2pair, read_pairs, k);
+                stats->pair_count--;
+                return;
+            }
+
+            int i;
+            for (i=0; i<pc->n; i++) {
+                if ( pmin >= pc->chunks[i].to )
+                    continue;
+
+                if ( pmax <= pc->chunks[i].from ) //no overlap
+                    break;
+
+                if ( pmin < pc->chunks[i].from ) { //overlap at the beginning
+                    round_buffer_insert_read(&(stats->cov_rbuf), pmin, pc->chunks[i].from-1);
+                    pmin = pc->chunks[i].from;
+                }
+
+                if ( pmax <= pc->chunks[i].to ) { //completely contained
+                    stats->nbases_mapped_cigar -= (pmax - pmin);
+                    return;
+                } else {                           //overlap at the end
+                    stats->nbases_mapped_cigar -= (pc->chunks[i].to - pmin);
+                    pmin = pc->chunks[i].to;
+                }
+            }
+        }
+    }
+    round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+}
+
+void collect_stats(bam1_t *bam_line, stats_t *stats, khash_t(qn2pair) *read_pairs)
  {
      if ( stats->rg_hash )
      {
@@ -806,6 +1006,11 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
      // Update max_len observed
      if ( stats->max_len<read_len )
          stats->max_len = read_len;
+    if ( IS_READ1(bam_line) && stats->max_len_1st < read_len )
+        stats->max_len_1st = read_len;
+    if ( IS_READ2(bam_line) && stats->max_len_2nd < read_len )
+        stats->max_len_2nd = read_len;
+
      int i;
      int gc_count = 0;
  
@@ -814,6 +1019,8 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
      if ( IS_ORIGINAL(bam_line) )
      {
          stats->read_lengths[read_len]++;
+        if ( IS_READ1(bam_line) ) stats->read_lengths_1st[read_len]++;
+        if ( IS_READ2(bam_line) ) stats->read_lengths_2nd[read_len]++;
          collect_orig_read_stats(bam_line, stats, &gc_count);
      }
  
@@ -841,7 +1048,7 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
              if ( is_fwd*is_mfwd>0 )
                  stats->isize->inc_other(stats->isize->data, isize);
-            else if ( is_fst*pos_fst>0 )
+            else if ( is_fst*pos_fst>=0 )
              {
                  if ( is_fst*is_fwd>0 )
                      stats->isize->inc_inward(stats->isize->data, isize);
@@ -877,7 +1084,7 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
              int ncig = bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
              if ( !ncig ) continue;  // curiously, this can happen: 0D
              if ( cig==BAM_CDEL ) readlen += ncig;
-            else if ( cig==BAM_CMATCH )
+            else if ( cig==BAM_CMATCH || cig==BAM_CEQUAL || cig==BAM_CDIFF )
              {
                  if ( iref < stats->reg_from ) ncig -= stats->reg_from-iref;
                  else if ( iref+ncig-1 > stats->reg_to ) ncig -= iref+ncig-1 - stats->reg_to;
@@ -898,9 +1105,10 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
          // Count the whole read
          for (i=0; i<bam_line->core.n_cigar; i++)
          {
-            if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CMATCH || bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CINS )
+            int cig  = bam_cigar_op(bam_get_cigar(bam_line)[i]);
+            if ( cig==BAM_CMATCH || cig==BAM_CINS || cig==BAM_CEQUAL || cig==BAM_CDIFF )
                  stats->nbases_mapped_cigar += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
-            if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CDEL )
+            if ( cig==BAM_CDEL )
                  readlen += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
          }
      }
@@ -911,8 +1119,22 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
      if ( stats->is_sorted )
      {
-        if ( stats->tid==-1 || stats->tid!=bam_line->core.tid )
+        if ( stats->tid==-1 || stats->tid!=bam_line->core.tid ) {
              round_buffer_flush(stats, -1);
+        }
+
+        //cleanup the pair hash table to free memory
+        stats->last_read_flush++;
+        if ( stats->pair_count > DEFAULT_PAIR_MAX && stats->last_read_flush > DEFAULT_PAIR_MAX) {
+            stats->pair_count -= cleanup_overlaps(read_pairs, bam_line->core.pos);
+            stats->last_read_flush = 0;
+        }
+
+        if ( stats->last_pair_tid != bam_line->core.tid) {
+            stats->pair_count -= cleanup_overlaps(read_pairs, INT_MAX-1);
+            stats->last_pair_tid = bam_line->core.tid;
+            stats->last_read_flush = 0;
+        }
  
          // Mismatches per cycle and GC-depth graph. For simplicity, reads overlapping GCD bins
          //  are not splitted which results in up to seq_len-1 overlaps. The default bin size is
@@ -960,7 +1182,61 @@ void collect_stats(bam1_t *bam_line, stats_t *stats)
  
          // Coverage distribution graph
          round_buffer_flush(stats,bam_line->core.pos);
-        round_buffer_insert_read(&(stats->cov_rbuf),bam_line->core.pos,bam_line->core.pos+seq_len-1);
+        if ( stats->regions ) {
+            uint32_t p = bam_line->core.pos, pnew, pmin, pmax, j;
+            pmin = pmax = i = j = 0;
+            while ( j < bam_line->core.n_cigar && i < stats->nchunks ) {
+                int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+                int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+                switch(op) {
+                case BAM_CMATCH:
+                case BAM_CEQUAL:
+                case BAM_CDIFF:
+                    pmin = MAX(p, stats->chunks[i].from-1);
+                    pmax = MIN(p+oplen, stats->chunks[i].to);
+                    if ( pmax >= pmin ) {
+                        if ( stats->info->remove_overlaps )
+                            remove_overlaps(bam_line, read_pairs, stats, pmin, pmax);
+                        else
+                            round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+                    }
+                    break;
+                case BAM_CDEL:
+                    break;
+                }
+                pnew = p + (bam_cigar_type(op)&2 ? oplen : 0); // consumes reference
+
+                if ( pnew >= stats->chunks[i].to ) {
+                    // go to the next chunk
+                    i++;
+                } else {
+                    // go to the next CIGAR op
+                    j++;
+                    p = pnew;
+                }
+            }
+        } else {
+            uint32_t p = bam_line->core.pos, j;
+            for (j = 0; j < bam_line->core.n_cigar; j++) {
+                int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+                int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+                switch(op) {
+                case BAM_CMATCH:
+                case BAM_CEQUAL:
+                case BAM_CDIFF:
+                    if ( stats->info->remove_overlaps )
+                        remove_overlaps(bam_line, read_pairs, stats, p, p+oplen);
+                    else
+                        round_buffer_insert_read(&(stats->cov_rbuf), p, p+oplen-1);
+                    break;
+                case BAM_CDEL:
+                    break;
+                }
+                p += bam_cigar_type(op)&2 ? oplen : 0; // consumes reference
+            }
+        }
+        if ( stats->info->remove_overlaps )
+           remove_overlaps(bam_line, read_pairs, stats, -1, -1); //remove the line from the hash table
      }
  }
  
@@ -995,8 +1271,9 @@ float gcd_percentile(gc_depth_t *gcd, int N, int p)
  void output_stats(FILE *to, stats_t *stats, int sparse)
  {
      // Calculate average insert size and standard deviation (from the main bulk data only)
-    int isize, ibulk=0;
-    uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0;
+    int isize, ibulk=0, icov;
+    uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0, cov_sum=0;
+    double bulk=0, avg_isize=0, sd_isize=0;
      for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
      {
          // Each pair was counted twice
@@ -1010,10 +1287,11 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
          nisize += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
      }
  
-    double bulk=0, avg_isize=0, sd_isize=0;
      for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
      {
-        bulk += stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+        uint64_t num = stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+        if (num > 0) ibulk = isize + 1;
+        bulk += num;
          avg_isize += isize * (stats->isize->inward(stats->isize->data, isize) +  stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize));
  
          if ( bulk/nisize > stats->info->isize_main_bulk )
@@ -1025,10 +1303,9 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      }
      avg_isize /= nisize ? nisize : 1;
      for (isize=1; isize<ibulk; isize++)
-        sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / nisize;
+        sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / (nisize ? nisize : 1);
      sd_isize = sqrt(sd_isize);
  
-
      fprintf(to, "# This file was produced by samtools stats (%s+htslib-%s) and can be plotted using plot-bamstats\n", samtools_version(), hts_version());
      if( stats->split_name != NULL ){
          fprintf(to, "# This file contains statistics only for reads with tag: %s=%s\n", stats->info->split_tag, stats->split_name);
@@ -1061,6 +1338,8 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\treads QC failed:\t%ld\n", (long)stats->nreads_QCfailed);
      fprintf(to, "SN\tnon-primary alignments:\t%ld\n", (long)stats->nreads_secondary);
      fprintf(to, "SN\ttotal length:\t%ld\t# ignores clipping\n", (long)stats->total_len);
+    fprintf(to, "SN\ttotal first fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_1st);
+    fprintf(to, "SN\ttotal last fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_2nd);
      fprintf(to, "SN\tbases mapped:\t%ld\t# ignores clipping\n", (long)stats->nbases_mapped);                 // the length of the whole read goes here, including soft-clips etc.
      fprintf(to, "SN\tbases mapped (cigar):\t%ld\t# more accurate\n", (long)stats->nbases_mapped_cigar);   // only matched and inserted bases are counted here
      fprintf(to, "SN\tbases trimmed:\t%ld\n", (long)stats->nbases_trimmed);
@@ -1069,7 +1348,11 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\terror rate:\t%e\t# mismatches / bases mapped (cigar)\n", stats->nbases_mapped_cigar ? (float)stats->nmismatches/stats->nbases_mapped_cigar : 0);
      float avg_read_length = (stats->nreads_1st+stats->nreads_2nd)?stats->total_len/(stats->nreads_1st+stats->nreads_2nd):0;
      fprintf(to, "SN\taverage length:\t%.0f\n", avg_read_length);
+    fprintf(to, "SN\taverage first fragment length:\t%.0f\n", stats->nreads_1st? (float)stats->total_len_1st/stats->nreads_1st:0);
+    fprintf(to, "SN\taverage last fragment length:\t%.0f\n", stats->nreads_2nd? (float)stats->total_len_2nd/stats->nreads_2nd:0);
      fprintf(to, "SN\tmaximum length:\t%d\n", stats->max_len);
+    fprintf(to, "SN\tmaximum first fragment length:\t%d\n", stats->max_len_1st);
+    fprintf(to, "SN\tmaximum last fragment length:\t%d\n", stats->max_len_2nd);
      fprintf(to, "SN\taverage quality:\t%.1f\n", stats->total_len?stats->sum_qual/stats->total_len:0);
      fprintf(to, "SN\tinsert size average:\t%.1f\n", avg_isize);
      fprintf(to, "SN\tinsert size standard deviation:\t%.1f\n", sd_isize);
@@ -1077,13 +1360,20 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "SN\toutward oriented pairs:\t%ld\n", (long)nisize_outward);
      fprintf(to, "SN\tpairs with other orientation:\t%ld\n", (long)nisize_other);
      fprintf(to, "SN\tpairs on different chromosomes:\t%ld\n", (long)stats->nreads_anomalous/2);
+    fprintf(to, "SN\tpercentage of properly paired reads (%%):\t%.1f\n", (stats->nreads_1st+stats->nreads_2nd)? (float)(100*stats->nreads_properly_paired)/(stats->nreads_1st+stats->nreads_2nd):0);
+    if ( stats->target_count ) {
+        fprintf(to, "SN\tbases inside the target:\t%u\n", stats->target_count);
+        for (icov=stats->info->cov_threshold+1; icov<stats->ncov; icov++)
+            cov_sum += stats->cov[icov];
+        fprintf(to, "SN\tpercentage of target genome with coverage > %d (%%):\t%.2f\n", stats->info->cov_threshold, (float)(100*cov_sum)/stats->target_count);
+    }
  
      int ibase,iqual;
      if ( stats->max_len<stats->nbases ) stats->max_len++;
      if ( stats->max_qual+1<stats->nquals ) stats->max_qual++;
-    fprintf(to, "# First Fragment Qualitites. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
+    fprintf(to, "# First Fragment Qualities. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
      fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
-    for (ibase=0; ibase<stats->max_len; ibase++)
+    for (ibase=0; ibase<stats->max_len_1st; ibase++)
      {
          fprintf(to, "FFQ\t%d",ibase+1);
          for (iqual=0; iqual<=stats->max_qual; iqual++)
@@ -1092,9 +1382,9 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
          }
          fprintf(to, "\n");
      }
-    fprintf(to, "# Last Fragment Qualitites. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
+    fprintf(to, "# Last Fragment Qualities. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
      fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
-    for (ibase=0; ibase<stats->max_len; ibase++)
+    for (ibase=0; ibase<stats->max_len_2nd; ibase++)
      {
          fprintf(to, "LFQ\t%d",ibase+1);
          for (iqual=0; iqual<=stats->max_qual; iqual++)
@@ -1137,10 +1427,51 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "# ACGT content per cycle. Use `grep ^GCC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
      for (ibase=0; ibase<stats->max_len; ibase++)
      {
-        acgtno_count_t *acgtno_count = &(stats->acgtno_cycles[ibase]);
-        uint64_t acgt_sum = acgtno_count->a + acgtno_count->c + acgtno_count->g + acgtno_count->t;
+        acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+        acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+        uint64_t acgt_sum = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t +
+                acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
          if ( ! acgt_sum ) continue;
-        fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1, 100.*acgtno_count->a/acgt_sum, 100.*acgtno_count->c/acgt_sum, 100.*acgtno_count->g/acgt_sum, 100.*acgtno_count->t/acgt_sum, 100.*acgtno_count->n/acgt_sum, 100.*acgtno_count->other/acgt_sum);
+        fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                100.*(acgtno_count_1st->a + acgtno_count_2nd->a)/acgt_sum,
+                100.*(acgtno_count_1st->c + acgtno_count_2nd->c)/acgt_sum,
+                100.*(acgtno_count_1st->g + acgtno_count_2nd->g)/acgt_sum,
+                100.*(acgtno_count_1st->t + acgtno_count_2nd->t)/acgt_sum,
+                100.*(acgtno_count_1st->n + acgtno_count_2nd->n)/acgt_sum,
+                100.*(acgtno_count_1st->other + acgtno_count_2nd->other)/acgt_sum);
+
+    }
+    fprintf(to, "# ACGT content per cycle for first fragments. Use `grep ^FBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+    for (ibase=0; ibase<stats->max_len; ibase++)
+    {
+        acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+        uint64_t acgt_sum_1st = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t;
+
+        if ( acgt_sum_1st )
+            fprintf(to, "FBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                    100.*acgtno_count_1st->a/acgt_sum_1st,
+                    100.*acgtno_count_1st->c/acgt_sum_1st,
+                    100.*acgtno_count_1st->g/acgt_sum_1st,
+                    100.*acgtno_count_1st->t/acgt_sum_1st,
+                    100.*acgtno_count_1st->n/acgt_sum_1st,
+                    100.*acgtno_count_1st->other/acgt_sum_1st);
+
+    }
+    fprintf(to, "# ACGT content per cycle for last fragments. Use `grep ^LBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+    for (ibase=0; ibase<stats->max_len; ibase++)
+    {
+        acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+        uint64_t acgt_sum_2nd = acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
+
+        if ( acgt_sum_2nd )
+            fprintf(to, "LBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+                    100.*acgtno_count_2nd->a/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->c/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->g/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->t/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->n/acgt_sum_2nd,
+                    100.*acgtno_count_2nd->other/acgt_sum_2nd);
+
      }
      fprintf(to, "# Insert sizes. Use `grep ^IS | cut -f 2-` to extract this part. The columns are: insert size, pairs total, inward oriented pairs, outward oriented pairs, other pairs\n");
      for (isize=0; isize<ibulk; isize++) {
@@ -1157,11 +1488,26 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      int ilen;
      for (ilen=0; ilen<stats->max_len; ilen++)
      {
-        if ( stats->read_lengths[ilen]>0 )
-            fprintf(to, "RL\t%d\t%ld\n", ilen, (long)stats->read_lengths[ilen]);
+        if ( stats->read_lengths[ilen+1]>0 )
+            fprintf(to, "RL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths[ilen+1]);
+    }
+
+    fprintf(to, "# Read lengths - first fragments. Use `grep ^FRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+    for (ilen=0; ilen<stats->max_len_1st; ilen++)
+    {
+        if ( stats->read_lengths_1st[ilen+1]>0 )
+            fprintf(to, "FRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_1st[ilen+1]);
+    }
+
+    fprintf(to, "# Read lengths - last fragments. Use `grep ^LRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+    for (ilen=0; ilen<stats->max_len_2nd; ilen++)
+    {
+        if ( stats->read_lengths_2nd[ilen+1]>0 )
+            fprintf(to, "LRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_2nd[ilen+1]);
      }
  
      fprintf(to, "# Indel distribution. Use `grep ^ID | cut -f 2-` to extract this part. The columns are: length, number of insertions, number of deletions\n");
+
      for (ilen=0; ilen<stats->nindels; ilen++)
      {
          if ( stats->insertions[ilen]>0 || stats->deletions[ilen]>0 )
@@ -1180,7 +1526,6 @@ void output_stats(FILE *to, stats_t *stats, int sparse)
      fprintf(to, "# Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.\n");
      if  ( stats->cov[0] )
          fprintf(to, "COV\t[<%d]\t%d\t%ld\n",stats->info->cov_min,stats->info->cov_min-1, (long)stats->cov[0]);
-    int icov;
      for (icov=1; icov<stats->ncov-1; icov++)
          if ( stats->cov[icov] )
              fprintf(to, "COV\t[%d-%d]\t%d\t%ld\n",stats->info->cov_min + (icov-1)*stats->info->cov_step, stats->info->cov_min + icov*stats->info->cov_step-1,stats->info->cov_min + icov*stats->info->cov_step-1, (long)stats->cov[icov]);
@@ -1227,7 +1572,7 @@ void init_regions(stats_t *stats, const char *file)
      if ( !fp ) error("%s: %s\n",file,strerror(errno));
  
      kstring_t line = { 0, 0, NULL };
-    int warned = 0;
+    int warned = 0, r, p, new_p;
      int prev_tid=-1, prev_pos=-1;
      while (line.l = 0, kgetline(&line, (kgets_func *)fgets, fp) >= 0)
      {
@@ -1274,10 +1619,31 @@ void init_regions(stats_t *stats, const char *file)
          if ( prev_pos>stats->regions[tid].pos[npos].from )
              error("The positions are not in chromosomal order (%s:%d comes after %d)\n", line.s,stats->regions[tid].pos[npos].from,prev_pos);
          stats->regions[tid].npos++;
+        if ( stats->regions[tid].npos > stats->nchunks )
+            stats->nchunks = stats->regions[tid].npos;
      }
      free(line.s);
      if ( !stats->regions ) error("Unable to map the -t sequences to the BAM sequences.\n");
      fclose(fp);
+
+    // sort region intervals and remove duplicates
+    for (r = 0; r < stats->nregions; r++) {
+        regions_t *reg = &stats->regions[r];
+        if ( reg->npos > 1 ) {
+            qsort(reg->pos, reg->npos, sizeof(pos_t), regions_lt);
+            for (new_p = 0, p = 1; p < reg->npos; p++) {
+                if ( reg->pos[new_p].to < reg->pos[p].from )
+                    reg->pos[++new_p] = reg->pos[p];
+                else if ( reg->pos[new_p].to < reg->pos[p].to )
+                    reg->pos[new_p].to = reg->pos[p].to;
+            }
+            reg->npos = ++new_p;
+        }
+        for (p = 0; p < reg->npos; p++)
+            stats->target_count += (reg->pos[p].to - reg->pos[p].from + 1);
+    }
+
+    stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
  }
  
  void destroy_regions(stats_t *stats)
@@ -1289,6 +1655,7 @@ void destroy_regions(stats_t *stats)
          free(stats->regions[i].pos);
      }
      if ( stats->regions ) free(stats->regions);
+    if ( stats->chunks ) free(stats->chunks);
  }
  
  void reset_regions(stats_t *stats)
@@ -1313,14 +1680,70 @@ int is_in_regions(bam1_t *bam_line, stats_t *stats)
      int i = reg->cpos;
      while ( i<reg->npos && reg->pos[i].to<=bam_line->core.pos ) i++;
      if ( i>=reg->npos ) { reg->cpos = reg->npos; return 0; }
-    if ( bam_line->core.pos + bam_line->core.l_qseq + 1 < reg->pos[i].from ) return 0;
+    int64_t endpos = bam_endpos(bam_line);
+    if ( endpos < reg->pos[i].from ) return 0;
+
+    //found a read overlapping a region
      reg->cpos = i;
      stats->reg_from = reg->pos[i].from;
      stats->reg_to   = reg->pos[i].to;
  
+    //now find all the overlapping chunks
+    stats->nchunks = 0;
+    while (i < reg->npos) {
+        if (bam_line->core.pos < reg->pos[i].to && endpos >= reg->pos[i].from) {
+            stats->chunks[stats->nchunks].from = MAX(bam_line->core.pos+1, reg->pos[i].from);
+            stats->chunks[stats->nchunks].to = MIN(endpos, reg->pos[i].to);
+            stats->nchunks++;
+        }
+        i++;
+    }
+
      return 1;
  }
  
+int replicate_regions(stats_t *stats, hts_itr_multi_t *iter) {
+    if ( !stats || !iter)
+        return 1;
+
+    int i, j, tid;
+    stats->nregions = iter->n_reg;
+    stats->regions = calloc(stats->nregions, sizeof(regions_t));
+    stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
+    if ( !stats->regions || !stats->chunks )
+        return 1;
+
+    for (i = 0; i < iter->n_reg; i++) {
+        tid = iter->reg_list[i].tid;
+        if ( tid < 0 )
+            continue;
+
+        if ( tid >= stats->nregions ) {
+            regions_t *tmp = realloc(stats->regions, (tid+10) * sizeof(regions_t));
+            if ( !tmp )
+                return 1;
+            stats->regions = tmp;
+            memset(stats->regions + stats->nregions, 0,
+                   (tid+10-stats->nregions) * sizeof(regions_t));
+            stats->nregions = tid+10;
+        }
+
+        stats->regions[tid].mpos = stats->regions[tid].npos = iter->reg_list[i].count;
+        stats->regions[tid].pos = calloc(stats->regions[tid].mpos, sizeof(pos_t));
+        if ( !stats->regions[tid].pos )
+            return 1;
+
+        for (j = 0; j < stats->regions[tid].npos; j++) {
+            stats->regions[tid].pos[j].from = iter->reg_list[i].intervals[j].beg+1;
+            stats->regions[tid].pos[j].to = iter->reg_list[i].intervals[j].end;
+
+            stats->target_count += (stats->regions[tid].pos[j].to - stats->regions[tid].pos[j].from + 1);
+        }
+    }
+
+    return 0;
+}
+
  void init_group_id(stats_t *stats, const char *id)
  {
  #if 0
@@ -1377,6 +1800,8 @@ static void error(const char *format, ...)
          fprintf(samtools_stdout, "    -S, --split <tag>                   Also write statistics to separate files split by tagged field.\n");
          fprintf(samtools_stdout, "    -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.\n");
          fprintf(samtools_stdout, "    -x, --sparse                        Suppress outputting IS rows where there are no insertions.\n");
+        fprintf(samtools_stdout, "    -p, --remove-overlaps               Remove overlaps of paired-end reads from coverage and base count computations.\n");
+        fprintf(samtools_stdout, "    -g, --cov-threshold                 Only bases with coverage above this value will be included in the target percentage computation.\n");
          sam_global_opt_help(samtools_stdout, "-.--.@");
          fprintf(samtools_stdout, "\n");
      }
@@ -1406,8 +1831,11 @@ void cleanup_stats(stats_t* stats)
      free(stats->gcd);
      free(stats->rseq_buf);
      free(stats->mpc_buf);
-    free(stats->acgtno_cycles);
+    free(stats->acgtno_cycles_1st);
+    free(stats->acgtno_cycles_2nd);
      free(stats->read_lengths);
+    free(stats->read_lengths_1st);
+    free(stats->read_lengths_2nd);
      free(stats->insertions);
      free(stats->deletions);
      free(stats->ins_cycles_1st);
@@ -1456,8 +1884,8 @@ void destroy_split_stats(khash_t(c2stats) *split_hash)
      stats_t *curr_stats = NULL;
      for(i = kh_begin(split_hash); i != kh_end(split_hash); ++i){
          if(!kh_exist(split_hash, i)) continue;
-            curr_stats = kh_value(split_hash, i);
-            cleanup_stats(curr_stats);
+        curr_stats = kh_value(split_hash, i);
+        cleanup_stats(curr_stats);
      }
      kh_destroy(c2stats, split_hash);
  }
@@ -1474,6 +1902,8 @@ stats_info_t* stats_info_init(int argc, char *argv[])
      info->filter_readlen = -1;
      info->argc = argc;
      info->argv = argv;
+    info->remove_overlaps = 0;
+    info->cov_threshold = 0;
  
      return info;
  }
@@ -1501,14 +1931,17 @@ stats_t* stats_init()
      stats->ngc    = 200;
      stats->nquals = 256;
      stats->nbases = 300;
-    stats->max_len   = 30;
-    stats->max_qual  = 40;
      stats->rseq_pos     = -1;
      stats->tid = stats->gcd_pos = -1;
      stats->igcd = 0;
      stats->is_sorted = 1;
      stats->nindels = stats->nbases;
      stats->split_name = NULL;
+    stats->nchunks = 0;
+    stats->pair_count = 0;
+    stats->last_pair_tid = -2;
+    stats->last_read_flush = 0;
+    stats->target_count = 0;
  
      return stats;
  }
@@ -1542,8 +1975,11 @@ static void init_stat_structs(stats_t* stats, stats_info_t* info, const char* gr
      stats->isize          = init_isize_t(info->nisize ?info->nisize+1 :0);
      stats->gcd            = calloc(stats->ngcd,sizeof(gc_depth_t));
      stats->mpc_buf        = info->fai ? calloc(stats->nquals*stats->nbases,sizeof(uint64_t)) : NULL;
-    stats->acgtno_cycles  = calloc(stats->nbases,sizeof(acgtno_count_t));
+    stats->acgtno_cycles_1st  = calloc(stats->nbases,sizeof(acgtno_count_t));
+    stats->acgtno_cycles_2nd  = calloc(stats->nbases,sizeof(acgtno_count_t));
      stats->read_lengths   = calloc(stats->nbases,sizeof(uint64_t));
+    stats->read_lengths_1st   = calloc(stats->nbases,sizeof(uint64_t));
+    stats->read_lengths_2nd   = calloc(stats->nbases,sizeof(uint64_t));
      stats->insertions     = calloc(stats->nbases,sizeof(uint64_t));
      stats->deletions      = calloc(stats->nbases,sizeof(uint64_t));
      stats->ins_cycles_1st = calloc(stats->nbases+1,sizeof(uint64_t));
@@ -1616,16 +2052,18 @@ int main_stats(int argc, char *argv[])
          {"sparse", no_argument, NULL, 'x'},
          {"split", required_argument, NULL, 'S'},
          {"split-prefix", required_argument, NULL, 'P'},
+        {"remove-overlaps", no_argument, NULL, 'p'},
+        {"cov-threshold", required_argument, NULL, 'g'},
          {NULL, 0, NULL, 0}
      };
      int opt;
  
-    while ( (opt=getopt_long(argc,argv,"?hdsxr:c:l:i:t:m:q:f:F:I:1:S:P:@:",loptions,NULL))>0 )
+    while ( (opt=getopt_long(argc,argv,"?hdsxpr:c:l:i:t:m:q:f:F:g:I:1:S:P:@:",loptions,NULL))>0 )
      {
          switch (opt)
          {
              case 'f': info->flag_require = bam_str2flag(optarg); break;
-            case 'F': info->flag_filter = bam_str2flag(optarg); break;
+            case 'F': info->flag_filter |= bam_str2flag(optarg); break;
              case 'd': info->flag_filter |= BAM_FDUP; break;
              case 's': break;
              case 'r': info->fai = fai_load(optarg);
@@ -1645,6 +2083,11 @@ int main_stats(int argc, char *argv[])
              case 'x': sparse = 1; break;
              case 'S': info->split_tag = optarg; break;
              case 'P': info->split_prefix = optarg; break;
+            case 'p': info->remove_overlaps = 1; break;
+            case 'g': info->cov_threshold = atoi(optarg);
+                      if ( info->cov_threshold < 0 || info->cov_threshold == INT_MAX )
+                          error("Unsupported value for coverage threshold %d\n", info->cov_threshold);
+                      break;
              case '?':
              case 'h': error(NULL);
              default:
@@ -1663,7 +2106,10 @@ int main_stats(int argc, char *argv[])
          bam_fname = "-";
      }
  
-    if (init_stat_info_fname(info, bam_fname, &ga.in)) return 1;
+    if (init_stat_info_fname(info, bam_fname, &ga.in)) {
+        free(info);
+        return 1;
+    }
      if (ga.nthreads > 0)
          hts_set_threads(info->sam, ga.nthreads);
  
@@ -1674,41 +2120,78 @@ int main_stats(int argc, char *argv[])
      // .. hash
      khash_t(c2stats)* split_hash = kh_init(c2stats);
  
+    khash_t(qn2pair)* read_pairs = kh_init(qn2pair);
+
      // Collect statistics
      bam1_t *bam_line = bam_init1();
      if ( optind<argc )
      {
-        // Collect stats in selected regions only
-        hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
-        if (bam_idx == 0)
-            error("Random alignment retrieval only works for indexed BAM files.\n");
-
-        int i;
-        for (i=optind; i<argc; i++)
-        {
-            hts_itr_t* iter = bam_itr_querys(bam_idx, info->sam_header, argv[i]);
-            while (sam_itr_next(info->sam, iter, bam_line) >= 0) {
-                if (info->split_tag) {
-                    curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
-                    collect_stats(bam_line, curr_stats);
+        int filter = 1;
+        // Prepare the region hash table for the multi-region iterator
+        void *region_hash = bed_hash_regions(NULL, argv, optind, argc, &filter);
+        if (region_hash) {
+
+            // Collect stats in selected regions only
+            hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
+            if (bam_idx) {
+
+                int regcount = 0;
+                hts_reglist_t *reglist = bed_reglist(region_hash, ALL, &regcount);
+                if (reglist) {
+
+                    hts_itr_multi_t *iter = sam_itr_regions(bam_idx, info->sam_header, reglist, regcount);
+                    if (iter) {
+
+                        if (!targets) {
+                            all_stats->nchunks = argc-optind;
+                            if ( replicate_regions(all_stats, iter) )
+                                fprintf(samtools_stderr, "Replications of the regions failed.");
+                        }
+
+                        if ( all_stats->nregions && all_stats->regions ) {
+                            while (sam_itr_multi_next(info->sam, iter, bam_line) >= 0) {
+                               if (info->split_tag) {
+                                   curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
+                                   collect_stats(bam_line, curr_stats, read_pairs);
+                               }
+                               collect_stats(bam_line, all_stats, read_pairs);
+                            }
+                        }
+
+                        hts_itr_multi_destroy(iter);
+                    } else {
+                       fprintf(samtools_stderr, "Creation of the region iterator failed.");
+                       hts_reglist_free(reglist, regcount);
+                    }
+                } else {
+                    fprintf(samtools_stderr, "Creation of the region list failed.");
                  }
-                collect_stats(bam_line, all_stats);
+
+                hts_idx_destroy(bam_idx);
+            } else {
+                fprintf(samtools_stderr, "Random alignment retrieval only works for indexed BAM files.\n");
              }
-            reset_regions(all_stats);
-            bam_itr_destroy(iter);
+
+            bed_destroy(region_hash);
+        } else {
+            fprintf(samtools_stderr, "Creation of the region hash table failed.\n");
          }
-        hts_idx_destroy(bam_idx);
      }
      else
      {
+        if ( info->cov_threshold > 0 && !targets ) {
+            fprintf(samtools_stderr, "Coverage percentage calcuation requires a list of target regions\n");
+            goto cleanup;
+        }
+
          // Stream through the entire BAM ignoring off-target regions if -t is given
          int ret;
          while ((ret = sam_read1(info->sam, info->sam_header, bam_line)) >= 0) {
              if (info->split_tag) {
                  curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
-                collect_stats(bam_line, curr_stats);
+                collect_stats(bam_line, curr_stats, read_pairs);
              }
-            collect_stats(bam_line, all_stats);
+            collect_stats(bam_line, all_stats, read_pairs);
          }
  
          if (ret < -1) {
@@ -1722,6 +2205,7 @@ int main_stats(int argc, char *argv[])
      if (info->split_tag)
          output_split_stats(split_hash, bam_fname, sparse);
  
+cleanup:
      bam_destroy1(bam_line);
      bam_hdr_destroy(info->sam_header);
      sam_global_args_free(&ga);
@@ -1729,6 +2213,7 @@ int main_stats(int argc, char *argv[])
      cleanup_stats(all_stats);
      cleanup_stats_info(info);
      destroy_split_stats(split_hash);
+    cleanup_overlaps(read_pairs, INT_MAX);
  
      return 0;
  }
diff --git a/samtools/test/split/test_filter_header_rg.c b/samtools/test/split/test_filter_header_rg.c

index 3792ab5465c38027e7a06136bf913be18cac8a55..cccf0e933319d5e9b7324ba526d45ff4d4b05a5d 100644 (file)
--- a/samtools/test/split/test_filter_header_rg.c
+++ b/samtools/test/split/test_filter_header_rg.c
@@ -40,11 +40,10 @@ void setup_test_1(bam_hdr_t** hdr_in)
  }
  
  bool check_test_1(const bam_hdr_t* hdr) {
-    char test1_res[200];
-    snprintf(test1_res, 199,
+    const char *test1_res =
      "@HD\tVN:1.4\n"
      "@SQ\tSN:blah\n"
-    "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+    "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
  
      if (strcmp(hdr->text, test1_res)) {
          return false;
@@ -64,12 +63,11 @@ void setup_test_2(bam_hdr_t** hdr_in)
  }
  
  bool check_test_2(const bam_hdr_t* hdr) {
-    char test2_res[200];
-    snprintf(test2_res, 199,
+    const char *test2_res =
      "@HD\tVN:1.4\n"
      "@SQ\tSN:blah\n"
      "@RG\tID:fish\n"
-    "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+    "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
  
      if (strcmp(hdr->text, test2_res)) {
          return false;
@@ -114,7 +112,7 @@ int main(int argc, char *argv[])
      bam_hdr_t* hdr1;
      const char* id_to_keep_1 = "1#2.3";
      setup_test_1(&hdr1);
-    if (verbose > 0) {
+    if (verbose > 1) {
          printf("hdr1\n");
          dump_hdr(hdr1);
      }
@@ -126,7 +124,7 @@ int main(int argc, char *argv[])
      fclose(stderr);
  
      if (verbose) printf("END RUN test 1\n");
-    if (verbose > 0) {
+    if (verbose > 1) {
          printf("hdr1\n");
          dump_hdr(hdr1);
      }
@@ -153,7 +151,7 @@ int main(int argc, char *argv[])
      bam_hdr_t* hdr2;
      const char* id_to_keep_2 = "fish";
      setup_test_2(&hdr2);
-    if (verbose > 0) {
+    if (verbose > 1) {
          printf("hdr2\n");
          dump_hdr(hdr2);
      }
@@ -165,7 +163,7 @@ int main(int argc, char *argv[])
      fclose(stderr);
  
      if (verbose) printf("END RUN test 2\n");
-    if (verbose > 0) {
+    if (verbose > 1) {
          printf("hdr2\n");
          dump_hdr(hdr2);
      }
diff --git a/samtools/test/split/test_filter_header_rg.c.pysam.c b/samtools/test/split/test_filter_header_rg.c.pysam.c

index 54227fce4ae1c458938e725d73a1fea5650cbe1f..18e3adf0e198800416295192d7a1ee7e21ac24b0 100644 (file)
--- a/samtools/test/split/test_filter_header_rg.c.pysam.c
+++ b/samtools/test/split/test_filter_header_rg.c.pysam.c
@@ -42,11 +42,10 @@ void setup_test_1(bam_hdr_t** hdr_in)
  }
  
  bool check_test_1(const bam_hdr_t* hdr) {
-    char test1_res[200];
-    snprintf(test1_res, 199,
+    const char *test1_res =
      "@HD\tVN:1.4\n"
      "@SQ\tSN:blah\n"
-    "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+    "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
  
      if (strcmp(hdr->text, test1_res)) {
          return false;
@@ -66,12 +65,11 @@ void setup_test_2(bam_hdr_t** hdr_in)
  }
  
  bool check_test_2(const bam_hdr_t* hdr) {
-    char test2_res[200];
-    snprintf(test2_res, 199,
+    const char *test2_res =
      "@HD\tVN:1.4\n"
      "@SQ\tSN:blah\n"
      "@RG\tID:fish\n"
-    "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+    "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
  
      if (strcmp(hdr->text, test2_res)) {
          return false;
@@ -116,7 +114,7 @@ int samtools_test_filter_header_rg_main(int argc, char *argv[])
      bam_hdr_t* hdr1;
      const char* id_to_keep_1 = "1#2.3";
      setup_test_1(&hdr1);
-    if (verbose > 0) {
+    if (verbose > 1) {
          fprintf(samtools_stdout, "hdr1\n");
          dump_hdr(hdr1);
      }
@@ -128,7 +126,7 @@ int samtools_test_filter_header_rg_main(int argc, char *argv[])
      fclose(samtools_stderr);
  
      if (verbose) fprintf(samtools_stdout, "END RUN test 1\n");
-    if (verbose > 0) {
+    if (verbose > 1) {
          fprintf(samtools_stdout, "hdr1\n");
          dump_hdr(hdr1);
      }
@@ -155,7 +153,7 @@ int samtools_test_filter_header_rg_main(int argc, char *argv[])
      bam_hdr_t* hdr2;
      const char* id_to_keep_2 = "fish";
      setup_test_2(&hdr2);
-    if (verbose > 0) {
+    if (verbose > 1) {
          fprintf(samtools_stdout, "hdr2\n");
          dump_hdr(hdr2);
      }
@@ -167,7 +165,7 @@ int samtools_test_filter_header_rg_main(int argc, char *argv[])
      fclose(samtools_stderr);
  
      if (verbose) fprintf(samtools_stdout, "END RUN test 2\n");
-    if (verbose > 0) {
+    if (verbose > 1) {
          fprintf(samtools_stdout, "hdr2\n");
          dump_hdr(hdr2);
      }
diff --git a/samtools/test/test.c b/samtools/test/test.c

index 0b4d585111e76b8d4fd928ace98e64d0a809359c..fb0b54927184a614b3d0eb656f531efa89ac9bcf 100644 (file)
--- a/samtools/test/test.c
+++ b/samtools/test/test.c
@@ -53,3 +53,9 @@ void dump_hdr(const bam_hdr_t* hdr)
      }
      printf("text: \"%s\"\n", hdr->text);
  }
+
+// For tests, just return a constant that can be embedded in expected output.
+const char *samtools_version(void)
+{
+    return "x.y.test";
+}
diff --git a/samtools/test/test.c.pysam.c b/samtools/test/test.c.pysam.c

index df87fbbd8fa9a0d196a19c2d5b9a9e6bcecbd988..cd41889c4b0a5863fedcfe2d6cebc2e96a267db3 100644 (file)
--- a/samtools/test/test.c.pysam.c
+++ b/samtools/test/test.c.pysam.c
@@ -55,3 +55,9 @@ void dump_hdr(const bam_hdr_t* hdr)
      }
      fprintf(samtools_stdout, "text: \"%s\"\n", hdr->text);
  }
+
+// For tests, just return a constant that can be embedded in expected output.
+const char *samtools_version(void)
+{
+    return "x.y.test";
+}
diff --git a/samtools/version.h b/samtools/version.h

index 9dcb73f1892fe4ce21c65a916910c5a4ee6a45f6..9af0b738784611f3b6b31189fbe582d1d27ed0e7 100644 (file)
--- a/samtools/version.h
+++ b/samtools/version.h
@@ -1 +1 @@
-#define SAMTOOLS_VERSION "1.7"
+#define SAMTOOLS_VERSION "1.9"
diff --git a/save/example.py b/save/example.py

deleted file mode 100644 (file)

index 473158e..0000000
--- a/save/example.py
+++ /dev/null
@@ -1,79 +0,0 @@
-## This script contains some example code 
-## illustrating ways to to use the pysam 
-## interface to samtools.
-##
-## The unit tests in the script pysam_test.py
-## contain more examples.
-##
-
-import pysam
-
-samfile = pysam.Samfile( "ex1.bam", "rb" )
-
-print "###################"
-# check different ways to iterate
-print len(list(samfile.fetch()))
-print len(list(samfile.fetch( "chr1", 10, 200 )))
-print len(list(samfile.fetch( region="chr1:10-200" )))
-print len(list(samfile.fetch( "chr1" )))
-print len(list(samfile.fetch( region="chr1")))
-print len(list(samfile.fetch( "chr2" )))
-print len(list(samfile.fetch( region="chr2")))
-print len(list(samfile.fetch()))
-print len(list(samfile.fetch( "chr1" )))
-print len(list(samfile.fetch( region="chr1")))
-print len(list(samfile.fetch()))
-
-print len(list(samfile.pileup( "chr1", 10, 200 )))
-print len(list(samfile.pileup( region="chr1:10-200" )))
-print len(list(samfile.pileup( "chr1" )))
-print len(list(samfile.pileup( region="chr1")))
-print len(list(samfile.pileup( "chr2" )))
-print len(list(samfile.pileup( region="chr2")))
-print len(list(samfile.pileup()))
-print len(list(samfile.pileup()))
-
-print "########### fetch with callback ################"
-def my_fetch_callback( alignment ): print str(alignment)
-samfile.fetch( region="chr1:10-200", callback=my_fetch_callback )
-
-print "########## pileup with callback ################"
-def my_pileup_callback( column ): print str(column)
-samfile.pileup( region="chr1:10-200", callback=my_pileup_callback )
-
-
-print "########### Using a callback object ###### "
-
-class Counter:
-    mCounts = 0
-    def __call__(self, alignment):
-        self.mCounts += 1
-
-c = Counter()
-samfile.fetch( region = "chr1:10-200", callback = c )
-print "counts=", c.mCounts
-
-print "########### Calling a samtools command line function ############"
-
-for p in pysam.mpileup( "-c", "ex1.bam" ):
-    print str(p)
-
-print pysam.mpileup.getMessages()
-
-print "########### Investigating headers #######################"
-
-# playing arount with headers
-samfile = pysam.Samfile( "ex3.sam", "r" )
-print samfile.references
-print samfile.lengths
-print samfile.text
-print samfile.header
-header = samfile.header
-samfile.close()
-
-header["HD"]["SO"] = "unsorted"
-outfile = pysam.Samfile( "out.sam", "wh", 
-                         header = header )
-
-outfile.close()
-
diff --git a/save/pysam_bench.py b/save/pysam_bench.py

deleted file mode 100644 (file)

index 03503ec..0000000
--- a/save/pysam_bench.py
+++ /dev/null
@@ -1,63 +0,0 @@
-'''benchmark pysam BAM/SAM access with the samtools commandline tools.
-
-samtools functions are called via the pysam interface to avoid the over-head
-of starting additional processes.
-'''
-
-import pysam
-import timeit 
-
-iterations = 10
-
-def runBenchmark( test, 
-                  pysam_way,
-                  samtools_way = None):
-    print test
-    print timeit.repeat( pysam_way, number = iterations, setup="from __main__ import pysam" )
-    if samtools_way:
-        print timeit.repeat( samtools_way, number = iterations, setup="from __main__ import pysam" )
-
-runBenchmark( "Samfile.fetch",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = list(f.fetch())
-''',
-'''
-f = pysam.view( "ex1.bam" )
-'''
-)
-
-runBenchmark( "Samfile.pileup",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = list(f.pileup())
-''',
-'''
-f = pysam.pileup( "ex1.bam" )
-''')
-
-runBenchmark( "Samfile.pileup with coverage retrieval",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = [ x.n for x in f.pileup() ]
-''' )
-
-runBenchmark( "Samfile.pileup with full retrieval",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = [ x.pileups for x in f.pileup() ]
-''' )
-
-runBenchmark( "Samfile.pileup - many references",
-'''
-f = pysam.Samfile( "manyrefs.bam", "rb" )
-results = [ x.pileups for x in f.pileup() ]
-''',
-'''
-f = pysam.pileup( "manyrefs.bam" )
-'''
- )
-
-
-
-
diff --git a/save/pysam_test2.6.py b/save/pysam_test2.6.py

deleted file mode 100755 (executable)

index eb4848a..0000000
--- a/save/pysam_test2.6.py
+++ /dev/null
@@ -1,1607 +0,0 @@
-#!/usr/bin/env python
-'''unit testing code for pysam.
-
-Execute in the :file:`tests` directory as it requires the Makefile
-and data files located there.
-'''
-
-import pysam
-import unittest
-import os, re, sys
-import itertools
-import collections
-import subprocess
-import shutil
-import logging
-
-
-if sys.version_info[0] < 3:
-    from itertools import izip as zip_longest
-else:
-    from itertools import zip_longest
-
-
-SAMTOOLS="samtools"
-WORKDIR="pysam_test_work"
-
-def checkBinaryEqual( filename1, filename2 ):
-    '''return true if the two files are binary equal.'''
-    if os.path.getsize( filename1 ) !=  os.path.getsize( filename2 ):
-        return False
-
-    infile1 = open(filename1, "rb")
-    infile2 = open(filename2, "rb")
-
-    def chariter( infile ):
-        while 1:
-            c = infile.read(1)
-            if c == b"": break
-            yield c
-
-    found = False
-    for c1,c2 in zip_longest( chariter( infile1), chariter( infile2) ):
-        if c1 != c2: break
-    else:
-        found = True
-
-    infile1.close()
-    infile2.close()
-    return found
-
-def runSamtools( cmd ):
-    '''run a samtools command'''
-
-    try:
-        retcode = subprocess.call(cmd, shell=True)
-        if retcode < 0:
-            print >>sys.stderr, "Child was terminated by signal", -retcode
-    except OSError as e:
-        print >>sys.stderr, "Execution failed:", e
-
-def getSamtoolsVersion():
-    '''return samtools version'''
-
-    pipe = subprocess.Popen(SAMTOOLS, shell=True, stderr=subprocess.PIPE).stderr
-    lines = b"".join(pipe.readlines())
-    if sys.version_info[0] >= 3:
-        lines = lines.decode('ascii')
-    return re.search( "Version:\s+(\S+)", lines).groups()[0]
-
-class BinaryTest(unittest.TestCase):
-    '''test samtools command line commands and compare
-    against pysam commands.
-
-    Tests fail, if the output is not binary identical.
-    '''
-
-    first_time = True
-
-    # a list of commands to test
-    commands = \
-        { 
-          "view" :
-              (
-                ("ex1.view", "view ex1.bam > ex1.view"),
-                ("pysam_ex1.view", (pysam.view, "ex1.bam" ) ),
-                ),
-          "view2" :
-              (
-                ("ex1.view", "view -bT ex1.fa -o ex1.view2 ex1.sam"),
-                # note that -o ex1.view2 throws exception.
-                ("pysam_ex1.view", (pysam.view, "-bT ex1.fa -oex1.view2 ex1.sam" ) ),
-                ),
-          "sort" :
-              (
-                ( "ex1.sort.bam", "sort ex1.bam ex1.sort" ),
-                ( "pysam_ex1.sort.bam", (pysam.sort, "ex1.bam pysam_ex1.sort" ) ),
-                ),
-          "mpileup" :
-              (
-                ("ex1.pileup", "mpileup ex1.bam > ex1.pileup" ),
-                ("pysam_ex1.mpileup", (pysam.mpileup, "ex1.bam" ) ),
-                ),
-          "depth" :
-              (
-                ("ex1.depth", "depth ex1.bam > ex1.depth" ),
-                ("pysam_ex1.depth", (pysam.depth, "ex1.bam" ) ),
-                ),
-          "faidx" : 
-              ( 
-                ("ex1.fa.fai", "faidx ex1.fa"), 
-                ("pysam_ex1.fa.fai", (pysam.faidx, "ex1.fa") ),
-                ),
-          "index":
-              (
-                ("ex1.bam.bai", "index ex1.bam" ),
-                ("pysam_ex1.bam.bai", (pysam.index, "pysam_ex1.bam" ) ),
-                ),
-          "idxstats" :
-              ( 
-                ("ex1.idxstats", "idxstats ex1.bam > ex1.idxstats" ),
-                ("pysam_ex1.idxstats", (pysam.idxstats, "pysam_ex1.bam" ) ),
-                ),
-          "fixmate" :
-              (
-                ("ex1.fixmate", "fixmate ex1.bam ex1.fixmate" ),
-                ("pysam_ex1.fixmate", (pysam.fixmate, "pysam_ex1.bam pysam_ex1.fixmate") ),
-                ),
-          "flagstat" :
-              (
-                ("ex1.flagstat", "flagstat ex1.bam > ex1.flagstat" ),
-                ("pysam_ex1.flagstat", (pysam.flagstat, "pysam_ex1.bam") ),
-                ),
-          "calmd" :
-              (
-                ("ex1.calmd", "calmd ex1.bam ex1.fa > ex1.calmd" ),
-                ("pysam_ex1.calmd", (pysam.calmd, "pysam_ex1.bam ex1.fa") ),
-                ),
-          "merge" :
-              (
-                ("ex1.merge", "merge -f ex1.merge ex1.bam ex1.bam" ),
-                # -f option does not work - following command will cause the subsequent
-                # command to fail
-                ("pysam_ex1.merge", (pysam.merge, "pysam_ex1.merge pysam_ex1.bam pysam_ex1.bam") ),
-                ),
-          "rmdup" :
-              (
-                ("ex1.rmdup", "rmdup ex1.bam ex1.rmdup" ),
-                ("pysam_ex1.rmdup", (pysam.rmdup, "pysam_ex1.bam pysam_ex1.rmdup" )),
-                ),
-          "reheader" :
-              (
-                ( "ex1.reheader", "reheader ex1.bam ex1.bam > ex1.reheader"),
-                ( "pysam_ex1.reheader", (pysam.reheader, "ex1.bam ex1.bam" ) ),
-                ),
-          "cat":
-              (
-                ( "ex1.cat", "cat ex1.bam ex1.bam > ex1.cat"),
-                ( "pysam_ex1.cat", (pysam.cat, "ex1.bam ex1.bam" ) ),
-                ),
-          "targetcut":
-              (
-                ("ex1.targetcut", "targetcut ex1.bam > ex1.targetcut" ),
-                ("pysam_ex1.targetcut", (pysam.targetcut, "pysam_ex1.bam") ),
-                ),
-          "phase":
-              (
-                ("ex1.phase", "phase ex1.bam > ex1.phase" ),
-                ("pysam_ex1.phase", (pysam.phase, "pysam_ex1.bam") ),
-                ),
-          "import" :
-              (
-                ("ex1.bam", "import ex1.fa.fai ex1.sam.gz ex1.bam" ),
-                ("pysam_ex1.bam", (pysam.samimport, "ex1.fa.fai ex1.sam.gz pysam_ex1.bam") ),
-                ),
-          "bam2fq":
-              (
-                ("ex1.bam2fq", "bam2fq ex1.bam > ex1.bam2fq" ),
-                ("pysam_ex1.bam2fq", (pysam.bam2fq, "pysam_ex1.bam") ),
-                ),
-        }
-
-    # some tests depend on others. The order specifies in which order
-    # the samtools commands are executed.
-    # The first three (faidx, import, index) need to be in that order, 
-    # the rest is arbitrary.
-    order = ('faidx', 'import', 'index', 
-              # 'pileup1', 'pileup2', deprecated
-              # 'glfview', deprecated
-              'view', 'view2',
-              'sort',
-              'mpileup',
-              'depth',
-              'idxstats',
-              'fixmate',
-              'flagstat',
-              # 'calmd',
-              'merge',
-              'rmdup',
-              'reheader',
-              'cat',
-              'targetcut',
-              'phase',
-              'bam2fq',
-              )
-
-    def setUp( self ):
-        '''setup tests. 
-
-        For setup, all commands will be run before the first test is
-        executed. Individual tests will then just compare the output
-        files.
-        '''
-        if BinaryTest.first_time:
-
-            # remove previous files
-            if os.path.exists( WORKDIR ):
-                shutil.rmtree( WORKDIR )
-                
-            # copy the source files to WORKDIR
-            os.makedirs( WORKDIR )
-
-            shutil.copy( "ex1.fa", os.path.join( WORKDIR, "pysam_ex1.fa" ) )
-            shutil.copy( "ex1.fa", os.path.join( WORKDIR, "ex1.fa" ) )
-            shutil.copy( "ex1.sam.gz", os.path.join( WORKDIR, "ex1.sam.gz" ) )
-            shutil.copy( "ex1.sam", os.path.join( WORKDIR, "ex1.sam" ) )
-
-            # cd to workdir
-            savedir = os.getcwd()
-            os.chdir( WORKDIR )
-            
-            for label in self.order:
-                command = self.commands[label]
-                samtools_target, samtools_command = command[0]
-                try:
-                    pysam_target, pysam_command = command[1]
-                except ValueError as msg:
-                    raise ValueError( "error while setting up %s=%s: %s" %\
-                                          (label, command, msg) )
-                runSamtools( " ".join( (SAMTOOLS, samtools_command )))
-                pysam_method, pysam_options = pysam_command
-                try:
-                    output = pysam_method( *pysam_options.split(" "), raw=True)
-                except pysam.SamtoolsError as msg:
-                    raise pysam.SamtoolsError( "error while executing %s: options=%s: msg=%s" %\
-                                                   (label, pysam_options, msg) )
-                if ">" in samtools_command:
-                    outfile = open( pysam_target, "wb" )
-                    if sys.version_info[0] < 3:
-                        for line in output: outfile.write( line )
-                    else:
-                        for line in output: outfile.write(line.encode('ascii'))
-                    outfile.close()
-                    
-            os.chdir( savedir )
-            BinaryTest.first_time = False
-
-            
-
-        samtools_version = getSamtoolsVersion()
-
-        
-        def _r( s ):
-            # patch - remove any of the alpha/beta suffixes, i.e., 0.1.12a -> 0.1.12
-            if s.count('-') > 0: s = s[0:s.find('-')]
-            return re.sub( "[^0-9.]", "", s )
-
-        if _r(samtools_version) != _r( pysam.__samtools_version__):
-            raise ValueError("versions of pysam/samtools and samtools differ: %s != %s" % \
-                                 (pysam.__samtools_version__,
-                                  samtools_version ))
-
-    def checkCommand( self, command ):
-        if command:
-            samtools_target, pysam_target = self.commands[command][0][0], self.commands[command][1][0]
-            samtools_target = os.path.join( WORKDIR, samtools_target )
-            pysam_target = os.path.join( WORKDIR, pysam_target )
-            self.assertTrue( checkBinaryEqual( samtools_target, pysam_target ), 
-                             "%s failed: files %s and %s are not the same" % (command, samtools_target, pysam_target) )
-            
-    def testImport( self ):
-        self.checkCommand( "import" )
-
-    def testIndex( self ):
-        self.checkCommand( "index" )
-
-    def testSort( self ):
-        self.checkCommand( "sort" )
-
-    def testMpileup( self ):
-        self.checkCommand( "mpileup" )
-
-    def testDepth( self ):
-        self.checkCommand( "depth" )
-
-    def testIdxstats( self ):
-        self.checkCommand( "idxstats" )
-
-    def testFixmate( self ):
-        self.checkCommand( "fixmate" )
-
-    def testFlagstat( self ):
-        self.checkCommand( "flagstat" )
-        
-    def testMerge( self ):
-        self.checkCommand( "merge" )
-
-    def testRmdup( self ):
-        self.checkCommand( "rmdup" )
-
-    def testReheader( self ):
-        self.checkCommand( "reheader" )
-
-    def testCat( self ):
-        self.checkCommand( "cat" )
-
-    def testTargetcut( self ):
-        self.checkCommand( "targetcut" )
-
-    def testPhase( self ):
-        self.checkCommand( "phase" )
-
-    def testBam2fq( self ):
-        self.checkCommand( "bam2fq" )
-
-    # def testPileup1( self ):
-    #     self.checkCommand( "pileup1" )
-    
-    # def testPileup2( self ):
-    #     self.checkCommand( "pileup2" )
-
-    # deprecated
-    # def testGLFView( self ):
-    #     self.checkCommand( "glfview" )
-
-    def testView( self ):
-        self.checkCommand( "view" )
-
-    def testEmptyIndex( self ):
-        self.assertRaises( IOError, pysam.index, "exdoesntexist.bam" )
-
-    def __del__(self):
-        if os.path.exists( WORKDIR ):
-            shutil.rmtree( WORKDIR )
-
-class IOTest(unittest.TestCase):
-    '''check if reading samfile and writing a samfile are consistent.'''
-
-    def checkEcho( self, input_filename, reference_filename, 
-                   output_filename, 
-                   input_mode, output_mode, use_template = True ):
-        '''iterate through *input_filename* writing to *output_filename* and
-        comparing the output to *reference_filename*. 
-        
-        The files are opened according to the *input_mode* and *output_mode*.
-
-        If *use_template* is set, the header is copied from infile using the
-        template mechanism, otherwise target names and lengths are passed 
-        explicitly. 
-
-        '''
-
-        infile = pysam.Samfile( input_filename, input_mode )
-        if use_template:
-            outfile = pysam.Samfile( output_filename, output_mode, template = infile )
-        else:
-            outfile = pysam.Samfile( output_filename, output_mode, 
-                                     referencenames = infile.references,
-                                     referencelengths = infile.lengths,
-                                     add_sq_text = False )
-            
-        iter = infile.fetch()
-        for x in iter: outfile.write( x )
-        infile.close()
-        outfile.close()
-
-        self.assertTrue( checkBinaryEqual( reference_filename, output_filename), 
-                         "files %s and %s are not the same" % (reference_filename, output_filename) )
-
-
-    def testReadWriteBam( self ):
-        
-        input_filename = "ex1.bam"
-        output_filename = "pysam_ex1.bam"
-        reference_filename = "ex1.bam"
-
-        self.checkEcho( input_filename, reference_filename, output_filename,
-                        "rb", "wb" )
-
-
-    def testReadWriteBamWithTargetNames( self ):
-        
-        input_filename = "ex1.bam"
-        output_filename = "pysam_ex1.bam"
-        reference_filename = "ex1.bam"
-
-        self.checkEcho( input_filename, reference_filename, output_filename,
-                        "rb", "wb", use_template = False )
-
-    def testReadWriteSamWithHeader( self ):
-        
-        input_filename = "ex2.sam"
-        output_filename = "pysam_ex2.sam"
-        reference_filename = "ex2.sam"
-
-        self.checkEcho( input_filename, reference_filename, output_filename,
-                        "r", "wh" )
-
-    def testReadWriteSamWithoutHeader( self ):
-        
-        input_filename = "ex2.sam"
-        output_filename = "pysam_ex2.sam"
-        reference_filename = "ex1.sam"
-
-        self.checkEcho( input_filename, reference_filename, output_filename,
-                        "r", "w" )
-
-    def testReadSamWithoutHeaderWriteSamWithoutHeader( self ):
-        
-        input_filename = "ex1.sam"
-        output_filename = "pysam_ex1.sam"
-        reference_filename = "ex1.sam"
-
-        # disabled - reading from a samfile without header
-        # is not implemented.
-        
-        # self.checkEcho( input_filename, reference_filename, output_filename,
-        #                 "r", "w" )
-
-    def testFetchFromClosedFile( self ):
-
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-        samfile.close()
-        self.assertRaises( ValueError, samfile.fetch, 'chr1', 100, 120)
-
-    def testClosedFile( self ):
-        '''test that access to a closed samfile raises ValueError.'''
-
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-        samfile.close()
-        self.assertRaises( ValueError, samfile.fetch, 'chr1', 100, 120)
-        self.assertRaises( ValueError, samfile.pileup, 'chr1', 100, 120)
-        self.assertRaises( ValueError, samfile.getrname, 0 )
-        self.assertRaises( ValueError, samfile.tell )
-        self.assertRaises( ValueError, samfile.seek, 0 )
-        self.assertRaises( ValueError, getattr, samfile, "nreferences" )
-        self.assertRaises( ValueError, getattr, samfile, "references" )
-        self.assertRaises( ValueError, getattr, samfile, "lengths" )
-        self.assertRaises( ValueError, getattr, samfile, "text" )
-        self.assertRaises( ValueError, getattr, samfile, "header" )
-
-        # write on closed file 
-        self.assertEqual( 0, samfile.write(None) )
-
-    def testAutoDetection( self ):
-        '''test if autodetection works.'''
-
-        samfile = pysam.Samfile( "ex3.sam" )
-        self.assertRaises( ValueError, samfile.fetch, 'chr1' )
-        samfile.close()
-
-        samfile = pysam.Samfile( "ex3.bam" )
-        samfile.fetch('chr1')
-        samfile.close()
-
-    def testReadingFromSamFileWithoutHeader( self ):
-        '''read from samfile without header.
-        '''
-        samfile = pysam.Samfile( "ex7.sam" )
-        self.assertRaises( NotImplementedError, samfile.__iter__ )
-
-    def testReadingFromFileWithoutIndex( self ):
-        '''read from bam file without index.'''
-
-        assert not os.path.exists( "ex2.bam.bai" )
-        samfile = pysam.Samfile( "ex2.bam", "rb" )
-        self.assertRaises( ValueError, samfile.fetch )
-        self.assertEqual( len(list( samfile.fetch(until_eof = True) )), 3270 )
-
-    def testReadingUniversalFileMode( self ):
-        '''read from samfile without header.
-        '''
-
-        input_filename = "ex2.sam"
-        output_filename = "pysam_ex2.sam"
-        reference_filename = "ex1.sam"
-
-        self.checkEcho( input_filename, reference_filename, output_filename,
-                        "rU", "w" )
-
-class TestFloatTagBug( unittest.TestCase ):
-    '''see issue 71'''
-
-    def testFloatTagBug( self ): 
-        '''a float tag before another exposed a parsing bug in bam_aux_get - expected to fail
-
-        This test is expected to fail until samtools is fixed.
-        '''
-        samfile = pysam.Samfile("tag_bug.bam")
-        read = samfile.fetch(until_eof=True).next()
-        self.assertTrue( ('XC',1) in read.tags )
-        self.assertEqual(read.opt('XC'), 1)
-
-class TestTagParsing( unittest.TestCase ):
-    '''tests checking the accuracy of tag setting and retrieval.'''
-
-    def makeRead( self ):
-        a = pysam.AlignedRead()
-        a.qname = "read_12345"
-        a.tid = 0
-        a.seq="ACGT" * 3
-        a.flag = 0
-        a.rname = 0
-        a.pos = 1
-        a.mapq = 20
-        a.cigar = ( (0,10), (2,1), (0,25) )
-        a.mrnm = 0
-        a.mpos=200
-        a.isize = 0
-        a.qual ="1234" * 3
-        # todo: create tags
-        return a
-
-    def testNegativeIntegers( self ):
-        x = -2
-        aligned_read = self.makeRead()
-        aligned_read.tags = [("XD", int(x) ) ]
-        # print (aligned_read.tags)
-
-    def testNegativeIntegers2( self ):
-        x = -2
-        r = self.makeRead()
-        r.tags = [("XD", int(x) ) ]
-        outfile = pysam.Samfile( "test.bam",
-                                 "wb",
-                                 referencenames = ("chr1",),
-                                 referencelengths = (1000,) )
-        outfile.write (r )
-        outfile.close()
-
-
-class TestIteratorRow(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
-    def checkRange( self, rnge ):
-        '''compare results from iterator with those from samtools.'''
-        ps = list(self.samfile.fetch(region=rnge))
-        sa = list(pysam.view( "ex1.bam", rnge, raw = True) )
-        self.assertEqual( len(ps), len(sa), "unequal number of results for range %s: %i != %i" % (rnge, len(ps), len(sa) ))
-        # check if the same reads are returned and in the same order
-        for line, (a, b) in enumerate( zip( ps, sa ) ):
-            d = b.split("\t")
-            self.assertEqual( a.qname, d[0], "line %i: read id mismatch: %s != %s" % (line, a.rname, d[0]) )
-            self.assertEqual( a.pos, int(d[3])-1, "line %i: read position mismatch: %s != %s, \n%s\n%s\n" % \
-                                  (line, a.pos, int(d[3])-1,
-                                   str(a), str(d) ) )
-            if sys.version_info[0] < 3:
-                qual = d[10]
-            else:
-                qual = d[10].encode('ascii')
-            self.assertEqual( a.qual, qual, "line %i: quality mismatch: %s != %s, \n%s\n%s\n" % \
-                                  (line, a.qual, qual,
-                                   str(a), str(d) ) )
-
-    def testIteratePerContig(self):
-        '''check random access per contig'''
-        for contig in self.samfile.references:
-            self.checkRange( contig )
-
-    def testIterateRanges(self):
-        '''check random access per range'''
-        for contig, length in zip(self.samfile.references, self.samfile.lengths):
-            for start in range( 1, length, 90):
-                self.checkRange( "%s:%i-%i" % (contig, start, start + 90) ) # this includes empty ranges
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestIteratorRowAll(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
-    def testIterate(self):
-        '''compare results from iterator with those from samtools.'''
-        ps = list(self.samfile.fetch())
-        sa = list(pysam.view( "ex1.bam", raw = True) )
-        self.assertEqual( len(ps), len(sa), "unequal number of results: %i != %i" % (len(ps), len(sa) ))
-        # check if the same reads are returned
-        for line, pair in enumerate( zip( ps, sa ) ):
-            data = pair[1].split("\t")
-            self.assertEqual( pair[0].qname, data[0], "read id mismatch in line %i: %s != %s" % (line, pair[0].rname, data[0]) )
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestIteratorColumn(unittest.TestCase):
-    '''test iterator column against contents of ex3.bam.'''
-    
-    # note that samfile contains 1-based coordinates
-    # 1D means deletion with respect to reference sequence
-    # 
-    mCoverages = { 'chr1' : [ 0 ] * 20 + [1] * 36 + [0] * (100 - 20 -35 ),
-                   'chr2' : [ 0 ] * 20 + [1] * 35 + [0] * (100 - 20 -35 ),
-                   }
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex4.bam","rb" )
-
-    def checkRange( self, rnge ):
-        '''compare results from iterator with those from samtools.'''
-        # check if the same reads are returned and in the same order
-        for column in self.samfile.pileup(region=rnge):
-            thiscov = len(column.pileups)
-            refcov = self.mCoverages[self.samfile.getrname(column.tid)][column.pos]
-            self.assertEqual( thiscov, refcov, "wrong coverage at pos %s:%i %i should be %i" % (self.samfile.getrname(column.tid), column.pos, thiscov, refcov))
-
-    def testIterateAll(self):
-        '''check random access per contig'''
-        self.checkRange( None )
-
-    def testIteratePerContig(self):
-        '''check random access per contig'''
-        for contig in self.samfile.references:
-            self.checkRange( contig )
-
-    def testIterateRanges(self):
-        '''check random access per range'''
-        for contig, length in zip(self.samfile.references, self.samfile.lengths):
-            for start in range( 1, length, 90):
-                self.checkRange( "%s:%i-%i" % (contig, start, start + 90) ) # this includes empty ranges
-
-    def testInverse( self ):
-        '''test the inverse, is point-wise pileup accurate.'''
-        for contig, refseq in self.mCoverages.items():
-            refcolumns = sum(refseq)
-            for pos, refcov in enumerate( refseq ):
-                columns = list(self.samfile.pileup( contig, pos, pos+1) )
-                if refcov == 0:
-                    # if no read, no coverage
-                    self.assertEqual( len(columns), refcov, "wrong number of pileup columns returned for position %s:%i, %i should be %i" %(contig,pos,len(columns), refcov) )
-                elif refcov == 1:
-                    # one read, all columns of the read are returned
-                    self.assertEqual( len(columns), refcolumns, "pileup incomplete at position %i: got %i, expected %i " %\
-                                          (pos, len(columns), refcolumns))
-
-                    
-    
-    def tearDown(self):
-        self.samfile.close()
-    
-class TestAlignedReadFromBam(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex3.bam","rb" )
-        self.reads=list(self.samfile.fetch())
-
-    def testARqname(self):
-        self.assertEqual( self.reads[0].qname, "read_28833_29006_6945", "read name mismatch in read 1: %s != %s" % (self.reads[0].qname, "read_28833_29006_6945") )
-        self.assertEqual( self.reads[1].qname, "read_28701_28881_323b", "read name mismatch in read 2: %s != %s" % (self.reads[1].qname, "read_28701_28881_323b") )
-
-    def testARflag(self):
-        self.assertEqual( self.reads[0].flag, 99, "flag mismatch in read 1: %s != %s" % (self.reads[0].flag, 99) )
-        self.assertEqual( self.reads[1].flag, 147, "flag mismatch in read 2: %s != %s" % (self.reads[1].flag, 147) )
-
-    def testARrname(self):
-        self.assertEqual( self.reads[0].rname, 0, "chromosome/target id mismatch in read 1: %s != %s" % (self.reads[0].rname, 0) )
-        self.assertEqual( self.reads[1].rname, 1, "chromosome/target id mismatch in read 2: %s != %s" % (self.reads[1].rname, 1) )
-
-    def testARpos(self):
-        self.assertEqual( self.reads[0].pos, 33-1, "mapping position mismatch in read 1: %s != %s" % (self.reads[0].pos, 33-1) )
-        self.assertEqual( self.reads[1].pos, 88-1, "mapping position mismatch in read 2: %s != %s" % (self.reads[1].pos, 88-1) )
-
-    def testARmapq(self):
-        self.assertEqual( self.reads[0].mapq, 20, "mapping quality mismatch in read 1: %s != %s" % (self.reads[0].mapq, 20) )
-        self.assertEqual( self.reads[1].mapq, 30, "mapping quality mismatch in read 2: %s != %s" % (self.reads[1].mapq, 30) )
-
-    def testARcigar(self):
-        self.assertEqual( self.reads[0].cigar, [(0, 10), (2, 1), (0, 25)], "read name length mismatch in read 1: %s != %s" % (self.reads[0].cigar, [(0, 10), (2, 1), (0, 25)]) )
-        self.assertEqual( self.reads[1].cigar, [(0, 35)], "read name length mismatch in read 2: %s != %s" % (self.reads[1].cigar, [(0, 35)]) )
-
-    def testARmrnm(self):
-        self.assertEqual( self.reads[0].mrnm, 0, "mate reference sequence name mismatch in read 1: %s != %s" % (self.reads[0].mrnm, 0) )
-        self.assertEqual( self.reads[1].mrnm, 1, "mate reference sequence name mismatch in read 2: %s != %s" % (self.reads[1].mrnm, 1) )
-        self.assertEqual( self.reads[0].rnext, 0, "mate reference sequence name mismatch in read 1: %s != %s" % (self.reads[0].rnext, 0) )
-        self.assertEqual( self.reads[1].rnext, 1, "mate reference sequence name mismatch in read 2: %s != %s" % (self.reads[1].rnext, 1) )
-
-    def testARmpos(self):
-        self.assertEqual( self.reads[0].mpos, 200-1, "mate mapping position mismatch in read 1: %s != %s" % (self.reads[0].mpos, 200-1) )
-        self.assertEqual( self.reads[1].mpos, 500-1, "mate mapping position mismatch in read 2: %s != %s" % (self.reads[1].mpos, 500-1) )
-        self.assertEqual( self.reads[0].pnext, 200-1, "mate mapping position mismatch in read 1: %s != %s" % (self.reads[0].pnext, 200-1) )
-        self.assertEqual( self.reads[1].pnext, 500-1, "mate mapping position mismatch in read 2: %s != %s" % (self.reads[1].pnext, 500-1) )
-
-    def testARisize(self):
-        self.assertEqual( self.reads[0].isize, 167, "insert size mismatch in read 1: %s != %s" % (self.reads[0].isize, 167) )
-        self.assertEqual( self.reads[1].isize, 412, "insert size mismatch in read 2: %s != %s" % (self.reads[1].isize, 412) )
-        self.assertEqual( self.reads[0].tlen, 167, "insert size mismatch in read 1: %s != %s" % (self.reads[0].tlen, 167) )
-        self.assertEqual( self.reads[1].tlen, 412, "insert size mismatch in read 2: %s != %s" % (self.reads[1].tlen, 412) )
-
-    def testARseq(self):
-        self.assertEqual( self.reads[0].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "sequence mismatch in read 1: %s != %s" % (self.reads[0].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
-        self.assertEqual( self.reads[1].seq, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA", "sequence size mismatch in read 2: %s != %s" % (self.reads[1].seq, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA") )
-        self.assertEqual( self.reads[3].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "sequence mismatch in read 4: %s != %s" % (self.reads[3].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
-
-    def testARqual(self):
-        self.assertEqual( self.reads[0].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "quality string mismatch in read 1: %s != %s" % (self.reads[0].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
-        self.assertEqual( self.reads[1].qual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<", "quality string mismatch in read 2: %s != %s" % (self.reads[1].qual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<") )
-        self.assertEqual( self.reads[3].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "quality string mismatch in read 3: %s != %s" % (self.reads[3].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
-
-    def testARquery(self):
-        self.assertEqual( self.reads[0].query, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "query mismatch in read 1: %s != %s" % (self.reads[0].query, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
-        self.assertEqual( self.reads[1].query, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA", "query size mismatch in read 2: %s != %s" % (self.reads[1].query, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA") )
-        self.assertEqual( self.reads[3].query, b"TAGCTAGCTACCTATATCTTGGTCTT", "query mismatch in read 4: %s != %s" % (self.reads[3].query, b"TAGCTAGCTACCTATATCTTGGTCTT") )
-
-    def testARqqual(self):
-        self.assertEqual( self.reads[0].qqual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "qquality string mismatch in read 1: %s != %s" % (self.reads[0].qqual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
-        self.assertEqual( self.reads[1].qqual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<", "qquality string mismatch in read 2: %s != %s" % (self.reads[1].qqual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<") )
-        self.assertEqual( self.reads[3].qqual, b"<<<<<<<<<<<<<<<<<:<9/,&,22", "qquality string mismatch in read 3: %s != %s" % (self.reads[3].qqual, b"<<<<<<<<<<<<<<<<<:<9/,&,22") )
-
-    def testPresentOptionalFields(self):
-        self.assertEqual( self.reads[0].opt('NM'), 1, "optional field mismatch in read 1, NM: %s != %s" % (self.reads[0].opt('NM'), 1) )
-        self.assertEqual( self.reads[0].opt('RG'), 'L1', "optional field mismatch in read 1, RG: %s != %s" % (self.reads[0].opt('RG'), 'L1') )
-        self.assertEqual( self.reads[1].opt('RG'), 'L2', "optional field mismatch in read 2, RG: %s != %s" % (self.reads[1].opt('RG'), 'L2') )
-        self.assertEqual( self.reads[1].opt('MF'), 18, "optional field mismatch in read 2, MF: %s != %s" % (self.reads[1].opt('MF'), 18) )
-
-    def testPairedBools(self):
-        self.assertEqual( self.reads[0].is_paired, True, "is paired mismatch in read 1: %s != %s" % (self.reads[0].is_paired, True) )
-        self.assertEqual( self.reads[1].is_paired, True, "is paired mismatch in read 2: %s != %s" % (self.reads[1].is_paired, True) )
-        self.assertEqual( self.reads[0].is_proper_pair, True, "is proper pair mismatch in read 1: %s != %s" % (self.reads[0].is_proper_pair, True) )
-        self.assertEqual( self.reads[1].is_proper_pair, True, "is proper pair mismatch in read 2: %s != %s" % (self.reads[1].is_proper_pair, True) )
-
-    def testTags( self ):
-        self.assertEqual( self.reads[0].tags, 
-                          [('NM', 1), ('RG', 'L1'), 
-                           ('PG', 'P1'), ('XT', 'U')] )
-        self.assertEqual( self.reads[1].tags, 
-                          [('MF', 18), ('RG', 'L2'), 
-                           ('PG', 'P2'),('XT', 'R') ] )
-
-    def testOpt( self ):
-        self.assertEqual( self.reads[0].opt("XT"), "U" )
-        self.assertEqual( self.reads[1].opt("XT"), "R" )
-
-    def testMissingOpt( self ):
-        self.assertRaises( KeyError, self.reads[0].opt, "XP" )
-
-    def testEmptyOpt( self ):
-        self.assertRaises( KeyError, self.reads[2].opt, "XT" )
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestAlignedReadFromSam(TestAlignedReadFromBam):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex3.sam","r" )
-        self.reads=list(self.samfile.fetch())
-
-# needs to be implemented 
-# class TestAlignedReadFromSamWithoutHeader(TestAlignedReadFromBam):
-#
-#     def setUp(self):
-#         self.samfile=pysam.Samfile( "ex7.sam","r" )
-#         self.reads=list(self.samfile.fetch())
-
-class TestHeaderSam(unittest.TestCase):
-
-    header = {'SQ': [{'LN': 1575, 'SN': 'chr1'}, 
-                     {'LN': 1584, 'SN': 'chr2'}], 
-              'RG': [{'LB': 'SC_1', 'ID': 'L1', 'SM': 'NA12891', 'PU': 'SC_1_10', "CN":"name:with:colon"}, 
-                     {'LB': 'SC_2', 'ID': 'L2', 'SM': 'NA12891', 'PU': 'SC_2_12', "CN":"name:with:colon"}],
-              'PG': [{'ID': 'P1', 'VN': '1.0'}, {'ID': 'P2', 'VN': '1.1'}], 
-              'HD': {'VN': '1.0'},
-              'CO' : [ 'this is a comment', 'this is another comment'],
-              }
-
-    def compareHeaders( self, a, b ):
-        '''compare two headers a and b.'''
-        for ak,av in a.iteritems():
-            self.assertTrue( ak in b, "key '%s' not in '%s' " % (ak,b) )
-            self.assertEqual( av, b[ak] )
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex3.sam","r" )
-
-    def testHeaders(self):
-        self.compareHeaders( self.header, self.samfile.header )
-        self.compareHeaders( self.samfile.header, self.header )
-
-    def testNameMapping( self ):
-        for x, y in enumerate( ("chr1", "chr2")):
-            tid = self.samfile.gettid( y )
-            ref = self.samfile.getrname( x )
-            self.assertEqual( tid, x )
-            self.assertEqual( ref, y )
-
-        self.assertEqual( self.samfile.gettid("chr?"), -1 )
-        self.assertRaises( ValueError, self.samfile.getrname, 2 )
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestHeaderBam(TestHeaderSam):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex3.bam","rb" )
-
-class TestUnmappedReads(unittest.TestCase):
-
-    def testSAM(self):
-        samfile=pysam.Samfile( "ex5.sam","r" )
-        self.assertEqual( len(list(samfile.fetch( until_eof = True))), 2 ) 
-        samfile.close()
-
-    def testBAM(self):
-        samfile=pysam.Samfile( "ex5.bam","rb" )
-        self.assertEqual( len(list(samfile.fetch( until_eof = True))), 2 ) 
-        samfile.close()
-
-class TestPileupObjects(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
-    def testPileupColumn(self):
-        for pcolumn1 in self.samfile.pileup( region="chr1:105" ):
-            if pcolumn1.pos == 104:
-                self.assertEqual( pcolumn1.tid, 0, "chromosome/target id mismatch in position 1: %s != %s" % (pcolumn1.tid, 0) )
-                self.assertEqual( pcolumn1.pos, 105-1, "position mismatch in position 1: %s != %s" % (pcolumn1.pos, 105-1) )
-                self.assertEqual( pcolumn1.n, 2, "# reads mismatch in position 1: %s != %s" % (pcolumn1.n, 2) )
-        for pcolumn2 in self.samfile.pileup( region="chr2:1480" ):
-            if pcolumn2.pos == 1479:
-                self.assertEqual( pcolumn2.tid, 1, "chromosome/target id mismatch in position 1: %s != %s" % (pcolumn2.tid, 1) )
-                self.assertEqual( pcolumn2.pos, 1480-1, "position mismatch in position 1: %s != %s" % (pcolumn2.pos, 1480-1) )
-                self.assertEqual( pcolumn2.n, 12, "# reads mismatch in position 1: %s != %s" % (pcolumn2.n, 12) )
-
-    def testPileupRead(self):
-        for pcolumn1 in self.samfile.pileup( region="chr1:105" ):
-            if pcolumn1.pos == 104:
-                self.assertEqual( len(pcolumn1.pileups), 2, "# reads aligned to column mismatch in position 1: %s != %s" % (len(pcolumn1.pileups), 2) )
-#                self.assertEqual( pcolumn1.pileups[0]  # need to test additional properties here
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestContextManager(unittest.TestCase):
-
-    def testManager( self ):
-        with pysam.Samfile('ex1.bam', 'rb') as samfile:
-            samfile.fetch()
-        self.assertEqual( samfile._isOpen(), False )
-
-class TestExceptions(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
-    def testMissingFile(self):
-
-        self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.bam", "rb" )
-        self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.sam", "r" )
-        self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.bam", "r" )
-        self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.sam", "rb" )
-
-    def testBadContig(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr88" )
-
-    def testMeaninglessCrap(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "skljf" )
-
-    def testBackwardsOrderNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, 'chr1', 100, 10 )
-
-    def testBackwardsOrderOldFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, region="chr1:100-10")
-        
-    def testOutOfRangeNegativeNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, -10 )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, 0 )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", -5, -10 )
-
-        self.assertRaises( ValueError, self.samfile.count, "chr1", 5, -10 )
-        self.assertRaises( ValueError, self.samfile.count, "chr1", 5, 0 )        
-        self.assertRaises( ValueError, self.samfile.count, "chr1", -5, -10 )
-
-    def testOutOfRangeNegativeOldFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5-10" )
-        self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5-0" )
-        self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5--10" )
-
-        self.assertRaises( ValueError, self.samfile.count, region="chr1:-5-10" )
-        self.assertRaises( ValueError, self.samfile.count, region="chr1:-5-0" )
-        self.assertRaises( ValueError, self.samfile.count, region="chr1:-5--10" )
-
-    def testOutOfRangNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 9999999999, 99999999999 )
-        self.assertRaises( ValueError, self.samfile.count, "chr1", 9999999999, 99999999999 )
-
-    def testOutOfRangeLargeNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 9999999999999999999999999999999, 9999999999999999999999999999999999999999 )
-        self.assertRaises( ValueError, self.samfile.count, "chr1", 9999999999999999999999999999999, 9999999999999999999999999999999999999999 )
-
-    def testOutOfRangeLargeOldFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1:99999999999999999-999999999999999999" )
-        self.assertRaises( ValueError, self.samfile.count, "chr1:99999999999999999-999999999999999999" )
-
-    def testZeroToZero(self):        
-        '''see issue 44'''
-        self.assertEqual( len(list(self.samfile.fetch('chr1', 0, 0))), 0)
-
-    def tearDown(self):
-        self.samfile.close()
-
-class TestWrongFormat(unittest.TestCase):
-    '''test cases for opening files not in bam/sam format.'''
-
-    def testOpenSamAsBam( self ):
-        self.assertRaises( ValueError, pysam.Samfile, 'ex1.sam', 'rb' )
-
-    def testOpenBamAsSam( self ):
-        # test fails, needs to be implemented.
-        # sam.fetch() fails on reading, not on opening
-        # self.assertRaises( ValueError, pysam.Samfile, 'ex1.bam', 'r' )
-        pass
-
-    def testOpenFastaAsSam( self ):
-        # test fails, needs to be implemented.
-        # sam.fetch() fails on reading, not on opening
-        # self.assertRaises( ValueError, pysam.Samfile, 'ex1.fa', 'r' )
-        pass
-
-    def testOpenFastaAsBam( self ):
-        self.assertRaises( ValueError, pysam.Samfile, 'ex1.fa', 'rb' )
-
-class TestFastaFile(unittest.TestCase):
-
-    mSequences = { 'chr1' :
-                       b"CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGCTGAGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACCTCTGGTGACTGCCAGAGCTGCTGGCAAGCTAGAGTCCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAATGAAAACTATATTTATGCTATTCAGTTCTAAATATAGAAATTGAAACAGCTGTGTTTAGTGCCTTTGTTCAACCCCCTTGCAACAACCTTGAGAACCCCAGGGAATTTGTCAATGTCAGGGAAGGAGCATTTTGTCAGTTACCAAATGTGTTTATTACCAGAGGGATGGAGGGAAGAGGGACGCTGAAGAACTTTGATGCCCTCTTCTTCCAAAGATGAAACGCGTAACTGCGCTCTCATTCACTCCAGCTCCCTGTCACCCAATGGACCTGTGATATCTGGATTCTGGGAAATTCTTCATCCTGGACCCTGAGAGATTCTGCAGCCCAGCTCCAGATTGCTTGTGGTCTGACAGGCTGCAACTGTGAGCCATCACAATGAACAACAGGAAGAAAAGGTCTTTCAAAAGGTGATGTGTGTTCTCATCAACCTCATACACACACATGGTTTAGGGGTATAATACCTCTACATGGCTGATTATGAAAACAATGTTCCCCAGATACCATCCCTGTCTTACTTCCAGCTCCCCAGAGGGAAAGCTTTCAACGCTTCTAGCCATTTCTTTTGGCATTTGCCTTCAGACCCTACACGAATGCGTCTCTACCACAGGGGGCTGCGCGGTTTCCCATCATGAAGCACTGAACTTCCACGTCTCATCTAGGGGAACAGGGAGGTGCACTAATGCGCTCCACGCCCAAGCCCTTCTCACAGTTTCTGCCCCCAGCATGGTTGTACTGGGCAATACATGAGATTATTAGGAAATGCTTTACTGTCATAACTATGAAGAGACTATTGCCAGATGAACCACACATTAATACTATGTTTCTTATCTGCACATTACTACCCTGCAATTAATATAATTGTGTCCATGTACACACGCTGTCCTATGTACTTATCATGACTCTATCCCAAATTCCCAATTACGTCCTATCTTCTTCTTAGGGAAGAACAGCTTAGGTATCAATTTGGTGTTCTGTGTAAAGTCTCAGGGAGCCGTCCGTGTCCTCCCATCTGGCCTCGTCCACACTGGTTCTCTTGAAAGCTTGGGCTGTAATGATGCCCCTTGGCCATCACCCAGTCCCTGCCCCATCTCTTGTAATCTCTCTCCTTTTTGCTGCATCCCTGTCTTCCTCTGTCTTGATTTACTTGTTGTTGGTTTTCTGTTTCTTTGTTTGATTTGGTGGAAGACATAATCCCACGCTTCCTATGGAAAGGTTGTTGGGAGATTTTTAATGATTCCTCAATGTTAAAATGTCTATTTTTGTCTTGACACCCAACTAATATTTGTCTGAGCAAAACAGTCTAGATGAGAGAGAACTTCCCTGGAGGTCTGATGGCGTTTCTCCCTCGTCTTCTTA",
-                   'chr2' :
-                       b"TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGACTAAACCAAGCAGAAGAAAGAGGTTCAGAACTTGAAGACAAGTCTCTTATGAATTAACCCAGTCAGACAAAAATAAAGAAAAAAATTTTAAAAATGAACAGAGCTTTCAAGAAGTATGAGATTATGTAAAGTAACTGAACCTATGAGTCACAGGTATTCCTGAGGAAAAAGAAAAAGTGAGAAGTTTGGAAAAACTATTTGAGGAAGTAATTGGGGAAAACCTCTTTAGTCTTGCTAGAGATTTAGACATCTAAATGAAAGAGGCTCAAAGAATGCCAGGAAGATACATTGCAAGACAGACTTCATCAAGATATGTAGTCATCAGACTATCTAAAGTCAACATGAAGGAAAAAAATTCTAAAATCAGCAAGAGAAAAGCATACAGTCATCTATAAAGGAAATCCCATCAGAATAACAATGGGCTTCTCAGCAGAAACCTTACAAGCCAGAAGAGATTGGATCTAATTTTTGGACTTCTTAAAGAAAAAAAAACCTGTCAAACACGAATGTTATGCCCTGCTAAACTAAGCATCATAAATGAAGGGGAAATAAAGTCAAGTCTTTCCTGACAAGCAAATGCTAAGATAATTCATCATCACTAAACCAGTCCTATAAGAAATGCTCAAAAGAATTGTAAAAGTCAAAATTAAAGTTCAATACTCACCATCATAAATACACACAAAAGTACAAAACTCACAGGTTTTATAAAACAATTGAGACTACAGAGCAACTAGGTAAAAAATTAACATTACAACAGGAACAAAACCTCATATATCAATATTAACTTTGAATAAAAAGGGATTAAATTCCCCCACTTAAGAGATATAGATTGGCAGAACAGATTTAAAAACATGAACTAACTATATGCTGTTTACAAGAAACTCATTAATAAAGACATGAGTTCAGGTAAAGGGGTGGAAAAAGATGTTCTACGCAAACAGAAACCAAATGAGAGAAGGAGTAGCTATACTTATATCAGATAAAGCACACTTTAAATCAACAACAGTAAAATAAAACAAAGGAGGTCATCATACAATGATAAAAAGATCAATTCAGCAAGAAGATATAACCATCCTACTAAATACATATGCACCTAACACAAGACTACCCAGATTCATAAAACAAATACTACTAGACCTAAGAGGGATGAGAAATTACCTAATTGGTACAATGTACAATATTCTGATGATGGTTACACTAAAAGCCCATACTTTACTGCTACTCAATATATCCATGTAACAAATCTGCGCTTGTACTTCTAAATCTATAAAAAAATTAAAATTTAACAAAAGTAAATAAAACACATAGCTAAAACTAAAAAAGCAAAAACAAAAACTATGCTAAGTATTGGTAAAGATGTGGGGAAAAAAGTAAACTCTCAAATATTGCTAGTGGGAGTATAAATTGTTTTCCACTTTGGAAAACAATTTGGTAATTTCGTTTTTTTTTTTTTCTTTTCTCTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAAAAAATATTTACAGTAACT",
-                   }
-
-    def setUp(self):
-        self.file=pysam.Fastafile( "ex1.fa" )
-
-    def testFetch(self):
-        for id, seq in self.mSequences.items():
-            self.assertEqual( seq, self.file.fetch( id ) )
-            for x in range( 0, len(seq), 10):
-                self.assertEqual( seq[x:x+10], self.file.fetch( id, x, x+10) )
-                # test x:end
-                self.assertEqual( seq[x:], self.file.fetch( id, x) )
-                # test 0:x
-                self.assertEqual( seq[:x], self.file.fetch( id, None, x) )
-
-        
-        # unknown sequence returns ""
-        self.assertEqual( b"", self.file.fetch("chr12") )
-
-    def testOutOfRangeAccess( self ):
-        '''test out of range access.'''
-        # out of range access returns an empty string
-        for contig, s in self.mSequences.iteritems():
-            self.assertEqual( self.file.fetch( contig, len(s), len(s)+1), b"" )
-
-        self.assertEqual( self.file.fetch( "chr3", 0 , 100), b"" ) 
-
-    def testFetchErrors( self ):
-        self.assertRaises( ValueError, self.file.fetch )
-        self.assertRaises( ValueError, self.file.fetch, "chr1", -1, 10 )
-        self.assertRaises( ValueError, self.file.fetch, "chr1", 20, 10 )
-
-    def testLength( self ):
-        self.assertEqual( len(self.file), 2 )
-        
-    def tearDown(self):
-        self.file.close()
-
-class TestAlignedRead(unittest.TestCase):
-    '''tests to check if aligned read can be constructed
-    and manipulated.
-    '''
-
-    def checkFieldEqual( self, read1, read2, exclude = []):
-        '''check if two reads are equal by comparing each field.'''
-
-        for x in ("qname", "seq", "flag",
-                  "rname", "pos", "mapq", "cigar",
-                  "mrnm", "mpos", "isize", "qual",
-                  "is_paired", "is_proper_pair",
-                  "is_unmapped", "mate_is_unmapped",
-                  "is_reverse", "mate_is_reverse",
-                  "is_read1", "is_read2",
-                  "is_secondary", "is_qcfail",
-                  "is_duplicate", "bin"):
-            if x in exclude: continue
-            self.assertEqual( getattr(read1, x), getattr(read2,x), "attribute mismatch for %s: %s != %s" % 
-                              (x, getattr(read1, x), getattr(read2,x)))
-    
-    def testEmpty( self ):
-        a = pysam.AlignedRead()
-        self.assertEqual( a.qname, None )
-        self.assertEqual( a.seq, None )
-        self.assertEqual( a.qual, None )
-        self.assertEqual( a.flag, 0 )
-        self.assertEqual( a.rname, 0 )
-        self.assertEqual( a.mapq, 0 )
-        self.assertEqual( a.cigar, None )
-        self.assertEqual( a.tags, [] )
-        self.assertEqual( a.mrnm, 0 )
-        self.assertEqual( a.mpos, 0 )
-        self.assertEqual( a.isize, 0 )
-
-    def buildRead( self ):
-        '''build an example read.'''
-        
-        a = pysam.AlignedRead()
-        a.qname = "read_12345"
-        a.seq="ACGT" * 10
-        a.flag = 0
-        a.rname = 0
-        a.pos = 20
-        a.mapq = 20
-        a.cigar = ( (0,10), (2,1), (0,9), (1,1), (0,20) )
-        a.mrnm = 0
-        a.mpos=200
-        a.isize=167
-        a.qual="1234" * 10
-        # todo: create tags
-        return a
-
-    def testUpdate( self ):
-        '''check if updating fields affects other variable length data
-        '''
-        a = self.buildRead()
-        b = self.buildRead()
-
-        # check qname
-        b.qname = "read_123"
-        self.checkFieldEqual( a, b, "qname" )
-        b.qname = "read_12345678"
-        self.checkFieldEqual( a, b, "qname" )
-        b.qname = "read_12345"
-        self.checkFieldEqual( a, b)
-
-        # check cigar
-        b.cigar = ( (0,10), )
-        self.checkFieldEqual( a, b, "cigar" )
-        b.cigar = ( (0,10), (2,1), (0,10) )
-        self.checkFieldEqual( a, b, "cigar" )
-        b.cigar = ( (0,10), (2,1), (0,9), (1,1), (0,20) )
-        self.checkFieldEqual( a, b)
-
-        # check seq 
-        b.seq = "ACGT"
-        self.checkFieldEqual( a, b, ("seq", "qual") )
-        b.seq = "ACGT" * 3
-        self.checkFieldEqual( a, b, ("seq", "qual") )
-        b.seq = "ACGT" * 10
-        self.checkFieldEqual( a, b, ("qual",))
-
-        # reset qual
-        b = self.buildRead()
-
-        # check flags:
-        for x in (
-            "is_paired", "is_proper_pair",
-            "is_unmapped", "mate_is_unmapped",
-            "is_reverse", "mate_is_reverse",
-            "is_read1", "is_read2",
-            "is_secondary", "is_qcfail",
-            "is_duplicate"):
-            setattr( b, x, True )
-            self.assertEqual( getattr(b, x), True )
-            self.checkFieldEqual( a, b, ("flag", x,) )
-            setattr( b, x, False )
-            self.assertEqual( getattr(b, x), False )
-            self.checkFieldEqual( a, b )
-
-    def testLargeRead( self ):
-        '''build an example read.'''
-        
-        a = pysam.AlignedRead()
-        a.qname = "read_12345"
-        a.seq="ACGT" * 200
-        a.flag = 0
-        a.rname = 0
-        a.pos = 20
-        a.mapq = 20
-        a.cigar = ( (0, 4 * 200), )
-        a.mrnm = 0
-        a.mpos=200
-        a.isize=167
-        a.qual="1234" * 200
-
-        return a
-
-    def testTagParsing( self ):
-        '''test for tag parsing
-
-        see http://groups.google.com/group/pysam-user-group/browse_thread/thread/67ca204059ea465a
-        '''
-        samfile=pysam.Samfile( "ex8.bam","rb" )
-
-        for entry in samfile:
-            before = entry.tags
-            entry.tags = entry.tags
-            after = entry.tags
-            self.assertEqual( after, before )
-
-    def testUpdateTlen( self ):
-        '''check if updating tlen works'''
-        a = self.buildRead()
-        oldlen = a.tlen
-        oldlen *= 2
-        a.tlen = oldlen
-        self.assertEqual( a.tlen, oldlen )
-
-    def testPositions( self ):
-        a = self.buildRead()
-        self.assertEqual( a.positions,
-                          [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
-                                31, 32, 33, 34, 35, 36, 37, 38, 39, 
-                           40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
-                           50, 51, 52, 53, 54, 55, 56, 57, 58, 59] )
-
-        self.assertEqual( a.aligned_pairs,
-                          [(0, 20), (1, 21), (2, 22), (3, 23), (4, 24), 
-                           (5, 25), (6, 26), (7, 27), (8, 28), (9, 29), 
-                           (None, 30), 
-                           (10, 31), (11, 32), (12, 33), (13, 34), (14, 35), 
-                           (15, 36), (16, 37), (17, 38), (18, 39), (19, None), 
-                           (20, 40), (21, 41), (22, 42), (23, 43), (24, 44), 
-                           (25, 45), (26, 46), (27, 47), (28, 48), (29, 49), 
-                           (30, 50), (31, 51), (32, 52), (33, 53), (34, 54), 
-                           (35, 55), (36, 56), (37, 57), (38, 58), (39, 59)] )
-
-        self.assertEqual( a.positions, [x[1] for x in a.aligned_pairs if x[0] != None and x[1] != None] )
-        # alen is the length of the aligned read in genome
-        self.assertEqual( a.alen, a.aligned_pairs[-1][0] + 1 )
-        # aend points to one beyond last aligned base in ref
-        self.assertEqual( a.positions[-1], a.aend - 1 )
-
-class TestDeNovoConstruction(unittest.TestCase):
-    '''check BAM/SAM file construction using ex3.sam
-    
-    (note these are +1 coordinates):
-    
-    read_28833_29006_6945      99      chr1    33      20      10M1D25M        =       200     167     AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG     <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<     NM:i:1  RG:Z:L1
-    read_28701_28881_323b      147     chr2    88      30      35M     =       500     412     ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA     <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<     MF:i:18 RG:Z:L2
-    '''
-
-    header = { 'HD': {'VN': '1.0'},
-               'SQ': [{'LN': 1575, 'SN': 'chr1'}, 
-                      {'LN': 1584, 'SN': 'chr2'}], }
-
-    bamfile = "ex6.bam"
-    samfile = "ex6.sam"
-
-    def checkFieldEqual( self, read1, read2, exclude = []):
-        '''check if two reads are equal by comparing each field.'''
-
-        for x in ("qname", "seq", "flag",
-                  "rname", "pos", "mapq", "cigar",
-                  "mrnm", "mpos", "isize", "qual",
-                  "bin",
-                  "is_paired", "is_proper_pair",
-                  "is_unmapped", "mate_is_unmapped",
-                  "is_reverse", "mate_is_reverse",
-                  "is_read1", "is_read2",
-                  "is_secondary", "is_qcfail",
-                  "is_duplicate"):
-            if x in exclude: continue
-            self.assertEqual( getattr(read1, x), getattr(read2,x), "attribute mismatch for %s: %s != %s" % 
-                              (x, getattr(read1, x), getattr(read2,x)))
-
-    def setUp( self ):
-
-        
-        a = pysam.AlignedRead()
-        a.qname = "read_28833_29006_6945"
-        a.seq="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
-        a.flag = 99
-        a.rname = 0
-        a.pos = 32
-        a.mapq = 20
-        a.cigar = ( (0,10), (2,1), (0,25) )
-        a.mrnm = 0
-        a.mpos=199
-        a.isize=167
-        a.qual="<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<"
-        a.tags = ( ("NM", 1),
-                   ("RG", "L1") )
-
-        b = pysam.AlignedRead()
-        b.qname = "read_28701_28881_323b"
-        b.seq="ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA"
-        b.flag = 147
-        b.rname = 1
-        b.pos = 87
-        b.mapq = 30
-        b.cigar = ( (0,35), )
-        b.mrnm = 1
-        b.mpos=499
-        b.isize=412
-        b.qual="<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<"
-        b.tags = ( ("MF", 18),
-                   ("RG", "L2") )
-
-        self.reads = (a,b)
-
-    def testSAMWholeFile( self ):
-        
-        tmpfilename = "tmp_%i.sam" % id(self)
-
-        outfile = pysam.Samfile( tmpfilename, "wh", header = self.header )
-
-        for x in self.reads: outfile.write( x )
-        outfile.close()
-        
-        self.assertTrue( checkBinaryEqual( tmpfilename, self.samfile ),
-                         "mismatch when construction SAM file, see %s %s" % (tmpfilename, self.samfile))
-        
-        os.unlink( tmpfilename )
-
-    def testBAMPerRead( self ):
-        '''check if individual reads are binary equal.'''
-        infile = pysam.Samfile( self.bamfile, "rb")
-
-        others = list(infile)
-        for denovo, other in zip( others, self.reads):
-            self.checkFieldEqual( other, denovo )
-            self.assertEqual( other.compare( denovo ), 0 )
-
-    def testSAMPerRead( self ):
-        '''check if individual reads are binary equal.'''
-        infile = pysam.Samfile( self.samfile, "r")
-
-        others = list(infile)
-        for denovo, other in zip( others, self.reads):
-            self.checkFieldEqual( other, denovo )
-            self.assertEqual( other.compare( denovo), 0 )
-            
-    def testBAMWholeFile( self ):
-        
-        tmpfilename = "tmp_%i.bam" % id(self)
-
-        outfile = pysam.Samfile( tmpfilename, "wb", header = self.header )
-
-        for x in self.reads: outfile.write( x )
-        outfile.close()
-        
-        self.assertTrue( checkBinaryEqual( tmpfilename, self.bamfile ),
-                         "mismatch when construction BAM file, see %s %s" % (tmpfilename, self.bamfile))
-        
-        os.unlink( tmpfilename )
-
-
-class TestDoubleFetch(unittest.TestCase):
-    '''check if two iterators on the same bamfile are independent.'''
-    
-    def testDoubleFetch( self ):
-
-        samfile1 = pysam.Samfile('ex1.bam', 'rb')
-
-        for a,b in zip(samfile1.fetch(), samfile1.fetch()):
-            self.assertEqual( a.compare( b ), 0 )
-
-    def testDoubleFetchWithRegion( self ):
-
-        samfile1 = pysam.Samfile('ex1.bam', 'rb')
-        chr, start, stop = 'chr1', 200, 3000000
-        self.assertTrue(len(list(samfile1.fetch ( chr, start, stop))) > 0) #just making sure the test has something to catch
-
-        for a,b in zip(samfile1.fetch( chr, start, stop), samfile1.fetch( chr, start, stop)):
-            self.assertEqual( a.compare( b ), 0 ) 
-
-    def testDoubleFetchUntilEOF( self ):
-
-        samfile1 = pysam.Samfile('ex1.bam', 'rb')
-
-        for a,b in zip(samfile1.fetch( until_eof = True), 
-                       samfile1.fetch( until_eof = True )):
-            self.assertEqual( a.compare( b), 0 )
-
-class TestRemoteFileFTP(unittest.TestCase):
-    '''test remote access.
-
-    '''
-
-    # Need to find an ftp server without password on standard
-    # port.
-
-    url = "ftp://ftp.sanger.ac.uk/pub/rd/humanSequences/CV.bam"
-    region = "1:1-1000"
-
-    def testFTPView( self ):
-        return
-        result = pysam.view( self.url, self.region )
-        self.assertEqual( len(result), 36 )
-        
-    def testFTPFetch( self ):
-        return
-        samfile = pysam.Samfile(self.url, "rb")  
-        result = list(samfile.fetch( region = self.region ))
-        self.assertEqual( len(result), 36 )
-
-class TestRemoteFileHTTP( unittest.TestCase):
-
-    url = "http://genserv.anat.ox.ac.uk/downloads/pysam/test/ex1.bam"
-    region = "chr1:1-1000"
-    local = "ex1.bam"
-
-    def testView( self ):
-        samfile_local = pysam.Samfile(self.local, "rb")  
-        ref = list(samfile_local.fetch( region = self.region ))
-        
-        result = pysam.view( self.url, self.region )
-        self.assertEqual( len(result), len(ref) )
-        
-    def testFetch( self ):
-        samfile = pysam.Samfile(self.url, "rb")  
-        result = list(samfile.fetch( region = self.region ))
-        samfile_local = pysam.Samfile(self.local, "rb")  
-        ref = list(samfile_local.fetch( region = self.region ))
-
-        self.assertEqual( len(ref), len(result) )
-        for x, y in zip(result, ref):
-            self.assertEqual( x.compare( y ), 0 )
-
-    def testFetchAll( self ):
-        samfile = pysam.Samfile(self.url, "rb")  
-        result = list(samfile.fetch())
-        samfile_local = pysam.Samfile(self.local, "rb")  
-        ref = list(samfile_local.fetch() )
-
-        self.assertEqual( len(ref), len(result) )
-        for x, y in zip(result, ref):
-            self.assertEqual( x.compare( y ), 0 )
-
-class TestLargeOptValues( unittest.TestCase ):
-
-    ints = ( 65536, 214748, 2147484, 2147483647 )
-    floats = ( 65536.0, 214748.0, 2147484.0 )
-
-    def check( self, samfile ):
-        
-        i = samfile.fetch()
-        for exp in self.ints:
-            rr = i.next()
-            obs = rr.opt("ZP")
-            self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
-        for exp in [ -x for x in self.ints ]:
-            rr = i.next()
-            obs = rr.opt("ZP")
-            self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
-        for exp in self.floats:
-            rr = i.next()
-            obs = rr.opt("ZP")
-            self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
-        for exp in [ -x for x in self.floats ]:
-            rr = i.next()
-            obs = rr.opt("ZP")
-            self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
-    def testSAM( self ):
-        samfile = pysam.Samfile("ex10.sam", "r")
-        self.check( samfile )
-
-    def testBAM( self ):
-        samfile = pysam.Samfile("ex10.bam", "rb")
-        self.check( samfile )
-
-# class TestSNPCalls( unittest.TestCase ):
-#     '''test pysam SNP calling ability.'''
-
-#     def checkEqual( self, a, b ):
-#         for x in ("reference_base", 
-#                   "pos",
-#                   "genotype",
-#                   "consensus_quality",
-#                   "snp_quality",
-#                   "mapping_quality",
-#                   "coverage" ):
-#             self.assertEqual( getattr(a, x), getattr(b,x), "%s mismatch: %s != %s\n%s\n%s" % 
-#                               (x, getattr(a, x), getattr(b,x), str(a), str(b)))
-
-#     def testAllPositionsViaIterator( self ):
-#         samfile = pysam.Samfile( "ex1.bam", "rb")  
-#         fastafile = pysam.Fastafile( "ex1.fa" )
-#         try: 
-#             refs = [ x for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ) if x.reference_base != "*"]
-#         except pysam.SamtoolsError:
-#             pass
-
-#         i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-#         calls = list(pysam.IteratorSNPCalls(i))
-#         for x,y in zip( refs, calls ):
-#             self.checkEqual( x, y )
-
-#     def testPerPositionViaIterator( self ):
-#         # test pileup for each position. This is a slow operation
-#         # so this test is disabled 
-#         return
-#         samfile = pysam.Samfile( "ex1.bam", "rb")  
-#         fastafile = pysam.Fastafile( "ex1.fa" )
-#         for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-#             if x.reference_base == "*": continue
-#             i = samfile.pileup( x.chromosome, x.pos, x.pos+1,
-#                                 fastafile = fastafile,
-#                                 stepper = "samtools" )
-#             z = [ zz for zz in pysam.IteratorSamtools(i) if zz.pos == x.pos ]
-#             self.assertEqual( len(z), 1 )
-#             self.checkEqual( x, z[0] )
-
-#     def testPerPositionViaCaller( self ):
-#         # test pileup for each position. This is a fast operation
-#         samfile = pysam.Samfile( "ex1.bam", "rb")  
-#         fastafile = pysam.Fastafile( "ex1.fa" )
-#         i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-#         caller = pysam.SNPCaller( i )
-
-#         for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-#             if x.reference_base == "*": continue
-#             call = caller.call( x.chromosome, x.pos )
-#             self.checkEqual( x, call )
-
-# class TestIndelCalls( unittest.TestCase ):
-#     '''test pysam indel calling.'''
-
-#     def checkEqual( self, a, b ):
-
-#         for x in ("pos",
-#                   "genotype",
-#                   "consensus_quality",
-#                   "snp_quality",
-#                   "mapping_quality",
-#                   "coverage",
-#                   "first_allele",
-#                   "second_allele",
-#                   "reads_first",
-#                   "reads_second",
-#                   "reads_diff"):
-#             if b.genotype == "*/*" and x == "second_allele":
-#                 # ignore test for second allele (positions chr2:439 and chr2:1512)
-#                 continue
-#             self.assertEqual( getattr(a, x), getattr(b,x), "%s mismatch: %s != %s\n%s\n%s" % 
-#                               (x, getattr(a, x), getattr(b,x), str(a), str(b)))
-
-#     def testAllPositionsViaIterator( self ):
-
-#         samfile = pysam.Samfile( "ex1.bam", "rb")  
-#         fastafile = pysam.Fastafile( "ex1.fa" )
-#         try: 
-#             refs = [ x for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ) if x.reference_base == "*"]
-#         except pysam.SamtoolsError:
-#             pass
-
-#         i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-#         calls = [ x for x in list(pysam.IteratorIndelCalls(i)) if x != None ]
-#         for x,y in zip( refs, calls ):
-#             self.checkEqual( x, y )
-
-#     def testPerPositionViaCaller( self ):
-#         # test pileup for each position. This is a fast operation
-#         samfile = pysam.Samfile( "ex1.bam", "rb")  
-#         fastafile = pysam.Fastafile( "ex1.fa" )
-#         i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-#         caller = pysam.IndelCaller( i )
-
-#         for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-#             if x.reference_base != "*": continue
-#             call = caller.call( x.chromosome, x.pos )
-#             self.checkEqual( x, call )
-
-class TestLogging( unittest.TestCase ):
-    '''test around bug issue 42,
-
-    failed in versions < 0.4
-    '''
-
-    def check( self, bamfile, log ):
-
-        if log:
-            logger = logging.getLogger('franklin')
-            logger.setLevel(logging.INFO)
-            formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
-            log_hand  = logging.FileHandler('log.txt')
-            log_hand.setFormatter(formatter)
-            logger.addHandler(log_hand)
-
-        bam  = pysam.Samfile(bamfile, 'rb')
-        cols = bam.pileup()
-        self.assert_( True )
-
-    def testFail1( self ):
-        self.check( "ex9_fail.bam", False )
-        self.check( "ex9_fail.bam", True )
-
-    def testNoFail1( self ):
-        self.check( "ex9_nofail.bam", False )
-        self.check( "ex9_nofail.bam", True )
-
-    def testNoFail2( self ):
-        self.check( "ex9_nofail.bam", True )
-        self.check( "ex9_nofail.bam", True )
-        
-# TODOS
-# 1. finish testing all properties within pileup objects
-# 2. check exceptions and bad input problems (missing files, optional fields that aren't present, etc...)
-# 3. check: presence of sequence
-
-class TestSamfileUtilityFunctions( unittest.TestCase ):
-
-    def testCount( self ):
-
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-
-        for contig in ("chr1", "chr2" ):
-            for start in xrange( 0, 2000, 100 ):
-                end = start + 1
-                self.assertEqual( len( list( samfile.fetch( contig, start, end ) ) ),
-                                  samfile.count( contig, start, end ) )
-
-                # test empty intervals
-                self.assertEqual( len( list( samfile.fetch( contig, start, start ) ) ),
-                                  samfile.count( contig, start, start ) )
-
-                # test half empty intervals
-                self.assertEqual( len( list( samfile.fetch( contig, start ) ) ),
-                                  samfile.count( contig, start ) )
-
-    def testMate( self ):
-        '''test mate access.'''
-
-        readnames = [ x.split(b"\t")[0] for x in open( "ex1.sam", "rb" ).readlines() ]
-        if sys.version_info[0] >= 3:
-            readnames = [ name.decode('ascii') for name in readnames ]
-            
-        counts = collections.defaultdict( int )
-        for x in readnames: counts[x] += 1
-
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-        for read in samfile.fetch():
-            if not read.is_paired:
-                self.assertRaises( ValueError, samfile.mate, read )
-            elif read.mate_is_unmapped:
-                self.assertRaises( ValueError, samfile.mate, read )
-            else:
-                if counts[read.qname] == 1:
-                    self.assertRaises( ValueError, samfile.mate, read )
-                else:
-                    mate = samfile.mate( read )
-                    self.assertEqual( read.qname, mate.qname )
-                    self.assertEqual( read.is_read1, mate.is_read2 )
-                    self.assertEqual( read.is_read2, mate.is_read1 )
-                    self.assertEqual( read.pos, mate.mpos )
-                    self.assertEqual( read.mpos, mate.pos )
-
-    def testIndexStats( self ):
-        '''test if total number of mapped/unmapped reads is correct.'''
-
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-        self.assertEqual( samfile.mapped, 3235 )
-        self.assertEqual( samfile.unmapped, 35 )
-
-class TestSamtoolsProxy( unittest.TestCase ):
-    '''tests for sanity checking access to samtools functions.'''
-
-    def testIndex( self ):
-        self.assertRaises( IOError, pysam.index, "missing_file" )
-
-    def testView( self ):
-        # note that view still echos "open: No such file or directory"
-        self.assertRaises( pysam.SamtoolsError, pysam.view, "missing_file" )
-
-    def testSort( self ):
-        self.assertRaises( pysam.SamtoolsError, pysam.sort, "missing_file" )
-
-class TestSamfileIndex( unittest.TestCase):
-    
-    def testIndex( self ):
-        samfile = pysam.Samfile( "ex1.bam", "rb" )
-        index = pysam.IndexedReads( samfile )
-        index.build()
-
-        reads = collections.defaultdict( int )
-
-        for read in samfile: reads[read.qname] += 1
-            
-        for qname, counts in reads.iteritems():
-            found = list(index.find( qname ))
-            self.assertEqual( len(found), counts )
-            for x in found: self.assertEqual( x.qname, qname )
-            
-
-if __name__ == "__main__":
-    # build data files
-    print ("building data files")
-    subprocess.call( "make", shell=True)
-    print ("starting tests")
-    unittest.main()
-    print ("completed tests")
diff --git a/save/pysam_test_stdin.py b/save/pysam_test_stdin.py

deleted file mode 100644 (file)

index 84e7366..0000000
--- a/save/pysam_test_stdin.py
+++ /dev/null
@@ -1,10 +0,0 @@
-# usage: cat pysam_ex1.bam | python pysam_test_stdin.pyx
-
-import pysam
-
-samfile = pysam.Samfile( "-", "rb" )
-
-# set up the modifying iterators
-l = list(samfile.fetch( until_eof = True ))
-assert len(l) == 3270
-
diff --git a/save/pysam_testpp.py b/save/pysam_testpp.py

deleted file mode 100644 (file)

index 4e4da2e..0000000
--- a/save/pysam_testpp.py
+++ /dev/null
@@ -1,13 +0,0 @@
-import pp
-import pysam
-
-def sortBam(bam_filepath, out_prefix):
-    pysam.sort(bam_filepath, out_prefix)
-
-job_server = pp.Server(ncpus=2, ppservers = (), secret='secret')
-
-job1 = job_server.submit(sortBam, ('ex1.bam', 'tmp1_'), modules=('pysam',))
-job2 = job_server.submit(sortBam, ('ex2.bam', 'tmp2_'), modules=('pysam',))
-
-result1 = job1()
-result2 = job2()
diff --git a/save/segfault_tests.py b/save/segfault_tests.py

deleted file mode 100755 (executable)

index ff32fec..0000000
--- a/save/segfault_tests.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/usr/bin/env python
-'''unit testing code for pysam.'''
-
-import pysam
-import unittest
-import os
-import itertools
-import subprocess
-import shutil
-
-class TestExceptions(unittest.TestCase):
-
-    def setUp(self):
-        self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
-    def testOutOfRangeNegativeNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, -10 )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, 0 )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", -5, -10 )
-
-    def testOutOfRangeNegativeOldFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5-10" )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5-0" )
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5--10" )
-
-    def testOutOfRangeLargeNewFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1", 99999999999999999, 999999999999999999 )
-
-    def testOutOfRangeLargeOldFormat(self):
-        self.assertRaises( ValueError, self.samfile.fetch, "chr1:99999999999999999-999999999999999999" )
-
-    def tearDown(self):
-        self.samfile.close()
-
-if __name__ == "__main__":
-    unittest.main()
-
diff --git a/save/vcf_test.py b/save/vcf_test.py

deleted file mode 100644 (file)

index 396edc1..0000000
--- a/save/vcf_test.py
+++ /dev/null
@@ -1,123 +0,0 @@
-#!/usr/bin/env python
-'''unit testing code for pysam.
-
-Execute in the :file:`tests` directory as it requires the Makefile
-and data files located there.
-'''
-
-import sys, os, shutil, gzip
-import pysam
-import unittest
-import itertools
-import subprocess
-
-class TestVCFIterator( unittest.TestCase ):
-
-    filename = "example.vcf40.gz"
-    columns = ("contig", "pos", "id", 
-               "ref", "alt", "qual", 
-               "filter", "info", "format" )
-
-    def testRead( self ):
-        self.vcf = pysam.VCF()
-        self.vcf.connect( self.filename )
-        
-        for x in self.vcf.fetch():
-            print str(x)
-            print x.pos
-            print x.alt
-            print x.id
-            print x.qual
-            print x.filter
-            print x.info
-            print x.format
-
-            for s in x.samples:
-                print s, x[s]
-        
-if __name__ == "__main__":
-
-    unittest.main()
-
-
-def Test():
-
-    vcf33 = """##fileformat=VCFv3.3
-##fileDate=20090805
-##source=myImputationProgramV3.1
-##reference=1000GenomesPilot-NCBI36
-##phasing=partial
-##INFO=NS,1,Integer,"Number of Samples With Data"
-##INFO=DP,1,Integer,"Total Depth"
-##INFO=AF,-1,Float,"Allele Frequency"
-##INFO=AA,1,String,"Ancestral Allele"
-##INFO=DB,0,Flag,"dbSNP membership, build 129"
-##INFO=H2,0,Flag,"HapMap2 membership"
-##FILTER=q10,"Quality below 10"
-##FILTER=s50,"Less than 50% of samples have data"
-##FORMAT=GT,1,String,"Genotype"
-##FORMAT=GQ,1,Integer,"Genotype Quality"
-##FORMAT=DP,1,Integer,"Read Depth"
-##FORMAT=HQ,2,Integer,"Haplotype Quality"
-#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
-20\t14370\trs6054257\tG\tA\t29\t0\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:-1,-1
-17\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3:-1,-1
-20\t1110696\trs6040355\tA\tG,T\t67\t0\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4:-1,-1
-17\t1230237\t.\tT\t.\t47\t0\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2:-1,-1
-20\t1234567\tmicrosat1\tG\tD4,IGA\t50\t0\tNS=3;DP=9;AA=G\tGT:GQ:DP\t0/1:35:4\t0/2:17:2\t1/1:40:3"""
-
-    vcf40 = """##fileformat=VCFv4.0
-##fileDate=20090805
-##source=myImputationProgramV3.1
-##reference=1000GenomesPilot-NCBI36
-##phasing=partial
-##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
-##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
-##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
-##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
-##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
-##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
-##FILTER=<ID=q10,Description="Quality below 10">
-##FILTER=<ID=s50,Description="Less than 50% of samples have data">
-##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
-##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
-##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
-##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
-#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
-M\t1230237\t.\tT\t.\t47\tPASS\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2
-20\t1234567\tmicrosat1\tGTCT\tG,GTACT\t50\tPASS\tNS=3;DP=9;AA=G\tGT:GQ:DP\t0/1:35:4\t0/2:17:2\t1/1:40:3
-17\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:.,.
-20\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3
-20\t1110696\trs6040355\tA\tG,T\t67\tPASS\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4"""
-
-    if False:
-        print "Parsing v3.3 file:"
-        print vcf33
-        vcf = VCFFile()
-        lines = [data for data in vcf.parse( (line+"\n" for line in vcf33.split('\n') ) )]
-        print "Writing v3.3 file:"
-        vcf.write( sys.stdout, lines )
-
-    if False:
-        print "Parsing v4.0 file:"
-        print vcf40
-        vcf = VCFFile()
-        lines = [data for data in vcf.parse( (line+"\n" for line in vcf40.split('\n') ) )]
-        print "Writing v4.0 file:"
-        vcf.write( sys.stdout, lines )
-
-    if True:
-        print "Parsing v3.3 file:"
-        print vcf33
-        vcf = sortedVCFFile()
-        lines = [data for data in vcf.parse( (line+"\n" for line in vcf33.split('\n') ) )]
-        print "Writing v3.3 file:"
-        vcf.write( sys.stdout, lines )
-
-    if True:
-        print "Parsing v4.0 file:"
-        print vcf40
-        vcf = sortedVCFFile()
-        lines = [data for data in vcf.parse( (line+"\n" for line in vcf40.split('\n') ) )]
-        print "Writing v4.0 file:"
-        vcf.write( sys.stdout, lines )
diff --git a/tests/AlignedSegment_test.py b/tests/AlignedSegment_test.py

index 332727901513fce4a4e52b947180d08364195faf..5a759a5ba357d58246b64936f8eb73dae381508a 100644 (file)
--- a/tests/AlignedSegment_test.py
+++ b/tests/AlignedSegment_test.py
@@ -430,6 +430,42 @@ class TestAlignedSegment(ReadTest):
               (6, 27), (7, 28),
               (8, 29), (9, 30)])
  
+    def test_get_aligned_pairs_lowercase_md(self):
+        a = self.build_read()
+        a.query_sequence = "A" * 10
+        a.cigarstring = "10M"
+        a.set_tag("MD", "5g4")
+        self.assertEqual(
+            a.get_aligned_pairs(with_seq=True),
+            [(0, 20, 'A'),
+             (1, 21, 'A'),
+             (2, 22, 'A'),
+             (3, 23, 'A'),
+             (4, 24, 'A'),
+             (5, 25, 'g'),
+             (6, 26, 'A'),
+             (7, 27, 'A'),
+             (8, 28, 'A'),
+             (9, 29, 'A')])
+
+    def test_get_aligned_pairs_uppercase_md(self):
+        a = self.build_read()
+        a.query_sequence = "A" * 10
+        a.cigarstring = "10M"
+        a.set_tag("MD", "5G4")
+        self.assertEqual(
+            a.get_aligned_pairs(with_seq=True),
+            [(0, 20, 'A'),
+             (1, 21, 'A'),
+             (2, 22, 'A'),
+             (3, 23, 'A'),
+             (4, 24, 'A'),
+             (5, 25, 'g'),
+             (6, 26, 'A'),
+             (7, 27, 'A'),
+             (8, 28, 'A'),
+             (9, 29, 'A')])
+        
      def testNoSequence(self):
          '''issue 176: retrieving length without query sequence
          with soft-clipping.
diff --git a/tests/AlignmentFilePileup_test.py b/tests/AlignmentFilePileup_test.py

index 32738fe713d0088eec10f2eccee0ea59f6b0e5ad..6872b019c18cb4cd27597ce4f071af071b3f7d61 100644 (file)
--- a/tests/AlignmentFilePileup_test.py
+++ b/tests/AlignmentFilePileup_test.py
@@ -311,7 +311,7 @@ class TestIteratorColumn2(unittest.TestCase):
      def testTruncate(self):
          '''see issue 107.'''
          # note that ranges in regions start from 1
-        p = self.samfile.pileup(region='chr1:170:172', truncate=True)
+        p = self.samfile.pileup(region='chr1:170-172', truncate=True)
          columns = [x.reference_pos for x in p]
          self.assertEqual(len(columns), 3)
          self.assertEqual(columns, [169, 170, 171])
diff --git a/tests/AlignmentFile_test.py b/tests/AlignmentFile_test.py

index 73a7b0a81067eba7058c89e6288030a2d2eb2a62..0068bb7ac5e8f38468dc23a58acdb02632f60020 100644 (file)
--- a/tests/AlignmentFile_test.py
+++ b/tests/AlignmentFile_test.py
@@ -1122,6 +1122,39 @@ class TestContextManager(unittest.TestCase):
          self.assertEqual(samfile.closed, True)
  
  
+class TestMultiThread(unittest.TestCase):
+
+    def testSingleThreadEqualsMultithread(self):
+        input_bam = os.path.join(BAM_DATADIR, 'ex1.bam')
+        single_thread_out = get_temp_filename("tmp_single.bam")
+        multi_thread_out = get_temp_filename("tmp_multi.bam")
+        with pysam.AlignmentFile(input_bam,
+                                 'rb') as samfile:
+            reads = [r for r in samfile]
+            with pysam.AlignmentFile(single_thread_out,
+                                     mode='wb',
+                                     template=samfile,
+                                     threads=1) as single_out:
+                [single_out.write(r) for r in reads]
+            with pysam.AlignmentFile(multi_thread_out,
+                                     mode='wb',
+                                     template=samfile,
+                                     threads=2) as multi_out:
+                [single_out.write(r) for r in reads]
+        with pysam.AlignmentFile(input_bam) as inp, \
+            pysam.AlignmentFile(single_thread_out) as single, \
+            pysam.AlignmentFile(multi_thread_out) as multi:
+            for r1, r2, r3 in zip(inp, single, multi):
+                assert r1.to_string == r2.to_string == r3.to_string
+
+    def TestNoMultiThreadingWithIgnoreTruncation(self):
+        self.assertRaises(
+            ValueError, pysam.AlignmentFile(os.path.join(BAM_DATADIR, 'ex1.bam'),
+                                            threads=2,
+                                            ignore_truncation=True)
+        )
+
+
  class TestExceptions(unittest.TestCase):
  
      def setUp(self):
@@ -1226,6 +1259,17 @@ class TestWrongFormat(unittest.TestCase):
                            'rb')
  
  
+class TestRegionParsiong(unittest.TestCase):
+
+    def test_dash_in_chr(self):
+        with pysam.AlignmentFile(
+                os.path.join(BAM_DATADIR, "example_dash_in_chr.bam")) as inf:
+            self.assertEqual(len(list(inf.fetch(contig="chr-1"))), 1)
+            self.assertEqual(len(list(inf.fetch(contig="chr2"))), 1)
+            self.assertEqual(len(list(inf.fetch(region="chr-1"))), 1)
+            self.assertEqual(len(list(inf.fetch(region="chr2"))), 1)
+
+
  class TestDeNovoConstruction(unittest.TestCase):
  
      '''check BAM/SAM file construction using ex6.sam
@@ -2304,19 +2348,18 @@ class TestLargeCigar(unittest.TestCase):
          with open(fn_reference, "w") as outf:
              outf.write(">chr1\n{seq}\n>chr2\n{seq}\n".format(
                  seq=s))
-                
+
          if mode == "bam":
              write_mode = "wb"
          elif mode == "sam":
              write_mode = "w"
          elif mode == "cram":
              write_mode = "wc"
-        
+
          with pysam.AlignmentFile(fn, write_mode,
                                   header=self.header,
                                   reference_filename=fn_reference) as outf:
              outf.write(read)
-
          with pysam.AlignmentFile(fn) as inf:
              ref_read = next(inf)
  
@@ -2328,7 +2371,7 @@ class TestLargeCigar(unittest.TestCase):
  
          os.unlink(fn)
          os.unlink(fn_reference)
-            
+
      def test_reading_writing_sam(self):
          read = self.build_read()
          self.check_read(read, mode="sam")
@@ -2337,7 +2380,15 @@ class TestLargeCigar(unittest.TestCase):
          read = self.build_read()
          self.check_read(read, mode="bam")
  
+    @unittest.skip("fails on linux - https issue?")
      def test_reading_writing_cram(self):
+        # The following test fails with htslib 1.9, but worked with previous versions.
+        # Error is:
+        # [W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/ac9fac7c3e9c476f74f1d0e47abc8be2": Input/output error
+        # Error can be reproduced using samtools 1.9 command line.
+        # Could be a conda configuration issue, see
+        # https://github.com/bioconda/bioconda-recipes/issues/9056
+        return
          read = self.build_read()
          self.check_read(read, mode="cram")
          
diff --git a/tests/VariantFile_test.py b/tests/VariantFile_test.py

index eedcc9ac48d20cbf3d4d975892d8261c00301692..88bd34ff4a8674033fc102362e3ec1f9ec613c50 100644 (file)
--- a/tests/VariantFile_test.py
+++ b/tests/VariantFile_test.py
@@ -592,6 +592,37 @@ class TestSettingRecordValues(unittest.TestCase):
              sample["GT"] = sample["GT"]
  
  
+class TestMultiThreading(unittest.TestCase):
+
+    filename = os.path.join(CBCF_DATADIR, "example_vcf42.vcf.gz")
+
+    def testSingleThreadEqualsMultithreadResult(self):
+        with pysam.VariantFile(self.filename) as inf:
+             header = inf.header
+             single = [r for r in inf]
+        with pysam.VariantFile(self.filename, threads=2) as inf:
+             multi = [r for r in inf]
+        for r1, r2 in zip(single, multi):
+            assert str(r1) == str(r2)
+
+        bcf_out = get_temp_filename(suffix=".bcf")
+        with pysam.VariantFile(bcf_out, mode='wb',
+                               header=header,
+                               threads=2) as out:
+            for r in single:
+                out.write(r)
+        with pysam.VariantFile(bcf_out) as inf:
+            multi_out = [r for r in inf]
+        for r1, r2 in zip(single, multi_out):
+            assert str(r1) == str(r2)
+
+    def testNoMultiThreadingWithIgnoreTruncation(self):
+        with self.assertRaises(ValueError):
+            pysam.VariantFile(self.filename,
+                              threads=2,
+                              ignore_truncation=True)
+
+
  class TestSubsetting(unittest.TestCase):
  
      filename = "example_vcf42.vcf.gz"
diff --git a/tests/pysam_data/Makefile b/tests/pysam_data/Makefile

index b48a27c027e0ae5dc321e746485c7dda35a3d23e..bb38a4f4548327c3ae96e4cce624b271adea6f30 100644 (file)
--- a/tests/pysam_data/Makefile
+++ b/tests/pysam_data/Makefile
@@ -21,7 +21,8 @@ all: ex1.pileup.gz \
         faidx_empty_seq.fq.gz \
         ex1.fa.gz ex1.fa.gz.csi \
         ex1_csi.bam \
-       example_reverse_complement.bam
+       example_reverse_complement.bam \
+       example_dash_in_chr.bam
  
  # ex2.sam - as ex1.sam, but with header
  ex2.sam.gz: ex1.bam ex1.bam.bai
diff --git a/tests/pysam_data/example_dash_in_chr.sam b/tests/pysam_data/example_dash_in_chr.sam

new file mode 100644 (file)

index 0000000..c2a0a1e
--- /dev/null
+++ b/tests/pysam_data/example_dash_in_chr.sam
@@ -0,0 +1,7 @@
+@HD    VN:1.0
+@SQ    SN:chr-1        LN:1575
+@SQ    SN:chr2 LN:1584
+@CO    this is a comment
+@CO    this is another comment
+read_28833_29006_6945  99      chr-1   33      20      10M1D25M        =       200     167     AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG     <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<     NM:i:1  RG:Z:L1 PG:Z:P1 XT:A:U
+read_28701_28881_323b  147     chr2    88      30      35M     =       500     412     ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA     <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<     MF:i:18 RG:Z:L2 PG:Z:P2 XT:A:R
diff --git a/tests/tabix_test.py b/tests/tabix_test.py

index 265c60a41868b2736c506b63f830a48bcee4c5a4..fdd39b02624275b10a889fd6b50ce12e02fe993e 100644 (file)
--- a/tests/tabix_test.py
+++ b/tests/tabix_test.py
@@ -1250,6 +1250,19 @@ class TestContextManager(unittest.TestCase):
          self.assertEqual(tabixfile.closed, True)
  
  
+class TestMultithreadTabixFile(unittest.TestCase):
+
+    filename = os.path.join(TABIX_DATADIR, "example.gtf.gz")
+
+    def testMultithreadEqualsSinglethread(self):
+        with pysam.TabixFile(self.filename) as tabixfile:
+            single = [r for r in tabixfile.fetch()]
+        with pysam.TabixFile(self.filename, threads=2) as tabixfile:
+            multi = [r for r in tabixfile.fetch()]
+        for r1, r2 in zip(single, multi):
+            assert str(r1) == str(r2)
+
+
  if __name__ == "__main__":
      subprocess.call("make -C %s" % TABIX_DATADIR, shell=True)
      unittest.main()
author	Steffen Moeller <moeller@debian.org>
	Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)
committer	Steffen Moeller <moeller@debian.org>
	Sat, 28 Jul 2018 22:50:40 +0000 (00:50 +0200)
.travis.yml		patch \| blob \| history
NEWS		patch \| blob \| history
README.rst		patch \| blob \| history
bcftools/LICENSE	[new file with mode: 0644]	patch \| blob
bcftools/README	[new file with mode: 0644]	patch \| blob
bcftools/bam_sample.c		patch \| blob \| history
bcftools/bam_sample.c.pysam.c		patch \| blob \| history
bcftools/bcftools.h		patch \| blob \| history
bcftools/bcftools.pysam.c		patch \| blob \| history
bcftools/bcftools.pysam.h		patch \| blob \| history
bcftools/bin.c		patch \| blob \| history
bcftools/bin.c.pysam.c		patch \| blob \| history
bcftools/consensus.c		patch \| blob \| history
bcftools/consensus.c.pysam.c		patch \| blob \| history
bcftools/convert.c		patch \| blob \| history
bcftools/convert.c.pysam.c		patch \| blob \| history
bcftools/csq.c		patch \| blob \| history
bcftools/csq.c.pysam.c		patch \| blob \| history
bcftools/filter.c		patch \| blob \| history
bcftools/filter.c.pysam.c		patch \| blob \| history
bcftools/htslib-1.9/LICENSE	[new file with mode: 0644]	patch \| blob
bcftools/htslib-1.9/README	[new file with mode: 0644]	patch \| blob
bcftools/main.c		patch \| blob \| history
bcftools/main.c.pysam.c		patch \| blob \| history
bcftools/mpileup.c		patch \| blob \| history
bcftools/mpileup.c.pysam.c		patch \| blob \| history
bcftools/plugins/GTisec.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/GTisec.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/GTsubset.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/GTsubset.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/ad-bias.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/ad-bias.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/af-dist.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/af-dist.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/check-ploidy.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/check-ploidy.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/check-sparsity.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/check-sparsity.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/color-chrs.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/color-chrs.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/contrast.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/contrast.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/counts.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/counts.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/dosage.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/dosage.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-AN-AC.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-AN-AC.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-from-fasta.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-from-fasta.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-tags.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fill-tags.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fixploidy.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fixploidy.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fixref.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/fixref.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/frameshifts.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/frameshifts.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/guess-ploidy.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/guess-ploidy.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/impute-info.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/impute-info.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/isecGT.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/isecGT.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/mendelian.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/mendelian.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/missing2ref.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/missing2ref.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/prune.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/prune.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/setGT.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/setGT.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/smpl-stats.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/smpl-stats.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/split.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/split.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/tag2tag.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/tag2tag.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/trio-stats.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/trio-stats.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/trio-switch-rate.c	[new file with mode: 0644]	patch \| blob
bcftools/plugins/trio-switch-rate.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/regidx.c		patch \| blob \| history
bcftools/regidx.c.pysam.c		patch \| blob \| history
bcftools/regidx.h		patch \| blob \| history
bcftools/reheader.c		patch \| blob \| history
bcftools/reheader.c.pysam.c		patch \| blob \| history
bcftools/tabix.c.pysam.c		patch \| blob \| history
bcftools/test/test-rbuf.c	[new file with mode: 0644]	patch \| blob
bcftools/test/test-rbuf.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/test/test-regidx.c	[new file with mode: 0644]	patch \| blob
bcftools/test/test-regidx.c.pysam.c	[new file with mode: 0644]	patch \| blob
bcftools/vcfannotate.c		patch \| blob \| history
bcftools/vcfannotate.c.pysam.c		patch \| blob \| history
bcftools/vcfconcat.c		patch \| blob \| history
bcftools/vcfconcat.c.pysam.c		patch \| blob \| history
bcftools/vcfconvert.c		patch \| blob \| history
bcftools/vcfconvert.c.pysam.c		patch \| blob \| history
bcftools/vcfindex.c		patch \| blob \| history
bcftools/vcfindex.c.pysam.c		patch \| blob \| history
bcftools/vcfisec.c		patch \| blob \| history
bcftools/vcfisec.c.pysam.c		patch \| blob \| history
bcftools/vcfmerge.c		patch \| blob \| history
bcftools/vcfmerge.c.pysam.c		patch \| blob \| history
bcftools/vcfnorm.c		patch \| blob \| history
bcftools/vcfnorm.c.pysam.c		patch \| blob \| history
bcftools/vcfquery.c		patch \| blob \| history
bcftools/vcfquery.c.pysam.c		patch \| blob \| history
bcftools/vcfroh.c		patch \| blob \| history
bcftools/vcfroh.c.pysam.c		patch \| blob \| history
bcftools/vcfsom.c		patch \| blob \| history
bcftools/vcfsom.c.pysam.c		patch \| blob \| history
bcftools/vcfsort.c		patch \| blob \| history
bcftools/vcfsort.c.pysam.c		patch \| blob \| history
bcftools/vcfstats.c		patch \| blob \| history
bcftools/vcfstats.c.pysam.c		patch \| blob \| history
bcftools/vcfview.c		patch \| blob \| history
bcftools/vcfview.c.pysam.c		patch \| blob \| history
bcftools/vcmp.c		patch \| blob \| history
bcftools/vcmp.c.pysam.c		patch \| blob \| history
bcftools/vcmp.h		patch \| blob \| history
bcftools/version.h		patch \| blob \| history
benchmark/cython_flagstat.py	[deleted file]	patch \| blob \| history
benchmark/python_flagstat.py	[deleted file]	patch \| blob \| history
buildwheels.sh	[deleted file]	patch \| blob \| history
ci/conda-recipe/build.sh	[deleted file]	patch \| blob \| history
ci/conda-recipe/meta.yaml	[deleted file]	patch \| blob \| history
ci/install-CGAT-tools.sh	[deleted file]	patch \| blob \| history
devtools/buildwheels.sh	[new file with mode: 0755]	patch \| blob
devtools/conda-recipe/build.sh	[new file with mode: 0644]	patch \| blob
devtools/conda-recipe/meta.yaml	[new file with mode: 0644]	patch \| blob
devtools/import.py	[new file with mode: 0644]	patch \| blob
devtools/install-CGAT-tools.sh	[new file with mode: 0755]	patch \| blob
devtools/run_tests_travis.sh	[new file with mode: 0755]	patch \| blob
doc/api.rst		patch \| blob \| history
doc/glossary.rst		patch \| blob \| history
doc/release.rst		patch \| blob \| history
import.py	[deleted file]	patch \| blob \| history
import/pysam.c		patch \| blob \| history
import/pysam.h		patch \| blob \| history
pysam/libcalignedsegment.pyx		patch \| blob \| history
pysam/libcalignmentfile.pyx		patch \| blob \| history
pysam/libcbcf.pyx		patch \| blob \| history
pysam/libchtslib.pxd		patch \| blob \| history
pysam/libchtslib.pyx		patch \| blob \| history
pysam/libctabix.pyx		patch \| blob \| history
pysam/libcutils.pxd		patch \| blob \| history
pysam/libcutils.pyx		patch \| blob \| history
pysam/version.py		patch \| blob \| history
run_tests_travis.sh	[deleted file]	patch \| blob \| history
samtools/LICENSE		patch \| blob \| history
samtools/README		patch \| blob \| history
samtools/bam.c.pysam.c		patch \| blob \| history
samtools/bam.h		patch \| blob \| history
samtools/bam2depth.c		patch \| blob \| history
samtools/bam2depth.c.pysam.c		patch \| blob \| history
samtools/bam_addrprg.c		patch \| blob \| history
samtools/bam_addrprg.c.pysam.c		patch \| blob \| history
samtools/bam_cat.c		patch \| blob \| history
samtools/bam_cat.c.pysam.c		patch \| blob \| history
samtools/bam_index.c		patch \| blob \| history
samtools/bam_index.c.pysam.c		patch \| blob \| history
samtools/bam_lpileup.c		patch \| blob \| history
samtools/bam_lpileup.c.pysam.c		patch \| blob \| history
samtools/bam_markdup.c		patch \| blob \| history
samtools/bam_markdup.c.pysam.c		patch \| blob \| history
samtools/bam_mate.c		patch \| blob \| history
samtools/bam_mate.c.pysam.c		patch \| blob \| history
samtools/bam_md.c		patch \| blob \| history
samtools/bam_md.c.pysam.c		patch \| blob \| history
samtools/bam_plcmd.c		patch \| blob \| history
samtools/bam_plcmd.c.pysam.c		patch \| blob \| history
samtools/bam_sort.c		patch \| blob \| history
samtools/bam_sort.c.pysam.c		patch \| blob \| history
samtools/bamshuf.c		patch \| blob \| history
samtools/bamshuf.c.pysam.c		patch \| blob \| history
samtools/bamtk.c		patch \| blob \| history
samtools/bamtk.c.pysam.c		patch \| blob \| history
samtools/bedcov.c		patch \| blob \| history
samtools/bedcov.c.pysam.c		patch \| blob \| history
samtools/bedidx.c		patch \| blob \| history
samtools/bedidx.c.pysam.c		patch \| blob \| history
samtools/bedidx.h		patch \| blob \| history
samtools/faidx.c		patch \| blob \| history
samtools/faidx.c.pysam.c		patch \| blob \| history
samtools/htslib-1.9/LICENSE	[new file with mode: 0644]	patch \| blob
samtools/htslib-1.9/README	[new file with mode: 0644]	patch \| blob
samtools/misc/ace2sam.c.pysam.c		patch \| blob \| history
samtools/phase.c		patch \| blob \| history
samtools/phase.c.pysam.c		patch \| blob \| history
samtools/sam_utils.c		patch \| blob \| history
samtools/sam_utils.c.pysam.c		patch \| blob \| history
samtools/sam_view.c		patch \| blob \| history
samtools/sam_view.c.pysam.c		patch \| blob \| history
samtools/samtools.h		patch \| blob \| history
samtools/samtools.pysam.c		patch \| blob \| history
samtools/samtools.pysam.h		patch \| blob \| history
samtools/stats.c		patch \| blob \| history
samtools/stats.c.pysam.c		patch \| blob \| history
samtools/test/split/test_filter_header_rg.c		patch \| blob \| history
samtools/test/split/test_filter_header_rg.c.pysam.c		patch \| blob \| history
samtools/test/test.c		patch \| blob \| history
samtools/test/test.c.pysam.c		patch \| blob \| history
samtools/version.h		patch \| blob \| history
save/example.py	[deleted file]	patch \| blob \| history
save/pysam_bench.py	[deleted file]	patch \| blob \| history
save/pysam_test2.6.py	[deleted file]	patch \| blob \| history
save/pysam_test_stdin.py	[deleted file]	patch \| blob \| history
save/pysam_testpp.py	[deleted file]	patch \| blob \| history
save/segfault_tests.py	[deleted file]	patch \| blob \| history
save/vcf_test.py	[deleted file]	patch \| blob \| history
tests/AlignedSegment_test.py		patch \| blob \| history
tests/AlignmentFilePileup_test.py		patch \| blob \| history
tests/AlignmentFile_test.py		patch \| blob \| history
tests/VariantFile_test.py		patch \| blob \| history
tests/pysam_data/Makefile		patch \| blob \| history
tests/pysam_data/example_dash_in_chr.sam	[new file with mode: 0644]	patch \| blob
tests/tabix_test.py		patch \| blob \| history