- g++
script:
- - ./run_tests_travis.sh
+ - ./devtools/run_tests_travis.sh
notifications:
email:
Release notes
=============
+Release 0.15.0
+==============
+
+This release wraps htslib/samtools/bcftools version 1.9.0.
+
+* [#673] permit dash in chromosome name of region string
+* [#656] Support `text` when opening a SAM file for writing
+* [#658] return None in get_forward_sequence if sequence not in record
+* [#683] allow lower case bases in MD tags
+* Ensure that = and X CIGAR ops are treated the same as M
+
Release 0.14.1
==============
If you are using the conda packaging manager (e.g. miniconda or anaconda),
you can install pysam from the `bioconda channel <https://bioconda.github.io/>`_::
- conda config --add channels r
+ conda config --add channels defaults
+ conda config --add channels conda-forge
conda config --add channels bioconda
conda install pysam
pip install pysam
-Pysam documentation is available through https://readthedocs.org/ from
+Pysam documentation is available
`here <http://pysam.readthedocs.org/en/latest/>`_
Questions and comments are very welcome and should be sent to the
--- /dev/null
+This software is available to you under a choice of one of two licenses. You
+may chose to be licensed under the terms of the MIT/Expat license or the GNU
+General Public License (GPL), both included below. If compiled with the GNU
+Scientific Library (which is optional and disabled by default as explained in
+the INSTALL document), the use of this software is governed by the GPL license.
+
+
+-----------------------------------------------------------------------------
+
+The MIT/Expat License
+
+Copyright (C) 2012-2014 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
+
+
+-----------------------------------------------------------------------------
+
+
+GNU GENERAL PUBLIC LICENSE
+Version 3, 29 June 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+
+ The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works. By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users. We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors. You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+ To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights. Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received. You must make sure that they, too, receive
+or can get the source code. And you must show them these terms so they
+know their rights.
+
+ Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+
+ For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software. For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+
+ Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so. This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software. The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable. Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products. If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+
+ Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary. To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ TERMS AND CONDITIONS
+
+ 0. Definitions.
+
+ "This License" refers to version 3 of the GNU General Public License.
+
+ "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+ "The Program" refers to any copyrightable work licensed under this
+License. Each licensee is addressed as "you". "Licensees" and
+"recipients" may be individuals or organizations.
+
+ To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy. The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+ A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+ To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy. Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+ To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies. Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+ An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License. If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+ 1. Source Code.
+
+ The "source code" for a work means the preferred form of the work
+for making modifications to it. "Object code" means any non-source
+form of a work.
+
+ A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+ The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form. A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+ The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities. However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work. For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+ The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+ The Corresponding Source for a work in source code form is that
+same work.
+
+ 2. Basic Permissions.
+
+ All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met. This License explicitly affirms your unlimited
+permission to run the unmodified Program. The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work. This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+ You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force. You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright. Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+ Conveying under any other circumstances is permitted solely under
+the conditions stated below. Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+ 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+ No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+ When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+ 4. Conveying Verbatim Copies.
+
+ You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+ You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+ 5. Conveying Modified Source Versions.
+
+ You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+ a) The work must carry prominent notices stating that you modified
+ it, and giving a relevant date.
+
+ b) The work must carry prominent notices stating that it is
+ released under this License and any conditions added under section
+ 7. This requirement modifies the requirement in section 4 to
+ "keep intact all notices".
+
+ c) You must license the entire work, as a whole, under this
+ License to anyone who comes into possession of a copy. This
+ License will therefore apply, along with any applicable section 7
+ additional terms, to the whole of the work, and all its parts,
+ regardless of how they are packaged. This License gives no
+ permission to license the work in any other way, but it does not
+ invalidate such permission if you have separately received it.
+
+ d) If the work has interactive user interfaces, each must display
+ Appropriate Legal Notices; however, if the Program has interactive
+ interfaces that do not display Appropriate Legal Notices, your
+ work need not make them do so.
+
+ A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit. Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+ 6. Conveying Non-Source Forms.
+
+ You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+ a) Convey the object code in, or embodied in, a physical product
+ (including a physical distribution medium), accompanied by the
+ Corresponding Source fixed on a durable physical medium
+ customarily used for software interchange.
+
+ b) Convey the object code in, or embodied in, a physical product
+ (including a physical distribution medium), accompanied by a
+ written offer, valid for at least three years and valid for as
+ long as you offer spare parts or customer support for that product
+ model, to give anyone who possesses the object code either (1) a
+ copy of the Corresponding Source for all the software in the
+ product that is covered by this License, on a durable physical
+ medium customarily used for software interchange, for a price no
+ more than your reasonable cost of physically performing this
+ conveying of source, or (2) access to copy the
+ Corresponding Source from a network server at no charge.
+
+ c) Convey individual copies of the object code with a copy of the
+ written offer to provide the Corresponding Source. This
+ alternative is allowed only occasionally and noncommercially, and
+ only if you received the object code with such an offer, in accord
+ with subsection 6b.
+
+ d) Convey the object code by offering access from a designated
+ place (gratis or for a charge), and offer equivalent access to the
+ Corresponding Source in the same way through the same place at no
+ further charge. You need not require recipients to copy the
+ Corresponding Source along with the object code. If the place to
+ copy the object code is a network server, the Corresponding Source
+ may be on a different server (operated by you or a third party)
+ that supports equivalent copying facilities, provided you maintain
+ clear directions next to the object code saying where to find the
+ Corresponding Source. Regardless of what server hosts the
+ Corresponding Source, you remain obligated to ensure that it is
+ available for as long as needed to satisfy these requirements.
+
+ e) Convey the object code using peer-to-peer transmission, provided
+ you inform other peers where the object code and Corresponding
+ Source of the work are being offered to the general public at no
+ charge under subsection 6d.
+
+ A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+ A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling. In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage. For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product. A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+ "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source. The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+ If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information. But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+ The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed. Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+ Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+ 7. Additional Terms.
+
+ "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law. If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+ When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it. (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.) You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+ Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+ a) Disclaiming warranty or limiting liability differently from the
+ terms of sections 15 and 16 of this License; or
+
+ b) Requiring preservation of specified reasonable legal notices or
+ author attributions in that material or in the Appropriate Legal
+ Notices displayed by works containing it; or
+
+ c) Prohibiting misrepresentation of the origin of that material, or
+ requiring that modified versions of such material be marked in
+ reasonable ways as different from the original version; or
+
+ d) Limiting the use for publicity purposes of names of licensors or
+ authors of the material; or
+
+ e) Declining to grant rights under trademark law for use of some
+ trade names, trademarks, or service marks; or
+
+ f) Requiring indemnification of licensors and authors of that
+ material by anyone who conveys the material (or modified versions of
+ it) with contractual assumptions of liability to the recipient, for
+ any liability that these contractual assumptions directly impose on
+ those licensors and authors.
+
+ All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10. If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term. If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+ If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+ Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+ 8. Termination.
+
+ You may not propagate or modify a covered work except as expressly
+provided under this License. Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+ However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+ Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+ Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License. If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+ 9. Acceptance Not Required for Having Copies.
+
+ You are not required to accept this License in order to receive or
+run a copy of the Program. Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance. However,
+nothing other than this License grants you permission to propagate or
+modify any covered work. These actions infringe copyright if you do
+not accept this License. Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+ 10. Automatic Licensing of Downstream Recipients.
+
+ Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License. You are not responsible
+for enforcing compliance by third parties with this License.
+
+ An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations. If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+ You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License. For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+ 11. Patents.
+
+ A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based. The
+work thus licensed is called the contributor's "contributor version".
+
+ A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version. For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+ Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+ In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement). To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+ If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients. "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+ If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+ A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License. You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+ Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+ 12. No Surrender of Others' Freedom.
+
+ If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all. For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+ 13. Use with the GNU Affero General Public License.
+
+ Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work. The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+
+ 14. Revised Versions of this License.
+
+ The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+ Each version is given a distinguishing version number. If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation. If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+ If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+ Later license versions may give you additional or different
+permissions. However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+ 15. Disclaimer of Warranty.
+
+ THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+ 16. Limitation of Liability.
+
+ IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+ 17. Interpretation of Sections 15 and 16.
+
+ If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 3 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+ If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+
+ <program> Copyright (C) <year> <name of author>
+ This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+
+ You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<http://www.gnu.org/licenses/>.
+
+ The GNU General Public License does not permit incorporating your program
+into proprietary programs. If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library. If this is what you want to do, use the GNU Lesser General
+Public License instead of this License. But first, please read
+<http://www.gnu.org/philosophy/why-not-lgpl.html>.
+
+
+-----------------------------------------------------------------------------
+
--- /dev/null
+BCFtools implements utilities for variant calling (in conjunction with
+SAMtools) and manipulating VCF and BCF files. The program is intended
+to replace the Perl-based tools from vcftools.
+
+See INSTALL for building and installation instructions.
/* bam_sample.c -- group data by sample.
Copyright (C) 2010, 2011 Broad Institute.
- Copyright (C) 2013, 2016 Genome Research Ltd.
+ Copyright (C) 2013, 2016-2018 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>, Petr Danecek <pd3@sanger.ac.uk>
void *bam_smpls = khash_str2int_init();
int first_smpl = -1, nskipped = 0;
const char *p = bam_hdr, *q, *r;
- while ((q = strstr(p, "@RG")) != 0)
+ while (p != NULL && (q = strstr(p, "@RG")) != 0)
{
+ char *eol = strchr(q + 3, '\n');
+ if (q > bam_hdr && *(q - 1) != '\n') { // @RG must be at start of line
+ p = eol;
+ continue;
+ }
p = q + 3;
- r = q = 0;
if ((q = strstr(p, "\tID:")) != 0) q += 4;
if ((r = strstr(p, "\tSM:")) != 0) r += 4;
if (r && q)
}
else
break;
- p = q > r ? q : r;
+ p = eol;
}
int nsmpls = khash_str2int_size(bam_smpls);
khash_str2int_destroy_free(bam_smpls);
{
// no suitable read group is available in this bam: ignore the whole file.
free(file->fname);
+ if ( file->rg2idx ) khash_str2int_destroy_free(file->rg2idx);
bsmpl->nfiles--;
return -1;
}
/* bam_sample.c -- group data by sample.
Copyright (C) 2010, 2011 Broad Institute.
- Copyright (C) 2013, 2016 Genome Research Ltd.
+ Copyright (C) 2013, 2016-2018 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>, Petr Danecek <pd3@sanger.ac.uk>
void *bam_smpls = khash_str2int_init();
int first_smpl = -1, nskipped = 0;
const char *p = bam_hdr, *q, *r;
- while ((q = strstr(p, "@RG")) != 0)
+ while (p != NULL && (q = strstr(p, "@RG")) != 0)
{
+ char *eol = strchr(q + 3, '\n');
+ if (q > bam_hdr && *(q - 1) != '\n') { // @RG must be at start of line
+ p = eol;
+ continue;
+ }
p = q + 3;
- r = q = 0;
if ((q = strstr(p, "\tID:")) != 0) q += 4;
if ((r = strstr(p, "\tSM:")) != 0) r += 4;
if (r && q)
}
else
break;
- p = q > r ? q : r;
+ p = eol;
}
int nsmpls = khash_str2int_size(bam_smpls);
khash_str2int_destroy_free(bam_smpls);
{
// no suitable read group is available in this bam: ignore the whole file.
free(file->fname);
+ if ( file->rg2idx ) khash_str2int_destroy_free(file->rg2idx);
bsmpl->nfiles--;
return -1;
}
#define FT_STDIN (1<<3)
char *bcftools_version(void);
-void error(const char *format, ...) HTS_NORETURN;
+void error(const char *format, ...) HTS_NORETURN HTS_FORMAT(HTS_PRINTF_FMT, 1, 2);
void bcf_hdr_append_version(bcf_hdr_t *hdr, int argc, char **argv, const char *cmd);
const char *hts_bcf_wmode(int file_type);
bcftools_stdout_fileno = STDOUT_FILENO;
}
+int bcftools_puts(const char *s)
+{
+ if (fputs(s, bcftools_stdout) == EOF) return EOF;
+ return putc('\n', bcftools_stdout);
+}
+
void bcftools_set_optind(int val)
{
// setting this in cython via
-#ifndef BCFTOOLS_PYSAM_H
-#define BCFTOOLS_PYSAM_H
+#ifndef PYSAM_H
+#define PYSAM_H
#include "stdio.h"
*/
void bcftools_unset_stdout(void);
+int bcftools_puts(const char *s);
+
int bcftools_dispatch(int argc, char *argv[]);
void bcftools_set_optind(int);
{
char *tmp;
bin->bins[i] = strtod(list[i],&tmp);
- if ( !tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
+ if ( *tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
if ( min!=max && (bin->bins[i]<min || bin->bins[i]>max) )
- error("Expected values from the interval [%f,%f], found %s\n", list[i]);
+ error("Expected values from the interval [%f,%f], found %s\n", min, max, list[i]);
free(list[i]);
}
free(list);
{
char *tmp;
bin->bins[i] = strtod(list[i],&tmp);
- if ( !tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
+ if ( *tmp ) error("Could not parse %s: %s\n", list_def, list[i]);
if ( min!=max && (bin->bins[i]<min || bin->bins[i]>max) )
- error("Expected values from the interval [%f,%f], found %s\n", list[i]);
+ error("Expected values from the interval [%f,%f], found %s\n", min, max, list[i]);
free(list[i]);
}
free(list);
#include <htslib/kstring.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kseq.h>
+#include <htslib/bgzf.h>
#include "regidx.h"
#include "bcftools.h"
#include "rbuf.h"
int fa_length; // region's length in the original sequence (in case end_pos not provided in the FASTA header)
int fa_case; // output upper case or lower case?
int fa_src_pos; // last genomic coordinate read from the input fasta (0-based)
+ char prev_base; // this is only to validate the REF allele in the VCF - the modified fa_buf cannot be used for inserts following deletions, see 600#issuecomment-383186778
+ int prev_base_pos; // the position of prev_base
rbuf_t vcf_rbuf;
bcf1_t **vcf_buf;
FILE *fp_chain;
char **argv;
int argc, output_iupac, haplotype, allele, isample;
- char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname;
+ char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname, missing_allele;
}
args_t;
if ( ! args->fp_out ) error("Failed to create %s: %s\n", args->output_fname, strerror(errno));
}
else args->fp_out = stdout;
- if ( args->isample<0 ) fprintf(stderr,"Note: the --sample option not given, applying all records\n");
+ if ( args->isample<0 ) fprintf(stderr,"Note: the --sample option not given, applying all records regardless of the genotype\n");
if ( args->filter_str )
args->filter = filter_init(args->hdr, args->filter_str);
}
char *ss, *se = line;
while ( *se && !isspace(*se) && *se!=':' ) se++;
int from = 0, to = 0;
- char tmp, *tmp_ptr = NULL;
+ char tmp = 0, *tmp_ptr = NULL;
if ( *se )
{
tmp = *se; *se = 0; tmp_ptr = se;
else to--;
}
}
+ free(args->chr);
args->chr = strdup(line);
args->rid = bcf_hdr_name2id(args->hdr,line);
if ( args->rid<0 ) fprintf(stderr,"Warning: Sequence \"%s\" not in %s\n", line,args->fname);
- args->fa_buf.l = 0;
+ args->prev_base_pos = -1;
+ args->fa_buf.l = 0;
args->fa_length = 0;
args->fa_end_pos = to;
args->fa_ori_pos = from;
}
static void apply_variant(args_t *args, bcf1_t *rec)
{
- if ( rec->n_allele==1 ) return;
+ static int warned_haplotype = 0;
+
+ if ( rec->n_allele==1 && !args->missing_allele ) return;
- if ( rec->pos <= args->fa_frz_pos )
- {
- fprintf(stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
- return;
- }
if ( args->mask )
{
char *chr = (char*)bcf_hdr_id2name(args->hdr,args->rid);
if ( fmt->type!=BCF_BT_INT8 )
error("Todo: GT field represented with BCF_BT_INT8, too many alleles at %s:%d?\n",bcf_seqname(args->hdr,rec),rec->pos+1);
uint8_t *ptr = fmt->p + fmt->size*args->isample;
-
if ( args->haplotype )
{
- if ( args->haplotype > fmt->n ) error("Can't apply %d-th haplotype at %s:%d\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
- ialt = ptr[args->haplotype-1];
- if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
- ialt = bcf_gt_allele(ialt);
+ if ( args->haplotype > fmt->n )
+ {
+ if ( bcf_gt_is_missing(ptr[fmt->n-1]) || bcf_gt_is_missing(ptr[0]) )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ {
+ if ( !warned_haplotype )
+ {
+ fprintf(stderr, "Can't apply %d-th haplotype at %s:%d. (This warning is printed only once.)\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned_haplotype = 1;
+ }
+ return;
+ }
+ }
+ else
+ {
+ ialt = (int8_t)ptr[args->haplotype-1];
+ if ( bcf_gt_is_missing(ialt) || ialt==bcf_int8_vector_end )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ ialt = bcf_gt_allele(ialt);
+ }
}
else if ( args->output_iupac )
{
ialt = ptr[0];
- if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
- ialt = bcf_gt_allele(ialt);
+ if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ ialt = bcf_gt_allele(ialt);
int jalt;
if ( fmt->n>1 )
{
jalt = ptr[1];
- if ( bcf_gt_is_missing(jalt) || jalt==bcf_int32_vector_end ) jalt = ialt;
- else jalt = bcf_gt_allele(jalt);
+ if ( bcf_gt_is_missing(jalt) )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else if ( jalt==bcf_int32_vector_end ) jalt = ialt;
+ else
+ jalt = bcf_gt_allele(jalt);
}
else jalt = ialt;
- if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
- if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+
+ if ( ialt>=0 )
{
- char ial = rec->d.allele[ialt][0];
- char jal = rec->d.allele[jalt][0];
- if ( !ialt ) ialt = jalt; // only ialt is used, make sure 0/1 is not ignored
- rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+ if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+ {
+ char ial = rec->d.allele[ialt][0];
+ char jal = rec->d.allele[jalt][0];
+ if ( !ialt ) ialt = jalt; // only ialt is used, make sure 0/1 is not ignored
+ rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+ }
}
}
else
int is_hom = 1;
for (i=0; i<fmt->n; i++)
{
- if ( bcf_gt_is_missing(ptr[i]) ) return; // ignore missing or half-missing genotypes
- if ( ptr[i]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[i]) )
+ {
+ if ( !args->missing_allele ) return; // ignore missing or half-missing genotypes
+ ialt = -1;
+ break;
+ }
+ if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
ialt = bcf_gt_allele(ptr[i]);
if ( i>0 && ialt!=bcf_gt_allele(ptr[i-1]) ) { is_hom = 0; break; }
}
int prev_len = 0, jalt;
for (i=0; i<fmt->n; i++)
{
- if ( ptr[i]==bcf_int32_vector_end ) break;
+ if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
jalt = bcf_gt_allele(ptr[i]);
if ( rec->n_allele <= jalt ) error("Broken VCF, too few alts at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
if ( args->allele & (PICK_LONG|PICK_SHORT) )
rec->d.allele[1][0] = gt2iupac(ial,jal);
}
+ if ( rec->n_allele==1 && ialt!=-1 ) return; // non-missing reference
+ if ( ialt==-1 )
+ {
+ char alleles[4];
+ alleles[0] = rec->d.allele[0][0];
+ alleles[1] = ',';
+ alleles[2] = args->missing_allele;
+ alleles[3] = 0;
+ bcf_update_alleles_str(args->hdr, rec, alleles);
+ ialt = 1;
+ }
+
+ // Overlapping variant? Can be still OK iff this is an insertion
+ if ( rec->pos <= args->fa_frz_pos && (rec->pos!=args->fa_frz_pos || rec->d.allele[0][0]!=rec->d.allele[ialt][0]) )
+ {
+ fprintf(stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ return;
+ }
+
int len_diff = 0, alen = 0;
int idx = rec->pos - args->fa_ori_pos + args->fa_mod_off;
if ( idx<0 )
}
}
if ( idx>=args->fa_buf.l )
- error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%d, off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,args->fa_buf.l,args->fa_mod_off);
+ error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%"PRIu64", off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,(uint64_t)args->fa_buf.l,args->fa_mod_off);
// sanity check the reference base
if ( rec->d.allele[ialt][0]=='<' )
}
else if ( strncasecmp(rec->d.allele[0],args->fa_buf.s+idx,rec->rlen) )
{
- // fprintf(stderr,"%d .. [%s], idx=%d ori=%d off=%d\n",args->fa_ori_pos,args->fa_buf.s,idx,args->fa_ori_pos,args->fa_mod_off);
- char tmp = 0;
- if ( args->fa_buf.l - idx > rec->rlen )
- {
- tmp = args->fa_buf.s[idx+rec->rlen];
- args->fa_buf.s[idx+rec->rlen] = 0;
+ // This is hacky, handle a special case: if insert follows a deletion (AAC>A, C>CAA),
+ // the reference base in fa_buf is lost and the check fails. We do not keep a buffer
+ // with the original sequence as it should not be necessary, we should encounter max
+ // one base overlap
+
+ int fail = 1;
+ if ( args->prev_base_pos==rec->pos && toupper(rec->d.allele[0][0])==toupper(args->prev_base) )
+ {
+ if ( rec->rlen==1 ) fail = 0;
+ else if ( !strncasecmp(rec->d.allele[0]+1,args->fa_buf.s+idx+1,rec->rlen-1) ) fail = 0;
+ }
+
+ if ( fail )
+ {
+ char tmp = 0;
+ if ( args->fa_buf.l - idx > rec->rlen )
+ {
+ tmp = args->fa_buf.s[idx+rec->rlen];
+ args->fa_buf.s[idx+rec->rlen] = 0;
+ }
+ error(
+ "The fasta sequence does not match the REF allele at %s:%d:\n"
+ " .vcf: [%s]\n"
+ " .vcf: [%s] <- (ALT)\n"
+ " .fa: [%s]%c%s\n",
+ bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx,
+ tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
+ );
}
- error(
- "The fasta sequence does not match the REF allele at %s:%d:\n"
- " .vcf: [%s]\n"
- " .vcf: [%s] <- (ALT)\n"
- " .fa: [%s]%c%s\n",
- bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx,
- tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
- );
+ alen = strlen(rec->d.allele[ialt]);
+ len_diff = alen - rec->rlen;
}
else
{
for (i=0; i<alen; i++)
args->fa_buf.s[idx+i] = rec->d.allele[ialt][i];
if ( len_diff )
+ {
+ args->prev_base = rec->d.allele[0][rec->rlen - 1];
+ args->prev_base_pos = rec->pos + rec->rlen - 1;
memmove(args->fa_buf.s+idx+alen,args->fa_buf.s+idx+rec->rlen,args->fa_buf.l-idx-rec->rlen);
+ }
}
else
{
static void consensus(args_t *args)
{
- htsFile *fasta = hts_open(args->ref_fname, "rb");
+ BGZF *fasta = bgzf_open(args->ref_fname, "r");
if ( !fasta ) error("Error reading %s\n", args->ref_fname);
kstring_t str = {0,0,0};
- while ( hts_getline(fasta, KS_SEP_LINE, &str) > 0 )
+ while ( bgzf_getline(fasta, '\n', &str) > 0 )
{
if ( str.s[0]=='>' )
{
destroy_chain(args);
}
flush_fa_buffer(args, 0);
- hts_close(fasta);
+ bgzf_close(fasta);
free(str.s);
}
fprintf(stderr, " --sample (and, optionally, --haplotype) option will apply genotype\n");
fprintf(stderr, " (or haplotype) calls from FORMAT/GT. The program ignores allelic depth\n");
fprintf(stderr, " information, such as INFO/AD or FORMAT/AD.\n");
- fprintf(stderr, "Usage: bcftools consensus [OPTIONS] <file.vcf>\n");
+ fprintf(stderr, "Usage: bcftools consensus [OPTIONS] <file.vcf.gz>\n");
fprintf(stderr, "Options:\n");
fprintf(stderr, " -c, --chain <file> write a chain file for liftover\n");
fprintf(stderr, " -e, --exclude <expr> exclude sites for which the expression is true (see man page for details)\n");
fprintf(stderr, " -i, --include <expr> select sites for which the expression is true (see man page for details)\n");
fprintf(stderr, " -I, --iupac-codes output variants in the form of IUPAC ambiguity codes\n");
fprintf(stderr, " -m, --mask <file> replace regions with N\n");
+ fprintf(stderr, " -M, --missing <char> output <char> instead of skipping the missing genotypes\n");
fprintf(stderr, " -o, --output <file> write output to a file [standard output]\n");
fprintf(stderr, " -s, --sample <name> apply variants of the given sample\n");
fprintf(stderr, "Examples:\n");
{"output",1,0,'o'},
{"fasta-ref",1,0,'f'},
{"mask",1,0,'m'},
+ {"missing",1,0,'M'},
{"chain",1,0,'c'},
{0,0,0,0}
};
int c;
- while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:",loptions,NULL)) >= 0)
+ while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:M:",loptions,NULL)) >= 0)
{
switch (c)
{
case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
case 'f': args->ref_fname = optarg; break;
case 'm': args->mask_fname = optarg; break;
+ case 'M':
+ args->missing_allele = optarg[0];
+ if ( optarg[1]!=0 ) error("Expected single character with -M, got \"%s\"\n", optarg);
+ break;
case 'c': args->chain_fname = optarg; break;
case 'H':
if ( !strcasecmp(optarg,"R") ) args->allele |= PICK_REF;
#include <htslib/kstring.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kseq.h>
+#include <htslib/bgzf.h>
#include "regidx.h"
#include "bcftools.h"
#include "rbuf.h"
int fa_length; // region's length in the original sequence (in case end_pos not provided in the FASTA header)
int fa_case; // output upper case or lower case?
int fa_src_pos; // last genomic coordinate read from the input fasta (0-based)
+ char prev_base; // this is only to validate the REF allele in the VCF - the modified fa_buf cannot be used for inserts following deletions, see 600#issuecomment-383186778
+ int prev_base_pos; // the position of prev_base
rbuf_t vcf_rbuf;
bcf1_t **vcf_buf;
FILE *fp_chain;
char **argv;
int argc, output_iupac, haplotype, allele, isample;
- char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname;
+ char *fname, *ref_fname, *sample, *output_fname, *mask_fname, *chain_fname, missing_allele;
}
args_t;
if ( ! args->fp_out ) error("Failed to create %s: %s\n", args->output_fname, strerror(errno));
}
else args->fp_out = bcftools_stdout;
- if ( args->isample<0 ) fprintf(bcftools_stderr,"Note: the --sample option not given, applying all records\n");
+ if ( args->isample<0 ) fprintf(bcftools_stderr,"Note: the --sample option not given, applying all records regardless of the genotype\n");
if ( args->filter_str )
args->filter = filter_init(args->hdr, args->filter_str);
}
char *ss, *se = line;
while ( *se && !isspace(*se) && *se!=':' ) se++;
int from = 0, to = 0;
- char tmp, *tmp_ptr = NULL;
+ char tmp = 0, *tmp_ptr = NULL;
if ( *se )
{
tmp = *se; *se = 0; tmp_ptr = se;
else to--;
}
}
+ free(args->chr);
args->chr = strdup(line);
args->rid = bcf_hdr_name2id(args->hdr,line);
if ( args->rid<0 ) fprintf(bcftools_stderr,"Warning: Sequence \"%s\" not in %s\n", line,args->fname);
- args->fa_buf.l = 0;
+ args->prev_base_pos = -1;
+ args->fa_buf.l = 0;
args->fa_length = 0;
args->fa_end_pos = to;
args->fa_ori_pos = from;
}
static void apply_variant(args_t *args, bcf1_t *rec)
{
- if ( rec->n_allele==1 ) return;
+ static int warned_haplotype = 0;
+
+ if ( rec->n_allele==1 && !args->missing_allele ) return;
- if ( rec->pos <= args->fa_frz_pos )
- {
- fprintf(bcftools_stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
- return;
- }
if ( args->mask )
{
char *chr = (char*)bcf_hdr_id2name(args->hdr,args->rid);
if ( fmt->type!=BCF_BT_INT8 )
error("Todo: GT field represented with BCF_BT_INT8, too many alleles at %s:%d?\n",bcf_seqname(args->hdr,rec),rec->pos+1);
uint8_t *ptr = fmt->p + fmt->size*args->isample;
-
if ( args->haplotype )
{
- if ( args->haplotype > fmt->n ) error("Can't apply %d-th haplotype at %s:%d\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
- ialt = ptr[args->haplotype-1];
- if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
- ialt = bcf_gt_allele(ialt);
+ if ( args->haplotype > fmt->n )
+ {
+ if ( bcf_gt_is_missing(ptr[fmt->n-1]) || bcf_gt_is_missing(ptr[0]) )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ {
+ if ( !warned_haplotype )
+ {
+ fprintf(bcftools_stderr, "Can't apply %d-th haplotype at %s:%d. (This warning is printed only once.)\n", args->haplotype,bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned_haplotype = 1;
+ }
+ return;
+ }
+ }
+ else
+ {
+ ialt = (int8_t)ptr[args->haplotype-1];
+ if ( bcf_gt_is_missing(ialt) || ialt==bcf_int8_vector_end )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ ialt = bcf_gt_allele(ialt);
+ }
}
else if ( args->output_iupac )
{
ialt = ptr[0];
- if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end ) return;
- ialt = bcf_gt_allele(ialt);
+ if ( bcf_gt_is_missing(ialt) || ialt==bcf_int32_vector_end )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else
+ ialt = bcf_gt_allele(ialt);
int jalt;
if ( fmt->n>1 )
{
jalt = ptr[1];
- if ( bcf_gt_is_missing(jalt) || jalt==bcf_int32_vector_end ) jalt = ialt;
- else jalt = bcf_gt_allele(jalt);
+ if ( bcf_gt_is_missing(jalt) )
+ {
+ if ( !args->missing_allele ) return;
+ ialt = -1;
+ }
+ else if ( jalt==bcf_int32_vector_end ) jalt = ialt;
+ else
+ jalt = bcf_gt_allele(jalt);
}
else jalt = ialt;
- if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
- if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+
+ if ( ialt>=0 )
{
- char ial = rec->d.allele[ialt][0];
- char jal = rec->d.allele[jalt][0];
- if ( !ialt ) ialt = jalt; // only ialt is used, make sure 0/1 is not ignored
- rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+ if ( rec->n_allele <= ialt || rec->n_allele <= jalt ) error("Invalid VCF, too few ALT alleles at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ if ( ialt!=jalt && !rec->d.allele[ialt][1] && !rec->d.allele[jalt][1] ) // is this a het snp?
+ {
+ char ial = rec->d.allele[ialt][0];
+ char jal = rec->d.allele[jalt][0];
+ if ( !ialt ) ialt = jalt; // only ialt is used, make sure 0/1 is not ignored
+ rec->d.allele[ialt][0] = gt2iupac(ial,jal);
+ }
}
}
else
int is_hom = 1;
for (i=0; i<fmt->n; i++)
{
- if ( bcf_gt_is_missing(ptr[i]) ) return; // ignore missing or half-missing genotypes
- if ( ptr[i]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[i]) )
+ {
+ if ( !args->missing_allele ) return; // ignore missing or half-missing genotypes
+ ialt = -1;
+ break;
+ }
+ if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
ialt = bcf_gt_allele(ptr[i]);
if ( i>0 && ialt!=bcf_gt_allele(ptr[i-1]) ) { is_hom = 0; break; }
}
int prev_len = 0, jalt;
for (i=0; i<fmt->n; i++)
{
- if ( ptr[i]==bcf_int32_vector_end ) break;
+ if ( ptr[i]==(uint8_t)bcf_int8_vector_end ) break;
jalt = bcf_gt_allele(ptr[i]);
if ( rec->n_allele <= jalt ) error("Broken VCF, too few alts at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
if ( args->allele & (PICK_LONG|PICK_SHORT) )
rec->d.allele[1][0] = gt2iupac(ial,jal);
}
+ if ( rec->n_allele==1 && ialt!=-1 ) return; // non-missing reference
+ if ( ialt==-1 )
+ {
+ char alleles[4];
+ alleles[0] = rec->d.allele[0][0];
+ alleles[1] = ',';
+ alleles[2] = args->missing_allele;
+ alleles[3] = 0;
+ bcf_update_alleles_str(args->hdr, rec, alleles);
+ ialt = 1;
+ }
+
+ // Overlapping variant? Can be still OK iff this is an insertion
+ if ( rec->pos <= args->fa_frz_pos && (rec->pos!=args->fa_frz_pos || rec->d.allele[0][0]!=rec->d.allele[ialt][0]) )
+ {
+ fprintf(bcftools_stderr,"The site %s:%d overlaps with another variant, skipping...\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ return;
+ }
+
int len_diff = 0, alen = 0;
int idx = rec->pos - args->fa_ori_pos + args->fa_mod_off;
if ( idx<0 )
}
}
if ( idx>=args->fa_buf.l )
- error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%d, off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,args->fa_buf.l,args->fa_mod_off);
+ error("FIXME: %s:%d .. idx=%d, ori_pos=%d, len=%"PRIu64", off=%d\n",bcf_seqname(args->hdr,rec),rec->pos+1,idx,args->fa_ori_pos,(uint64_t)args->fa_buf.l,args->fa_mod_off);
// sanity check the reference base
if ( rec->d.allele[ialt][0]=='<' )
}
else if ( strncasecmp(rec->d.allele[0],args->fa_buf.s+idx,rec->rlen) )
{
- // fprintf(bcftools_stderr,"%d .. [%s], idx=%d ori=%d off=%d\n",args->fa_ori_pos,args->fa_buf.s,idx,args->fa_ori_pos,args->fa_mod_off);
- char tmp = 0;
- if ( args->fa_buf.l - idx > rec->rlen )
- {
- tmp = args->fa_buf.s[idx+rec->rlen];
- args->fa_buf.s[idx+rec->rlen] = 0;
+ // This is hacky, handle a special case: if insert follows a deletion (AAC>A, C>CAA),
+ // the reference base in fa_buf is lost and the check fails. We do not keep a buffer
+ // with the original sequence as it should not be necessary, we should encounter max
+ // one base overlap
+
+ int fail = 1;
+ if ( args->prev_base_pos==rec->pos && toupper(rec->d.allele[0][0])==toupper(args->prev_base) )
+ {
+ if ( rec->rlen==1 ) fail = 0;
+ else if ( !strncasecmp(rec->d.allele[0]+1,args->fa_buf.s+idx+1,rec->rlen-1) ) fail = 0;
+ }
+
+ if ( fail )
+ {
+ char tmp = 0;
+ if ( args->fa_buf.l - idx > rec->rlen )
+ {
+ tmp = args->fa_buf.s[idx+rec->rlen];
+ args->fa_buf.s[idx+rec->rlen] = 0;
+ }
+ error(
+ "The fasta sequence does not match the REF allele at %s:%d:\n"
+ " .vcf: [%s]\n"
+ " .vcf: [%s] <- (ALT)\n"
+ " .fa: [%s]%c%s\n",
+ bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx,
+ tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
+ );
}
- error(
- "The fasta sequence does not match the REF allele at %s:%d:\n"
- " .vcf: [%s]\n"
- " .vcf: [%s] <- (ALT)\n"
- " .fa: [%s]%c%s\n",
- bcf_seqname(args->hdr,rec),rec->pos+1, rec->d.allele[0], rec->d.allele[ialt], args->fa_buf.s+idx,
- tmp?tmp:' ',tmp?args->fa_buf.s+idx+rec->rlen+1:""
- );
+ alen = strlen(rec->d.allele[ialt]);
+ len_diff = alen - rec->rlen;
}
else
{
for (i=0; i<alen; i++)
args->fa_buf.s[idx+i] = rec->d.allele[ialt][i];
if ( len_diff )
+ {
+ args->prev_base = rec->d.allele[0][rec->rlen - 1];
+ args->prev_base_pos = rec->pos + rec->rlen - 1;
memmove(args->fa_buf.s+idx+alen,args->fa_buf.s+idx+rec->rlen,args->fa_buf.l-idx-rec->rlen);
+ }
}
else
{
static void consensus(args_t *args)
{
- htsFile *fasta = hts_open(args->ref_fname, "rb");
+ BGZF *fasta = bgzf_open(args->ref_fname, "r");
if ( !fasta ) error("Error reading %s\n", args->ref_fname);
kstring_t str = {0,0,0};
- while ( hts_getline(fasta, KS_SEP_LINE, &str) > 0 )
+ while ( bgzf_getline(fasta, '\n', &str) > 0 )
{
if ( str.s[0]=='>' )
{
destroy_chain(args);
}
flush_fa_buffer(args, 0);
- hts_close(fasta);
+ bgzf_close(fasta);
free(str.s);
}
fprintf(bcftools_stderr, " --sample (and, optionally, --haplotype) option will apply genotype\n");
fprintf(bcftools_stderr, " (or haplotype) calls from FORMAT/GT. The program ignores allelic depth\n");
fprintf(bcftools_stderr, " information, such as INFO/AD or FORMAT/AD.\n");
- fprintf(bcftools_stderr, "Usage: bcftools consensus [OPTIONS] <file.vcf>\n");
+ fprintf(bcftools_stderr, "Usage: bcftools consensus [OPTIONS] <file.vcf.gz>\n");
fprintf(bcftools_stderr, "Options:\n");
fprintf(bcftools_stderr, " -c, --chain <file> write a chain file for liftover\n");
fprintf(bcftools_stderr, " -e, --exclude <expr> exclude sites for which the expression is true (see man page for details)\n");
fprintf(bcftools_stderr, " -i, --include <expr> select sites for which the expression is true (see man page for details)\n");
fprintf(bcftools_stderr, " -I, --iupac-codes output variants in the form of IUPAC ambiguity codes\n");
fprintf(bcftools_stderr, " -m, --mask <file> replace regions with N\n");
+ fprintf(bcftools_stderr, " -M, --missing <char> output <char> instead of skipping the missing genotypes\n");
fprintf(bcftools_stderr, " -o, --output <file> write output to a file [standard output]\n");
fprintf(bcftools_stderr, " -s, --sample <name> apply variants of the given sample\n");
fprintf(bcftools_stderr, "Examples:\n");
{"output",1,0,'o'},
{"fasta-ref",1,0,'f'},
{"mask",1,0,'m'},
+ {"missing",1,0,'M'},
{"chain",1,0,'c'},
{0,0,0,0}
};
int c;
- while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:",loptions,NULL)) >= 0)
+ while ((c = getopt_long(argc, argv, "h?s:1Ii:e:H:f:o:m:c:M:",loptions,NULL)) >= 0)
{
switch (c)
{
case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
case 'f': args->ref_fname = optarg; break;
case 'm': args->mask_fname = optarg; break;
+ case 'M':
+ args->missing_allele = optarg[0];
+ if ( optarg[1]!=0 ) error("Expected single character with -M, got \"%s\"\n", optarg);
+ break;
case 'c': args->chain_fname = optarg; break;
case 'H':
if ( !strcasecmp(optarg,"R") ) args->allele |= PICK_REF;
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
+#include <inttypes.h>
#include <math.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
if ( line->n_allele > 100 )
error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
- error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+ error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
if ( fmt_gt->type!=BCF_BT_INT8 ) // todo: use BRANCH_INT if the VCF is valid
error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
+ if ( fmt_gt->n!=1 && fmt_gt->n!=2 )
+ error("Uh, ploidy of %d not supported, see %s:%d\n", fmt_gt->n, bcf_seqname(convert->header, line), line->pos+1);
int8_t *ptr = ((int8_t*) fmt_gt->p) - fmt_gt->n;
for (i=0; i<convert->nsamples; i++)
{
ptr += fmt_gt->n;
- if ( ptr[0]==2 )
+ if ( fmt_gt->n==1 ) // haploid genotypes
+ {
+ if ( ptr[0]==2 ) /* 0 */
+ {
+ str->s[str->l++] = '0'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ else if ( ptr[0]==bcf_int8_missing ) /* . */
+ {
+ str->s[str->l++] = '?'; str->s[str->l++] = ' '; str->s[str->l++] = '?'; str->s[str->l++] = ' ';
+ }
+ else if ( ptr[0]==4 ) /* 1 */
+ {
+ str->s[str->l++] = '1'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ else
+ {
+ kputw(bcf_gt_allele(ptr[0]),str); str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ }
+ else if ( ptr[0]==2 )
{
if ( ptr[1]==3 ) /* 0|0 */
{
if ( line->n_allele > 100 )
error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
- error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+ error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
if ( fmt_gt->type!=BCF_BT_INT8 ) // todo: use BRANCH_INT if the VCF is valid
error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
+#include <inttypes.h>
#include <math.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
if ( line->n_allele > 100 )
error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
- error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+ error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
if ( fmt_gt->type!=BCF_BT_INT8 ) // todo: use BRANCH_INT if the VCF is valid
error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
+ if ( fmt_gt->n!=1 && fmt_gt->n!=2 )
+ error("Uh, ploidy of %d not supported, see %s:%d\n", fmt_gt->n, bcf_seqname(convert->header, line), line->pos+1);
int8_t *ptr = ((int8_t*) fmt_gt->p) - fmt_gt->n;
for (i=0; i<convert->nsamples; i++)
{
ptr += fmt_gt->n;
- if ( ptr[0]==2 )
+ if ( fmt_gt->n==1 ) // haploid genotypes
+ {
+ if ( ptr[0]==2 ) /* 0 */
+ {
+ str->s[str->l++] = '0'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ else if ( ptr[0]==bcf_int8_missing ) /* . */
+ {
+ str->s[str->l++] = '?'; str->s[str->l++] = ' '; str->s[str->l++] = '?'; str->s[str->l++] = ' ';
+ }
+ else if ( ptr[0]==4 ) /* 1 */
+ {
+ str->s[str->l++] = '1'; str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ else
+ {
+ kputw(bcf_gt_allele(ptr[0]),str); str->s[str->l++] = ' '; str->s[str->l++] = '-'; str->s[str->l++] = ' ';
+ }
+ }
+ else if ( ptr[0]==2 )
{
if ( ptr[1]==3 ) /* 0|0 */
{
if ( line->n_allele > 100 )
error("Too many alleles (%d) at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
if ( ks_resize(str, str->l+convert->nsamples*8) != 0 )
- error("Could not alloc %d bytes\n", str->l + convert->nsamples*8);
+ error("Could not alloc %"PRIu64" bytes\n", (uint64_t)(str->l + convert->nsamples*8));
if ( fmt_gt->type!=BCF_BT_INT8 ) // todo: use BRANCH_INT if the VCF is valid
error("Uh, too many alleles (%d) or redundant BCF representation at %s:%d\n", line->n_allele, bcf_seqname(convert->header, line), line->pos+1);
A .. gene line with a supported biotype
A.ID=~/^gene:/
- B .. transcript line referencing A
+ B .. transcript line referencing A with supported biotype
B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
C .. corresponding CDS, exon, and UTR lines:
csq_t *csq_buf; // pool of csq not managed by hap_node_t, i.e. non-CDS csqs
int ncsq_buf, mcsq_buf;
id_tbl_t tscript_ids; // mapping between transcript id (eg. Zm00001d027245_T001) and a numeric idx
+ int force; // force run under various conditions. Currently only to skip out-of-phase transcripts
faidx_t *fai;
kstring_t str, str2;
tr->cds[0]->len -= tr->cds[0]->phase;
tr->cds[0]->phase = 0;
- // sanity check phase
+ // sanity check phase; the phase number in gff tells us how many bases to skip in this
+ // feature to reach the first base of the next codon
+ int tscript_ok = 1;
for (i=0; i<tr->ncds; i++)
{
int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
if ( phase!=len%3)
- error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
- assert( phase == len%3 );
+ {
+ if ( args->force )
+ {
+ if ( args->quiet < 2 )
+ fprintf(stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ tscript_ok = 0;
+ break;
+ }
+ error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ }
len += tr->cds[i]->len;
}
+ if ( !tscript_ok ) continue; // skip this transcript
}
else
{
tr->cds[i]->phase = 0;
// sanity check phase
+ int tscript_ok = 1;
for (i=tr->ncds-1; i>=0; i--)
{
int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
if ( phase!=len%3)
- error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ {
+ if ( args->force )
+ {
+ if ( args->quiet < 2 )
+ fprintf(stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ tscript_ok = 0;
+ break;
+ }
+ error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ }
len += tr->cds[i]->len;
}
+ if ( !tscript_ok ) continue; // skip this transcript
}
// set len. At the same check that CDS within a transcript do not overlap
splice->kalt.l = 0; kputsn(splice->vcf.alt + splice->tbeg, splice->vcf.alen, &splice->kalt);
if ( (splice->ref_beg+1 < ex_beg && splice->ref_end >= ex_beg) || (splice->ref_beg+1 < ex_end && splice->ref_end >= ex_end) ) // ouch, ugly ENST00000409523/long-overlapping-del.vcf
{
- splice->csq |= (splice->ref_end - splice->ref_beg + 1)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
+ splice->csq |= (splice->ref_end - splice->ref_beg)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
return SPLICE_OVERLAP;
}
}
child->var = str.s;
child->type = HAP_SSS;
child->csq = splice.csq;
- child->prev = parent->type==HAP_SSS ? parent->prev : parent;
child->rec = rec;
return 0;
}
assert( dbeg <= splice.kalt.l );
}
- if ( parent->type==HAP_SSS ) parent = parent->prev;
+ assert( parent->type!=HAP_SSS );
if ( parent->type==HAP_CDS )
{
i = parent->icds;
#endif
khint_t k = kh_get(pos2vbuf, args->pos2vbuf, (int)csq->pos);
vbuf_t *vbuf = (k == kh_end(args->pos2vbuf)) ? NULL : kh_val(args->pos2vbuf, k);
- if ( !vbuf ) error("This should not happen. %s:%d %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+ if ( !vbuf ) error("This should not happen. %s:%d %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
int i;
for (i=0; i<vbuf->n; i++)
if ( vbuf->vrec[i]->line==rec ) break;
- if ( i==vbuf->n ) error("This should not happen.. %s:%d %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+ if ( i==vbuf->n ) error("This should not happen.. %s:%d %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
vrec_t *vrec = vbuf->vrec[i];
// if the variant overlaps donor/acceptor and also splice region, report only donor/acceptor
hap->upstream_stop = 0;
int i = 1, dlen = 0, ibeg, indel = 0;
- while ( i<istack && hap->stack[i].node->type == HAP_SSS ) i++;
hap->sbeg = hap->stack[i].node->sbeg;
-
+ assert( hap->stack[istack].node->type != HAP_SSS );
if ( tr->strand==STRAND_FWD )
{
i = 0, ibeg = -1;
while ( ++i <= istack )
{
- if ( hap->stack[i].node->type == HAP_SSS )
- {
- // start/stop/splice site overlap: don't know how to build the haplotypes correctly, skipping
- hap_add_csq(args,hap,node,0,i,i,0,0);
- continue;
- }
+ assert( hap->stack[i].node->type != HAP_SSS );
+
dlen += hap->stack[i].node->dlen;
if ( hap->stack[i].node->dlen ) indel = 1;
- // This condition extends compound variants. Note that s/s/s sites are forced out to always break
- // a compound block. See ENST00000271583/splice-acceptor.vcf for motivation.
- if ( i<istack && hap->stack[i+1].node->type != HAP_SSS )
+ // This condition extends compound variants.
+ if ( i<istack )
{
if ( dlen%3 ) // frameshift
{
i = istack + 1, ibeg = -1;
while ( --i > 0 )
{
- if ( hap->stack[i].node->type == HAP_SSS )
- {
- hap_add_csq(args,hap,node,0,i,i,0,0);
- continue;
- }
+ assert ( hap->stack[i].node->type != HAP_SSS );
dlen += hap->stack[i].node->dlen;
if ( hap->stack[i].node->dlen ) indel = 1;
- if ( i>1 && hap->stack[i-1].node->type != HAP_SSS )
+ if ( i>1 )
{
if ( dlen%3 )
{
if ( rec->d.allele[1][0]=='<' || rec->d.allele[1][0]=='*' ) { continue; }
hap_node_t *parent = tr->hap[0] ? tr->hap[0] : tr->root;
hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
- if ( (hap_ret=hap_init(args, parent, child, cds, rec, 1))!=0 )
+ hap_ret = hap_init(args, parent, child, cds, rec, 1);
+ if ( hap_ret!=0 )
{
// overlapping or intron variant, cannot apply
if ( hap_ret==1 )
fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d\t%s>%s\n", chr,rec->pos+1,rec->d.allele[0],rec->d.allele[1]);
}
else ret = 1; // prevent reporting as intron in test_tscript
- free(child);
+ hap_destroy(child);
+ continue;
+ }
+ if ( child->type==HAP_SSS )
+ {
+ csq_t csq;
+ memset(&csq, 0, sizeof(csq_t));
+ csq.pos = rec->pos;
+ csq.type.biotype = tr->type;
+ csq.type.strand = tr->strand;
+ csq.type.trid = tr->id;
+ csq.type.gene = tr->gene->name;
+ csq.type.type = child->csq;
+ csq_stage(args, &csq, rec);
+ hap_destroy(child);
+ ret = 1;
continue;
}
parent->nend--;
}
hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
- if ( (hap_ret=hap_init(args, parent, child, cds, rec, ial))!=0 )
+ hap_ret = hap_init(args, parent, child, cds, rec, ial);
+ if ( hap_ret!=0 )
{
// overlapping or intron variant, cannot apply
if ( hap_ret==1 )
fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d, sample %s\t%s>%s\n",
chr,rec->pos+1,args->hdr->samples[args->smpl->idx[ismpl]],rec->d.allele[0],rec->d.allele[ial]);
}
- free(child);
+ hap_destroy(child);
+ continue;
+ }
+ if ( child->type==HAP_SSS )
+ {
+ csq_t csq;
+ memset(&csq, 0, sizeof(csq_t));
+ csq.pos = rec->pos;
+ csq.type.biotype = tr->type;
+ csq.type.strand = tr->strand;
+ csq.type.trid = tr->id;
+ csq.type.gene = tr->gene->name;
+ csq.type.type = child->csq;
+ csq_stage(args, &csq, rec);
+ hap_destroy(child);
continue;
}
-
if ( parent->cur_rec!=rec )
{
hts_expand(int,rec->n_allele,parent->mcur_child,parent->cur_child);
" s: skip unphased hets\n"
"Options:\n"
" -e, --exclude <expr> exclude sites for which the expression is true\n"
+ " --force run even if some sanity checks fail\n"
" -i, --include <expr> select sites for which the expression is true\n"
" -o, --output <file> write output to a file [standard output]\n"
" -O, --output-type <b|u|z|v|t> b: compressed BCF, u: uncompressed BCF, z: compressed VCF\n"
static struct option loptions[] =
{
+ {"force",0,0,1},
{"help",0,0,'h'},
{"ncsq",1,0,'n'},
{"custom-tag",1,0,'c'},
{
switch (c)
{
+ case 1 : args->force = 1; break;
case 'l': args->local_csq = 1; break;
case 'c': args->bcsq_tag = optarg; break;
case 'q': args->quiet++; break;
A .. gene line with a supported biotype
A.ID=~/^gene:/
- B .. transcript line referencing A
+ B .. transcript line referencing A with supported biotype
B.ID=~/^transcript:/ && B.Parent=~/^gene:A.ID/
C .. corresponding CDS, exon, and UTR lines:
csq_t *csq_buf; // pool of csq not managed by hap_node_t, i.e. non-CDS csqs
int ncsq_buf, mcsq_buf;
id_tbl_t tscript_ids; // mapping between transcript id (eg. Zm00001d027245_T001) and a numeric idx
+ int force; // force run under various conditions. Currently only to skip out-of-phase transcripts
faidx_t *fai;
kstring_t str, str2;
tr->cds[0]->len -= tr->cds[0]->phase;
tr->cds[0]->phase = 0;
- // sanity check phase
+ // sanity check phase; the phase number in gff tells us how many bases to skip in this
+ // feature to reach the first base of the next codon
+ int tscript_ok = 1;
for (i=0; i<tr->ncds; i++)
{
int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
if ( phase!=len%3)
- error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
- assert( phase == len%3 );
+ {
+ if ( args->force )
+ {
+ if ( args->quiet < 2 )
+ fprintf(bcftools_stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ tscript_ok = 0;
+ break;
+ }
+ error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ }
len += tr->cds[i]->len;
}
+ if ( !tscript_ok ) continue; // skip this transcript
}
else
{
tr->cds[i]->phase = 0;
// sanity check phase
+ int tscript_ok = 1;
for (i=tr->ncds-1; i>=0; i--)
{
int phase = tr->cds[i]->phase ? 3 - tr->cds[i]->phase : 0;
if ( phase!=len%3)
- error("GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ {
+ if ( args->force )
+ {
+ if ( args->quiet < 2 )
+ fprintf(bcftools_stderr,"Warning: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ tscript_ok = 0;
+ break;
+ }
+ error("Error: GFF3 assumption failed for transcript %s, CDS=%d: phase!=len%%3 (phase=%d, len=%d)\n",args->tscript_ids.str[tr->id],tr->cds[i]->beg+1,phase,len);
+ }
len += tr->cds[i]->len;
}
+ if ( !tscript_ok ) continue; // skip this transcript
}
// set len. At the same check that CDS within a transcript do not overlap
splice->kalt.l = 0; kputsn(splice->vcf.alt + splice->tbeg, splice->vcf.alen, &splice->kalt);
if ( (splice->ref_beg+1 < ex_beg && splice->ref_end >= ex_beg) || (splice->ref_beg+1 < ex_end && splice->ref_end >= ex_end) ) // ouch, ugly ENST00000409523/long-overlapping-del.vcf
{
- splice->csq |= (splice->ref_end - splice->ref_beg + 1)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
+ splice->csq |= (splice->ref_end - splice->ref_beg)%3 ? CSQ_FRAMESHIFT_VARIANT : CSQ_INFRAME_DELETION;
return SPLICE_OVERLAP;
}
}
child->var = str.s;
child->type = HAP_SSS;
child->csq = splice.csq;
- child->prev = parent->type==HAP_SSS ? parent->prev : parent;
child->rec = rec;
return 0;
}
assert( dbeg <= splice.kalt.l );
}
- if ( parent->type==HAP_SSS ) parent = parent->prev;
+ assert( parent->type!=HAP_SSS );
if ( parent->type==HAP_CDS )
{
i = parent->icds;
#endif
khint_t k = kh_get(pos2vbuf, args->pos2vbuf, (int)csq->pos);
vbuf_t *vbuf = (k == kh_end(args->pos2vbuf)) ? NULL : kh_val(args->pos2vbuf, k);
- if ( !vbuf ) error("This should not happen. %s:%d %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+ if ( !vbuf ) error("This should not happen. %s:%d %s\n",bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
int i;
for (i=0; i<vbuf->n; i++)
if ( vbuf->vrec[i]->line==rec ) break;
- if ( i==vbuf->n ) error("This should not happen.. %s:%d %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr);
+ if ( i==vbuf->n ) error("This should not happen.. %s:%d %s\n", bcf_seqname(args->hdr,rec),csq->pos+1,csq->type.vstr.s);
vrec_t *vrec = vbuf->vrec[i];
// if the variant overlaps donor/acceptor and also splice region, report only donor/acceptor
hap->upstream_stop = 0;
int i = 1, dlen = 0, ibeg, indel = 0;
- while ( i<istack && hap->stack[i].node->type == HAP_SSS ) i++;
hap->sbeg = hap->stack[i].node->sbeg;
-
+ assert( hap->stack[istack].node->type != HAP_SSS );
if ( tr->strand==STRAND_FWD )
{
i = 0, ibeg = -1;
while ( ++i <= istack )
{
- if ( hap->stack[i].node->type == HAP_SSS )
- {
- // start/stop/splice site overlap: don't know how to build the haplotypes correctly, skipping
- hap_add_csq(args,hap,node,0,i,i,0,0);
- continue;
- }
+ assert( hap->stack[i].node->type != HAP_SSS );
+
dlen += hap->stack[i].node->dlen;
if ( hap->stack[i].node->dlen ) indel = 1;
- // This condition extends compound variants. Note that s/s/s sites are forced out to always break
- // a compound block. See ENST00000271583/splice-acceptor.vcf for motivation.
- if ( i<istack && hap->stack[i+1].node->type != HAP_SSS )
+ // This condition extends compound variants.
+ if ( i<istack )
{
if ( dlen%3 ) // frameshift
{
i = istack + 1, ibeg = -1;
while ( --i > 0 )
{
- if ( hap->stack[i].node->type == HAP_SSS )
- {
- hap_add_csq(args,hap,node,0,i,i,0,0);
- continue;
- }
+ assert ( hap->stack[i].node->type != HAP_SSS );
dlen += hap->stack[i].node->dlen;
if ( hap->stack[i].node->dlen ) indel = 1;
- if ( i>1 && hap->stack[i-1].node->type != HAP_SSS )
+ if ( i>1 )
{
if ( dlen%3 )
{
if ( rec->d.allele[1][0]=='<' || rec->d.allele[1][0]=='*' ) { continue; }
hap_node_t *parent = tr->hap[0] ? tr->hap[0] : tr->root;
hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
- if ( (hap_ret=hap_init(args, parent, child, cds, rec, 1))!=0 )
+ hap_ret = hap_init(args, parent, child, cds, rec, 1);
+ if ( hap_ret!=0 )
{
// overlapping or intron variant, cannot apply
if ( hap_ret==1 )
fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d\t%s>%s\n", chr,rec->pos+1,rec->d.allele[0],rec->d.allele[1]);
}
else ret = 1; // prevent reporting as intron in test_tscript
- free(child);
+ hap_destroy(child);
+ continue;
+ }
+ if ( child->type==HAP_SSS )
+ {
+ csq_t csq;
+ memset(&csq, 0, sizeof(csq_t));
+ csq.pos = rec->pos;
+ csq.type.biotype = tr->type;
+ csq.type.strand = tr->strand;
+ csq.type.trid = tr->id;
+ csq.type.gene = tr->gene->name;
+ csq.type.type = child->csq;
+ csq_stage(args, &csq, rec);
+ hap_destroy(child);
+ ret = 1;
continue;
}
parent->nend--;
}
hap_node_t *child = (hap_node_t*)calloc(1,sizeof(hap_node_t));
- if ( (hap_ret=hap_init(args, parent, child, cds, rec, ial))!=0 )
+ hap_ret = hap_init(args, parent, child, cds, rec, ial);
+ if ( hap_ret!=0 )
{
// overlapping or intron variant, cannot apply
if ( hap_ret==1 )
fprintf(args->out,"LOG\tWarning: Skipping overlapping variants at %s:%d, sample %s\t%s>%s\n",
chr,rec->pos+1,args->hdr->samples[args->smpl->idx[ismpl]],rec->d.allele[0],rec->d.allele[ial]);
}
- free(child);
+ hap_destroy(child);
+ continue;
+ }
+ if ( child->type==HAP_SSS )
+ {
+ csq_t csq;
+ memset(&csq, 0, sizeof(csq_t));
+ csq.pos = rec->pos;
+ csq.type.biotype = tr->type;
+ csq.type.strand = tr->strand;
+ csq.type.trid = tr->id;
+ csq.type.gene = tr->gene->name;
+ csq.type.type = child->csq;
+ csq_stage(args, &csq, rec);
+ hap_destroy(child);
continue;
}
-
if ( parent->cur_rec!=rec )
{
hts_expand(int,rec->n_allele,parent->mcur_child,parent->cur_child);
" s: skip unphased hets\n"
"Options:\n"
" -e, --exclude <expr> exclude sites for which the expression is true\n"
+ " --force run even if some sanity checks fail\n"
" -i, --include <expr> select sites for which the expression is true\n"
" -o, --output <file> write output to a file [standard output]\n"
" -O, --output-type <b|u|z|v|t> b: compressed BCF, u: uncompressed BCF, z: compressed VCF\n"
static struct option loptions[] =
{
+ {"force",0,0,1},
{"help",0,0,'h'},
{"ncsq",1,0,'n'},
{"custom-tag",1,0,'c'},
{
switch (c)
{
+ case 1 : args->force = 1; break;
case 'l': args->local_csq = 1; break;
case 'c': args->bcsq_tag = optarg; break;
case 'q': args->quiet++; break;
#include <strings.h>
#include <errno.h>
#include <math.h>
-#include <wordexp.h>
+#include <sys/types.h>
+#include <pwd.h>
#include <regex.h>
#include <htslib/khash_str2int.h>
-#include "filter.h"
-#include "bcftools.h"
#include <htslib/hts_defs.h>
#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include "config.h"
+#include "filter.h"
+#include "bcftools.h"
+
+#if ENABLE_PERL_FILTERS
+# define filter_t perl_filter_t
+# include <EXTERN.h>
+# include <perl.h>
+# undef filter_t
+# define my_perl perl
+
+static int filter_ninit = 0;
+#endif
+
#ifndef __FUNCTION__
# define __FUNCTION__ __func__
{
// read-only values, same for all VCF lines
int tok_type; // one of the TOK_* keys below
+ int nargs; // with TOK_PERLSUB the first argument is the name of the subroutine
char *key; // set only for string constants, otherwise NULL
char *tag; // for debugging and printout only, VCF tag name
double threshold; // filtering threshold
+ int is_constant; // the threshold is set
int hdr_id, type; // BCF header lookup ID and one of BCF_HT_* types
int idx; // 0-based index to VCF vectors,
// -2: list (e.g. [0,1,2] or [1..3] or [1..] or any field[*], which is equivalent to [0..])
float *tmpf;
kstring_t tmps;
int max_unpack, mtmpi, mtmpf, nsamples;
+#if ENABLE_PERL_FILTERS
+ PerlInterpreter *perl;
+#endif
};
#define TOK_LEN 24
#define TOK_FUNC 25
#define TOK_CNT 26
+#define TOK_PERLSUB 27
+#define TOK_BINOM 28
-// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
-// ( ) [ < = > ] ! | & + - * / M m a A O ~ ^ S . l
-static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8};
-#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^f"
+// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
+// ( ) [ < = > ] ! | & + - * / M m a A O ~ ^ S . l f c p
+static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8, 8};
+#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^S.lfcp"
+// Return negative values if it is a function with variable number of arguments
static int filters_next_token(char **str, int *len)
{
char *tmp = *str;
if ( !strncasecmp(tmp,"ABS(",4) ) { (*str) += 3; return TOK_ABS; }
if ( !strncasecmp(tmp,"COUNT(",4) ) { (*str) += 5; return TOK_CNT; }
if ( !strncasecmp(tmp,"STRLEN(",7) ) { (*str) += 6; return TOK_LEN; }
+ if ( !strncasecmp(tmp,"BINOM(",6) ) { (*str) += 5; return -TOK_BINOM; }
if ( !strncasecmp(tmp,"%MAX(",5) ) { (*str) += 4; return TOK_MAX; } // for backward compatibility
if ( !strncasecmp(tmp,"%MIN(",5) ) { (*str) += 4; return TOK_MIN; } // for backward compatibility
if ( !strncasecmp(tmp,"%AVG(",5) ) { (*str) += 4; return TOK_AVG; } // for backward compatibility
if ( !strncasecmp(tmp,"INFO/",5) ) tmp += 5;
if ( !strncasecmp(tmp,"FORMAT/",7) ) tmp += 7;
if ( !strncasecmp(tmp,"FMT/",4) ) tmp += 4;
+ if ( !strncasecmp(tmp,"PERL.",5) ) { (*str) += 5; return -TOK_PERLSUB; }
+ if ( !strncasecmp(tmp,"N_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
+ if ( !strncasecmp(tmp,"F_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
if ( tmp[0]=='@' ) // file name
{
int square_brackets = 0;
while ( tmp[0] )
{
- if ( tmp[0]=='"' ) break;
- if ( tmp[0]=='\'' ) break;
- if ( isspace(tmp[0]) ) break;
- if ( tmp[0]=='<' ) break;
- if ( tmp[0]=='>' ) break;
- if ( tmp[0]=='=' ) break;
- if ( tmp[0]=='!' ) break;
- if ( tmp[0]=='&' ) break;
- if ( tmp[0]=='|' ) break;
- if ( tmp[0]=='(' ) break;
- if ( tmp[0]==')' ) break;
- if ( tmp[0]=='+' ) break;
- if ( tmp[0]=='*' && !square_brackets ) break;
- if ( tmp[0]=='-' && !square_brackets ) break;
- if ( tmp[0]=='/' ) break;
- if ( tmp[0]=='~' ) break;
+ if ( !square_brackets )
+ {
+ if ( tmp[0]=='"' ) break;
+ if ( tmp[0]=='\'' ) break;
+ if ( isspace(tmp[0]) ) break;
+ if ( tmp[0]=='<' ) break;
+ if ( tmp[0]=='>' ) break;
+ if ( tmp[0]=='=' ) break;
+ if ( tmp[0]=='!' ) break;
+ if ( tmp[0]=='&' ) break;
+ if ( tmp[0]=='|' ) break;
+ if ( tmp[0]=='(' ) break;
+ if ( tmp[0]==')' ) break;
+ if ( tmp[0]=='+' ) break;
+ if ( tmp[0]=='*' ) break;
+ if ( tmp[0]=='-' ) break;
+ if ( tmp[0]=='/' ) break;
+ if ( tmp[0]=='~' ) break;
+ }
if ( tmp[0]==']' ) { if (square_brackets) tmp++; break; }
if ( tmp[0]=='[' ) square_brackets++;
tmp++;
return TOK_VAL;
}
+
+/*
+ Simple path expansion, expands ~/, ~user, $var. The result must be freed by the caller.
+
+ Based on jkb's staden code with some adjustements.
+ https://sourceforge.net/p/staden/code/HEAD/tree/staden/trunk/src/Misc/getfile.c#l123
+*/
+char *expand_path(char *path)
+{
+#ifdef _WIN32
+ return strdup(path); // windows expansion: todo
+#endif
+
+ kstring_t str = {0,0,0};
+
+ if ( path[0] == '~' )
+ {
+ if ( !path[1] || path[1] == '/' )
+ {
+ // ~ or ~/path
+ kputs(getenv("HOME"), &str);
+ if ( path[1] ) kputs(path+1, &str);
+ }
+ else
+ {
+ // user name: ~pd3/path
+ char *end = path;
+ while ( *end && *end!='/' ) end++;
+ kputsn(path+1, end-path-1, &str);
+ struct passwd *pwentry = getpwnam(str.s);
+ str.l = 0;
+
+ if ( !pwentry ) kputsn(path, end-path, &str);
+ else kputs(pwentry->pw_dir, &str);
+ kputs(end, &str);
+ }
+ return str.s;
+ }
+ if ( path[0] == '$' )
+ {
+ char *var = getenv(path+1);
+ if ( var ) path = var;
+ }
+ return strdup(path);
+}
+
static void filters_set_qual(filter_t *flt, bcf1_t *line, token_t *tok)
{
float *ptr = &line->qual;
kputs(line->d.allele[tok->idx + 1], &tok->str_value);
else
kputc('.', &tok->str_value);
- tok->idx = 0;
}
else if ( tok->idx==-2 )
{
tok->nvalues = 1;
tok->values[0] = tok->tag[0]=='N' ? nmissing : (double)nmissing / line->n_sample;
}
+static int func_npass(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+ if ( nstack==0 ) error("Error parsing the expresion\n");
+ token_t *tok = stack[nstack - 1];
+ if ( !tok->nsamples ) error("The function %s works with FORMAT fields\n", rtok->tag);
+
+ rtok->nsamples = tok->nsamples;
+ memcpy(rtok->pass_samples, tok->pass_samples, rtok->nsamples*sizeof(*rtok->pass_samples));
+
+ assert(tok->usmpl);
+ if ( !rtok->usmpl )
+ {
+ rtok->usmpl = (uint8_t*) malloc(tok->nsamples*sizeof(*rtok->usmpl));
+ memcpy(rtok->usmpl, tok->usmpl, tok->nsamples*sizeof(*rtok->usmpl));
+ }
+
+ int i, npass = 0;
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ if ( rtok->pass_samples[i] ) npass++;
+ }
+
+ assert( rtok->values );
+ rtok->nvalues = 1;
+ rtok->values[0] = rtok->tag[0]=='N' ? npass : (line->n_sample ? 1.0*npass/line->n_sample : 0);
+ rtok->nsamples = 0;
+
+ return 1;
+}
static void filters_set_nalt(filter_t *flt, bcf1_t *line, token_t *tok)
{
tok->nvalues = 1;
}
else
{
- rtok->values[0] = strlen(tok->str_value.s);
+ if ( !tok->str_value.s[1] && tok->str_value.s[0]=='.' )
+ rtok->values[0] = 0;
+ else
+ rtok->values[0] = strlen(tok->str_value.s);
rtok->nvalues = 1;
}
return 1;
}
+static inline double calc_binom(int na, int nb)
+{
+ if ( na==0 && nb==0 ) return -1;
+ if ( na==nb ) return 1;
+
+ // kfunc.h implements kf_betai, which is the regularized beta function P(X<=k/N;p) = I_{1-p}(N-k,k+1)
+
+ double pval = na < nb ? kf_betai(nb, na + 1, 0.5) : kf_betai(na, nb + 1, 0.5);
+ pval *= 2;
+ assert( pval-1 < 1e-10 );
+ if ( pval>1 ) pval = 1; // this can happen, machine precision error, eg. kf_betai(1,0,0.5)
+
+ return pval;
+}
+static int func_binom(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+ int i, istack = nstack - rtok->nargs;
+
+ if ( rtok->nargs!=2 && rtok->nargs!=1 ) error("Error: binom() takes one or two arguments\n");
+ assert( istack>=0 );
+
+ // The expected mean is 0.5. Should we support also prob!=0.5?
+ //
+ // double prob = 0.5;
+ // if ( nstack==3 )
+ // {
+ // // three parameters, the first one must be a scalar: binom(0.25,AD[0],AD[1])
+ // if ( !stack[istack]->is_constant )
+ // error("The first argument of binom() must be a numeric constant if three parameters are given\n");
+ // prob = stack[istack]->threshold;
+ // istack++;
+ // }
+ // else if ( nstack==2 && stack[istack]->is_constant )
+ // {
+ // // two parameters, the first can be a scalar: binom(0.25,AD) or binom(AD[0],AD[1])
+ // prob = stack[istack]->threshold;
+ // istack++;
+ // }
+
+ token_t *tok = stack[istack];
+ if ( tok->nsamples )
+ {
+ // working with a FORMAT tag
+ rtok->nval1 = 1;
+ rtok->nvalues = tok->nsamples;
+ rtok->nsamples = tok->nsamples;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+ assert(tok->usmpl);
+ if ( !rtok->usmpl ) rtok->usmpl = (uint8_t*) malloc(tok->nsamples);
+ memcpy(rtok->usmpl, tok->usmpl, tok->nsamples);
+
+ if ( istack+1==nstack )
+ {
+ // determine the index from the GT field: binom(AD)
+ int ngt = bcf_get_genotypes(flt->hdr, line, &flt->tmpi, &flt->mtmpi);
+ int max_ploidy = ngt/line->n_sample;
+ if ( ngt <= 0 || max_ploidy < 2 ) // GT not present or not diploid, cannot set
+ {
+ for (i=0; i<rtok->nsamples; i++)
+ if ( rtok->usmpl[i] ) bcf_double_set_missing(rtok->values[i]);
+ return rtok->nargs;
+ }
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ int32_t *ptr = flt->tmpi + i*max_ploidy;
+ if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ int idx1 = bcf_gt_allele(ptr[0]);
+ int idx2 = bcf_gt_allele(ptr[1]);
+ if ( idx1>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+ if ( idx2>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+ double *vals = tok->values + tok->nval1*i;
+ if ( bcf_double_is_missing(vals[idx1]) || bcf_double_is_missing(vals[idx2]) )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ rtok->values[i] = calc_binom(vals[idx1],vals[idx2]);
+ if ( rtok->values[i] < 0 )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ }
+ }
+ else
+ {
+ // the fields given explicitly: binom(AD[:0],AD[:1])
+ token_t *tok2 = stack[istack+1];
+ if ( tok->nval1!=1 || tok2->nval1!=1 )
+ error("Expected one value per binom() argument, found %d and %d at %s:%d\n",tok->nval1,tok2->nval1, bcf_seqname(flt->hdr,line),line->pos+1);
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ double *ptr1 = tok->values + tok->nval1*i;
+ double *ptr2 = tok2->values + tok2->nval1*i;
+ if ( bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ rtok->values[i] = calc_binom(ptr1[0],ptr2[0]);
+ if ( rtok->values[i] < 0 )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ }
+ }
+ }
+ else
+ {
+ // working with an INFO tag
+ rtok->nvalues = 1;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+
+ double *ptr1 = NULL, *ptr2 = NULL;
+ if ( istack+1==nstack )
+ {
+ // only one tag, expecting two values: binom(INFO/AD)
+ if ( tok->nvalues==2 )
+ {
+ ptr1 = &tok->values[0];
+ ptr2 = &tok->values[1];
+ }
+ }
+ else
+ {
+ // two tags, expecting one value in each: binom(INFO/AD[0],INFO/AD[1])
+ token_t *tok2 = stack[istack+1];
+ if ( tok->nvalues==1 && tok2->nvalues==1 )
+ {
+ ptr1 = &tok->values[0];
+ ptr2 = &tok2->values[0];
+ }
+ }
+ if ( !ptr1 || !ptr2 || bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+ bcf_double_set_missing(rtok->values[0]);
+ else
+ {
+ rtok->values[0] = calc_binom(ptr1[0],ptr2[0]);
+ if ( rtok->values[0] < 0 )
+ bcf_double_set_missing(rtok->values[0]);
+ }
+ }
+ return rtok->nargs;
+}
inline static void tok_init_values(token_t *atok, token_t *btok, token_t *rtok)
{
token_t *tok = atok->nvalues > btok->nvalues ? atok : btok;
rtok->nvalues = tok->nvalues;
rtok->nval1 = tok->nval1;
- hts_expand(double*, rtok->nvalues, rtok->mvalues, rtok->values);
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
}
inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
{
static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
{
+ if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\"\n", filter->str);
+
token_t *atok = stack[nstack-2];
token_t *btok = stack[nstack-1];
tok_init_samples(atok, btok, rtok);
}
static int vector_logic_and(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
{
+ if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\". (nstack=%d)\n", filter->str,nstack);
+
token_t *atok = stack[nstack-2];
token_t *btok = stack[nstack-1];
tok_init_samples(atok, btok, rtok);
int *idxs2 = NULL, nidxs2 = 0, idx2 = 0;
int set_samples = 0;
- char *colon = index(tag_idx, ':');
- if ( colon )
+ char *colon = rindex(tag_idx, ':');
+ if ( tag_idx[0]=='@' ) // file list with sample names
+ {
+ if ( !is_fmt ) error("Could not parse \"%s\". (Not a FORMAT tag yet a sample list provided.)\n", ori);
+ char *fname = expand_path(tag_idx+1);
+ int nsmpl;
+ char **list = hts_readlist(fname, 1, &nsmpl);
+ if ( !list && colon )
+ {
+ if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
+ tok->idxs = idxs2;
+ tok->nidxs = nidxs2;
+ tok->idx = idx2;
+ colon = rindex(fname, ':');
+ *colon = 0;
+ list = hts_readlist(fname, 1, &nsmpl);
+ }
+ if ( !list ) error("Could not read: %s\n", fname);
+ free(fname);
+ tok->nsamples = bcf_hdr_nsamples(hdr);
+ tok->usmpl = (uint8_t*) calloc(tok->nsamples,1);
+ for (i=0; i<nsmpl; i++)
+ {
+ int ismpl = bcf_hdr_id2int(hdr,BCF_DT_SAMPLE,list[i]);
+ if ( ismpl<0 ) error("No such sample in the VCF: \"%s\"\n", list[i]);
+ free(list[i]);
+ tok->usmpl[ismpl] = 1;
+ }
+ free(list);
+ if ( !colon )
+ {
+ tok->idxs = (int*) malloc(sizeof(int));
+ tok->idxs[0] = -1;
+ tok->nidxs = 1;
+ tok->idx = -2;
+ }
+ }
+ else if ( colon )
{
+ if ( !is_fmt ) error("Could not parse the index \"%s\". (Not a FORMAT tag yet sample index implied.)\n", ori);
*colon = 0;
if ( parse_idxs(tag_idx, &idxs1, &nidxs1, &idx1) != 0 ) error("Could not parse the index: %s\n", ori);
if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
for (i=0; i<tok->nidxs; i++) if ( tok->idxs[i] ) tok->nuidxs++;
}
}
+static int max_ac_an_unpack(bcf_hdr_t *hdr)
+{
+ int hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AC");
+ if ( hdr_id<0 ) return BCF_UN_FMT;
+ if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+ hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AN");
+ if ( hdr_id<0 ) return BCF_UN_FMT;
+ if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+ return BCF_UN_INFO;
+}
static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
{
tok->tok_type = TOK_VAL;
tok->tag = (char*) calloc(len+1,sizeof(char));
memcpy(tok->tag,str,len);
tok->tag[len] = 0;
- wordexp_t wexp;
- wordexp(tok->tag+1, &wexp, 0);
- if ( !wexp.we_wordc ) error("No such file: %s\n", tok->tag+1);
+ char *fname = expand_path(tok->tag+1);
int i, n;
- char **list = hts_readlist(wexp.we_wordv[0], 1, &n);
- if ( !list ) error("Could not read: %s\n", wexp.we_wordv[0]);
- wordfree(&wexp);
+ char **list = hts_readlist(fname, 1, &n);
+ if ( !list ) error("Could not read: %s\n", fname);
+ free(fname);
tok->hash = khash_str2int_init();
for (i=0; i<n; i++)
{
{
if ( tok->hdr_id >=0 )
{
- if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ) is_fmt = 0;
- else if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ) is_fmt = 1;
+ int is_info = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ? 1 : 0;
+ is_fmt = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ? 1 : 0;
+ if ( is_info && is_fmt ) error("Both INFO/%s and FORMAT/%s exist, which one do you want?\n", tmp.s,tmp.s);
}
if ( is_fmt==-1 ) is_fmt = 0;
}
}
else if ( !strcasecmp(tmp.s,"AN") )
{
+ filter->max_unpack |= BCF_UN_FMT;
tok->setter = &filters_set_an;
tok->tag = strdup("AN");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"AC") )
{
+ filter->max_unpack |= BCF_UN_FMT;
tok->setter = &filters_set_ac;
tok->tag = strdup("AC");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"MAC") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_mac;
tok->tag = strdup("MAC");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"AF") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_af;
tok->tag = strdup("AF");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"MAF") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_maf;
tok->tag = strdup("MAF");
free(tmp.s);
tok->threshold = strtof(tmp.s, &end); // float?
if ( errno!=0 || end!=tmp.s+len ) error("[%s:%d %s] Error: the tag \"%s\" is not defined in the VCF header\n", __FILE__,__LINE__,__FUNCTION__,tmp.s);
}
+ tok->is_constant = 1;
if ( tmp.s ) free(tmp.s);
return 0;
{
while ( *str ) { *str = tolower(*str); str++; }
}
+static int perl_exec(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+#if ENABLE_PERL_FILTERS
+
+ PerlInterpreter *perl = flt->perl;
+ if ( !perl ) error("Error: perl expression without a perl script name\n");
+
+ dSP;
+ ENTER;
+ SAVETMPS;
+
+ PUSHMARK(SP);
+ int i,j, istack = nstack - rtok->nargs;
+ for (i=istack+1; i<nstack; i++)
+ {
+ token_t *tok = stack[i];
+ if ( tok->is_str )
+ XPUSHs(sv_2mortal(newSVpvn(tok->str_value.s,tok->str_value.l)));
+ else if ( tok->nvalues==1 )
+ XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+ else if ( tok->nvalues>1 )
+ {
+ AV *av = newAV();
+ for (j=0; j<tok->nvalues; j++) av_push(av, newSVnv(tok->values[j]));
+ SV *rv = newRV_inc((SV*)av);
+ XPUSHs(rv);
+ }
+ else
+ {
+ bcf_double_set_missing(tok->values[0]);
+ XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+ }
+ }
+ PUTBACK;
+
+ // A possible future todo: provide a means to select samples and indexes,
+ // expressions like this don't work yet
+ // perl.filter(FMT/AD)[1:0]
+
+ int nret = call_pv(stack[istack]->str_value.s, G_ARRAY);
+
+ SPAGAIN;
+
+ rtok->nvalues = nret;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+ for (i=nret; i>0; i--)
+ {
+ rtok->values[i-1] = (double) POPn;
+ if ( isnan(rtok->values[i-1]) ) bcf_double_set_missing(rtok->values[i-1]);
+ }
+
+ PUTBACK;
+ FREETMPS;
+ LEAVE;
+
+#else
+ error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+ return rtok->nargs;
+}
+static void perl_init(filter_t *filter, char **str)
+{
+ char *beg = *str;
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) return;
+ if ( strncasecmp("perl:", beg, 5) ) return;
+#if ENABLE_PERL_FILTERS
+ beg += 5;
+ char *end = beg;
+ while ( *end && *end!=';' ) end++; // for now not escaping semicolons
+ *str = end+1;
+
+ if ( ++filter_ninit == 1 )
+ {
+ // must be executed only once, even for multiple filters; first time here
+ int argc = 0;
+ char **argv = NULL;
+ char **env = NULL;
+ PERL_SYS_INIT3(&argc, &argv, &env);
+ }
+
+ filter->perl = perl_alloc();
+ PerlInterpreter *perl = filter->perl;
+
+ if ( !perl ) error("perl_alloc failed\n");
+ perl_construct(perl);
+
+ // name of the perl script to run
+ char *rmme = (char*) calloc(end - beg + 1,1);
+ memcpy(rmme, beg, end - beg);
+ char *argv[] = { "", "" };
+ argv[1] = expand_path(rmme);
+ free(rmme);
+
+ PL_origalen = 1; // don't allow $0 change
+ int ret = perl_parse(filter->perl, NULL, 2, argv, NULL);
+ PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
+ if ( ret ) error("Failed to parse: %s\n", argv[1]);
+ free(argv[1]);
+
+ perl_run(perl);
+#else
+ error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+}
+static void perl_destroy(filter_t *filter)
+{
+#if ENABLE_PERL_FILTERS
+ if ( !filter->perl ) return;
+
+ PerlInterpreter *perl = filter->perl;
+ perl_destruct(perl);
+ perl_free(perl);
+ if ( --filter_ninit <= 0 )
+ {
+ // similarly to PERL_SYS_INIT3, can must be executed only once? todo: test
+ PERL_SYS_TERM();
+ }
+#endif
+}
// Parse filter expression and convert to reverse polish notation. Dijkstra's shunting-yard algorithm
filter->hdr = hdr;
filter->max_unpack |= BCF_UN_STR;
- int nops = 0, mops = 0, *ops = NULL; // operators stack
- int nout = 0, mout = 0; // filter tokens, RPN
+ int nops = 0, mops = 0; // operators stack
+ int nout = 0, mout = 0; // filter tokens, RPN
token_t *out = NULL;
+ token_t *ops = NULL;
char *tmp = filter->str;
+ perl_init(filter, &tmp);
+
int last_op = -1;
while ( *tmp )
{
if ( ret==-1 ) error("Missing quotes in: %s\n", str);
// fprintf(stderr,"token=[%c] .. [%s] %d\n", TOKEN_STRING[ret], tmp, len);
- // int i; for (i=0; i<nops; i++) fprintf(stderr," .%c.", TOKEN_STRING[ops[i]]); fprintf(stderr,"\n");
+ // int i; for (i=0; i<nops; i++) fprintf(stderr," .%c", TOKEN_STRING[ops[i]]); fprintf(stderr,"\n");
if ( ret==TOK_LFT ) // left bracket
{
nops++;
- hts_expand(int, nops, mops, ops);
- ops[nops-1] = ret;
+ hts_expand0(token_t, nops, mops, ops);
+ ops[nops-1].tok_type = ret;
}
else if ( ret==TOK_RGT ) // right bracket
{
- while ( nops>0 && ops[nops-1]!=TOK_LFT )
+ while ( nops>0 && ops[nops-1].tok_type!=TOK_LFT )
{
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
if ( nops<=0 ) error("Could not parse: %s\n", str);
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
else if ( ret!=TOK_VAL ) // one of the operators
tok->threshold = -1.0;
ret = TOK_MULT;
}
+ else if ( ret == -TOK_FUNC )
+ {
+ // this is different from TOK_PERLSUB,TOK_BINOM in that the expression inside the
+ // brackets gets evaluated as normal expression
+ nops++;
+ hts_expand0(token_t, nops, mops, ops);
+ token_t *tok = &ops[nops-1];
+ tok->tok_type = -ret;
+ tok->hdr_id = -1;
+ tok->pass_site = -1;
+ tok->threshold = -1.0;
+ if ( !strncasecmp(tmp-len,"N_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("N_PASS"); }
+ else if ( !strncasecmp(tmp-len,"F_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("F_PASS"); }
+ else error("The function \"%s\" is not supported\n", tmp-len);
+ continue;
+ }
+ else if ( ret < 0 ) // variable number of arguments: TOK_PERLSUB,TOK_BINOM
+ {
+ ret = -ret;
+
+ tmp += len;
+ char *beg = tmp;
+ kstring_t rmme = {0,0,0};
+ int i, margs, nargs = 0;
+
+ if ( ret == TOK_PERLSUB )
+ {
+ while ( *beg && ((isalnum(*beg) && !ispunct(*beg)) || *beg=='_') ) beg++;
+ if ( *beg!='(' ) error("Could not parse the expression: %s\n", str);
+
+ // the subroutine name
+ kputc('"', &rmme);
+ kputsn(tmp, beg-tmp, &rmme);
+ kputc('"', &rmme);
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ filters_init1(filter, rmme.s, rmme.l, &out[nout-1]);
+ nargs++;
+ }
+ char *end = beg;
+ while ( *end && *end!=')' ) end++;
+ if ( !*end ) error("Could not parse the expression: %s\n", str);
+
+ // subroutine arguments
+ rmme.l = 0;
+ kputsn(beg+1, end-beg-1, &rmme);
+ char **rmme_list = hts_readlist(rmme.s, 0, &margs);
+ for (i=0; i<margs; i++)
+ {
+ nargs++;
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ filters_init1(filter, rmme_list[i], strlen(rmme_list[i]), &out[nout-1]);
+ free(rmme_list[i]);
+ }
+ free(rmme_list);
+ free(rmme.s);
+
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ token_t *tok = &out[nout-1];
+ tok->tok_type = ret;
+ tok->nargs = nargs;
+ tok->hdr_id = -1;
+ tok->pass_site = -1;
+ tok->threshold = -1.0;
+
+ tmp = end + 1;
+ continue;
+ }
else
{
- while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1]] )
+ while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1].tok_type] )
{
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
}
nops++;
- hts_expand(int, nops, mops, ops);
- ops[nops-1] = ret;
+ hts_expand0(token_t, nops, mops, ops);
+ ops[nops-1].tok_type = ret;
}
else if ( !len )
{
{
nout++;
hts_expand0(token_t, nout, mout, out);
- filters_init1(filter, tmp, len, &out[nout-1]);
+ if ( tmp[len-1]==',' )
+ filters_init1(filter, tmp, len-1, &out[nout-1]);
+ else
+ filters_init1(filter, tmp, len, &out[nout-1]);
tmp += len;
}
last_op = ret;
}
while ( nops>0 )
{
- if ( ops[nops-1]==TOK_LFT || ops[nops-1]==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
+ if ( ops[nops-1].tok_type==TOK_LFT || ops[nops-1].tok_type==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
int i;
for (i=0; i<nout; i++)
{
+ if ( i+1<nout && (out[i].tok_type==TOK_LT || out[i].tok_type==TOK_BT) && out[i+1].tok_type==TOK_EQ )
+ error("Error parsing the expression: \"%s\"\n", filter->str);
+
if ( out[i].tok_type==TOK_OR || out[i].tok_type==TOK_OR_VEC )
out[i].func = vector_logic_or;
if ( out[i].tok_type==TOK_AND || out[i].tok_type==TOK_AND_VEC )
filter->nsamples = filter->max_unpack&BCF_UN_FMT ? bcf_hdr_nsamples(filter->hdr) : 0;
for (i=0; i<nout; i++)
{
- if ( out[i].tok_type==TOK_MAX ) { out[i].func = func_max; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; }
+ if ( out[i].tok_type==TOK_MAX ) { out[i].func = func_max; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_BINOM ) { out[i].func = func_binom; out[i].tok_type = TOK_FUNC; }
+ else if ( out[i].tok_type==TOK_PERLSUB ) { out[i].func = perl_exec; out[i].tok_type = TOK_FUNC; }
hts_expand0(double,1,out[i].mvalues,out[i].values);
if ( filter->nsamples )
{
void filter_destroy(filter_t *filter)
{
+ perl_destroy(filter);
int i;
for (i=0; i<filter->nfilters; i++)
{
for (i=0; i<filter->nfilters; i++)
{
filter->filters[i].pass_site = 0;
-
if ( filter->filters[i].tok_type == TOK_VAL )
{
if ( filter->filters[i].setter ) // variable, query the VCF line
#include <strings.h>
#include <errno.h>
#include <math.h>
-#include <wordexp.h>
+#include <sys/types.h>
+#include <pwd.h>
#include <regex.h>
#include <htslib/khash_str2int.h>
-#include "filter.h"
-#include "bcftools.h"
#include <htslib/hts_defs.h>
#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include "config.h"
+#include "filter.h"
+#include "bcftools.h"
+
+#if ENABLE_PERL_FILTERS
+# define filter_t perl_filter_t
+# include <EXTERN.h>
+# include <perl.h>
+# undef filter_t
+# define my_perl perl
+
+static int filter_ninit = 0;
+#endif
+
#ifndef __FUNCTION__
# define __FUNCTION__ __func__
{
// read-only values, same for all VCF lines
int tok_type; // one of the TOK_* keys below
+ int nargs; // with TOK_PERLSUB the first argument is the name of the subroutine
char *key; // set only for string constants, otherwise NULL
char *tag; // for debugging and printout only, VCF tag name
double threshold; // filtering threshold
+ int is_constant; // the threshold is set
int hdr_id, type; // BCF header lookup ID and one of BCF_HT_* types
int idx; // 0-based index to VCF vectors,
// -2: list (e.g. [0,1,2] or [1..3] or [1..] or any field[*], which is equivalent to [0..])
float *tmpf;
kstring_t tmps;
int max_unpack, mtmpi, mtmpf, nsamples;
+#if ENABLE_PERL_FILTERS
+ PerlInterpreter *perl;
+#endif
};
#define TOK_LEN 24
#define TOK_FUNC 25
#define TOK_CNT 26
+#define TOK_PERLSUB 27
+#define TOK_BINOM 28
-// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
-// ( ) [ < = > ] ! | & + - * / M m a A O ~ ^ S . l
-static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8};
-#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^f"
+// 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
+// ( ) [ < = > ] ! | & + - * / M m a A O ~ ^ S . l f c p
+static int op_prec[] = {0,1,1,5,5,5,5,5,5,2,3, 6, 6, 7, 7, 8, 8, 8, 3, 2, 5, 5, 8, 8, 8, 8, 8, 8};
+#define TOKEN_STRING "x()[<=>]!|&+-*/MmaAO~^S.lfcp"
+// Return negative values if it is a function with variable number of arguments
static int filters_next_token(char **str, int *len)
{
char *tmp = *str;
if ( !strncasecmp(tmp,"ABS(",4) ) { (*str) += 3; return TOK_ABS; }
if ( !strncasecmp(tmp,"COUNT(",4) ) { (*str) += 5; return TOK_CNT; }
if ( !strncasecmp(tmp,"STRLEN(",7) ) { (*str) += 6; return TOK_LEN; }
+ if ( !strncasecmp(tmp,"BINOM(",6) ) { (*str) += 5; return -TOK_BINOM; }
if ( !strncasecmp(tmp,"%MAX(",5) ) { (*str) += 4; return TOK_MAX; } // for backward compatibility
if ( !strncasecmp(tmp,"%MIN(",5) ) { (*str) += 4; return TOK_MIN; } // for backward compatibility
if ( !strncasecmp(tmp,"%AVG(",5) ) { (*str) += 4; return TOK_AVG; } // for backward compatibility
if ( !strncasecmp(tmp,"INFO/",5) ) tmp += 5;
if ( !strncasecmp(tmp,"FORMAT/",7) ) tmp += 7;
if ( !strncasecmp(tmp,"FMT/",4) ) tmp += 4;
+ if ( !strncasecmp(tmp,"PERL.",5) ) { (*str) += 5; return -TOK_PERLSUB; }
+ if ( !strncasecmp(tmp,"N_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
+ if ( !strncasecmp(tmp,"F_PASS(",7) ) { *len = 6; (*str) += 6; return -TOK_FUNC; }
if ( tmp[0]=='@' ) // file name
{
int square_brackets = 0;
while ( tmp[0] )
{
- if ( tmp[0]=='"' ) break;
- if ( tmp[0]=='\'' ) break;
- if ( isspace(tmp[0]) ) break;
- if ( tmp[0]=='<' ) break;
- if ( tmp[0]=='>' ) break;
- if ( tmp[0]=='=' ) break;
- if ( tmp[0]=='!' ) break;
- if ( tmp[0]=='&' ) break;
- if ( tmp[0]=='|' ) break;
- if ( tmp[0]=='(' ) break;
- if ( tmp[0]==')' ) break;
- if ( tmp[0]=='+' ) break;
- if ( tmp[0]=='*' && !square_brackets ) break;
- if ( tmp[0]=='-' && !square_brackets ) break;
- if ( tmp[0]=='/' ) break;
- if ( tmp[0]=='~' ) break;
+ if ( !square_brackets )
+ {
+ if ( tmp[0]=='"' ) break;
+ if ( tmp[0]=='\'' ) break;
+ if ( isspace(tmp[0]) ) break;
+ if ( tmp[0]=='<' ) break;
+ if ( tmp[0]=='>' ) break;
+ if ( tmp[0]=='=' ) break;
+ if ( tmp[0]=='!' ) break;
+ if ( tmp[0]=='&' ) break;
+ if ( tmp[0]=='|' ) break;
+ if ( tmp[0]=='(' ) break;
+ if ( tmp[0]==')' ) break;
+ if ( tmp[0]=='+' ) break;
+ if ( tmp[0]=='*' ) break;
+ if ( tmp[0]=='-' ) break;
+ if ( tmp[0]=='/' ) break;
+ if ( tmp[0]=='~' ) break;
+ }
if ( tmp[0]==']' ) { if (square_brackets) tmp++; break; }
if ( tmp[0]=='[' ) square_brackets++;
tmp++;
return TOK_VAL;
}
+
+/*
+ Simple path expansion, expands ~/, ~user, $var. The result must be freed by the caller.
+
+ Based on jkb's staden code with some adjustements.
+ https://sourceforge.net/p/staden/code/HEAD/tree/staden/trunk/src/Misc/getfile.c#l123
+*/
+char *expand_path(char *path)
+{
+#ifdef _WIN32
+ return strdup(path); // windows expansion: todo
+#endif
+
+ kstring_t str = {0,0,0};
+
+ if ( path[0] == '~' )
+ {
+ if ( !path[1] || path[1] == '/' )
+ {
+ // ~ or ~/path
+ kputs(getenv("HOME"), &str);
+ if ( path[1] ) kputs(path+1, &str);
+ }
+ else
+ {
+ // user name: ~pd3/path
+ char *end = path;
+ while ( *end && *end!='/' ) end++;
+ kputsn(path+1, end-path-1, &str);
+ struct passwd *pwentry = getpwnam(str.s);
+ str.l = 0;
+
+ if ( !pwentry ) kputsn(path, end-path, &str);
+ else kputs(pwentry->pw_dir, &str);
+ kputs(end, &str);
+ }
+ return str.s;
+ }
+ if ( path[0] == '$' )
+ {
+ char *var = getenv(path+1);
+ if ( var ) path = var;
+ }
+ return strdup(path);
+}
+
static void filters_set_qual(filter_t *flt, bcf1_t *line, token_t *tok)
{
float *ptr = &line->qual;
kputs(line->d.allele[tok->idx + 1], &tok->str_value);
else
kputc('.', &tok->str_value);
- tok->idx = 0;
}
else if ( tok->idx==-2 )
{
tok->nvalues = 1;
tok->values[0] = tok->tag[0]=='N' ? nmissing : (double)nmissing / line->n_sample;
}
+static int func_npass(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+ if ( nstack==0 ) error("Error parsing the expresion\n");
+ token_t *tok = stack[nstack - 1];
+ if ( !tok->nsamples ) error("The function %s works with FORMAT fields\n", rtok->tag);
+
+ rtok->nsamples = tok->nsamples;
+ memcpy(rtok->pass_samples, tok->pass_samples, rtok->nsamples*sizeof(*rtok->pass_samples));
+
+ assert(tok->usmpl);
+ if ( !rtok->usmpl )
+ {
+ rtok->usmpl = (uint8_t*) malloc(tok->nsamples*sizeof(*rtok->usmpl));
+ memcpy(rtok->usmpl, tok->usmpl, tok->nsamples*sizeof(*rtok->usmpl));
+ }
+
+ int i, npass = 0;
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ if ( rtok->pass_samples[i] ) npass++;
+ }
+
+ assert( rtok->values );
+ rtok->nvalues = 1;
+ rtok->values[0] = rtok->tag[0]=='N' ? npass : (line->n_sample ? 1.0*npass/line->n_sample : 0);
+ rtok->nsamples = 0;
+
+ return 1;
+}
static void filters_set_nalt(filter_t *flt, bcf1_t *line, token_t *tok)
{
tok->nvalues = 1;
}
else
{
- rtok->values[0] = strlen(tok->str_value.s);
+ if ( !tok->str_value.s[1] && tok->str_value.s[0]=='.' )
+ rtok->values[0] = 0;
+ else
+ rtok->values[0] = strlen(tok->str_value.s);
rtok->nvalues = 1;
}
return 1;
}
+static inline double calc_binom(int na, int nb)
+{
+ if ( na==0 && nb==0 ) return -1;
+ if ( na==nb ) return 1;
+
+ // kfunc.h implements kf_betai, which is the regularized beta function P(X<=k/N;p) = I_{1-p}(N-k,k+1)
+
+ double pval = na < nb ? kf_betai(nb, na + 1, 0.5) : kf_betai(na, nb + 1, 0.5);
+ pval *= 2;
+ assert( pval-1 < 1e-10 );
+ if ( pval>1 ) pval = 1; // this can happen, machine precision error, eg. kf_betai(1,0,0.5)
+
+ return pval;
+}
+static int func_binom(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+ int i, istack = nstack - rtok->nargs;
+
+ if ( rtok->nargs!=2 && rtok->nargs!=1 ) error("Error: binom() takes one or two arguments\n");
+ assert( istack>=0 );
+
+ // The expected mean is 0.5. Should we support also prob!=0.5?
+ //
+ // double prob = 0.5;
+ // if ( nstack==3 )
+ // {
+ // // three parameters, the first one must be a scalar: binom(0.25,AD[0],AD[1])
+ // if ( !stack[istack]->is_constant )
+ // error("The first argument of binom() must be a numeric constant if three parameters are given\n");
+ // prob = stack[istack]->threshold;
+ // istack++;
+ // }
+ // else if ( nstack==2 && stack[istack]->is_constant )
+ // {
+ // // two parameters, the first can be a scalar: binom(0.25,AD) or binom(AD[0],AD[1])
+ // prob = stack[istack]->threshold;
+ // istack++;
+ // }
+
+ token_t *tok = stack[istack];
+ if ( tok->nsamples )
+ {
+ // working with a FORMAT tag
+ rtok->nval1 = 1;
+ rtok->nvalues = tok->nsamples;
+ rtok->nsamples = tok->nsamples;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+ assert(tok->usmpl);
+ if ( !rtok->usmpl ) rtok->usmpl = (uint8_t*) malloc(tok->nsamples);
+ memcpy(rtok->usmpl, tok->usmpl, tok->nsamples);
+
+ if ( istack+1==nstack )
+ {
+ // determine the index from the GT field: binom(AD)
+ int ngt = bcf_get_genotypes(flt->hdr, line, &flt->tmpi, &flt->mtmpi);
+ int max_ploidy = ngt/line->n_sample;
+ if ( ngt <= 0 || max_ploidy < 2 ) // GT not present or not diploid, cannot set
+ {
+ for (i=0; i<rtok->nsamples; i++)
+ if ( rtok->usmpl[i] ) bcf_double_set_missing(rtok->values[i]);
+ return rtok->nargs;
+ }
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ int32_t *ptr = flt->tmpi + i*max_ploidy;
+ if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ int idx1 = bcf_gt_allele(ptr[0]);
+ int idx2 = bcf_gt_allele(ptr[1]);
+ if ( idx1>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+ if ( idx2>=line->n_allele ) error("Incorrect allele index at %s:%d, sample %s\n", bcf_seqname(flt->hdr,line),line->pos+1,flt->hdr->samples[i]);
+ double *vals = tok->values + tok->nval1*i;
+ if ( bcf_double_is_missing(vals[idx1]) || bcf_double_is_missing(vals[idx2]) )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ rtok->values[i] = calc_binom(vals[idx1],vals[idx2]);
+ if ( rtok->values[i] < 0 )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ }
+ }
+ else
+ {
+ // the fields given explicitly: binom(AD[:0],AD[:1])
+ token_t *tok2 = stack[istack+1];
+ if ( tok->nval1!=1 || tok2->nval1!=1 )
+ error("Expected one value per binom() argument, found %d and %d at %s:%d\n",tok->nval1,tok2->nval1, bcf_seqname(flt->hdr,line),line->pos+1);
+ for (i=0; i<rtok->nsamples; i++)
+ {
+ if ( !rtok->usmpl[i] ) continue;
+ double *ptr1 = tok->values + tok->nval1*i;
+ double *ptr2 = tok2->values + tok2->nval1*i;
+ if ( bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ rtok->values[i] = calc_binom(ptr1[0],ptr2[0]);
+ if ( rtok->values[i] < 0 )
+ {
+ bcf_double_set_missing(rtok->values[i]);
+ continue;
+ }
+ }
+ }
+ }
+ else
+ {
+ // working with an INFO tag
+ rtok->nvalues = 1;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+
+ double *ptr1 = NULL, *ptr2 = NULL;
+ if ( istack+1==nstack )
+ {
+ // only one tag, expecting two values: binom(INFO/AD)
+ if ( tok->nvalues==2 )
+ {
+ ptr1 = &tok->values[0];
+ ptr2 = &tok->values[1];
+ }
+ }
+ else
+ {
+ // two tags, expecting one value in each: binom(INFO/AD[0],INFO/AD[1])
+ token_t *tok2 = stack[istack+1];
+ if ( tok->nvalues==1 && tok2->nvalues==1 )
+ {
+ ptr1 = &tok->values[0];
+ ptr2 = &tok2->values[0];
+ }
+ }
+ if ( !ptr1 || !ptr2 || bcf_double_is_missing(ptr1[0]) || bcf_double_is_missing(ptr2[0]) )
+ bcf_double_set_missing(rtok->values[0]);
+ else
+ {
+ rtok->values[0] = calc_binom(ptr1[0],ptr2[0]);
+ if ( rtok->values[0] < 0 )
+ bcf_double_set_missing(rtok->values[0]);
+ }
+ }
+ return rtok->nargs;
+}
inline static void tok_init_values(token_t *atok, token_t *btok, token_t *rtok)
{
token_t *tok = atok->nvalues > btok->nvalues ? atok : btok;
rtok->nvalues = tok->nvalues;
rtok->nval1 = tok->nval1;
- hts_expand(double*, rtok->nvalues, rtok->mvalues, rtok->values);
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
}
inline static void tok_init_samples(token_t *atok, token_t *btok, token_t *rtok)
{
static int vector_logic_or(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
{
+ if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\"\n", filter->str);
+
token_t *atok = stack[nstack-2];
token_t *btok = stack[nstack-1];
tok_init_samples(atok, btok, rtok);
}
static int vector_logic_and(filter_t *filter, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
{
+ if ( nstack < 2 ) error("Error occurred while processing the filter \"%s\". (nstack=%d)\n", filter->str,nstack);
+
token_t *atok = stack[nstack-2];
token_t *btok = stack[nstack-1];
tok_init_samples(atok, btok, rtok);
int *idxs2 = NULL, nidxs2 = 0, idx2 = 0;
int set_samples = 0;
- char *colon = index(tag_idx, ':');
- if ( colon )
+ char *colon = rindex(tag_idx, ':');
+ if ( tag_idx[0]=='@' ) // file list with sample names
+ {
+ if ( !is_fmt ) error("Could not parse \"%s\". (Not a FORMAT tag yet a sample list provided.)\n", ori);
+ char *fname = expand_path(tag_idx+1);
+ int nsmpl;
+ char **list = hts_readlist(fname, 1, &nsmpl);
+ if ( !list && colon )
+ {
+ if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
+ tok->idxs = idxs2;
+ tok->nidxs = nidxs2;
+ tok->idx = idx2;
+ colon = rindex(fname, ':');
+ *colon = 0;
+ list = hts_readlist(fname, 1, &nsmpl);
+ }
+ if ( !list ) error("Could not read: %s\n", fname);
+ free(fname);
+ tok->nsamples = bcf_hdr_nsamples(hdr);
+ tok->usmpl = (uint8_t*) calloc(tok->nsamples,1);
+ for (i=0; i<nsmpl; i++)
+ {
+ int ismpl = bcf_hdr_id2int(hdr,BCF_DT_SAMPLE,list[i]);
+ if ( ismpl<0 ) error("No such sample in the VCF: \"%s\"\n", list[i]);
+ free(list[i]);
+ tok->usmpl[ismpl] = 1;
+ }
+ free(list);
+ if ( !colon )
+ {
+ tok->idxs = (int*) malloc(sizeof(int));
+ tok->idxs[0] = -1;
+ tok->nidxs = 1;
+ tok->idx = -2;
+ }
+ }
+ else if ( colon )
{
+ if ( !is_fmt ) error("Could not parse the index \"%s\". (Not a FORMAT tag yet sample index implied.)\n", ori);
*colon = 0;
if ( parse_idxs(tag_idx, &idxs1, &nidxs1, &idx1) != 0 ) error("Could not parse the index: %s\n", ori);
if ( parse_idxs(colon+1, &idxs2, &nidxs2, &idx2) != 0 ) error("Could not parse the index: %s\n", ori);
for (i=0; i<tok->nidxs; i++) if ( tok->idxs[i] ) tok->nuidxs++;
}
}
+static int max_ac_an_unpack(bcf_hdr_t *hdr)
+{
+ int hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AC");
+ if ( hdr_id<0 ) return BCF_UN_FMT;
+ if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+ hdr_id = bcf_hdr_id2int(hdr,BCF_DT_ID,"AN");
+ if ( hdr_id<0 ) return BCF_UN_FMT;
+ if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,hdr_id) ) return BCF_UN_FMT;
+
+ return BCF_UN_INFO;
+}
static int filters_init1(filter_t *filter, char *str, int len, token_t *tok)
{
tok->tok_type = TOK_VAL;
tok->tag = (char*) calloc(len+1,sizeof(char));
memcpy(tok->tag,str,len);
tok->tag[len] = 0;
- wordexp_t wexp;
- wordexp(tok->tag+1, &wexp, 0);
- if ( !wexp.we_wordc ) error("No such file: %s\n", tok->tag+1);
+ char *fname = expand_path(tok->tag+1);
int i, n;
- char **list = hts_readlist(wexp.we_wordv[0], 1, &n);
- if ( !list ) error("Could not read: %s\n", wexp.we_wordv[0]);
- wordfree(&wexp);
+ char **list = hts_readlist(fname, 1, &n);
+ if ( !list ) error("Could not read: %s\n", fname);
+ free(fname);
tok->hash = khash_str2int_init();
for (i=0; i<n; i++)
{
{
if ( tok->hdr_id >=0 )
{
- if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ) is_fmt = 0;
- else if ( bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ) is_fmt = 1;
+ int is_info = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_INFO,tok->hdr_id) ? 1 : 0;
+ is_fmt = bcf_hdr_idinfo_exists(filter->hdr,BCF_HL_FMT,tok->hdr_id) ? 1 : 0;
+ if ( is_info && is_fmt ) error("Both INFO/%s and FORMAT/%s exist, which one do you want?\n", tmp.s,tmp.s);
}
if ( is_fmt==-1 ) is_fmt = 0;
}
}
else if ( !strcasecmp(tmp.s,"AN") )
{
+ filter->max_unpack |= BCF_UN_FMT;
tok->setter = &filters_set_an;
tok->tag = strdup("AN");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"AC") )
{
+ filter->max_unpack |= BCF_UN_FMT;
tok->setter = &filters_set_ac;
tok->tag = strdup("AC");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"MAC") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_mac;
tok->tag = strdup("MAC");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"AF") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_af;
tok->tag = strdup("AF");
free(tmp.s);
}
else if ( !strcasecmp(tmp.s,"MAF") )
{
+ filter->max_unpack |= max_ac_an_unpack(filter->hdr);
tok->setter = &filters_set_maf;
tok->tag = strdup("MAF");
free(tmp.s);
tok->threshold = strtof(tmp.s, &end); // float?
if ( errno!=0 || end!=tmp.s+len ) error("[%s:%d %s] Error: the tag \"%s\" is not defined in the VCF header\n", __FILE__,__LINE__,__FUNCTION__,tmp.s);
}
+ tok->is_constant = 1;
if ( tmp.s ) free(tmp.s);
return 0;
{
while ( *str ) { *str = tolower(*str); str++; }
}
+static int perl_exec(filter_t *flt, bcf1_t *line, token_t *rtok, token_t **stack, int nstack)
+{
+#if ENABLE_PERL_FILTERS
+
+ PerlInterpreter *perl = flt->perl;
+ if ( !perl ) error("Error: perl expression without a perl script name\n");
+
+ dSP;
+ ENTER;
+ SAVETMPS;
+
+ PUSHMARK(SP);
+ int i,j, istack = nstack - rtok->nargs;
+ for (i=istack+1; i<nstack; i++)
+ {
+ token_t *tok = stack[i];
+ if ( tok->is_str )
+ XPUSHs(sv_2mortal(newSVpvn(tok->str_value.s,tok->str_value.l)));
+ else if ( tok->nvalues==1 )
+ XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+ else if ( tok->nvalues>1 )
+ {
+ AV *av = newAV();
+ for (j=0; j<tok->nvalues; j++) av_push(av, newSVnv(tok->values[j]));
+ SV *rv = newRV_inc((SV*)av);
+ XPUSHs(rv);
+ }
+ else
+ {
+ bcf_double_set_missing(tok->values[0]);
+ XPUSHs(sv_2mortal(newSVnv(tok->values[0])));
+ }
+ }
+ PUTBACK;
+
+ // A possible future todo: provide a means to select samples and indexes,
+ // expressions like this don't work yet
+ // perl.filter(FMT/AD)[1:0]
+
+ int nret = call_pv(stack[istack]->str_value.s, G_ARRAY);
+
+ SPAGAIN;
+
+ rtok->nvalues = nret;
+ hts_expand(double, rtok->nvalues, rtok->mvalues, rtok->values);
+ for (i=nret; i>0; i--)
+ {
+ rtok->values[i-1] = (double) POPn;
+ if ( isnan(rtok->values[i-1]) ) bcf_double_set_missing(rtok->values[i-1]);
+ }
+
+ PUTBACK;
+ FREETMPS;
+ LEAVE;
+
+#else
+ error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+ return rtok->nargs;
+}
+static void perl_init(filter_t *filter, char **str)
+{
+ char *beg = *str;
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) return;
+ if ( strncasecmp("perl:", beg, 5) ) return;
+#if ENABLE_PERL_FILTERS
+ beg += 5;
+ char *end = beg;
+ while ( *end && *end!=';' ) end++; // for now not escaping semicolons
+ *str = end+1;
+
+ if ( ++filter_ninit == 1 )
+ {
+ // must be executed only once, even for multiple filters; first time here
+ int argc = 0;
+ char **argv = NULL;
+ char **env = NULL;
+ PERL_SYS_INIT3(&argc, &argv, &env);
+ }
+
+ filter->perl = perl_alloc();
+ PerlInterpreter *perl = filter->perl;
+
+ if ( !perl ) error("perl_alloc failed\n");
+ perl_construct(perl);
+
+ // name of the perl script to run
+ char *rmme = (char*) calloc(end - beg + 1,1);
+ memcpy(rmme, beg, end - beg);
+ char *argv[] = { "", "" };
+ argv[1] = expand_path(rmme);
+ free(rmme);
+
+ PL_origalen = 1; // don't allow $0 change
+ int ret = perl_parse(filter->perl, NULL, 2, argv, NULL);
+ PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
+ if ( ret ) error("Failed to parse: %s\n", argv[1]);
+ free(argv[1]);
+
+ perl_run(perl);
+#else
+ error("\nPerl filtering requires running `configure --enable-perl-filters` at compile time.\n\n");
+#endif
+}
+static void perl_destroy(filter_t *filter)
+{
+#if ENABLE_PERL_FILTERS
+ if ( !filter->perl ) return;
+
+ PerlInterpreter *perl = filter->perl;
+ perl_destruct(perl);
+ perl_free(perl);
+ if ( --filter_ninit <= 0 )
+ {
+ // similarly to PERL_SYS_INIT3, can must be executed only once? todo: test
+ PERL_SYS_TERM();
+ }
+#endif
+}
// Parse filter expression and convert to reverse polish notation. Dijkstra's shunting-yard algorithm
filter->hdr = hdr;
filter->max_unpack |= BCF_UN_STR;
- int nops = 0, mops = 0, *ops = NULL; // operators stack
- int nout = 0, mout = 0; // filter tokens, RPN
+ int nops = 0, mops = 0; // operators stack
+ int nout = 0, mout = 0; // filter tokens, RPN
token_t *out = NULL;
+ token_t *ops = NULL;
char *tmp = filter->str;
+ perl_init(filter, &tmp);
+
int last_op = -1;
while ( *tmp )
{
if ( ret==-1 ) error("Missing quotes in: %s\n", str);
// fprintf(bcftools_stderr,"token=[%c] .. [%s] %d\n", TOKEN_STRING[ret], tmp, len);
- // int i; for (i=0; i<nops; i++) fprintf(bcftools_stderr," .%c.", TOKEN_STRING[ops[i]]); fprintf(bcftools_stderr,"\n");
+ // int i; for (i=0; i<nops; i++) fprintf(bcftools_stderr," .%c", TOKEN_STRING[ops[i]]); fprintf(bcftools_stderr,"\n");
if ( ret==TOK_LFT ) // left bracket
{
nops++;
- hts_expand(int, nops, mops, ops);
- ops[nops-1] = ret;
+ hts_expand0(token_t, nops, mops, ops);
+ ops[nops-1].tok_type = ret;
}
else if ( ret==TOK_RGT ) // right bracket
{
- while ( nops>0 && ops[nops-1]!=TOK_LFT )
+ while ( nops>0 && ops[nops-1].tok_type!=TOK_LFT )
{
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
if ( nops<=0 ) error("Could not parse: %s\n", str);
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
else if ( ret!=TOK_VAL ) // one of the operators
tok->threshold = -1.0;
ret = TOK_MULT;
}
+ else if ( ret == -TOK_FUNC )
+ {
+ // this is different from TOK_PERLSUB,TOK_BINOM in that the expression inside the
+ // brackets gets evaluated as normal expression
+ nops++;
+ hts_expand0(token_t, nops, mops, ops);
+ token_t *tok = &ops[nops-1];
+ tok->tok_type = -ret;
+ tok->hdr_id = -1;
+ tok->pass_site = -1;
+ tok->threshold = -1.0;
+ if ( !strncasecmp(tmp-len,"N_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("N_PASS"); }
+ else if ( !strncasecmp(tmp-len,"F_PASS",6) ) { tok->func = func_npass; tok->tag = strdup("F_PASS"); }
+ else error("The function \"%s\" is not supported\n", tmp-len);
+ continue;
+ }
+ else if ( ret < 0 ) // variable number of arguments: TOK_PERLSUB,TOK_BINOM
+ {
+ ret = -ret;
+
+ tmp += len;
+ char *beg = tmp;
+ kstring_t rmme = {0,0,0};
+ int i, margs, nargs = 0;
+
+ if ( ret == TOK_PERLSUB )
+ {
+ while ( *beg && ((isalnum(*beg) && !ispunct(*beg)) || *beg=='_') ) beg++;
+ if ( *beg!='(' ) error("Could not parse the expression: %s\n", str);
+
+ // the subroutine name
+ kputc('"', &rmme);
+ kputsn(tmp, beg-tmp, &rmme);
+ kputc('"', &rmme);
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ filters_init1(filter, rmme.s, rmme.l, &out[nout-1]);
+ nargs++;
+ }
+ char *end = beg;
+ while ( *end && *end!=')' ) end++;
+ if ( !*end ) error("Could not parse the expression: %s\n", str);
+
+ // subroutine arguments
+ rmme.l = 0;
+ kputsn(beg+1, end-beg-1, &rmme);
+ char **rmme_list = hts_readlist(rmme.s, 0, &margs);
+ for (i=0; i<margs; i++)
+ {
+ nargs++;
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ filters_init1(filter, rmme_list[i], strlen(rmme_list[i]), &out[nout-1]);
+ free(rmme_list[i]);
+ }
+ free(rmme_list);
+ free(rmme.s);
+
+ nout++;
+ hts_expand0(token_t, nout, mout, out);
+ token_t *tok = &out[nout-1];
+ tok->tok_type = ret;
+ tok->nargs = nargs;
+ tok->hdr_id = -1;
+ tok->pass_site = -1;
+ tok->threshold = -1.0;
+
+ tmp = end + 1;
+ continue;
+ }
else
{
- while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1]] )
+ while ( nops>0 && op_prec[ret] < op_prec[ops[nops-1].tok_type] )
{
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
}
nops++;
- hts_expand(int, nops, mops, ops);
- ops[nops-1] = ret;
+ hts_expand0(token_t, nops, mops, ops);
+ ops[nops-1].tok_type = ret;
}
else if ( !len )
{
{
nout++;
hts_expand0(token_t, nout, mout, out);
- filters_init1(filter, tmp, len, &out[nout-1]);
+ if ( tmp[len-1]==',' )
+ filters_init1(filter, tmp, len-1, &out[nout-1]);
+ else
+ filters_init1(filter, tmp, len, &out[nout-1]);
tmp += len;
}
last_op = ret;
}
while ( nops>0 )
{
- if ( ops[nops-1]==TOK_LFT || ops[nops-1]==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
+ if ( ops[nops-1].tok_type==TOK_LFT || ops[nops-1].tok_type==TOK_RGT ) error("Could not parse the expression: [%s]\n", filter->str);
nout++;
hts_expand0(token_t, nout, mout, out);
- out[nout-1].tok_type = ops[nops-1];
+ out[nout-1] = ops[nops-1];
+ memset(&ops[nops-1],0,sizeof(token_t));
nops--;
}
int i;
for (i=0; i<nout; i++)
{
+ if ( i+1<nout && (out[i].tok_type==TOK_LT || out[i].tok_type==TOK_BT) && out[i+1].tok_type==TOK_EQ )
+ error("Error parsing the expression: \"%s\"\n", filter->str);
+
if ( out[i].tok_type==TOK_OR || out[i].tok_type==TOK_OR_VEC )
out[i].func = vector_logic_or;
if ( out[i].tok_type==TOK_AND || out[i].tok_type==TOK_AND_VEC )
filter->nsamples = filter->max_unpack&BCF_UN_FMT ? bcf_hdr_nsamples(filter->hdr) : 0;
for (i=0; i<nout; i++)
{
- if ( out[i].tok_type==TOK_MAX ) { out[i].func = func_max; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; }
- else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; }
+ if ( out[i].tok_type==TOK_MAX ) { out[i].func = func_max; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_MIN ) { out[i].func = func_min; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_AVG ) { out[i].func = func_avg; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_SUM ) { out[i].func = func_sum; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_ABS ) { out[i].func = func_abs; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_CNT ) { out[i].func = func_count; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_LEN ) { out[i].func = func_strlen; out[i].tok_type = TOK_FUNC; out[i].tok_type = 1; }
+ else if ( out[i].tok_type==TOK_BINOM ) { out[i].func = func_binom; out[i].tok_type = TOK_FUNC; }
+ else if ( out[i].tok_type==TOK_PERLSUB ) { out[i].func = perl_exec; out[i].tok_type = TOK_FUNC; }
hts_expand0(double,1,out[i].mvalues,out[i].values);
if ( filter->nsamples )
{
void filter_destroy(filter_t *filter)
{
+ perl_destroy(filter);
int i;
for (i=0; i<filter->nfilters; i++)
{
for (i=0; i<filter->nfilters; i++)
{
filter->filters[i].pass_site = 0;
-
if ( filter->filters[i].tok_type == TOK_VAL )
{
if ( filter->filters[i].setter ) // variable, query the VCF line
--- /dev/null
+[Files in this distribution outwith the cram/ subdirectory are distributed
+according to the terms of the following MIT/Expat license.]
+
+The MIT/Expat License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[Files within the cram/ subdirectory in this distribution are distributed
+according to the terms of the following Modified 3-Clause BSD license.]
+
+The Modified-BSD License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+
+3. Neither the names Genome Research Ltd and Wellcome Trust Sanger Institute
+ nor the names of its contributors may be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY GENOME RESEARCH LTD AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL GENOME RESEARCH LTD OR ITS CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
--- /dev/null
+HTSlib is an implementation of a unified C library for accessing common file
+formats, such as SAM, CRAM, VCF, and BCF, used for high-throughput sequencing
+data. It is the core library used by samtools and bcftools.
+
+See INSTALL for building and installation instructions.
/* main.c -- main bcftools command front-end.
- Copyright (C) 2012-2016 Genome Research Ltd.
+ Copyright (C) 2012-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
if (argc < 2) { usage(stderr); return 1; }
if (strcmp(argv[1], "version") == 0 || strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-v") == 0) {
- printf("bcftools %s\nUsing htslib %s\nCopyright (C) 2016 Genome Research Ltd.\n", bcftools_version(), hts_version());
+ printf("bcftools %s\nUsing htslib %s\nCopyright (C) 2018 Genome Research Ltd.\n", bcftools_version(), hts_version());
#if USE_GPL
printf("License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\n");
#else
/* main.c -- main bcftools command front-end.
- Copyright (C) 2012-2016 Genome Research Ltd.
+ Copyright (C) 2012-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
if (argc < 2) { usage(bcftools_stderr); return 1; }
if (strcmp(argv[1], "version") == 0 || strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-v") == 0) {
- fprintf(bcftools_stdout, "bcftools %s\nUsing htslib %s\nCopyright (C) 2016 Genome Research Ltd.\n", bcftools_version(), hts_version());
+ fprintf(bcftools_stdout, "bcftools %s\nUsing htslib %s\nCopyright (C) 2018 Genome Research Ltd.\n", bcftools_version(), hts_version());
#if USE_GPL
fprintf(bcftools_stdout, "License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\n");
#else
case 'u': mplp.output_type = FT_BCF; break;
case 'z': mplp.output_type = FT_VCF_GZ; break;
case 'v': mplp.output_type = FT_VCF; break;
- default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n", optarg);
+ default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n");
}
break;
case 'C': mplp.capQ_thres = atoi(optarg); break;
case 'u': mplp.output_type = FT_BCF; break;
case 'z': mplp.output_type = FT_VCF_GZ; break;
case 'v': mplp.output_type = FT_VCF; break;
- default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n", optarg);
+ default: error("[error] The option \"-O\" changed meaning when mpileup moved to bcftools. Did you mean: \"bcftools mpileup --output-type\" or \"samtools mpileup --output-BP\"?\n");
}
break;
case 'C': mplp.capQ_thres = atoi(optarg); break;
--- /dev/null
+/* plugins/GTisec.c -- collect genotype intersection counts of all possible
+ subsets of the present samples and output in banker's
+ sequence order (in this sequence, the number of contained
+ samples increases monotonically, a property that is e.g.
+ useful for programatically creating plotting files for the
+ R package VennDiagram or the plotting tool circos from the
+ counts, as in the command line tools bankers2VennDiagram and
+ bankers2circos at htpps://github.com/dlaehnemann/bankers2)
+
+ Copyright (C) 2016 Computational Biology of Infection Research,
+ Helmholtz Centre for Infection Research, Braunschweig,
+ Germany
+
+ Author: David Laehnemann <david.laehnemann@helmholtz-hzi.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/khash.h>
+KHASH_MAP_INIT_INT(gts2smps, uint32_t)
+
+#include "bcftools.h"
+
+/*!
+ * Flag definitions for args.flag
+ */
+#define MISSING (1<<0)
+#define VERBOSE (1<<1)
+#define SMPORDER (1<<2)
+
+typedef struct _args_t
+{
+ bcf_srs_t *file; /*! multi-sample VCF file */
+ bcf_hdr_t *hdr; /*! VCF file header */
+ FILE *out; /*! output file pointer */
+ int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+ int nsmpp2; /*! 2^(nsmp) (is needed multiple times) */
+ int *gt_arr; /*! temporary array, to store GTs of current line/record */
+ int ngt_arr; /*! hold the number of current GT array entries */
+ uint32_t *bankers; /*! array to store banker's sequence for all possible sample subsets for
+ programmatic indexing into smp_is for output printing, e.g. for three
+ samples A, B and C this would be the following order:
+ [ C, B, A, CB, CA, BA, CBA ]
+ [ 100, 010, 001, 110, 101, 011, 111 ]
+ */
+ uint64_t *quick; /*! array to store n choose k lookup table of choose() function */
+ uint8_t flag; /*! several flags, for positions see above*/
+ uint64_t *missing_gts; /*! array to count missing genotypes of each sample */
+ uint64_t *smp_is; /*! array to track all possible intersections between
+ samples, with each bit in the index integer belonging to one
+ sample. E.g. for three samples A, B and C, count would be in
+ the following order:
+ [ A, B, AB, C, AC, BC, ABC ]
+ [ 001, 010, 011, 100, 101, 110, 111 ]
+ */
+}
+args_t;
+
+static args_t args;
+uint32_t compute_bankers(unsigned long a);
+
+const char *about(void)
+{
+ return "Count genotype intersections across all possible sample subsets in a vcf file.\n";
+}
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Count genotype intersections across all possible sample subsets in a vcf file.\n"
+ "Usage: bcftools +GTisec <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+ "\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -m, --missing if set, include count of missing genotypes per sample in output\n"
+ " -v, --verbose if set, annotate count rows with corresponding sample subset lists\n"
+ " -H, --human-readable if set, create human readable output; i.e. sort output by sample and\n"
+ " print each subset's intersection count once for each sample contained\n"
+ " in the subset; implies verbose output (-v)\n"
+ "\n"
+ "Example:\n"
+ " bcftools +GTisec in.vcf -- -v # for verbose output\n"
+ " bcftools +GTisec in.vcf -- -H # for human readable output\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.flag = 0;
+
+ static struct option loptions[] =
+ {
+ {"help", no_argument, 0,'h'},
+ {"missing", no_argument, 0,'m'},
+ {"verbose", no_argument, 0,'v'},
+ {"human-readable", no_argument, 0,'H'},
+ {0,0,0,0}
+ };
+
+ int c;
+ while ((c = getopt_long(argc, argv, "?mvHh",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'm': args.flag |= MISSING; break;
+ case 'v': args.flag |= VERBOSE; break;
+ case 'H': args.flag |= ( SMPORDER | VERBOSE ); break;
+ case 'h': usage(); break;
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) usage(); // too many files given
+
+
+ args.hdr = in;
+
+ if ( !bcf_hdr_nsamples(args.hdr) )
+ {
+ error("No samples in input file.\n");
+ }
+
+ args.nsmp = bcf_hdr_nsamples(args.hdr);
+ if ( args.nsmp > 32 ) error("Too many samples. A maximum of 32 is supported.\n");
+ args.nsmpp2 = pow( 2, args.nsmp);
+ args.bankers = (uint32_t*) calloc( args.nsmpp2, sizeof(uint32_t) );
+ args.quick = (uint64_t*) calloc((args.nsmp * (args.nsmp + 1)) / 4, sizeof(unsigned long));
+ if ( args.flag & MISSING ) args.missing_gts = (uint64_t*) calloc( args.nsmp, sizeof(uint64_t));
+ args.smp_is = (uint64_t*) calloc( args.nsmpp2, sizeof(uint64_t));
+ if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+ args.gt_arr = NULL;
+ args.ngt_arr = 0;
+
+ args.out = stdout;
+
+ /*! Header printing */
+ FILE *fp = args.out;
+ fprintf(fp, "# This file was produced by bcftools +GTisec (%s+htslib-%s)\n", bcftools_version(), hts_version());
+ fprintf(fp, "# The command line was:\tbcftools +GTisec %s ", argv[0]);
+ int i;
+ for (i=1; i < argc; i++)
+ {
+ fprintf(fp, " %s", argv[i]);
+ }
+ fprintf(fp,"\n");
+ fprintf(fp,"# This file can be used as input to the subset plotting tools at:\n"
+ "# https://github.com/dlaehnemann/bankers2\n");
+ fprintf(fp,"# Genotype intersections across samples:\n");
+ fprintf(fp,"@SMPS");
+ for (i = args.nsmp-1; i >= 0; i--)
+ {
+ fprintf(fp," %s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, i));
+ }
+ fprintf(fp,"\n");
+ if ( args.flag & MISSING )
+ {
+ if ( args.flag & SMPORDER )
+ {
+ fprintf(fp, "# The first line of each sample contains its count of missing genotypes, with a '-' appended\n"
+ "# to the sample name.\n");
+ }
+ else
+ {
+ fprintf(fp, "# The first %i lines contain the counts for missing values of each sample in the order provided\n"
+ "# in the SMPS-line above. Intersection counts only start afterwards.\n", args.nsmp);
+ }
+ }
+ if ( args.flag & SMPORDER )
+ {
+ fprintf(fp, "# Human readable output (-H) was requested. Subset intersection counts are therefore sorted by\n"
+ "# sample and repeated for each contained sample. For each sample, counts are in banker's \n"
+ "# sequence order regarding all other samples.\n");
+ }
+ else
+ {
+ fprintf(fp, "# Subset intersection counts are in global banker's sequence order.\n");
+ if ( args.nsmp > 2 )
+ {
+ fprintf(fp, "# After exclusive sample counts in order of the SMPS-line, banker's sequence continues with:\n"
+ "# %s,%s %s,%s ...\n",
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-2 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-3 )
+ );
+ }
+ }
+ if (args.flag & VERBOSE )
+ {
+ fprintf(fp,"# [1] Number of shared non-ref genotypes \t[2] Samples sharing non-ref genotype (GT)\n");
+ }
+ else
+ {
+ fprintf(fp,"# [1] Number of shared non-ref genotypes\n");
+ }
+
+ /* Compute banker's sequence for following printing by sample and
+ * with increasing subset size.
+ */
+ uint32_t j;
+ for ( j = 0; j < args.nsmpp2; j++ )
+ {
+ args.bankers[j] = compute_bankers(j);
+ }
+
+ return 1;
+}
+
+
+/* ADAPTED CODE FROM CORIN LAWSON (START)
+ * https://github.com/au-phiware/bankers/blob/master/c/bankers.c
+ * who implemented ideas of Eric Burnett:
+ * http://www.thelowlyprogrammer.com/2010/04/indexing-and-enumerating-subsets-of.html
+ */
+
+/*
+ * Compute the binomial coefficient of `n choose k'.
+ * Use the fact that binom(n, k) = binom(n, n - k).
+ * Use a lookup table (triangle, actually) for speed.
+ * Otherwise it's dumb (heart) recursion.
+ * Added relative to Corin Lawson:
+ * * Passing in of sample number through pointer to args struct
+ * * Make quick lookup table external to keep it persistent with clean allocation
+ * and freeing
+ */
+uint64_t choose(unsigned int n, unsigned int k) {
+ if (n == 0)
+ return 0;
+ if (n == k || k == 0)
+ return 1;
+ if (k > n / 2)
+ k = n - k;
+
+ unsigned int i = (n * (n - 1)) / 4 + k - 1;
+ if (args.quick[i] == 0)
+ args.quick[i] = choose(n - 1, k - 1) + choose(n - 1, k);
+
+ return args.quick[i];
+}
+
+/*
+ * Returns the Banker's number at the specified position, a.
+ * Derived from the recursive bit flip method.
+ * Added relative to Corin Lawson:
+ * * Uses same lookup table solution as choose function, just
+ * maintained externally to persist across separate function calls.
+ * * Uses bitwise symmetry of banker's sequence to use bitwise inversion
+ * instead of recursive bit flip for second half of sequence.
+ */
+uint32_t compute_bankers(unsigned long a)
+{
+ if (a == 0)
+ return 0;
+
+ if ( args.bankers[a] == 0 )
+ {
+ if ( a >= (args.nsmpp2 / 2) )
+ return args.bankers[a] = ( compute_bankers(args.nsmpp2 - (a+1)) ^ (args.nsmpp2 - 1) ); // use bitwise symmetry of bankers sequence
+ unsigned int c = 0;
+ uint32_t n = args.nsmp;
+ uint64_t e = a, binom;
+ binom = choose(n, c);
+ do {
+ e -= binom;
+ } while ((binom = choose(n, ++c)) <= e);
+
+ do {
+ if (e == 0 || (binom = choose(n - 1, c - 1)) > e)
+ c--, args.bankers[a] |= 1;
+ else
+ e -= binom;
+ } while (--n && c && ((args.bankers[a] <<= 1) || 1));
+ args.bankers[a] <<= n;
+ }
+
+ return args.bankers[a];
+}
+
+// ADAPTED CODE FROM CORIN LAWSON END
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+ uint64_t i;
+ bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+ int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+ if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+ {
+ error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+ }
+
+ gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+ int ret;
+
+ // stick all genotypes in a hash as keys and store up to 32 samples in a corresponding flag as its value
+ khiter_t bucket;
+ khash_t(gts2smps) *gts = kh_init(gts2smps); // create hash
+ for ( i = 0; i < args.nsmp; i++ )
+ {
+ int *gt_ptr = args.gt_arr + gte_smp * i;
+
+ if (bcf_gt_is_missing(gt_ptr[0]) || ( gte_smp == 2 && bcf_gt_is_missing(gt_ptr[1]) ) )
+ {
+ if ( args.flag & MISSING ) args.missing_gts[i]++; // count missing genotypes, if requested
+ continue; // don't do anything else for missing genotypes, their "sharing" gives no info...
+ }
+
+ int a = bcf_gt_allele(gt_ptr[0]);
+ int b;
+ if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+ {
+ b = bcf_gt_allele(gt_ptr[1]);
+ }
+ else if (gte_smp == 1 ) // use missing value for second entry in hash key generation below, if only one is available
+ {
+ b = bcf_gt_allele(bcf_int32_vector_end);
+ }
+ else
+ {
+ error("gtisec does not support ploidy higher than 2.\n");
+ }
+
+ int idx = bcf_alleles2gt(a,b); // generate genotype specific hash key
+
+ bucket = kh_get(gts2smps, gts, idx); // get the genotype's hash bucket
+
+ if ( bucket == kh_end(gts) ) { // means that key does not exist
+ bucket = kh_put(gts2smps, gts, idx, &ret); // create bucket with genotype index as key and return its iterator
+ kh_val(gts, bucket) = 0; // initialize the bucket with all sample bits unset
+ }
+ kh_value(gts, bucket) |= (1<<i); // set the sample's bit to 1 in this genotype's bucket
+ }
+
+ // iterate over genotypes and for each genotype increment the appropriate smp_is entry
+ for ( bucket = kh_begin(gts); bucket != kh_end(gts); ++bucket ) // iterate over all genotypes at this position
+ {
+ if ( kh_exist(gts, bucket) ) // for existing genotype buckets
+ {
+ uint32_t s = kh_val(gts, bucket); // get the 32 bit flag
+ args.smp_is[s]++; // add to the corresponding subset
+ }
+ }
+ kh_destroy(gts2smps, gts); // destroy hash
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ int32_t i;
+ int s;
+
+ FILE *fp = args.out;
+
+ /* Printing to File */
+ if ( args.flag & SMPORDER )
+ {
+ /* Iterate over samples, printing out all subsets including
+ * the current sample, with the current sample first. This
+ * includes multiple printouts of the same sample but makes
+ * output more readable and is also needed for circos files
+ * printing.
+ */
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ if ( (args.bankers[i]>>s) & 1 )
+ {
+ fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ int j;
+ /* Print sample list */
+ fprintf(fp, "%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s)); // print current sample first
+ for ( j = args.nsmp-1; j >= 0; j-- )
+ {
+ if ( (args.bankers[i] ^ (1<<s)) & (1<<j) ) // exclude current sample from printing again
+ {
+ fprintf(fp, ",%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, j ) ); // print out sample list, starting with our current major sample
+ }
+ }
+ fprintf(fp, "\n" );
+ }
+ }
+ }
+ }
+ else if ( args.flag & VERBOSE )
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+ }
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ int j = 0;
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ if ( (args.bankers[i]>>s) & 1 )
+ {
+ fprintf(fp, "%s%s", j ? "," : "", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s) ); // samples in specified order
+ j = 1;
+ }
+ }
+ fprintf(fp, "\n" );
+ }
+ }
+ else
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ fprintf(fp, "%"PRIu64"\n", args.missing_gts[s]);
+ }
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ fprintf(fp, "%"PRIu64"\n", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ }
+ }
+ fclose(fp);
+
+ /* freeing up args */
+ free(args.gt_arr);
+ free(args.bankers);
+ free(args.quick);
+ if (args.flag & MISSING) free(args.missing_gts);
+ free(args.smp_is);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/GTisec.c -- collect genotype intersection counts of all possible
+ subsets of the present samples and output in banker's
+ sequence order (in this sequence, the number of contained
+ samples increases monotonically, a property that is e.g.
+ useful for programatically creating plotting files for the
+ R package VennDiagram or the plotting tool circos from the
+ counts, as in the command line tools bankers2VennDiagram and
+ bankers2circos at htpps://github.com/dlaehnemann/bankers2)
+
+ Copyright (C) 2016 Computational Biology of Infection Research,
+ Helmholtz Centre for Infection Research, Braunschweig,
+ Germany
+
+ Author: David Laehnemann <david.laehnemann@helmholtz-hzi.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/khash.h>
+KHASH_MAP_INIT_INT(gts2smps, uint32_t)
+
+#include "bcftools.h"
+
+/*!
+ * Flag definitions for args.flag
+ */
+#define MISSING (1<<0)
+#define VERBOSE (1<<1)
+#define SMPORDER (1<<2)
+
+typedef struct _args_t
+{
+ bcf_srs_t *file; /*! multi-sample VCF file */
+ bcf_hdr_t *hdr; /*! VCF file header */
+ FILE *out; /*! output file pointer */
+ int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+ int nsmpp2; /*! 2^(nsmp) (is needed multiple times) */
+ int *gt_arr; /*! temporary array, to store GTs of current line/record */
+ int ngt_arr; /*! hold the number of current GT array entries */
+ uint32_t *bankers; /*! array to store banker's sequence for all possible sample subsets for
+ programmatic indexing into smp_is for output printing, e.g. for three
+ samples A, B and C this would be the following order:
+ [ C, B, A, CB, CA, BA, CBA ]
+ [ 100, 010, 001, 110, 101, 011, 111 ]
+ */
+ uint64_t *quick; /*! array to store n choose k lookup table of choose() function */
+ uint8_t flag; /*! several flags, for positions see above*/
+ uint64_t *missing_gts; /*! array to count missing genotypes of each sample */
+ uint64_t *smp_is; /*! array to track all possible intersections between
+ samples, with each bit in the index integer belonging to one
+ sample. E.g. for three samples A, B and C, count would be in
+ the following order:
+ [ A, B, AB, C, AC, BC, ABC ]
+ [ 001, 010, 011, 100, 101, 110, 111 ]
+ */
+}
+args_t;
+
+static args_t args;
+uint32_t compute_bankers(unsigned long a);
+
+const char *about(void)
+{
+ return "Count genotype intersections across all possible sample subsets in a vcf file.\n";
+}
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Count genotype intersections across all possible sample subsets in a vcf file.\n"
+ "Usage: bcftools +GTisec <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+ "\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -m, --missing if set, include count of missing genotypes per sample in output\n"
+ " -v, --verbose if set, annotate count rows with corresponding sample subset lists\n"
+ " -H, --human-readable if set, create human readable output; i.e. sort output by sample and\n"
+ " print each subset's intersection count once for each sample contained\n"
+ " in the subset; implies verbose output (-v)\n"
+ "\n"
+ "Example:\n"
+ " bcftools +GTisec in.vcf -- -v # for verbose output\n"
+ " bcftools +GTisec in.vcf -- -H # for human readable output\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.flag = 0;
+
+ static struct option loptions[] =
+ {
+ {"help", no_argument, 0,'h'},
+ {"missing", no_argument, 0,'m'},
+ {"verbose", no_argument, 0,'v'},
+ {"human-readable", no_argument, 0,'H'},
+ {0,0,0,0}
+ };
+
+ int c;
+ while ((c = getopt_long(argc, argv, "?mvHh",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'm': args.flag |= MISSING; break;
+ case 'v': args.flag |= VERBOSE; break;
+ case 'H': args.flag |= ( SMPORDER | VERBOSE ); break;
+ case 'h': usage(); break;
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) usage(); // too many files given
+
+
+ args.hdr = in;
+
+ if ( !bcf_hdr_nsamples(args.hdr) )
+ {
+ error("No samples in input file.\n");
+ }
+
+ args.nsmp = bcf_hdr_nsamples(args.hdr);
+ if ( args.nsmp > 32 ) error("Too many samples. A maximum of 32 is supported.\n");
+ args.nsmpp2 = pow( 2, args.nsmp);
+ args.bankers = (uint32_t*) calloc( args.nsmpp2, sizeof(uint32_t) );
+ args.quick = (uint64_t*) calloc((args.nsmp * (args.nsmp + 1)) / 4, sizeof(unsigned long));
+ if ( args.flag & MISSING ) args.missing_gts = (uint64_t*) calloc( args.nsmp, sizeof(uint64_t));
+ args.smp_is = (uint64_t*) calloc( args.nsmpp2, sizeof(uint64_t));
+ if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+ args.gt_arr = NULL;
+ args.ngt_arr = 0;
+
+ args.out = bcftools_stdout;
+
+ /*! Header printing */
+ FILE *fp = args.out;
+ fprintf(fp, "# This file was produced by bcftools +GTisec (%s+htslib-%s)\n", bcftools_version(), hts_version());
+ fprintf(fp, "# The command line was:\tbcftools +GTisec %s ", argv[0]);
+ int i;
+ for (i=1; i < argc; i++)
+ {
+ fprintf(fp, " %s", argv[i]);
+ }
+ fprintf(fp,"\n");
+ fprintf(fp,"# This file can be used as input to the subset plotting tools at:\n"
+ "# https://github.com/dlaehnemann/bankers2\n");
+ fprintf(fp,"# Genotype intersections across samples:\n");
+ fprintf(fp,"@SMPS");
+ for (i = args.nsmp-1; i >= 0; i--)
+ {
+ fprintf(fp," %s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, i));
+ }
+ fprintf(fp,"\n");
+ if ( args.flag & MISSING )
+ {
+ if ( args.flag & SMPORDER )
+ {
+ fprintf(fp, "# The first line of each sample contains its count of missing genotypes, with a '-' appended\n"
+ "# to the sample name.\n");
+ }
+ else
+ {
+ fprintf(fp, "# The first %i lines contain the counts for missing values of each sample in the order provided\n"
+ "# in the SMPS-line above. Intersection counts only start afterwards.\n", args.nsmp);
+ }
+ }
+ if ( args.flag & SMPORDER )
+ {
+ fprintf(fp, "# Human readable output (-H) was requested. Subset intersection counts are therefore sorted by\n"
+ "# sample and repeated for each contained sample. For each sample, counts are in banker's \n"
+ "# sequence order regarding all other samples.\n");
+ }
+ else
+ {
+ fprintf(fp, "# Subset intersection counts are in global banker's sequence order.\n");
+ if ( args.nsmp > 2 )
+ {
+ fprintf(fp, "# After exclusive sample counts in order of the SMPS-line, banker's sequence continues with:\n"
+ "# %s,%s %s,%s ...\n",
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-2 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-1 ),
+ bcf_hdr_int2id(in, BCF_DT_SAMPLE, args.nsmp-3 )
+ );
+ }
+ }
+ if (args.flag & VERBOSE )
+ {
+ fprintf(fp,"# [1] Number of shared non-ref genotypes \t[2] Samples sharing non-ref genotype (GT)\n");
+ }
+ else
+ {
+ fprintf(fp,"# [1] Number of shared non-ref genotypes\n");
+ }
+
+ /* Compute banker's sequence for following printing by sample and
+ * with increasing subset size.
+ */
+ uint32_t j;
+ for ( j = 0; j < args.nsmpp2; j++ )
+ {
+ args.bankers[j] = compute_bankers(j);
+ }
+
+ return 1;
+}
+
+
+/* ADAPTED CODE FROM CORIN LAWSON (START)
+ * https://github.com/au-phiware/bankers/blob/master/c/bankers.c
+ * who implemented ideas of Eric Burnett:
+ * http://www.thelowlyprogrammer.com/2010/04/indexing-and-enumerating-subsets-of.html
+ */
+
+/*
+ * Compute the binomial coefficient of `n choose k'.
+ * Use the fact that binom(n, k) = binom(n, n - k).
+ * Use a lookup table (triangle, actually) for speed.
+ * Otherwise it's dumb (heart) recursion.
+ * Added relative to Corin Lawson:
+ * * Passing in of sample number through pointer to args struct
+ * * Make quick lookup table external to keep it persistent with clean allocation
+ * and freeing
+ */
+uint64_t choose(unsigned int n, unsigned int k) {
+ if (n == 0)
+ return 0;
+ if (n == k || k == 0)
+ return 1;
+ if (k > n / 2)
+ k = n - k;
+
+ unsigned int i = (n * (n - 1)) / 4 + k - 1;
+ if (args.quick[i] == 0)
+ args.quick[i] = choose(n - 1, k - 1) + choose(n - 1, k);
+
+ return args.quick[i];
+}
+
+/*
+ * Returns the Banker's number at the specified position, a.
+ * Derived from the recursive bit flip method.
+ * Added relative to Corin Lawson:
+ * * Uses same lookup table solution as choose function, just
+ * maintained externally to persist across separate function calls.
+ * * Uses bitwise symmetry of banker's sequence to use bitwise inversion
+ * instead of recursive bit flip for second half of sequence.
+ */
+uint32_t compute_bankers(unsigned long a)
+{
+ if (a == 0)
+ return 0;
+
+ if ( args.bankers[a] == 0 )
+ {
+ if ( a >= (args.nsmpp2 / 2) )
+ return args.bankers[a] = ( compute_bankers(args.nsmpp2 - (a+1)) ^ (args.nsmpp2 - 1) ); // use bitwise symmetry of bankers sequence
+ unsigned int c = 0;
+ uint32_t n = args.nsmp;
+ uint64_t e = a, binom;
+ binom = choose(n, c);
+ do {
+ e -= binom;
+ } while ((binom = choose(n, ++c)) <= e);
+
+ do {
+ if (e == 0 || (binom = choose(n - 1, c - 1)) > e)
+ c--, args.bankers[a] |= 1;
+ else
+ e -= binom;
+ } while (--n && c && ((args.bankers[a] <<= 1) || 1));
+ args.bankers[a] <<= n;
+ }
+
+ return args.bankers[a];
+}
+
+// ADAPTED CODE FROM CORIN LAWSON END
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+ uint64_t i;
+ bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+ int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+ if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+ {
+ error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+ }
+
+ gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+ int ret;
+
+ // stick all genotypes in a hash as keys and store up to 32 samples in a corresponding flag as its value
+ khiter_t bucket;
+ khash_t(gts2smps) *gts = kh_init(gts2smps); // create hash
+ for ( i = 0; i < args.nsmp; i++ )
+ {
+ int *gt_ptr = args.gt_arr + gte_smp * i;
+
+ if (bcf_gt_is_missing(gt_ptr[0]) || ( gte_smp == 2 && bcf_gt_is_missing(gt_ptr[1]) ) )
+ {
+ if ( args.flag & MISSING ) args.missing_gts[i]++; // count missing genotypes, if requested
+ continue; // don't do anything else for missing genotypes, their "sharing" gives no info...
+ }
+
+ int a = bcf_gt_allele(gt_ptr[0]);
+ int b;
+ if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+ {
+ b = bcf_gt_allele(gt_ptr[1]);
+ }
+ else if (gte_smp == 1 ) // use missing value for second entry in hash key generation below, if only one is available
+ {
+ b = bcf_gt_allele(bcf_int32_vector_end);
+ }
+ else
+ {
+ error("gtisec does not support ploidy higher than 2.\n");
+ }
+
+ int idx = bcf_alleles2gt(a,b); // generate genotype specific hash key
+
+ bucket = kh_get(gts2smps, gts, idx); // get the genotype's hash bucket
+
+ if ( bucket == kh_end(gts) ) { // means that key does not exist
+ bucket = kh_put(gts2smps, gts, idx, &ret); // create bucket with genotype index as key and return its iterator
+ kh_val(gts, bucket) = 0; // initialize the bucket with all sample bits unset
+ }
+ kh_value(gts, bucket) |= (1<<i); // set the sample's bit to 1 in this genotype's bucket
+ }
+
+ // iterate over genotypes and for each genotype increment the appropriate smp_is entry
+ for ( bucket = kh_begin(gts); bucket != kh_end(gts); ++bucket ) // iterate over all genotypes at this position
+ {
+ if ( kh_exist(gts, bucket) ) // for existing genotype buckets
+ {
+ uint32_t s = kh_val(gts, bucket); // get the 32 bit flag
+ args.smp_is[s]++; // add to the corresponding subset
+ }
+ }
+ kh_destroy(gts2smps, gts); // destroy hash
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ int32_t i;
+ int s;
+
+ FILE *fp = args.out;
+
+ /* Printing to File */
+ if ( args.flag & SMPORDER )
+ {
+ /* Iterate over samples, printing out all subsets including
+ * the current sample, with the current sample first. This
+ * includes multiple printouts of the same sample but makes
+ * output more readable and is also needed for circos files
+ * printing.
+ */
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ if ( (args.bankers[i]>>s) & 1 )
+ {
+ fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ int j;
+ /* Print sample list */
+ fprintf(fp, "%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s)); // print current sample first
+ for ( j = args.nsmp-1; j >= 0; j-- )
+ {
+ if ( (args.bankers[i] ^ (1<<s)) & (1<<j) ) // exclude current sample from printing again
+ {
+ fprintf(fp, ",%s", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, j ) ); // print out sample list, starting with our current major sample
+ }
+ }
+ fprintf(fp, "\n" );
+ }
+ }
+ }
+ }
+ else if ( args.flag & VERBOSE )
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ fprintf(fp, "%"PRIu64"\t%s-\n", args.missing_gts[s], bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s));
+ }
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ fprintf(fp, "%"PRIu64"\t", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ int j = 0;
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ if ( (args.bankers[i]>>s) & 1 )
+ {
+ fprintf(fp, "%s%s", j ? "," : "", bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, s) ); // samples in specified order
+ j = 1;
+ }
+ }
+ fprintf(fp, "\n" );
+ }
+ }
+ else
+ {
+ if ( args.flag & MISSING ) // if missing genotype counts are requested, print them to standard output
+ {
+ for ( s = args.nsmp-1; s >= 0; s--)
+ {
+ fprintf(fp, "%"PRIu64"\n", args.missing_gts[s]);
+ }
+ }
+ for ( i = 1; i < args.nsmpp2; i++ )
+ {
+ fprintf(fp, "%"PRIu64"\n", args.smp_is[ args.bankers[i] ]); // print out count of genotypes shared by samples in current banker's sequence position
+ }
+ }
+ fclose(fp);
+
+ /* freeing up args */
+ free(args.gt_arr);
+ free(args.bankers);
+ free(args.quick);
+ if (args.flag & MISSING) free(args.missing_gts);
+ free(args.smp_is);
+}
--- /dev/null
+/* plugins/GTsubset.c -- output only positions where the selected samples exclusively
+ share a genotype, i.e. all selected samples must have the same
+ genotype (including both alleles) and none of the unselected
+ samples can have the same genotype
+
+ Copyright (C) 2016 Computational Biology of Infection Research,
+ Helmholtz Centre for Infection Research, Braunschweig,
+ Germany
+
+ Author: David Laehnemann <david.laehnemann@hhu.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+
+#include "bcftools.h"
+
+typedef struct _args_t
+{
+ bcf_hdr_t *hdr; /*! VCF file header */
+ int *gt_arr; /*! temporary array, to store GTs of current line/record */
+ int ngt_arr; /*! hold the number of current GT array entries */
+ int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+ int n_sel_smps; /*! number of selected samples who should exclusively share genotypes */
+ int *selected_smps; /*! pointer to start of array containing 1 at indices corresponding to selected samples in header dict and 0 at others*/
+}
+args_t;
+
+static args_t args;
+
+const char *about(void)
+{
+ return "Output only sites where the requested samples all exclusively share a genotype (GT).\n";
+}
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Output only sites where the requested samples all exclusively share a genotype (GT), i.e.\n"
+ " all selected samples must have the same GT, while non of the others can have it.\n"
+ "Usage: bcftools +GTsubset <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+ "\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -s,--sample-list comma-separated list of samples; only those sites where all of these\n"
+ " samples exclusively share their genotype are given as output\n"
+ "\n"
+ "Example:\n"
+ " bcftools +GTsubset in.vcf -- -s SMP1,SMP2 \n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+
+ int i;
+
+ static struct option loptions[] =
+ {
+ {"help", no_argument, 0,'h'},
+ {"sample-list", required_argument, 0,'s'},
+ {0,0,0,0}
+ };
+
+ char **smps_strs = NULL;
+
+ int c;
+ while ((c = getopt_long(argc, argv, "?s:h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 's': smps_strs = hts_readlist(optarg,0,&(args.n_sel_smps));
+ if ( args.n_sel_smps == 0 )
+ {
+ fprintf(stderr, "Sample specification not valid.\n");
+ error("%s", usage());
+ }
+ break;
+ case 'h': usage(); break;
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) usage(); // too many files given
+
+ args.hdr = bcf_hdr_dup(in);
+
+ // Samples parsing from header and input option
+ if ( !bcf_hdr_nsamples(args.hdr) )
+ {
+ error("No samples in input file.\n");
+ }
+ args.nsmp = bcf_hdr_nsamples(args.hdr);
+ args.selected_smps = (int*) calloc(args.nsmp,sizeof(int));
+ for ( i = 0; i < args.n_sel_smps; i++ )
+ {
+ int ind = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, smps_strs[i]);
+ if ( ind == -1 )
+ {
+ error("Sample '%s' not in input vcf file.\n", smps_strs[i]);
+ } else {
+ args.selected_smps[ind] = 1;
+ }
+ free(smps_strs[i]);
+ }
+ free(smps_strs);
+
+ /*
+ fprintf(stderr, "Selected samples array:[");
+ for (i=0;i<args.nsmp;i++)
+ {
+ fprintf(stderr, " %i", args.selected_smps[i]);
+ }
+ fprintf(stderr, " ]\n");
+ */
+
+ if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+ args.gt_arr = NULL;
+
+ return 0;
+}
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+ uint64_t i;
+ bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+ int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+ args.ngt_arr = 0; /*! hold the number of current GT array entries */
+ if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+ {
+ error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+ }
+
+ gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+
+ // initialize with missing genotype
+ int a1 = 0;
+ int a2 = 0;
+
+ // initialize with first selected sample genotype that is not missing
+ int gt = -1;
+ while ( (a1 == 0) || (a2 == 0) )
+ {
+ gt++;
+ if (gt == args.nsmp) break;
+ if (args.selected_smps[gt] == 0) continue;
+ a1 = (args.gt_arr + gte_smp * gt)[0];
+ if ( gte_smp == 2 ) a2 = (args.gt_arr + gte_smp * gt)[1];
+ else if ( gte_smp == 1 ) a2 = bcf_int32_vector_end;
+ else error("GTsubset does not support ploidy higher than 2.\n");
+ }
+// fprintf(stderr, "a1: %i a2: %i\n", a1, a2);
+
+ // check all genotypes if they match (for included samples) or disagree (for samples not included)
+ gt = 0;
+ for ( i = 0; i < args.nsmp; i++ )
+ {
+ int *gt_ptr = args.gt_arr + gte_smp * i;
+
+ int b1 = gt_ptr[0];
+ int b2;
+ if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+ {
+ b2 = gt_ptr[1];
+ }
+ else if (gte_smp == 1 ) // use vector end value for second entry, if only one is available
+ {
+ b2 = bcf_int32_vector_end;
+ }
+ else
+ {
+ error("GTsubset does not support ploidy higher than 2.\n");
+ }
+
+ // fprintf(stderr, "b1: %i b2: %i\n", b1, b2);
+ /* missing genotypes are counted as always passing, as they neither
+ * mismatch the initial selected genotype for a selected sample, nor
+ * do they match the initial selected genotype for an excluded sample's
+ * genotype */
+ if ( (b1 == 0) || (b2 == 0) )
+ {
+ gt++;
+// fprintf(stderr, "missing => pass\n");
+ continue;
+ }
+ else if ( args.selected_smps[i] == 1 )
+ {
+ if ( (b1 == a1) && (b2 == a2) )
+ {
+ gt++;
+// fprintf(stderr, "match => pass\n");
+ continue;
+ }
+ else
+ {
+// fprintf(stderr, "no match => fail\n");
+ break;
+ }
+ }
+ else if ( args.selected_smps[i] == 0 )
+ {
+ if ( (b1 != a1 ) || (b2 != a2) )
+ {
+ gt++;
+ // fprintf(stderr, "no match => pass\n");
+ continue;
+ }
+ else
+ {
+// fprintf(stderr, "match => fail\n");
+ break;
+ }
+ }
+ }
+ if ( gt == args.nsmp )
+ {
+ return rec;
+ }
+ else
+ {
+ return NULL;
+ }
+}
+
+void destroy(void)
+{
+ /* freeing up args */
+ bcf_hdr_destroy(args.hdr);
+ free(args.gt_arr);
+ free(args.selected_smps);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/GTsubset.c -- output only positions where the selected samples exclusively
+ share a genotype, i.e. all selected samples must have the same
+ genotype (including both alleles) and none of the unselected
+ samples can have the same genotype
+
+ Copyright (C) 2016 Computational Biology of Infection Research,
+ Helmholtz Centre for Infection Research, Braunschweig,
+ Germany
+
+ Author: David Laehnemann <david.laehnemann@hhu.de>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+
+#include "bcftools.h"
+
+typedef struct _args_t
+{
+ bcf_hdr_t *hdr; /*! VCF file header */
+ int *gt_arr; /*! temporary array, to store GTs of current line/record */
+ int ngt_arr; /*! hold the number of current GT array entries */
+ int nsmp; /*! number of samples, can be determined from header but is needed in multiple contexts */
+ int n_sel_smps; /*! number of selected samples who should exclusively share genotypes */
+ int *selected_smps; /*! pointer to start of array containing 1 at indices corresponding to selected samples in header dict and 0 at others*/
+}
+args_t;
+
+static args_t args;
+
+const char *about(void)
+{
+ return "Output only sites where the requested samples all exclusively share a genotype (GT).\n";
+}
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Output only sites where the requested samples all exclusively share a genotype (GT), i.e.\n"
+ " all selected samples must have the same GT, while non of the others can have it.\n"
+ "Usage: bcftools +GTsubset <multisample.bcf/.vcf.gz> [General Options] -- [Plugin Options] \n"
+ "\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -s,--sample-list comma-separated list of samples; only those sites where all of these\n"
+ " samples exclusively share their genotype are given as output\n"
+ "\n"
+ "Example:\n"
+ " bcftools +GTsubset in.vcf -- -s SMP1,SMP2 \n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+
+ int i;
+
+ static struct option loptions[] =
+ {
+ {"help", no_argument, 0,'h'},
+ {"sample-list", required_argument, 0,'s'},
+ {0,0,0,0}
+ };
+
+ char **smps_strs = NULL;
+
+ int c;
+ while ((c = getopt_long(argc, argv, "?s:h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 's': smps_strs = hts_readlist(optarg,0,&(args.n_sel_smps));
+ if ( args.n_sel_smps == 0 )
+ {
+ fprintf(bcftools_stderr, "Sample specification not valid.\n");
+ error("%s", usage());
+ }
+ break;
+ case 'h': usage(); break;
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) usage(); // too many files given
+
+ args.hdr = bcf_hdr_dup(in);
+
+ // Samples parsing from header and input option
+ if ( !bcf_hdr_nsamples(args.hdr) )
+ {
+ error("No samples in input file.\n");
+ }
+ args.nsmp = bcf_hdr_nsamples(args.hdr);
+ args.selected_smps = (int*) calloc(args.nsmp,sizeof(int));
+ for ( i = 0; i < args.n_sel_smps; i++ )
+ {
+ int ind = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, smps_strs[i]);
+ if ( ind == -1 )
+ {
+ error("Sample '%s' not in input vcf file.\n", smps_strs[i]);
+ } else {
+ args.selected_smps[ind] = 1;
+ }
+ free(smps_strs[i]);
+ }
+ free(smps_strs);
+
+ /*
+ fprintf(bcftools_stderr, "Selected samples array:[");
+ for (i=0;i<args.nsmp;i++)
+ {
+ fprintf(bcftools_stderr, " %i", args.selected_smps[i]);
+ }
+ fprintf(bcftools_stderr, " ]\n");
+ */
+
+ if ( bcf_hdr_id2int(args.hdr, BCF_DT_ID, "GT")<0 ) error("[E::%s] GT not present in the header\n", __func__);
+
+ args.gt_arr = NULL;
+
+ return 0;
+}
+
+
+/*
+ * GT field (genotype) comparison function.
+ */
+bcf1_t *process(bcf1_t *rec)
+{
+ uint64_t i;
+ bcf_unpack(rec, BCF_UN_FMT); // unpack the Format fields, including the GT field
+ int gte_smp = 0; // number GT array entries per sample (should be 2, one entry per allele)
+ args.ngt_arr = 0; /*! hold the number of current GT array entries */
+ if ( (gte_smp = bcf_get_genotypes(args.hdr, rec, &(args.gt_arr), &(args.ngt_arr) ) ) <= 0 )
+ {
+ error("GT not present at %s: %d\n", args.hdr->id[BCF_DT_CTG][rec->rid].key, rec->pos+1);
+ }
+
+ gte_smp /= args.nsmp; // divide total number of genotypes array entries (= args.ngt_arr) by number of samples
+
+ // initialize with missing genotype
+ int a1 = 0;
+ int a2 = 0;
+
+ // initialize with first selected sample genotype that is not missing
+ int gt = -1;
+ while ( (a1 == 0) || (a2 == 0) )
+ {
+ gt++;
+ if (gt == args.nsmp) break;
+ if (args.selected_smps[gt] == 0) continue;
+ a1 = (args.gt_arr + gte_smp * gt)[0];
+ if ( gte_smp == 2 ) a2 = (args.gt_arr + gte_smp * gt)[1];
+ else if ( gte_smp == 1 ) a2 = bcf_int32_vector_end;
+ else error("GTsubset does not support ploidy higher than 2.\n");
+ }
+// fprintf(bcftools_stderr, "a1: %i a2: %i\n", a1, a2);
+
+ // check all genotypes if they match (for included samples) or disagree (for samples not included)
+ gt = 0;
+ for ( i = 0; i < args.nsmp; i++ )
+ {
+ int *gt_ptr = args.gt_arr + gte_smp * i;
+
+ int b1 = gt_ptr[0];
+ int b2;
+ if ( gte_smp == 2 ) // two entries available per sample, padded with missing values for haploid genotypes
+ {
+ b2 = gt_ptr[1];
+ }
+ else if (gte_smp == 1 ) // use vector end value for second entry, if only one is available
+ {
+ b2 = bcf_int32_vector_end;
+ }
+ else
+ {
+ error("GTsubset does not support ploidy higher than 2.\n");
+ }
+
+ // fprintf(bcftools_stderr, "b1: %i b2: %i\n", b1, b2);
+ /* missing genotypes are counted as always passing, as they neither
+ * mismatch the initial selected genotype for a selected sample, nor
+ * do they match the initial selected genotype for an excluded sample's
+ * genotype */
+ if ( (b1 == 0) || (b2 == 0) )
+ {
+ gt++;
+// fprintf(bcftools_stderr, "missing => pass\n");
+ continue;
+ }
+ else if ( args.selected_smps[i] == 1 )
+ {
+ if ( (b1 == a1) && (b2 == a2) )
+ {
+ gt++;
+// fprintf(bcftools_stderr, "match => pass\n");
+ continue;
+ }
+ else
+ {
+// fprintf(bcftools_stderr, "no match => fail\n");
+ break;
+ }
+ }
+ else if ( args.selected_smps[i] == 0 )
+ {
+ if ( (b1 != a1 ) || (b2 != a2) )
+ {
+ gt++;
+ // fprintf(bcftools_stderr, "no match => pass\n");
+ continue;
+ }
+ else
+ {
+// fprintf(bcftools_stderr, "match => fail\n");
+ break;
+ }
+ }
+ }
+ if ( gt == args.nsmp )
+ {
+ return rec;
+ }
+ else
+ {
+ return NULL;
+ }
+}
+
+void destroy(void)
+{
+ /* freeing up args */
+ bcf_hdr_destroy(args.hdr);
+ free(args.gt_arr);
+ free(args.selected_smps);
+}
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include "bcftools.h"
+#include "convert.h"
+
+typedef struct
+{
+ int smpl,ctrl; // VCF sample index
+ const char *smpl_name, *ctrl_name;
+}
+pair_t;
+
+typedef struct
+{
+ bcf_hdr_t *hdr;
+ pair_t *pair;
+ int npair, mpair, min_dp, min_alt_dp;
+ int32_t *ad_arr;
+ int mad_arr;
+ double th;
+ convert_t *convert;
+ kstring_t str;
+ uint64_t nsite,ncmp;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n"
+ "Usage: bcftools +ad-bias [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -a, --min-alt-dp <int> Minimum required alternate allele depth [1]\n"
+ " -d, --min-dp <int> Minimum required depth [0]\n"
+ " -f, --format <string> Optional tags to append to output (`bcftools query` style of format)\n"
+ " -s, --samples <file> List of sample pairs, one tab-delimited pair per line\n"
+ " -t, --threshold <float> Output only hits with p-value smaller than <float> [1e-3]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +ad-bias file.bcf -- -t 1e-3 -s samples.txt\n"
+ "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // HPSI0513i-veqz_6 HPSI0513pf-veqz
+ int ncols = ksplit_core(str.s,'\t',&moff,&off);
+ if ( ncols<2 ) error("Could not parse the sample file: %s\n", str.s);
+
+ int smpl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[0]]);
+ if ( smpl<0 ) continue;
+ int ctrl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( ctrl<0 ) continue;
+
+ args->npair++;
+ hts_expand0(pair_t,args->npair,args->mpair,args->pair);
+ pair_t *pair = &args->pair[args->npair-1];
+ pair->ctrl = ctrl;
+ pair->smpl = smpl;
+ pair->smpl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->smpl);
+ pair->ctrl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->ctrl);
+ } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.hdr = in;
+ args.th = 1e-3;
+ args.min_alt_dp = 1;
+ char *fname = NULL, *format = NULL;
+ static struct option loptions[] =
+ {
+ {"min-dp",required_argument,NULL,'d'},
+ {"min-alt-dp",required_argument,NULL,'a'},
+ {"format",required_argument,NULL,'f'},
+ {"samples",required_argument,NULL,'s'},
+ {"threshold",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "?hs:t:f:d:a:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'a':
+ args.min_alt_dp = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -a %s\n", optarg);
+ break;
+ case 'd':
+ args.min_dp = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+ break;
+ case 't':
+ args.th = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -t %s\n", optarg);
+ break;
+ case 's': fname = optarg; break;
+ case 'f': format = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !fname ) error("Expected the -s option\n");
+ parse_samples(&args, fname);
+ if ( format ) args.convert = convert_init(args.hdr, NULL, 0, format);
+ printf("# This file was produced by: bcftools +ad-bias(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ printf("# The command line was:\tbcftools +ad-bias %s", argv[0]);
+ for (c=1; c<argc; c++) printf(" %s",argv[c]);
+ printf("\n#\n");
+ printf("# FT, Fisher Test\t[2]Sample\t[3]Control\t[4]Chrom\t[5]Pos\t[6]smpl.nREF\t[7]smpl.nALT\t[8]ctrl.nREF\t[9]ctrl.nALT\t[10]P-value");
+ if ( format ) printf("\t[11-]User data: %s", format);
+ printf("\n");
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int nad = bcf_get_format_int32(args.hdr, rec, "AD", &args.ad_arr, &args.mad_arr);
+ if ( nad<0 ) return NULL;
+ nad /= bcf_hdr_nsamples(args.hdr);
+
+ if ( args.convert ) convert_line(args.convert, rec, &args.str);
+ args.nsite++;
+
+ int i;
+ for (i=0; i<args.npair; i++)
+ {
+ pair_t *pair = &args.pair[i];
+ int32_t *aptr = args.ad_arr + nad*pair->smpl;
+ int32_t *bptr = args.ad_arr + nad*pair->ctrl;
+
+ if ( aptr[0]==bcf_int32_missing ) continue;
+ if ( bptr[0]==bcf_int32_missing ) continue;
+ if ( aptr[0]+aptr[1] < args.min_dp ) continue;
+ if ( bptr[0]+bptr[1] < args.min_dp ) continue;
+ if ( aptr[1] < args.min_alt_dp && bptr[1] < args.min_alt_dp ) continue;
+
+ args.ncmp++;
+
+ int n11 = aptr[0], n12 = aptr[1];
+ int n21 = bptr[0], n22 = bptr[1];
+ double left, right, fisher;
+ kt_fisher_exact(n11,n12,n21,n22, &left,&right,&fisher);
+ if ( fisher >= args.th ) continue;
+
+ printf("FT\t%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%e",
+ pair->smpl_name,pair->ctrl_name,
+ bcf_hdr_id2name(args.hdr,rec->rid), rec->pos+1,
+ n11,n12,n21,n22, fisher
+ );
+ if ( args.convert ) printf("\t%s", args.str.s);
+ printf("\n");
+ }
+ return NULL;
+}
+
+void destroy(void)
+{
+ printf("# SN, Summary Numbers\t[2]Number of Pairs\t[3]Number of Sites\t[4]Number of comparisons\t[5]P-value output threshold\n");
+ printf("SN\t%d\t%"PRId64"\t%"PRId64"\t%e\n",args.npair,args.nsite,args.ncmp,args.th);
+ if (args.convert) convert_destroy(args.convert);
+ free(args.str.s);
+ free(args.pair);
+ free(args.ad_arr);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include "bcftools.h"
+#include "convert.h"
+
+typedef struct
+{
+ int smpl,ctrl; // VCF sample index
+ const char *smpl_name, *ctrl_name;
+}
+pair_t;
+
+typedef struct
+{
+ bcf_hdr_t *hdr;
+ pair_t *pair;
+ int npair, mpair, min_dp, min_alt_dp;
+ int32_t *ad_arr;
+ int mad_arr;
+ double th;
+ convert_t *convert;
+ kstring_t str;
+ uint64_t nsite,ncmp;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Find positions with wildly varying ALT allele frequency (Fisher test on FMT/AD).\n"
+ "Usage: bcftools +ad-bias [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -a, --min-alt-dp <int> Minimum required alternate allele depth [1]\n"
+ " -d, --min-dp <int> Minimum required depth [0]\n"
+ " -f, --format <string> Optional tags to append to output (`bcftools query` style of format)\n"
+ " -s, --samples <file> List of sample pairs, one tab-delimited pair per line\n"
+ " -t, --threshold <float> Output only hits with p-value smaller than <float> [1e-3]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +ad-bias file.bcf -- -t 1e-3 -s samples.txt\n"
+ "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // HPSI0513i-veqz_6 HPSI0513pf-veqz
+ int ncols = ksplit_core(str.s,'\t',&moff,&off);
+ if ( ncols<2 ) error("Could not parse the sample file: %s\n", str.s);
+
+ int smpl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[0]]);
+ if ( smpl<0 ) continue;
+ int ctrl = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( ctrl<0 ) continue;
+
+ args->npair++;
+ hts_expand0(pair_t,args->npair,args->mpair,args->pair);
+ pair_t *pair = &args->pair[args->npair-1];
+ pair->ctrl = ctrl;
+ pair->smpl = smpl;
+ pair->smpl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->smpl);
+ pair->ctrl_name = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,pair->ctrl);
+ } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.hdr = in;
+ args.th = 1e-3;
+ args.min_alt_dp = 1;
+ char *fname = NULL, *format = NULL;
+ static struct option loptions[] =
+ {
+ {"min-dp",required_argument,NULL,'d'},
+ {"min-alt-dp",required_argument,NULL,'a'},
+ {"format",required_argument,NULL,'f'},
+ {"samples",required_argument,NULL,'s'},
+ {"threshold",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "?hs:t:f:d:a:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'a':
+ args.min_alt_dp = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -a %s\n", optarg);
+ break;
+ case 'd':
+ args.min_dp = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+ break;
+ case 't':
+ args.th = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -t %s\n", optarg);
+ break;
+ case 's': fname = optarg; break;
+ case 'f': format = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !fname ) error("Expected the -s option\n");
+ parse_samples(&args, fname);
+ if ( format ) args.convert = convert_init(args.hdr, NULL, 0, format);
+ fprintf(bcftools_stdout, "# This file was produced by: bcftools +ad-bias(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ fprintf(bcftools_stdout, "# The command line was:\tbcftools +ad-bias %s", argv[0]);
+ for (c=1; c<argc; c++) fprintf(bcftools_stdout, " %s",argv[c]);
+ fprintf(bcftools_stdout, "\n#\n");
+ fprintf(bcftools_stdout, "# FT, Fisher Test\t[2]Sample\t[3]Control\t[4]Chrom\t[5]Pos\t[6]smpl.nREF\t[7]smpl.nALT\t[8]ctrl.nREF\t[9]ctrl.nALT\t[10]P-value");
+ if ( format ) fprintf(bcftools_stdout, "\t[11-]User data: %s", format);
+ fprintf(bcftools_stdout, "\n");
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int nad = bcf_get_format_int32(args.hdr, rec, "AD", &args.ad_arr, &args.mad_arr);
+ if ( nad<0 ) return NULL;
+ nad /= bcf_hdr_nsamples(args.hdr);
+
+ if ( args.convert ) convert_line(args.convert, rec, &args.str);
+ args.nsite++;
+
+ int i;
+ for (i=0; i<args.npair; i++)
+ {
+ pair_t *pair = &args.pair[i];
+ int32_t *aptr = args.ad_arr + nad*pair->smpl;
+ int32_t *bptr = args.ad_arr + nad*pair->ctrl;
+
+ if ( aptr[0]==bcf_int32_missing ) continue;
+ if ( bptr[0]==bcf_int32_missing ) continue;
+ if ( aptr[0]+aptr[1] < args.min_dp ) continue;
+ if ( bptr[0]+bptr[1] < args.min_dp ) continue;
+ if ( aptr[1] < args.min_alt_dp && bptr[1] < args.min_alt_dp ) continue;
+
+ args.ncmp++;
+
+ int n11 = aptr[0], n12 = aptr[1];
+ int n21 = bptr[0], n22 = bptr[1];
+ double left, right, fisher;
+ kt_fisher_exact(n11,n12,n21,n22, &left,&right,&fisher);
+ if ( fisher >= args.th ) continue;
+
+ fprintf(bcftools_stdout, "FT\t%s\t%s\t%s\t%d\t%d\t%d\t%d\t%d\t%e",
+ pair->smpl_name,pair->ctrl_name,
+ bcf_hdr_id2name(args.hdr,rec->rid), rec->pos+1,
+ n11,n12,n21,n22, fisher
+ );
+ if ( args.convert ) fprintf(bcftools_stdout, "\t%s", args.str.s);
+ fprintf(bcftools_stdout, "\n");
+ }
+ return NULL;
+}
+
+void destroy(void)
+{
+ fprintf(bcftools_stdout, "# SN, Summary Numbers\t[2]Number of Pairs\t[3]Number of Sites\t[4]Number of comparisons\t[5]P-value output threshold\n");
+ fprintf(bcftools_stdout, "SN\t%d\t%"PRId64"\t%"PRId64"\t%e\n",args.npair,args.nsite,args.ncmp,args.th);
+ if (args.convert) convert_destroy(args.convert);
+ free(args.str.s);
+ free(args.pair);
+ free(args.ad_arr);
+}
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <inttypes.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+#include "bin.h"
+
+typedef struct
+{
+ char *af_tag;
+ bcf_hdr_t *hdr;
+ int32_t *gt, ngt, naf;
+ float *af, list_min, list_max;
+ bin_t *dev_bins, *prob_bins;
+ uint64_t *dev_dist, *prob_dist;
+}
+args_t;
+
+args_t *args;
+
+const char *about(void)
+{
+ return "AF and GT probability distribution stats.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Collect AF deviation stats and GT probability distribution\n"
+ " given AF and assuming HWE\n"
+ "Usage: bcftools +af-dist [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --dev-bins <list> AF deviation bins\n"
+ " -l, --list <min,max> list genotypes from the given bin (for debugging)\n"
+ " -p, --prob-bins <list> probability distribution bins\n"
+ " -t, --af-tag <tag> VCF INFO tag to use [AF]\n"
+ "\n"
+ "Default binning:\n"
+ " -d: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+ " -p: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+ "Example:\n"
+ " bcftools +af-tag file.bcf -- -t EUR_AF -p bins.txt\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ char *dev_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+ char *prob_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+ args->hdr = in;
+ args->af_tag = "AF";
+ args->list_min = -1;
+ static struct option loptions[] =
+ {
+ {"list",required_argument,NULL,'l'},
+ {"dev-bins",required_argument,NULL,'d'},
+ {"prob-bins",required_argument,NULL,'p'},
+ {"af-tag",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:d:p:l:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'l':
+ {
+ char *a,*b;
+ args->list_min = strtod(optarg,&a);
+ if ( a==optarg || *a!=',' ) error("Could not parse: --list %s\n", optarg);
+ args->list_max = strtod(a+1,&b);
+ if ( a+1==b || *b ) error("Could not parse: --list %s\n", optarg);
+ break;
+ }
+ case 'd': dev_bins = optarg; break;
+ case 'p': prob_bins = optarg; break;
+ case 't': args->af_tag = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+
+ args->dev_bins = bin_init(dev_bins,0,1);
+ int nbins = bin_get_size(args->dev_bins);
+ args->dev_dist = (uint64_t*)calloc(nbins,sizeof(*args->dev_dist));
+
+ args->prob_bins = bin_init(prob_bins,0,1);
+ nbins = bin_get_size(args->prob_bins);
+ args->prob_dist = (uint64_t*)calloc(nbins,sizeof(*args->prob_dist));
+
+ printf("# This file was produced by: bcftools +af-dist(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ printf("# The command line was:\tbcftools +af-dist %s", argv[0]);
+ for (c=1; c<argc; c++) printf(" %s",argv[c]);
+ printf("\n#\n");
+
+ if ( args->list_min!=-1 )
+ printf("# GT, genotypes with P(AF) in [%f,%f]; [2]Chromosome\t[3]Position[4]Sample\t[5]Genotype\t[6]AF-based probability\n",args->list_min,args->list_max);
+
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int naf = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af,&args->naf);
+ if ( naf<=0 ) return NULL;
+ float af = args->af[0];
+
+ float pRA = 2*af*(1-af);
+ float pAA = af*af;
+ int iRA = bin_get_idx(args->prob_bins,pRA);
+ int iAA = bin_get_idx(args->prob_bins,pAA);
+
+ int list_RA = args->list_min==-1 || pRA < args->list_min || pRA > args->list_max ? 0 : 1;
+ int list_AA = args->list_min==-1 || pAA < args->list_min || pAA > args->list_max ? 0 : 1;
+ const char *chr = bcf_seqname(args->hdr,rec);
+
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt, &args->ngt);
+ int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+ int nals = 0, nalt = 0;
+ ngt /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ int32_t *ptr = args->gt + i*ngt;
+ int dosage = 0;
+ for (j=0; j<ngt; j++)
+ {
+ if ( bcf_gt_is_missing(ptr[j]) ) break;
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_allele(ptr[j])==1 ) dosage++;
+ }
+ if ( j!=ngt ) continue;
+
+ nals += j;
+ nalt += dosage;
+
+ if ( dosage==1 )
+ {
+ args->prob_dist[iRA]++;
+ if ( list_RA ) printf("GT\t%s\t%d\t%s\t1\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pRA);
+ }
+ else if ( dosage==2 )
+ {
+ args->prob_dist[iAA]++;
+ if ( list_AA ) printf("GT\t%s\t%d\t%s\t2\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pAA);
+ }
+ }
+
+ if ( nals && (nalt || af) )
+ {
+ float af_dev = fabs(af - (float)nalt/nals);
+ int iAF = bin_get_idx(args->dev_bins,af_dev);
+ args->dev_dist[iAF]++;
+ }
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ printf("# PROB_DIST, genotype probability distribution, assumes HWE\n");
+ int i, n;
+ n = bin_get_size(args->prob_bins);
+ for (i=0; i<n-1; i++)
+ {
+ float min = bin_get_value(args->prob_bins,i);
+ float max = bin_get_value(args->prob_bins,i+1);
+ printf("PROB_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->prob_dist[i]);
+ }
+ printf("# DEV_DIST, distribution of AF deviation, based on %s and INFO/AN, AC calculated on the fly\n", args->af_tag);
+ n = bin_get_size(args->dev_bins);
+ for (i=0; i<n-1; i++)
+ {
+ float min = bin_get_value(args->dev_bins,i);
+ float max = bin_get_value(args->dev_bins,i+1);
+ printf("DEV_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->dev_dist[i]);
+ }
+ bin_destroy(args->dev_bins);
+ bin_destroy(args->prob_bins);
+ free(args->dev_dist);
+ free(args->prob_dist);
+ free(args->gt);
+ free(args->af);
+ free(args);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <inttypes.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+#include "bin.h"
+
+typedef struct
+{
+ char *af_tag;
+ bcf_hdr_t *hdr;
+ int32_t *gt, ngt, naf;
+ float *af, list_min, list_max;
+ bin_t *dev_bins, *prob_bins;
+ uint64_t *dev_dist, *prob_dist;
+}
+args_t;
+
+args_t *args;
+
+const char *about(void)
+{
+ return "AF and GT probability distribution stats.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Collect AF deviation stats and GT probability distribution\n"
+ " given AF and assuming HWE\n"
+ "Usage: bcftools +af-dist [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --dev-bins <list> AF deviation bins\n"
+ " -l, --list <min,max> list genotypes from the given bin (for debugging)\n"
+ " -p, --prob-bins <list> probability distribution bins\n"
+ " -t, --af-tag <tag> VCF INFO tag to use [AF]\n"
+ "\n"
+ "Default binning:\n"
+ " -d: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+ " -p: 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1\n"
+ "Example:\n"
+ " bcftools +af-tag file.bcf -- -t EUR_AF -p bins.txt\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ char *dev_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+ char *prob_bins = "0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1";
+ args->hdr = in;
+ args->af_tag = "AF";
+ args->list_min = -1;
+ static struct option loptions[] =
+ {
+ {"list",required_argument,NULL,'l'},
+ {"dev-bins",required_argument,NULL,'d'},
+ {"prob-bins",required_argument,NULL,'p'},
+ {"af-tag",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:d:p:l:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'l':
+ {
+ char *a,*b;
+ args->list_min = strtod(optarg,&a);
+ if ( a==optarg || *a!=',' ) error("Could not parse: --list %s\n", optarg);
+ args->list_max = strtod(a+1,&b);
+ if ( a+1==b || *b ) error("Could not parse: --list %s\n", optarg);
+ break;
+ }
+ case 'd': dev_bins = optarg; break;
+ case 'p': prob_bins = optarg; break;
+ case 't': args->af_tag = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+
+ args->dev_bins = bin_init(dev_bins,0,1);
+ int nbins = bin_get_size(args->dev_bins);
+ args->dev_dist = (uint64_t*)calloc(nbins,sizeof(*args->dev_dist));
+
+ args->prob_bins = bin_init(prob_bins,0,1);
+ nbins = bin_get_size(args->prob_bins);
+ args->prob_dist = (uint64_t*)calloc(nbins,sizeof(*args->prob_dist));
+
+ fprintf(bcftools_stdout, "# This file was produced by: bcftools +af-dist(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ fprintf(bcftools_stdout, "# The command line was:\tbcftools +af-dist %s", argv[0]);
+ for (c=1; c<argc; c++) fprintf(bcftools_stdout, " %s",argv[c]);
+ fprintf(bcftools_stdout, "\n#\n");
+
+ if ( args->list_min!=-1 )
+ fprintf(bcftools_stdout, "# GT, genotypes with P(AF) in [%f,%f]; [2]Chromosome\t[3]Position[4]Sample\t[5]Genotype\t[6]AF-based probability\n",args->list_min,args->list_max);
+
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int naf = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af,&args->naf);
+ if ( naf<=0 ) return NULL;
+ float af = args->af[0];
+
+ float pRA = 2*af*(1-af);
+ float pAA = af*af;
+ int iRA = bin_get_idx(args->prob_bins,pRA);
+ int iAA = bin_get_idx(args->prob_bins,pAA);
+
+ int list_RA = args->list_min==-1 || pRA < args->list_min || pRA > args->list_max ? 0 : 1;
+ int list_AA = args->list_min==-1 || pAA < args->list_min || pAA > args->list_max ? 0 : 1;
+ const char *chr = bcf_seqname(args->hdr,rec);
+
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt, &args->ngt);
+ int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+ int nals = 0, nalt = 0;
+ ngt /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ int32_t *ptr = args->gt + i*ngt;
+ int dosage = 0;
+ for (j=0; j<ngt; j++)
+ {
+ if ( bcf_gt_is_missing(ptr[j]) ) break;
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_allele(ptr[j])==1 ) dosage++;
+ }
+ if ( j!=ngt ) continue;
+
+ nals += j;
+ nalt += dosage;
+
+ if ( dosage==1 )
+ {
+ args->prob_dist[iRA]++;
+ if ( list_RA ) fprintf(bcftools_stdout, "GT\t%s\t%d\t%s\t1\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pRA);
+ }
+ else if ( dosage==2 )
+ {
+ args->prob_dist[iAA]++;
+ if ( list_AA ) fprintf(bcftools_stdout, "GT\t%s\t%d\t%s\t2\t%f\n",chr,rec->pos+1,args->hdr->samples[i],pAA);
+ }
+ }
+
+ if ( nals && (nalt || af) )
+ {
+ float af_dev = fabs(af - (float)nalt/nals);
+ int iAF = bin_get_idx(args->dev_bins,af_dev);
+ args->dev_dist[iAF]++;
+ }
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ fprintf(bcftools_stdout, "# PROB_DIST, genotype probability distribution, assumes HWE\n");
+ int i, n;
+ n = bin_get_size(args->prob_bins);
+ for (i=0; i<n-1; i++)
+ {
+ float min = bin_get_value(args->prob_bins,i);
+ float max = bin_get_value(args->prob_bins,i+1);
+ fprintf(bcftools_stdout, "PROB_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->prob_dist[i]);
+ }
+ fprintf(bcftools_stdout, "# DEV_DIST, distribution of AF deviation, based on %s and INFO/AN, AC calculated on the fly\n", args->af_tag);
+ n = bin_get_size(args->dev_bins);
+ for (i=0; i<n-1; i++)
+ {
+ float min = bin_get_value(args->dev_bins,i);
+ float max = bin_get_value(args->dev_bins,i+1);
+ fprintf(bcftools_stdout, "DEV_DIST\t%f\t%f\t%"PRId64"\n", min,max,args->dev_dist[i]);
+ }
+ bin_destroy(args->dev_bins);
+ bin_destroy(args->prob_bins);
+ free(args->dev_dist);
+ free(args->prob_dist);
+ free(args->gt);
+ free(args->af);
+ free(args);
+}
+
+
--- /dev/null
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ char *sample;
+ int beg,end,ploidy;
+}
+dat_t;
+
+typedef struct
+{
+ int argc;
+ char **argv;
+ int rid, gt_id, ndat;
+ dat_t *dat;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+ return "Check if ploidy of samples is consistent for all sites\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Check if ploidy of samples is consistent for all sites.\n"
+ "Usage: bcftools +check-ploidy [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Example:\n"
+ " bcftools +check-ploidy file.bcf\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->hdr = in;
+ args->ndat = bcf_hdr_nsamples(args->hdr);
+ args->dat = (dat_t*) calloc(args->ndat,sizeof(dat_t));
+ int i;
+ for (i=0; i<args->ndat; i++) args->dat[i].sample = args->hdr->samples[i];
+ args->rid = -1;
+ args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+ printf("# [1]Sample\t[2]Chromosome\t[3]Region Start\t[4]Region End\t[5]Ploidy\n");
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i;
+
+ bcf_unpack(rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<rec->n_fmt; i++)
+ if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) return NULL; // no GT tag
+
+ if ( args->ndat != rec->n_sample )
+ error("Incorrect number of samples at %s:%d .. found %d, expected %d\n",bcf_seqname(args->hdr,rec),rec->pos+1,rec->n_sample,args->ndat);
+
+ if ( args->rid!=rec->rid && args->rid!=-1 )
+ {
+ for (i=0; i<args->ndat; i++)
+ {
+ dat_t *dat = &args->dat[i];
+ if ( dat->ploidy!=0 ) printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy);
+ dat->ploidy = 0;
+ }
+ }
+ args->rid = rec->rid;
+
+ #define BRANCH_INT(type_t,vector_end) \
+ { \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+ int nal, missing = 0; \
+ for (nal=0; nal<fmt_gt->n; nal++) \
+ { \
+ if ( p[nal]==vector_end ) break; /* smaller ploidy */ \
+ if ( bcf_gt_is_missing(p[nal]) ) { missing=1; break; } /* missing allele */ \
+ } \
+ if ( !nal || missing ) continue; /* missing genotype */ \
+ dat_t *dat = &args->dat[i]; \
+ if ( dat->ploidy==nal ) \
+ { \
+ dat->end = rec->pos; \
+ continue; \
+ } \
+ if ( dat->ploidy!=0 ) \
+ printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); \
+ dat->ploidy = nal; \
+ dat->beg = rec->pos; \
+ dat->end = rec->pos; \
+ } \
+ }
+ switch (fmt_gt->type) {
+ case BCF_BT_INT8: BRANCH_INT(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+ default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->hdr,rec),rec->pos+1); break;
+ }
+ #undef BRANCH_INT
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ int i;
+ for (i=0; i<args->ndat; i++)
+ {
+ dat_t *dat = &args->dat[i];
+ if ( dat->ploidy!=0 ) printf("%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_hdr_id2name(args->hdr,args->rid),dat->beg+1,dat->end+1,dat->ploidy);
+ dat->ploidy = 0;
+ }
+ free(args->dat);
+ free(args);
+}
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ char *sample;
+ int beg,end,ploidy;
+}
+dat_t;
+
+typedef struct
+{
+ int argc;
+ char **argv;
+ int rid, gt_id, ndat;
+ dat_t *dat;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+ return "Check if ploidy of samples is consistent for all sites\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Check if ploidy of samples is consistent for all sites.\n"
+ "Usage: bcftools +check-ploidy [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Example:\n"
+ " bcftools +check-ploidy file.bcf\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->hdr = in;
+ args->ndat = bcf_hdr_nsamples(args->hdr);
+ args->dat = (dat_t*) calloc(args->ndat,sizeof(dat_t));
+ int i;
+ for (i=0; i<args->ndat; i++) args->dat[i].sample = args->hdr->samples[i];
+ args->rid = -1;
+ args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+ fprintf(bcftools_stdout, "# [1]Sample\t[2]Chromosome\t[3]Region Start\t[4]Region End\t[5]Ploidy\n");
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i;
+
+ bcf_unpack(rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<rec->n_fmt; i++)
+ if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) return NULL; // no GT tag
+
+ if ( args->ndat != rec->n_sample )
+ error("Incorrect number of samples at %s:%d .. found %d, expected %d\n",bcf_seqname(args->hdr,rec),rec->pos+1,rec->n_sample,args->ndat);
+
+ if ( args->rid!=rec->rid && args->rid!=-1 )
+ {
+ for (i=0; i<args->ndat; i++)
+ {
+ dat_t *dat = &args->dat[i];
+ if ( dat->ploidy!=0 ) fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy);
+ dat->ploidy = 0;
+ }
+ }
+ args->rid = rec->rid;
+
+ #define BRANCH_INT(type_t,vector_end) \
+ { \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+ int nal, missing = 0; \
+ for (nal=0; nal<fmt_gt->n; nal++) \
+ { \
+ if ( p[nal]==vector_end ) break; /* smaller ploidy */ \
+ if ( bcf_gt_is_missing(p[nal]) ) { missing=1; break; } /* missing allele */ \
+ } \
+ if ( !nal || missing ) continue; /* missing genotype */ \
+ dat_t *dat = &args->dat[i]; \
+ if ( dat->ploidy==nal ) \
+ { \
+ dat->end = rec->pos; \
+ continue; \
+ } \
+ if ( dat->ploidy!=0 ) \
+ fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_seqname(args->hdr,rec),dat->beg+1,dat->end+1,dat->ploidy); \
+ dat->ploidy = nal; \
+ dat->beg = rec->pos; \
+ dat->end = rec->pos; \
+ } \
+ }
+ switch (fmt_gt->type) {
+ case BCF_BT_INT8: BRANCH_INT(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+ default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->hdr,rec),rec->pos+1); break;
+ }
+ #undef BRANCH_INT
+
+ return NULL;
+}
+
+void destroy(void)
+{
+ int i;
+ for (i=0; i<args->ndat; i++)
+ {
+ dat_t *dat = &args->dat[i];
+ if ( dat->ploidy!=0 ) fprintf(bcftools_stdout, "%s\t%s\t%d\t%d\t%d\n", dat->sample,bcf_hdr_id2name(args->hdr,args->rid),dat->beg+1,dat->end+1,dat->ploidy);
+ dat->ploidy = 0;
+ }
+ free(args->dat);
+ free(args);
+}
+
--- /dev/null
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ int argc;
+ char **argv, *fname, *region, **regs;
+ int region_is_file, nregs, regs_free;
+ int *smpl, nsmpl, *nsites, min_sites, gt_id;
+ kstring_t tmps;
+ bcf1_t *rec;
+ tbx_t *tbx;
+ hts_idx_t *idx;
+ hts_itr_t *itr;
+ htsFile *fp;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Print samples without genotypes in a region or chromosome\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Print samples without genotypess in a region (-r/-R) or chromosome (the default)\n"
+ "\n"
+ "Usage: bcftools +check-sparsity <file.vcf.gz> [Plugin Options]\n"
+ "Plugin options:\n"
+ " -n, --n-markers <int> minimum number of required markers [1]\n"
+ " -r, --regions <chr:beg-end> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->fp = hts_open(args->fname,"r");
+ if ( !args->fp ) error("Could not read %s\n", args->fname);
+ args->hdr = bcf_hdr_read(args->fp);
+ if ( !args->hdr ) error("Could not read the header: %s\n", args->fname);
+
+ args->rec = bcf_init1();
+ args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+ int i;
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ args->nsites = (int*) calloc(args->nsmpl, sizeof(int));
+ args->smpl = (int*) malloc(sizeof(int)*args->nsmpl);
+ for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+
+ if ( strcmp("-",args->fname) ) // not reading from stdin
+ {
+ if ( hts_get_format(args->fp)->format==vcf )
+ {
+ args->tbx = tbx_index_load(args->fname);
+ if ( !args->tbx && args->region ) error("Could not load the VCF index, please drop the -r/-R option\n");
+ }
+ else if ( hts_get_format(args->fp)->format==bcf )
+ {
+ args->idx = bcf_index_load(args->fname);
+ if ( !args->idx && args->region ) error("Could not load the BCF index, please drop the -r/-R option\n");
+ }
+ }
+ else if ( args->region ) error("Cannot use index with this file, please drop the -r/-R option\n");
+
+ if ( args->tbx || args->idx )
+ {
+ if ( args->region )
+ {
+ args->regs = hts_readlist(args->region, args->region_is_file, &args->nregs);
+ if ( !args->regs ) error("Could not parse regions: %s\n", args->region);
+ args->regs_free = 1;
+ }
+ else
+ args->regs = (char**) (args->tbx ? tbx_seqnames(args->tbx, &args->nregs) : bcf_index_seqnames(args->idx, args->hdr, &args->nregs));
+ }
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ if ( args->regs_free )
+ for (i=0; i<args->nregs; i++) free(args->regs[i]);
+ free(args->regs);
+ bcf_hdr_destroy(args->hdr);
+ bcf_destroy(args->rec);
+ free(args->tmps.s);
+ free(args->smpl);
+ free(args->nsites);
+ if ( args->itr ) hts_itr_destroy(args->itr);
+ if ( args->tbx ) tbx_destroy(args->tbx);
+ if ( args->idx ) hts_idx_destroy(args->idx);
+ hts_close(args->fp);
+}
+
+static void report(args_t *args, const char *reg)
+{
+ int i;
+ for (i=0; i<args->nsmpl; i++)
+ printf("%s\t%s\n", reg, args->hdr->samples[args->smpl[i]]);
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+ memset(args->nsites, 0, sizeof(int)*args->nsmpl);
+}
+static void test_region(args_t *args, char *reg)
+{
+ if ( args->tbx )
+ {
+ args->itr = tbx_itr_querys(args->tbx,reg);
+ if ( !args->itr ) return;
+ }
+ else if ( args->idx )
+ {
+ args->itr = bcf_itr_querys(args->idx,args->hdr,reg);
+ if ( !args->itr ) return;
+ }
+
+ int ret,i, rid = -1, nread = 0;
+ while (1)
+ {
+ if ( args->tbx )
+ {
+ if ( (ret=tbx_itr_next(args->fp, args->tbx, args->itr, &args->tmps)) < 0 ) break; // no more lines
+ ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+ if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+ }
+ else if ( args->idx )
+ {
+ ret = bcf_itr_next(args->fp, args->itr, args->rec);
+ if ( ret < -1 ) error("Could not parse a line from %s\n", reg);
+ if ( ret < 0 ) break; // no more lines or an error
+ }
+ else
+ {
+ if ( args->fp->format.format==vcf )
+ {
+ if ( (ret=hts_getline(args->fp, KS_SEP_LINE, &args->tmps)) < 0 ) break; // no more lines
+ ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+ if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+ }
+ else if ( args->fp->format.format==bcf )
+ {
+ ret = bcf_read1(args->fp, args->hdr, args->rec);
+ if ( ret < -1 ) error("Could not parse %s\n", args->fname);
+ if ( ret < 0 ) break; // no more lines or an error
+ }
+ if ( rid!=-1 && rid!=args->rec->rid )
+ {
+ report(args, bcf_hdr_id2name(args->hdr,rid));
+ nread = 0;
+ }
+ rid = args->rec->rid;
+ }
+
+ bcf_unpack(args->rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<args->rec->n_fmt; i++)
+ if ( args->rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &args->rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) continue; // no GT tag
+ if ( fmt_gt->n==0 ) continue; // empty?!
+ if ( fmt_gt->type!=BCF_BT_INT8 ) error("TODO: the GT fmt_type is not int8!\n");
+
+ // update the array of missing samples
+ for (i=0; i<args->nsmpl; i++)
+ {
+ int8_t *ptr = (int8_t*) (fmt_gt->p + args->smpl[i]*fmt_gt->size);
+ int ial = 0;
+ for (ial=0; ial<fmt_gt->n; ial++)
+ if ( ptr[ial]==bcf_gt_missing || ptr[ial]==bcf_int8_vector_end ) break;
+ if ( ial==0 ) continue; // missing
+ if ( ++args->nsites[i] < args->min_sites ) continue;
+ if ( i+1<args->nsmpl )
+ {
+ memmove(args->smpl+i, args->smpl+i+1, sizeof(int)*(args->nsmpl-i-1));
+ memmove(args->nsites+i, args->nsites+i+1, sizeof(int)*(args->nsmpl-i-1));
+ }
+ args->nsmpl--;
+ i--;
+ }
+ nread = 1;
+ if ( !args->nsmpl ) break;
+ }
+ if ( nread ) report(args, rid==-1 ? reg : bcf_hdr_id2name(args->hdr,rid));
+
+ tbx_itr_destroy(args->itr);
+ args->itr = NULL;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->min_sites = 1;
+ static struct option loptions[] =
+ {
+ {"n-markers",required_argument,NULL,'n'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {NULL,0,NULL,0}
+ };
+ int c,i;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:n:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'n':
+ args->min_sites = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -n %s\n", optarg);
+ break;
+ case 'R': args->region_is_file = 1;
+ case 'r': args->region = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+
+ if ( optind>=argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else error("%s",usage_text());
+ }
+ else args->fname = argv[optind];
+ init_data(args);
+
+ for (i=0; i<args->nregs; i++) test_region(args, args->regs[i]);
+ if ( !args->nregs ) test_region(args, NULL);
+
+ destroy_data(args);
+ free(args);
+ return 0;
+}
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kseq.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ int argc;
+ char **argv, *fname, *region, **regs;
+ int region_is_file, nregs, regs_free;
+ int *smpl, nsmpl, *nsites, min_sites, gt_id;
+ kstring_t tmps;
+ bcf1_t *rec;
+ tbx_t *tbx;
+ hts_idx_t *idx;
+ hts_itr_t *itr;
+ htsFile *fp;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Print samples without genotypes in a region or chromosome\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Print samples without genotypess in a region (-r/-R) or chromosome (the default)\n"
+ "\n"
+ "Usage: bcftools +check-sparsity <file.vcf.gz> [Plugin Options]\n"
+ "Plugin options:\n"
+ " -n, --n-markers <int> minimum number of required markers [1]\n"
+ " -r, --regions <chr:beg-end> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->fp = hts_open(args->fname,"r");
+ if ( !args->fp ) error("Could not read %s\n", args->fname);
+ args->hdr = bcf_hdr_read(args->fp);
+ if ( !args->hdr ) error("Could not read the header: %s\n", args->fname);
+
+ args->rec = bcf_init1();
+ args->gt_id = bcf_hdr_id2int(args->hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+ int i;
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ args->nsites = (int*) calloc(args->nsmpl, sizeof(int));
+ args->smpl = (int*) malloc(sizeof(int)*args->nsmpl);
+ for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+
+ if ( strcmp("-",args->fname) ) // not reading from stdin
+ {
+ if ( hts_get_format(args->fp)->format==vcf )
+ {
+ args->tbx = tbx_index_load(args->fname);
+ if ( !args->tbx && args->region ) error("Could not load the VCF index, please drop the -r/-R option\n");
+ }
+ else if ( hts_get_format(args->fp)->format==bcf )
+ {
+ args->idx = bcf_index_load(args->fname);
+ if ( !args->idx && args->region ) error("Could not load the BCF index, please drop the -r/-R option\n");
+ }
+ }
+ else if ( args->region ) error("Cannot use index with this file, please drop the -r/-R option\n");
+
+ if ( args->tbx || args->idx )
+ {
+ if ( args->region )
+ {
+ args->regs = hts_readlist(args->region, args->region_is_file, &args->nregs);
+ if ( !args->regs ) error("Could not parse regions: %s\n", args->region);
+ args->regs_free = 1;
+ }
+ else
+ args->regs = (char**) (args->tbx ? tbx_seqnames(args->tbx, &args->nregs) : bcf_index_seqnames(args->idx, args->hdr, &args->nregs));
+ }
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ if ( args->regs_free )
+ for (i=0; i<args->nregs; i++) free(args->regs[i]);
+ free(args->regs);
+ bcf_hdr_destroy(args->hdr);
+ bcf_destroy(args->rec);
+ free(args->tmps.s);
+ free(args->smpl);
+ free(args->nsites);
+ if ( args->itr ) hts_itr_destroy(args->itr);
+ if ( args->tbx ) tbx_destroy(args->tbx);
+ if ( args->idx ) hts_idx_destroy(args->idx);
+ hts_close(args->fp);
+}
+
+static void report(args_t *args, const char *reg)
+{
+ int i;
+ for (i=0; i<args->nsmpl; i++)
+ fprintf(bcftools_stdout, "%s\t%s\n", reg, args->hdr->samples[args->smpl[i]]);
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ for (i=0; i<args->nsmpl; i++) args->smpl[i] = i;
+ memset(args->nsites, 0, sizeof(int)*args->nsmpl);
+}
+static void test_region(args_t *args, char *reg)
+{
+ if ( args->tbx )
+ {
+ args->itr = tbx_itr_querys(args->tbx,reg);
+ if ( !args->itr ) return;
+ }
+ else if ( args->idx )
+ {
+ args->itr = bcf_itr_querys(args->idx,args->hdr,reg);
+ if ( !args->itr ) return;
+ }
+
+ int ret,i, rid = -1, nread = 0;
+ while (1)
+ {
+ if ( args->tbx )
+ {
+ if ( (ret=tbx_itr_next(args->fp, args->tbx, args->itr, &args->tmps)) < 0 ) break; // no more lines
+ ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+ if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+ }
+ else if ( args->idx )
+ {
+ ret = bcf_itr_next(args->fp, args->itr, args->rec);
+ if ( ret < -1 ) error("Could not parse a line from %s\n", reg);
+ if ( ret < 0 ) break; // no more lines or an error
+ }
+ else
+ {
+ if ( args->fp->format.format==vcf )
+ {
+ if ( (ret=hts_getline(args->fp, KS_SEP_LINE, &args->tmps)) < 0 ) break; // no more lines
+ ret = vcf_parse1(&args->tmps, args->hdr, args->rec);
+ if ( ret<0 ) error("Could not parse the line: %s\n", args->tmps.s);
+ }
+ else if ( args->fp->format.format==bcf )
+ {
+ ret = bcf_read1(args->fp, args->hdr, args->rec);
+ if ( ret < -1 ) error("Could not parse %s\n", args->fname);
+ if ( ret < 0 ) break; // no more lines or an error
+ }
+ if ( rid!=-1 && rid!=args->rec->rid )
+ {
+ report(args, bcf_hdr_id2name(args->hdr,rid));
+ nread = 0;
+ }
+ rid = args->rec->rid;
+ }
+
+ bcf_unpack(args->rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<args->rec->n_fmt; i++)
+ if ( args->rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &args->rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) continue; // no GT tag
+ if ( fmt_gt->n==0 ) continue; // empty?!
+ if ( fmt_gt->type!=BCF_BT_INT8 ) error("TODO: the GT fmt_type is not int8!\n");
+
+ // update the array of missing samples
+ for (i=0; i<args->nsmpl; i++)
+ {
+ int8_t *ptr = (int8_t*) (fmt_gt->p + args->smpl[i]*fmt_gt->size);
+ int ial = 0;
+ for (ial=0; ial<fmt_gt->n; ial++)
+ if ( ptr[ial]==bcf_gt_missing || ptr[ial]==bcf_int8_vector_end ) break;
+ if ( ial==0 ) continue; // missing
+ if ( ++args->nsites[i] < args->min_sites ) continue;
+ if ( i+1<args->nsmpl )
+ {
+ memmove(args->smpl+i, args->smpl+i+1, sizeof(int)*(args->nsmpl-i-1));
+ memmove(args->nsites+i, args->nsites+i+1, sizeof(int)*(args->nsmpl-i-1));
+ }
+ args->nsmpl--;
+ i--;
+ }
+ nread = 1;
+ if ( !args->nsmpl ) break;
+ }
+ if ( nread ) report(args, rid==-1 ? reg : bcf_hdr_id2name(args->hdr,rid));
+
+ tbx_itr_destroy(args->itr);
+ args->itr = NULL;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->min_sites = 1;
+ static struct option loptions[] =
+ {
+ {"n-markers",required_argument,NULL,'n'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {NULL,0,NULL,0}
+ };
+ int c,i;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:n:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'n':
+ args->min_sites = strtol(optarg,&tmp,10);
+ if ( *tmp ) error("Could not parse: -n %s\n", optarg);
+ break;
+ case 'R': args->region_is_file = 1;
+ case 'r': args->region = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+
+ if ( optind>=argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else error("%s",usage_text());
+ }
+ else args->fname = argv[optind];
+ init_data(args);
+
+ for (i=0; i<args->nregs; i++) test_region(args, args->regs[i]);
+ if ( !args->nregs ) test_region(args, NULL);
+
+ destroy_data(args);
+ free(args);
+ return 0;
+}
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+/*
+ Trio haplotypes: mother (A,B), father (C,D), child (E,F)
+ Modeling the following states:
+ 01|23|02
+ 01|23|03
+ 01|23|12
+ 01|23|13
+ 01|23|20
+ 01|23|30
+ 01|23|21
+ 01|23|31
+ with the likelihoods of two haplotypes A,B segments sharing an allele:
+ P(01|A==B) .. e (P of error)
+ P(00|A==B) .. 1-e
+ and
+ P(ab,cd,ef|E=A,F=C) = P(ea|E=A)*P(fc|F=C)
+
+
+ Unrelated samples: (A,B) and (C,D)
+ Modeling the states:
+ xxxx .. A!=C,A!=D,B!=C,B!=D
+ 0x0x .. A=C,B!=D
+ 0xx0 .. A=D,B!=C
+ x00x .. B=C,A!=D
+ x0x0 .. B=D,A!=C
+ 0101 .. A=C,B=D
+ 0110 .. A=D,B=C
+ with the likelihoods
+ P(01|A!=B) .. f*(1-f)
+ P(00|A!=B) .. (1-f)*(1-f)
+ P(11|A!=B) .. f*f
+
+ Assuming 2x30 crossovers, P=2e-8.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "HMM.h"
+
+#define C_TRIO 1
+#define C_UNRL 2
+
+// states for unrelated samples
+#define UNRL_xxxx 0
+#define UNRL_0x0x 1
+#define UNRL_0xx0 2
+#define UNRL_x00x 3
+#define UNRL_x0x0 4
+#define UNRL_0101 5
+#define UNRL_0110 6
+
+// trio states
+#define TRIO_AC 0
+#define TRIO_AD 1
+#define TRIO_BC 2
+#define TRIO_BD 3
+#define TRIO_CA 4
+#define TRIO_DA 5
+#define TRIO_CB 6
+#define TRIO_DB 7
+
+typedef struct _args_t
+{
+ bcf_hdr_t *hdr;
+ hmm_t *hmm;
+ double *eprob, *tprob, pij, pgt_err;
+ uint32_t *sites;
+ int32_t *gt_arr;
+ int nsites, msites, ngt_arr, prev_rid;
+ int mode, nstates, nhet_father, nhet_mother;
+ int imother,ifather,ichild, isample,jsample;
+ void (*set_observed_prob) (bcf1_t *rec);
+ char *prefix;
+ FILE *fp;
+}
+args_t;
+
+static args_t args;
+
+#define SW_MOTHER 1
+#define SW_FATHER 2
+static int hap_switch[8][8];
+
+static void set_observed_prob_trio(bcf1_t *rec);
+static void set_observed_prob_unrelated(bcf1_t *rec);
+static void init_hmm_trio(args_t *args);
+static void init_hmm_unrelated(args_t *args);
+
+
+const char *about(void)
+{
+ return "Color shared chromosomal segments, requires phased GTs.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Color shared chromosomal segments, requires phased GTs. The output\n"
+ " can be visualized using the color-chrs.pl script.\n"
+ "Usage: bcftools +color-chrs [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --prefix <path> output files prefix\n"
+ " -t, --trio <m,f,c> names of mother, father and the child\n"
+ " -u, --unrelated <a,b> names of two unrelated samples\n"
+ "\n"
+ "Example:\n"
+ " bcftools +color-chrs in.vcf --\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ char *trio_samples = NULL, *unrelated_samples = NULL;
+ memset(&args,0,sizeof(args_t));
+ args.prev_rid = -1;
+ args.hdr = in;
+ args.pij = 2e-8;
+ args.pgt_err = 1e-9;
+
+ static struct option loptions[] =
+ {
+ {"prefix",1,0,'p'},
+ {"trio",1,0,'t'},
+ {"unrelated",1,0,'u'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:u:p:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': args.prefix = optarg; break;
+ case 't': trio_samples = optarg; break;
+ case 'u': unrelated_samples = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) error("%s",usage());
+ if ( trio_samples && unrelated_samples ) error("Expected only one of the -t/-u options\n");
+ if ( !trio_samples && !unrelated_samples ) error("Expected one of the -t/-u options\n");
+ if ( !args.prefix ) error("Expected the -p option\n");
+
+ int ret = bcf_hdr_set_samples(args.hdr, trio_samples ? trio_samples : unrelated_samples, 0);
+ if ( ret<0 ) error("Could not parse samples: %s\n", trio_samples ? trio_samples : unrelated_samples);
+ else if ( ret>0 ) error("%d-th sample not found: %s\n", ret,trio_samples ? trio_samples : unrelated_samples);
+
+ if ( trio_samples )
+ {
+ int i,n = 0;
+ char **list = hts_readlist(trio_samples, 0, &n);
+ if ( n!=3 ) error("Expected three sample names with -t\n");
+ args.imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ args.ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ args.set_observed_prob = set_observed_prob_trio;
+ args.mode = C_TRIO;
+ init_hmm_trio(&args);
+ }
+ else
+ {
+ int i,n = 0;
+ char **list = hts_readlist(unrelated_samples, 0, &n);
+ if ( n!=2 ) error("Expected two sample names with -u\n");
+ args.isample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.jsample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ args.set_observed_prob = set_observed_prob_unrelated;
+ args.mode = C_UNRL;
+ init_hmm_unrelated(&args);
+ }
+ return 1;
+}
+
+static void init_hmm_trio(args_t *args)
+{
+ int i,j;
+ args->nstates = 8;
+ args->tprob = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+ for (i=0; i<args->nstates; i++)
+ for (j=0; j<args->nstates; j++) hap_switch[i][j] = 0;
+
+ hap_switch[TRIO_AD][TRIO_AC] = SW_FATHER;
+ hap_switch[TRIO_BC][TRIO_AC] = SW_MOTHER;
+ hap_switch[TRIO_BD][TRIO_AC] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_AC][TRIO_AD] = SW_FATHER;
+ hap_switch[TRIO_BC][TRIO_AD] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_BD][TRIO_AD] = SW_MOTHER;
+ hap_switch[TRIO_AC][TRIO_BC] = SW_MOTHER;
+ hap_switch[TRIO_AD][TRIO_BC] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_BD][TRIO_BC] = SW_FATHER;
+ hap_switch[TRIO_AC][TRIO_BD] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_AD][TRIO_BD] = SW_MOTHER;
+ hap_switch[TRIO_BC][TRIO_BD] = SW_FATHER;
+
+ hap_switch[TRIO_DA][TRIO_CA] = SW_FATHER;
+ hap_switch[TRIO_CB][TRIO_CA] = SW_MOTHER;
+ hap_switch[TRIO_DB][TRIO_CA] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_CA][TRIO_DA] = SW_FATHER;
+ hap_switch[TRIO_CB][TRIO_DA] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DB][TRIO_DA] = SW_MOTHER;
+ hap_switch[TRIO_CA][TRIO_CB] = SW_MOTHER;
+ hap_switch[TRIO_DA][TRIO_CB] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DB][TRIO_CB] = SW_FATHER;
+ hap_switch[TRIO_CA][TRIO_DB] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DA][TRIO_DB] = SW_MOTHER;
+ hap_switch[TRIO_CB][TRIO_DB] = SW_FATHER;
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ {
+ if ( !hap_switch[i][j] ) MAT(args->tprob,args->nstates,i,j) = 0;
+ else
+ {
+ MAT(args->tprob,args->nstates,i,j) = 1;
+ if ( hap_switch[i][j] & SW_MOTHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+ if ( hap_switch[i][j] & SW_FATHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+ }
+ }
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ double sum = 0;
+ for (j=0; j<args->nstates; j++)
+ {
+ if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+ }
+ MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+ }
+
+ #if 0
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(stderr,"\t%d",hap_switch[j][i]);
+ fprintf(stderr,"\n");
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+ fprintf(stderr,"\n");
+ }
+ #endif
+
+ args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static void init_hmm_unrelated(args_t *args)
+{
+ int i,j;
+ args->nstates = 7;
+ args->tprob = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ MAT(args->tprob,args->nstates,i,j) = args->pij;
+ }
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_xxxx) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_xxxx) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_x0x0,UNRL_0x0x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0x0x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_x00x,UNRL_0xx0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_0xx0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_x00x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_x0x0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0101) = args->pij*args->pij;
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=i+1; j<args->nstates; j++)
+ MAT(args->tprob,args->nstates,i,j) = MAT(args->tprob,args->nstates,j,i);
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ double sum = 0;
+ for (j=0; j<args->nstates; j++)
+ if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+ MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+ }
+
+ #if 0
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+ fprintf(stderr,"\n");
+ }
+ #endif
+
+ args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static inline double prob_shared(float af, int a, int b)
+{
+ return a==b ? 1-args.pgt_err : args.pgt_err;
+}
+static inline double prob_not_shared(float af, int a, int b)
+{
+ if ( a!=b ) return af*(1-af);
+ else if ( a==0 ) return (1-af)*(1-af);
+ else return af*af;
+}
+static void set_observed_prob_unrelated(bcf1_t *rec)
+{
+ float af = 0.5; // alternate allele frequency
+
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return;
+ if ( ngt!=4 ) return; // chrX
+
+ int32_t a,b,c,d;
+ a = args.gt_arr[2*args.isample];
+ b = args.gt_arr[2*args.isample+1];
+ c = args.gt_arr[2*args.jsample];
+ d = args.gt_arr[2*args.jsample+1];
+ if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+ if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+ if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+ if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+ a = bcf_gt_allele(a);
+ b = bcf_gt_allele(b);
+ c = bcf_gt_allele(c);
+ d = bcf_gt_allele(d);
+
+ int m = args.msites;
+ args.nsites++;
+ hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+ if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+ args.sites[args.nsites-1] = rec->pos;
+ double *prob = args.eprob + args.nstates*(args.nsites-1);
+ prob[UNRL_xxxx] = prob_not_shared(af,a,c) * prob_not_shared(af,a,d) * prob_not_shared(af,b,c) * prob_not_shared(af,b,d);
+ prob[UNRL_0x0x] = prob_shared(af,a,c) * prob_not_shared(af,b,d);
+ prob[UNRL_0xx0] = prob_shared(af,a,d) * prob_not_shared(af,b,c);
+ prob[UNRL_x00x] = prob_shared(af,b,c) * prob_not_shared(af,a,d);
+ prob[UNRL_x0x0] = prob_shared(af,b,d) * prob_not_shared(af,a,c);
+ prob[UNRL_0101] = prob_shared(af,a,c) * prob_shared(af,b,d);
+ prob[UNRL_0110] = prob_shared(af,a,d) * prob_shared(af,b,c);
+
+#if 0
+ static int x = 0;
+ if ( !x++)
+ {
+ printf("p(0==0) .. %f\n", prob_shared(af,0,0));
+ printf("p(0!=0) .. %f\n", prob_not_shared(af,0,0));
+ printf("p(0==1) .. %f\n", prob_shared(af,0,1));
+ printf("p(0!=1) .. %f\n", prob_not_shared(af,0,1));
+ }
+ printf("%d|%d %d|%d x:%f 11:%f 12:%f 21:%f 22:%f 11,22:%f 12,21:%f %d\n", a,b,c,d,
+ prob[UNRL_xxxx], prob[UNRL_0x0x], prob[UNRL_0xx0], prob[UNRL_x00x], prob[UNRL_x0x0], prob[UNRL_0101], prob[UNRL_0110], rec->pos+1);
+#endif
+}
+static void set_observed_prob_trio(bcf1_t *rec)
+{
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return;
+ if ( ngt!=6 ) return; // chrX
+
+ int32_t a,b,c,d,e,f;
+ a = args.gt_arr[2*args.imother];
+ b = args.gt_arr[2*args.imother+1];
+ c = args.gt_arr[2*args.ifather];
+ d = args.gt_arr[2*args.ifather+1];
+ e = args.gt_arr[2*args.ichild];
+ f = args.gt_arr[2*args.ichild+1];
+ if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+ if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+ if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) return;
+ if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+ if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+ if ( !bcf_gt_is_phased(e) && !bcf_gt_is_phased(f) ) return;
+ a = bcf_gt_allele(a);
+ b = bcf_gt_allele(b);
+ c = bcf_gt_allele(c);
+ d = bcf_gt_allele(d);
+ e = bcf_gt_allele(e);
+ f = bcf_gt_allele(f);
+
+ int mother = (1<<a) | (1<<b);
+ int father = (1<<c) | (1<<d);
+ int child = (1<<e) | (1<<f);
+ if ( !(mother&child) || !(father&child) ) return; // Mendelian-inconsistent site, skip
+
+ if ( a!=b ) args.nhet_mother++;
+ if ( c!=d ) args.nhet_father++;
+
+ int m = args.msites;
+ args.nsites++;
+ hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+ if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+ args.sites[args.nsites-1] = rec->pos;
+ double *prob = args.eprob + args.nstates*(args.nsites-1);
+ prob[TRIO_AC] = prob_shared(0,e,a) * prob_shared(0,f,c);
+ prob[TRIO_AD] = prob_shared(0,e,a) * prob_shared(0,f,d);
+ prob[TRIO_BC] = prob_shared(0,e,b) * prob_shared(0,f,c);
+ prob[TRIO_BD] = prob_shared(0,e,b) * prob_shared(0,f,d);
+ prob[TRIO_CA] = prob_shared(0,e,c) * prob_shared(0,f,a);
+ prob[TRIO_DA] = prob_shared(0,e,d) * prob_shared(0,f,a);
+ prob[TRIO_CB] = prob_shared(0,e,c) * prob_shared(0,f,b);
+ prob[TRIO_DB] = prob_shared(0,e,d) * prob_shared(0,f,b);
+}
+
+void flush_viterbi(args_t *args)
+{
+ const char *s1, *s2, *s3 = NULL;
+ if ( args->mode==C_UNRL )
+ {
+ s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->isample);
+ s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->jsample);
+ }
+ else if ( args->mode==C_TRIO )
+ {
+ s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->imother);
+ s3 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ifather);
+ s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ichild);
+ }
+ else abort();
+
+ if ( !args->fp )
+ {
+ kstring_t str = {0,0,0};
+ kputs(args->prefix, &str);
+ kputs(".dat", &str);
+ args->fp = fopen(str.s,"w");
+ if ( !args->fp ) error("%s: %s\n", str.s,strerror(errno));
+ free(str.s);
+ fprintf(args->fp,"# SG, shared segment\t[2]Chromosome\t[3]Start\t[4]End\t[5]%s:1\t[6]%s:2\n",s2,s2);
+ fprintf(args->fp,"# SW, number of switches\t[3]Sample\t[4]Chromosome\t[5]nHets\t[5]nSwitches\t[6]switch rate\n");
+ }
+
+ hmm_run_viterbi(args->hmm,args->nsites,args->eprob,args->sites);
+ uint8_t *vpath = hmm_get_viterbi_path(args->hmm);
+ int i, iprev = -1, prev_state = -1, nstates = hmm_get_nstates(args->hmm);
+ int nswitch_mother = 0, nswitch_father = 0;
+ for (i=0; i<args->nsites; i++)
+ {
+ int state = vpath[i*nstates];
+ if ( state!=prev_state || i+1==args->nsites )
+ {
+ uint32_t start = iprev>=0 ? args->sites[iprev]+1 : 1, end = i>0 ? args->sites[i-1] : 1;
+ const char *chr = bcf_hdr_id2name(args->hdr,args->prev_rid);
+ if ( args->mode==C_UNRL )
+ {
+ switch (prev_state)
+ {
+ case UNRL_0x0x:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t-\n", chr,start,end,s1); break;
+ case UNRL_0xx0:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:1\n", chr,start,end,s1); break;
+ case UNRL_x00x:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t-\n", chr,start,end,s1); break;
+ case UNRL_x0x0:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:2\n", chr,start,end,s1); break;
+ case UNRL_0101:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s1); break;
+ case UNRL_0110:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s1); break;
+ }
+ }
+ else if ( args->mode==C_TRIO )
+ {
+ switch (prev_state)
+ {
+ case TRIO_AC:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s1,s3); break;
+ case TRIO_AD:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s3); break;
+ case TRIO_BC:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s3); break;
+ case TRIO_BD:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s1,s3); break;
+ case TRIO_CA:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s3,s1); break;
+ case TRIO_DA:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s3,s1); break;
+ case TRIO_CB:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s3,s1); break;
+ case TRIO_DB:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s3,s1); break;
+ }
+ if ( hap_switch[state][prev_state] & SW_MOTHER ) nswitch_mother++;
+ if ( hap_switch[state][prev_state] & SW_FATHER ) nswitch_father++;
+ }
+ iprev = i-1;
+ }
+ prev_state = state;
+ }
+ float mrate = args->nhet_mother>1 ? (float)nswitch_mother/(args->nhet_mother-1) : 0;
+ float frate = args->nhet_father>1 ? (float)nswitch_father/(args->nhet_father-1) : 0;
+ fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s1,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_mother,nswitch_mother,mrate);
+ fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s3,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_father,nswitch_father,frate);
+ args->nsites = 0;
+ args->nhet_father = args->nhet_mother = 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( args.prev_rid==-1 ) args.prev_rid = rec->rid;
+ if ( args.prev_rid!=rec->rid ) flush_viterbi(&args);
+ args.prev_rid = rec->rid;
+ args.set_observed_prob(rec);
+ return NULL;
+}
+
+void destroy(void)
+{
+ flush_viterbi(&args);
+ fclose(args.fp);
+
+ free(args.gt_arr);
+ free(args.tprob);
+ free(args.sites);
+ free(args.eprob);
+ hmm_destroy(args.hmm);
+}
+
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+/*
+ Trio haplotypes: mother (A,B), father (C,D), child (E,F)
+ Modeling the following states:
+ 01|23|02
+ 01|23|03
+ 01|23|12
+ 01|23|13
+ 01|23|20
+ 01|23|30
+ 01|23|21
+ 01|23|31
+ with the likelihoods of two haplotypes A,B segments sharing an allele:
+ P(01|A==B) .. e (P of error)
+ P(00|A==B) .. 1-e
+ and
+ P(ab,cd,ef|E=A,F=C) = P(ea|E=A)*P(fc|F=C)
+
+
+ Unrelated samples: (A,B) and (C,D)
+ Modeling the states:
+ xxxx .. A!=C,A!=D,B!=C,B!=D
+ 0x0x .. A=C,B!=D
+ 0xx0 .. A=D,B!=C
+ x00x .. B=C,A!=D
+ x0x0 .. B=D,A!=C
+ 0101 .. A=C,B=D
+ 0110 .. A=D,B=C
+ with the likelihoods
+ P(01|A!=B) .. f*(1-f)
+ P(00|A!=B) .. (1-f)*(1-f)
+ P(11|A!=B) .. f*f
+
+ Assuming 2x30 crossovers, P=2e-8.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "HMM.h"
+
+#define C_TRIO 1
+#define C_UNRL 2
+
+// states for unrelated samples
+#define UNRL_xxxx 0
+#define UNRL_0x0x 1
+#define UNRL_0xx0 2
+#define UNRL_x00x 3
+#define UNRL_x0x0 4
+#define UNRL_0101 5
+#define UNRL_0110 6
+
+// trio states
+#define TRIO_AC 0
+#define TRIO_AD 1
+#define TRIO_BC 2
+#define TRIO_BD 3
+#define TRIO_CA 4
+#define TRIO_DA 5
+#define TRIO_CB 6
+#define TRIO_DB 7
+
+typedef struct _args_t
+{
+ bcf_hdr_t *hdr;
+ hmm_t *hmm;
+ double *eprob, *tprob, pij, pgt_err;
+ uint32_t *sites;
+ int32_t *gt_arr;
+ int nsites, msites, ngt_arr, prev_rid;
+ int mode, nstates, nhet_father, nhet_mother;
+ int imother,ifather,ichild, isample,jsample;
+ void (*set_observed_prob) (bcf1_t *rec);
+ char *prefix;
+ FILE *fp;
+}
+args_t;
+
+static args_t args;
+
+#define SW_MOTHER 1
+#define SW_FATHER 2
+static int hap_switch[8][8];
+
+static void set_observed_prob_trio(bcf1_t *rec);
+static void set_observed_prob_unrelated(bcf1_t *rec);
+static void init_hmm_trio(args_t *args);
+static void init_hmm_unrelated(args_t *args);
+
+
+const char *about(void)
+{
+ return "Color shared chromosomal segments, requires phased GTs.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Color shared chromosomal segments, requires phased GTs. The output\n"
+ " can be visualized using the color-chrs.pl script.\n"
+ "Usage: bcftools +color-chrs [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --prefix <path> output files prefix\n"
+ " -t, --trio <m,f,c> names of mother, father and the child\n"
+ " -u, --unrelated <a,b> names of two unrelated samples\n"
+ "\n"
+ "Example:\n"
+ " bcftools +color-chrs in.vcf --\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ char *trio_samples = NULL, *unrelated_samples = NULL;
+ memset(&args,0,sizeof(args_t));
+ args.prev_rid = -1;
+ args.hdr = in;
+ args.pij = 2e-8;
+ args.pgt_err = 1e-9;
+
+ static struct option loptions[] =
+ {
+ {"prefix",1,0,'p'},
+ {"trio",1,0,'t'},
+ {"unrelated",1,0,'u'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:u:p:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': args.prefix = optarg; break;
+ case 't': trio_samples = optarg; break;
+ case 'u': unrelated_samples = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( optind != argc ) error("%s",usage());
+ if ( trio_samples && unrelated_samples ) error("Expected only one of the -t/-u options\n");
+ if ( !trio_samples && !unrelated_samples ) error("Expected one of the -t/-u options\n");
+ if ( !args.prefix ) error("Expected the -p option\n");
+
+ int ret = bcf_hdr_set_samples(args.hdr, trio_samples ? trio_samples : unrelated_samples, 0);
+ if ( ret<0 ) error("Could not parse samples: %s\n", trio_samples ? trio_samples : unrelated_samples);
+ else if ( ret>0 ) error("%d-th sample not found: %s\n", ret,trio_samples ? trio_samples : unrelated_samples);
+
+ if ( trio_samples )
+ {
+ int i,n = 0;
+ char **list = hts_readlist(trio_samples, 0, &n);
+ if ( n!=3 ) error("Expected three sample names with -t\n");
+ args.imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ args.ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ args.set_observed_prob = set_observed_prob_trio;
+ args.mode = C_TRIO;
+ init_hmm_trio(&args);
+ }
+ else
+ {
+ int i,n = 0;
+ char **list = hts_readlist(unrelated_samples, 0, &n);
+ if ( n!=2 ) error("Expected two sample names with -u\n");
+ args.isample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.jsample = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ args.set_observed_prob = set_observed_prob_unrelated;
+ args.mode = C_UNRL;
+ init_hmm_unrelated(&args);
+ }
+ return 1;
+}
+
+static void init_hmm_trio(args_t *args)
+{
+ int i,j;
+ args->nstates = 8;
+ args->tprob = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+ for (i=0; i<args->nstates; i++)
+ for (j=0; j<args->nstates; j++) hap_switch[i][j] = 0;
+
+ hap_switch[TRIO_AD][TRIO_AC] = SW_FATHER;
+ hap_switch[TRIO_BC][TRIO_AC] = SW_MOTHER;
+ hap_switch[TRIO_BD][TRIO_AC] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_AC][TRIO_AD] = SW_FATHER;
+ hap_switch[TRIO_BC][TRIO_AD] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_BD][TRIO_AD] = SW_MOTHER;
+ hap_switch[TRIO_AC][TRIO_BC] = SW_MOTHER;
+ hap_switch[TRIO_AD][TRIO_BC] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_BD][TRIO_BC] = SW_FATHER;
+ hap_switch[TRIO_AC][TRIO_BD] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_AD][TRIO_BD] = SW_MOTHER;
+ hap_switch[TRIO_BC][TRIO_BD] = SW_FATHER;
+
+ hap_switch[TRIO_DA][TRIO_CA] = SW_FATHER;
+ hap_switch[TRIO_CB][TRIO_CA] = SW_MOTHER;
+ hap_switch[TRIO_DB][TRIO_CA] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_CA][TRIO_DA] = SW_FATHER;
+ hap_switch[TRIO_CB][TRIO_DA] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DB][TRIO_DA] = SW_MOTHER;
+ hap_switch[TRIO_CA][TRIO_CB] = SW_MOTHER;
+ hap_switch[TRIO_DA][TRIO_CB] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DB][TRIO_CB] = SW_FATHER;
+ hap_switch[TRIO_CA][TRIO_DB] = SW_MOTHER | SW_FATHER;
+ hap_switch[TRIO_DA][TRIO_DB] = SW_MOTHER;
+ hap_switch[TRIO_CB][TRIO_DB] = SW_FATHER;
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ {
+ if ( !hap_switch[i][j] ) MAT(args->tprob,args->nstates,i,j) = 0;
+ else
+ {
+ MAT(args->tprob,args->nstates,i,j) = 1;
+ if ( hap_switch[i][j] & SW_MOTHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+ if ( hap_switch[i][j] & SW_FATHER ) MAT(args->tprob,args->nstates,i,j) *= args->pij;
+ }
+ }
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ double sum = 0;
+ for (j=0; j<args->nstates; j++)
+ {
+ if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+ }
+ MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+ }
+
+ #if 0
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(bcftools_stderr,"\t%d",hap_switch[j][i]);
+ fprintf(bcftools_stderr,"\n");
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(bcftools_stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+ fprintf(bcftools_stderr,"\n");
+ }
+ #endif
+
+ args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static void init_hmm_unrelated(args_t *args)
+{
+ int i,j;
+ args->nstates = 7;
+ args->tprob = (double*) malloc(sizeof(double)*args->nstates*args->nstates);
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ MAT(args->tprob,args->nstates,i,j) = args->pij;
+ }
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_xxxx) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_xxxx) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_x0x0,UNRL_0x0x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0x0x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_x00x,UNRL_0xx0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_0xx0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0101,UNRL_x00x) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_x0x0) = args->pij*args->pij;
+ MAT(args->tprob,args->nstates,UNRL_0110,UNRL_0101) = args->pij*args->pij;
+
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=i+1; j<args->nstates; j++)
+ MAT(args->tprob,args->nstates,i,j) = MAT(args->tprob,args->nstates,j,i);
+ }
+ for (i=0; i<args->nstates; i++)
+ {
+ double sum = 0;
+ for (j=0; j<args->nstates; j++)
+ if ( i!=j ) sum += MAT(args->tprob,args->nstates,i,j);
+ MAT(args->tprob,args->nstates,i,i) = 1 - sum;
+ }
+
+ #if 0
+ for (i=0; i<args->nstates; i++)
+ {
+ for (j=0; j<args->nstates; j++)
+ fprintf(bcftools_stderr,"\t%e",MAT(args->tprob,args->nstates,j,i));
+ fprintf(bcftools_stderr,"\n");
+ }
+ #endif
+
+ args->hmm = hmm_init(args->nstates, args->tprob, 10000);
+}
+static inline double prob_shared(float af, int a, int b)
+{
+ return a==b ? 1-args.pgt_err : args.pgt_err;
+}
+static inline double prob_not_shared(float af, int a, int b)
+{
+ if ( a!=b ) return af*(1-af);
+ else if ( a==0 ) return (1-af)*(1-af);
+ else return af*af;
+}
+static void set_observed_prob_unrelated(bcf1_t *rec)
+{
+ float af = 0.5; // alternate allele frequency
+
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return;
+ if ( ngt!=4 ) return; // chrX
+
+ int32_t a,b,c,d;
+ a = args.gt_arr[2*args.isample];
+ b = args.gt_arr[2*args.isample+1];
+ c = args.gt_arr[2*args.jsample];
+ d = args.gt_arr[2*args.jsample+1];
+ if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+ if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+ if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+ if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+ a = bcf_gt_allele(a);
+ b = bcf_gt_allele(b);
+ c = bcf_gt_allele(c);
+ d = bcf_gt_allele(d);
+
+ int m = args.msites;
+ args.nsites++;
+ hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+ if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+ args.sites[args.nsites-1] = rec->pos;
+ double *prob = args.eprob + args.nstates*(args.nsites-1);
+ prob[UNRL_xxxx] = prob_not_shared(af,a,c) * prob_not_shared(af,a,d) * prob_not_shared(af,b,c) * prob_not_shared(af,b,d);
+ prob[UNRL_0x0x] = prob_shared(af,a,c) * prob_not_shared(af,b,d);
+ prob[UNRL_0xx0] = prob_shared(af,a,d) * prob_not_shared(af,b,c);
+ prob[UNRL_x00x] = prob_shared(af,b,c) * prob_not_shared(af,a,d);
+ prob[UNRL_x0x0] = prob_shared(af,b,d) * prob_not_shared(af,a,c);
+ prob[UNRL_0101] = prob_shared(af,a,c) * prob_shared(af,b,d);
+ prob[UNRL_0110] = prob_shared(af,a,d) * prob_shared(af,b,c);
+
+#if 0
+ static int x = 0;
+ if ( !x++)
+ {
+ fprintf(bcftools_stdout, "p(0==0) .. %f\n", prob_shared(af,0,0));
+ fprintf(bcftools_stdout, "p(0!=0) .. %f\n", prob_not_shared(af,0,0));
+ fprintf(bcftools_stdout, "p(0==1) .. %f\n", prob_shared(af,0,1));
+ fprintf(bcftools_stdout, "p(0!=1) .. %f\n", prob_not_shared(af,0,1));
+ }
+ fprintf(bcftools_stdout, "%d|%d %d|%d x:%f 11:%f 12:%f 21:%f 22:%f 11,22:%f 12,21:%f %d\n", a,b,c,d,
+ prob[UNRL_xxxx], prob[UNRL_0x0x], prob[UNRL_0xx0], prob[UNRL_x00x], prob[UNRL_x0x0], prob[UNRL_0101], prob[UNRL_0110], rec->pos+1);
+#endif
+}
+static void set_observed_prob_trio(bcf1_t *rec)
+{
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return;
+ if ( ngt!=6 ) return; // chrX
+
+ int32_t a,b,c,d,e,f;
+ a = args.gt_arr[2*args.imother];
+ b = args.gt_arr[2*args.imother+1];
+ c = args.gt_arr[2*args.ifather];
+ d = args.gt_arr[2*args.ifather+1];
+ e = args.gt_arr[2*args.ichild];
+ f = args.gt_arr[2*args.ichild+1];
+ if ( bcf_gt_is_missing(a) || bcf_gt_is_missing(b) ) return;
+ if ( bcf_gt_is_missing(c) || bcf_gt_is_missing(d) ) return;
+ if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) return;
+ if ( !bcf_gt_is_phased(a) && !bcf_gt_is_phased(b) ) return; // only the second allele should be set when phased
+ if ( !bcf_gt_is_phased(c) && !bcf_gt_is_phased(d) ) return;
+ if ( !bcf_gt_is_phased(e) && !bcf_gt_is_phased(f) ) return;
+ a = bcf_gt_allele(a);
+ b = bcf_gt_allele(b);
+ c = bcf_gt_allele(c);
+ d = bcf_gt_allele(d);
+ e = bcf_gt_allele(e);
+ f = bcf_gt_allele(f);
+
+ int mother = (1<<a) | (1<<b);
+ int father = (1<<c) | (1<<d);
+ int child = (1<<e) | (1<<f);
+ if ( !(mother&child) || !(father&child) ) return; // Mendelian-inconsistent site, skip
+
+ if ( a!=b ) args.nhet_mother++;
+ if ( c!=d ) args.nhet_father++;
+
+ int m = args.msites;
+ args.nsites++;
+ hts_expand(uint32_t,args.nsites,args.msites,args.sites);
+ if ( m!=args.msites ) args.eprob = (double*) realloc(args.eprob, sizeof(double)*args.msites*args.nstates);
+
+ args.sites[args.nsites-1] = rec->pos;
+ double *prob = args.eprob + args.nstates*(args.nsites-1);
+ prob[TRIO_AC] = prob_shared(0,e,a) * prob_shared(0,f,c);
+ prob[TRIO_AD] = prob_shared(0,e,a) * prob_shared(0,f,d);
+ prob[TRIO_BC] = prob_shared(0,e,b) * prob_shared(0,f,c);
+ prob[TRIO_BD] = prob_shared(0,e,b) * prob_shared(0,f,d);
+ prob[TRIO_CA] = prob_shared(0,e,c) * prob_shared(0,f,a);
+ prob[TRIO_DA] = prob_shared(0,e,d) * prob_shared(0,f,a);
+ prob[TRIO_CB] = prob_shared(0,e,c) * prob_shared(0,f,b);
+ prob[TRIO_DB] = prob_shared(0,e,d) * prob_shared(0,f,b);
+}
+
+void flush_viterbi(args_t *args)
+{
+ const char *s1, *s2, *s3 = NULL;
+ if ( args->mode==C_UNRL )
+ {
+ s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->isample);
+ s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->jsample);
+ }
+ else if ( args->mode==C_TRIO )
+ {
+ s1 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->imother);
+ s3 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ifather);
+ s2 = bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,args->ichild);
+ }
+ else abort();
+
+ if ( !args->fp )
+ {
+ kstring_t str = {0,0,0};
+ kputs(args->prefix, &str);
+ kputs(".dat", &str);
+ args->fp = fopen(str.s,"w");
+ if ( !args->fp ) error("%s: %s\n", str.s,strerror(errno));
+ free(str.s);
+ fprintf(args->fp,"# SG, shared segment\t[2]Chromosome\t[3]Start\t[4]End\t[5]%s:1\t[6]%s:2\n",s2,s2);
+ fprintf(args->fp,"# SW, number of switches\t[3]Sample\t[4]Chromosome\t[5]nHets\t[5]nSwitches\t[6]switch rate\n");
+ }
+
+ hmm_run_viterbi(args->hmm,args->nsites,args->eprob,args->sites);
+ uint8_t *vpath = hmm_get_viterbi_path(args->hmm);
+ int i, iprev = -1, prev_state = -1, nstates = hmm_get_nstates(args->hmm);
+ int nswitch_mother = 0, nswitch_father = 0;
+ for (i=0; i<args->nsites; i++)
+ {
+ int state = vpath[i*nstates];
+ if ( state!=prev_state || i+1==args->nsites )
+ {
+ uint32_t start = iprev>=0 ? args->sites[iprev]+1 : 1, end = i>0 ? args->sites[i-1] : 1;
+ const char *chr = bcf_hdr_id2name(args->hdr,args->prev_rid);
+ if ( args->mode==C_UNRL )
+ {
+ switch (prev_state)
+ {
+ case UNRL_0x0x:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t-\n", chr,start,end,s1); break;
+ case UNRL_0xx0:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:1\n", chr,start,end,s1); break;
+ case UNRL_x00x:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t-\n", chr,start,end,s1); break;
+ case UNRL_x0x0:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t-\t%s:2\n", chr,start,end,s1); break;
+ case UNRL_0101:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s1); break;
+ case UNRL_0110:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s1); break;
+ }
+ }
+ else if ( args->mode==C_TRIO )
+ {
+ switch (prev_state)
+ {
+ case TRIO_AC:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s1,s3); break;
+ case TRIO_AD:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s1,s3); break;
+ case TRIO_BC:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s1,s3); break;
+ case TRIO_BD:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s1,s3); break;
+ case TRIO_CA:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:1\n", chr,start,end,s3,s1); break;
+ case TRIO_DA:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:1\n", chr,start,end,s3,s1); break;
+ case TRIO_CB:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:1\t%s:2\n", chr,start,end,s3,s1); break;
+ case TRIO_DB:
+ fprintf(args->fp,"SG\t%s\t%d\t%d\t%s:2\t%s:2\n", chr,start,end,s3,s1); break;
+ }
+ if ( hap_switch[state][prev_state] & SW_MOTHER ) nswitch_mother++;
+ if ( hap_switch[state][prev_state] & SW_FATHER ) nswitch_father++;
+ }
+ iprev = i-1;
+ }
+ prev_state = state;
+ }
+ float mrate = args->nhet_mother>1 ? (float)nswitch_mother/(args->nhet_mother-1) : 0;
+ float frate = args->nhet_father>1 ? (float)nswitch_father/(args->nhet_father-1) : 0;
+ fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s1,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_mother,nswitch_mother,mrate);
+ fprintf(args->fp,"SW\t%s\t%s\t%d\t%d\t%f\n", s3,bcf_hdr_id2name(args->hdr,args->prev_rid),args->nhet_father,nswitch_father,frate);
+ args->nsites = 0;
+ args->nhet_father = args->nhet_mother = 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( args.prev_rid==-1 ) args.prev_rid = rec->rid;
+ if ( args.prev_rid!=rec->rid ) flush_viterbi(&args);
+ args.prev_rid = rec->rid;
+ args.set_observed_prob(rec);
+ return NULL;
+}
+
+void destroy(void)
+{
+ flush_viterbi(&args);
+ fclose(args.fp);
+
+ free(args.gt_arr);
+ free(args.tprob);
+ free(args.sites);
+ free(args.eprob);
+ hmm_destroy(args.hmm);
+}
+
+
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <errno.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file, output_type;
+ char **argv, *output_fname, *fname, *regions, *targets, *filter_str;
+ char *bg_samples_str, *novel_samples_str;
+ int *bg_smpl, *novel_smpl, nbg_smpl, nnovel_smpl;
+ filter_t *filter;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr, *hdr_out;
+ htsFile *out_fh;
+ int32_t *gts;
+ int mgts;
+ uint32_t *bg_gts;
+ int nbg_gts, mbg_gts, ntotal, nskipped, ntested, nnovel_al, nnovel_gt;
+ kstring_t novel_als_smpl, novel_gts_smpl;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Find novel alleles and genotypes in two groups of samples.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Finds novel alleles and genotypes in two groups of samples. Adds\n"
+ " an annotation which lists samples with a novel allele (INFO/NOVELAL)\n"
+ " or a novel genotype (INFO/NOVELGT)\n"
+ "Usage: bcftools +contrast [Plugin Options]\n"
+ "Plugin options:\n"
+ " -0, --bg-samples <list> list of background samples\n"
+ " -1, --novel-samples <list> list of samples where novel allele or genotype are expected\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [stdout]\n"
+ " -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " # Test if any of the samples a,b is different from the samples c,d,e\n"
+ " bcftools +contrast -0 c,d,e -1 a,b file.bcf\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+ args->hdr_out = bcf_hdr_dup(args->hdr);
+ bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELAL,Number=.,Type=String,Description=\"List of samples with novel alleles\">");
+ bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELGT,Number=.,Type=String,Description=\"List of samples with novel genotypes. Note that only samples w/o a novel allele are listed.\">");
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
+ int i;
+ char **smpl = hts_readlist(args->bg_samples_str, 0, &args->nbg_smpl);
+ args->bg_smpl = (int*) malloc(sizeof(int)*args->nbg_smpl);
+ for (i=0; i<args->nbg_smpl; i++)
+ {
+ args->bg_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+ if ( args->bg_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+ free(smpl[i]);
+ }
+ free(smpl);
+
+ smpl = hts_readlist(args->novel_samples_str, 0, &args->nnovel_smpl);
+ args->novel_smpl = (int*) malloc(sizeof(int)*args->nnovel_smpl);
+ for (i=0; i<args->nnovel_smpl; i++)
+ {
+ args->novel_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+ if ( args->novel_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+ free(smpl[i]);
+ }
+ free(smpl);
+
+ args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ bcf_hdr_write(args->out_fh, args->hdr_out);
+}
+static void destroy_data(args_t *args)
+{
+ bcf_hdr_destroy(args->hdr_out);
+ hts_close(args->out_fh);
+ free(args->novel_als_smpl.s);
+ free(args->novel_gts_smpl.s);
+ free(args->gts);
+ free(args->bg_gts);
+ free(args->bg_smpl);
+ free(args->novel_smpl);
+ if ( args->filter ) filter_destroy(args->filter);
+ bcf_sr_destroy(args->sr);
+ free(args);
+}
+static inline int binary_search(uint32_t val, uint32_t *dat, int ndat)
+{
+ int i = -1, imin = 0, imax = ndat - 1;
+ while ( imin<=imax )
+ {
+ i = (imin+imax)/2;
+ if ( dat[i] < val ) imin = i + 1;
+ else if ( dat[i] > val ) imax = i - 1;
+ else return 1;
+ }
+ return 0;
+}
+static inline void binary_insert(uint32_t val, uint32_t **dat, int *ndat, int *mdat)
+{
+ int i = -1, imin = 0, imax = *ndat - 1;
+ while ( imin<=imax )
+ {
+ i = (imin+imax)/2;
+ if ( (*dat)[i] < val ) imin = i + 1;
+ else if ( (*dat)[i] > val ) imax = i - 1;
+ else return;
+ }
+ while ( i>=0 && (*dat)[i]>val ) i--;
+
+ (*ndat)++;
+ hts_expand(uint32_t, (*ndat), (*mdat), (*dat));
+
+ if ( *ndat > 1 )
+ memmove(*dat + i + 1, *dat + i, sizeof(uint32_t)*(*ndat - i - 1));
+
+ (*dat)[i+1] = val;
+}
+static int process_record(args_t *args, bcf1_t *rec)
+{
+ args->ntotal++;
+
+ static int warned = 0;
+ int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->mgts);
+ ngts /= rec->n_sample;
+ if ( ngts>2 ) error("todo: ploidy=%d\n", ngts);
+
+ args->nbg_gts = 0;
+ uint32_t bg_als = 0;
+ int i,j;
+ for (i=0; i<args->nbg_smpl; i++)
+ {
+ uint32_t gt = 0;
+ int32_t *ptr = args->gts + args->bg_smpl[i]*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[j]) ) continue;
+ int ial = bcf_gt_allele(ptr[j]);
+ if ( ial > 31 )
+ {
+ if ( !warned )
+ {
+ fprintf(stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned = 1;
+ }
+ args->nskipped++;
+ return -1;
+ }
+ bg_als |= 1<<ial;
+ gt |= 1<<ial;
+ }
+ binary_insert(gt, &args->bg_gts, &args->nbg_gts, &args->mbg_gts);
+ }
+ if ( !bg_als )
+ {
+ // all are missing
+ args->nskipped++;
+ return -1;
+ }
+
+ args->novel_als_smpl.l = 0;
+ args->novel_gts_smpl.l = 0;
+
+ int has_gt = 0;
+ for (i=0; i<args->nnovel_smpl; i++)
+ {
+ int novel_al = 0;
+ uint32_t gt = 0;
+ int32_t *ptr = args->gts + args->novel_smpl[i]*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[j]) ) continue;
+ int ial = bcf_gt_allele(ptr[j]);
+ if ( ial > 31 )
+ {
+ if ( !warned )
+ {
+ fprintf(stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned = 1;
+ }
+ args->nskipped++;
+ return -1;
+ }
+ if ( !(bg_als & (1<<ial)) ) novel_al = 1;
+ gt |= 1<<ial;
+ }
+ if ( !gt ) continue;
+ has_gt = 1;
+
+ char *smpl = args->hdr->samples[ args->novel_smpl[i] ];
+ if ( novel_al )
+ {
+ if ( args->novel_als_smpl.l ) kputc(',', &args->novel_als_smpl);
+ kputs(smpl, &args->novel_als_smpl);
+ }
+ else if ( !binary_search(gt, args->bg_gts, args->nbg_gts) )
+ {
+ if ( args->novel_gts_smpl.l ) kputc(',', &args->novel_gts_smpl);
+ kputs(smpl, &args->novel_gts_smpl);
+ }
+ }
+ if ( !has_gt )
+ {
+ // all are missing
+ args->nskipped++;
+ return -1;
+ }
+ if ( args->novel_als_smpl.l )
+ {
+ bcf_update_info_string(args->hdr_out, rec, "NOVELAL", args->novel_als_smpl.s);
+ args->nnovel_al++;
+ }
+ if ( args->novel_gts_smpl.l )
+ {
+ bcf_update_info_string(args->hdr_out, rec, "NOVELGT", args->novel_gts_smpl.s);
+ args->nnovel_gt++;
+ }
+ args->ntested++;
+ return 0;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"bg-samples",required_argument,0,'0'},
+ {"novel-samples",required_argument,0,'1'},
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "O:o:i:e:r:R:t:T:0:1:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case '0': args->bg_samples_str = optarg; break;
+ case '1': args->novel_samples_str = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ };
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, rec, NULL);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !pass ) continue;
+ }
+ process_record(args, rec);
+ bcf_write(args->out_fh, args->hdr_out, rec);
+ }
+
+ fprintf(stderr,"Total/processed/skipped/novel_allele/novel_gt:\t%d\t%d\t%d\t%d\t%d\n", args->ntotal, args->ntested, args->nskipped, args->nnovel_al, args->nnovel_gt);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <errno.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file, output_type;
+ char **argv, *output_fname, *fname, *regions, *targets, *filter_str;
+ char *bg_samples_str, *novel_samples_str;
+ int *bg_smpl, *novel_smpl, nbg_smpl, nnovel_smpl;
+ filter_t *filter;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr, *hdr_out;
+ htsFile *out_fh;
+ int32_t *gts;
+ int mgts;
+ uint32_t *bg_gts;
+ int nbg_gts, mbg_gts, ntotal, nskipped, ntested, nnovel_al, nnovel_gt;
+ kstring_t novel_als_smpl, novel_gts_smpl;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Find novel alleles and genotypes in two groups of samples.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Finds novel alleles and genotypes in two groups of samples. Adds\n"
+ " an annotation which lists samples with a novel allele (INFO/NOVELAL)\n"
+ " or a novel genotype (INFO/NOVELGT)\n"
+ "Usage: bcftools +contrast [Plugin Options]\n"
+ "Plugin options:\n"
+ " -0, --bg-samples <list> list of background samples\n"
+ " -1, --novel-samples <list> list of samples where novel allele or genotype are expected\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [bcftools_stdout]\n"
+ " -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " # Test if any of the samples a,b is different from the samples c,d,e\n"
+ " bcftools +contrast -0 c,d,e -1 a,b file.bcf\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+ args->hdr_out = bcf_hdr_dup(args->hdr);
+ bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELAL,Number=.,Type=String,Description=\"List of samples with novel alleles\">");
+ bcf_hdr_append(args->hdr_out, "##INFO=<ID=NOVELGT,Number=.,Type=String,Description=\"List of samples with novel genotypes. Note that only samples w/o a novel allele are listed.\">");
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
+ int i;
+ char **smpl = hts_readlist(args->bg_samples_str, 0, &args->nbg_smpl);
+ args->bg_smpl = (int*) malloc(sizeof(int)*args->nbg_smpl);
+ for (i=0; i<args->nbg_smpl; i++)
+ {
+ args->bg_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+ if ( args->bg_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+ free(smpl[i]);
+ }
+ free(smpl);
+
+ smpl = hts_readlist(args->novel_samples_str, 0, &args->nnovel_smpl);
+ args->novel_smpl = (int*) malloc(sizeof(int)*args->nnovel_smpl);
+ for (i=0; i<args->nnovel_smpl; i++)
+ {
+ args->novel_smpl[i] = bcf_hdr_id2int(args->hdr, BCF_DT_SAMPLE, smpl[i]);
+ if ( args->novel_smpl[i]<0 ) error("The sample not present in the VCF: \"%s\"\n", smpl[i]);
+ free(smpl[i]);
+ }
+ free(smpl);
+
+ args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ bcf_hdr_write(args->out_fh, args->hdr_out);
+}
+static void destroy_data(args_t *args)
+{
+ bcf_hdr_destroy(args->hdr_out);
+ hts_close(args->out_fh);
+ free(args->novel_als_smpl.s);
+ free(args->novel_gts_smpl.s);
+ free(args->gts);
+ free(args->bg_gts);
+ free(args->bg_smpl);
+ free(args->novel_smpl);
+ if ( args->filter ) filter_destroy(args->filter);
+ bcf_sr_destroy(args->sr);
+ free(args);
+}
+static inline int binary_search(uint32_t val, uint32_t *dat, int ndat)
+{
+ int i = -1, imin = 0, imax = ndat - 1;
+ while ( imin<=imax )
+ {
+ i = (imin+imax)/2;
+ if ( dat[i] < val ) imin = i + 1;
+ else if ( dat[i] > val ) imax = i - 1;
+ else return 1;
+ }
+ return 0;
+}
+static inline void binary_insert(uint32_t val, uint32_t **dat, int *ndat, int *mdat)
+{
+ int i = -1, imin = 0, imax = *ndat - 1;
+ while ( imin<=imax )
+ {
+ i = (imin+imax)/2;
+ if ( (*dat)[i] < val ) imin = i + 1;
+ else if ( (*dat)[i] > val ) imax = i - 1;
+ else return;
+ }
+ while ( i>=0 && (*dat)[i]>val ) i--;
+
+ (*ndat)++;
+ hts_expand(uint32_t, (*ndat), (*mdat), (*dat));
+
+ if ( *ndat > 1 )
+ memmove(*dat + i + 1, *dat + i, sizeof(uint32_t)*(*ndat - i - 1));
+
+ (*dat)[i+1] = val;
+}
+static int process_record(args_t *args, bcf1_t *rec)
+{
+ args->ntotal++;
+
+ static int warned = 0;
+ int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->mgts);
+ ngts /= rec->n_sample;
+ if ( ngts>2 ) error("todo: ploidy=%d\n", ngts);
+
+ args->nbg_gts = 0;
+ uint32_t bg_als = 0;
+ int i,j;
+ for (i=0; i<args->nbg_smpl; i++)
+ {
+ uint32_t gt = 0;
+ int32_t *ptr = args->gts + args->bg_smpl[i]*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[j]) ) continue;
+ int ial = bcf_gt_allele(ptr[j]);
+ if ( ial > 31 )
+ {
+ if ( !warned )
+ {
+ fprintf(bcftools_stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned = 1;
+ }
+ args->nskipped++;
+ return -1;
+ }
+ bg_als |= 1<<ial;
+ gt |= 1<<ial;
+ }
+ binary_insert(gt, &args->bg_gts, &args->nbg_gts, &args->mbg_gts);
+ }
+ if ( !bg_als )
+ {
+ // all are missing
+ args->nskipped++;
+ return -1;
+ }
+
+ args->novel_als_smpl.l = 0;
+ args->novel_gts_smpl.l = 0;
+
+ int has_gt = 0;
+ for (i=0; i<args->nnovel_smpl; i++)
+ {
+ int novel_al = 0;
+ uint32_t gt = 0;
+ int32_t *ptr = args->gts + args->novel_smpl[i]*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_missing(ptr[j]) ) continue;
+ int ial = bcf_gt_allele(ptr[j]);
+ if ( ial > 31 )
+ {
+ if ( !warned )
+ {
+ fprintf(bcftools_stderr,"Too many alleles (>32) at %s:%d, skipping. (todo?)\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ warned = 1;
+ }
+ args->nskipped++;
+ return -1;
+ }
+ if ( !(bg_als & (1<<ial)) ) novel_al = 1;
+ gt |= 1<<ial;
+ }
+ if ( !gt ) continue;
+ has_gt = 1;
+
+ char *smpl = args->hdr->samples[ args->novel_smpl[i] ];
+ if ( novel_al )
+ {
+ if ( args->novel_als_smpl.l ) kputc(',', &args->novel_als_smpl);
+ kputs(smpl, &args->novel_als_smpl);
+ }
+ else if ( !binary_search(gt, args->bg_gts, args->nbg_gts) )
+ {
+ if ( args->novel_gts_smpl.l ) kputc(',', &args->novel_gts_smpl);
+ kputs(smpl, &args->novel_gts_smpl);
+ }
+ }
+ if ( !has_gt )
+ {
+ // all are missing
+ args->nskipped++;
+ return -1;
+ }
+ if ( args->novel_als_smpl.l )
+ {
+ bcf_update_info_string(args->hdr_out, rec, "NOVELAL", args->novel_als_smpl.s);
+ args->nnovel_al++;
+ }
+ if ( args->novel_gts_smpl.l )
+ {
+ bcf_update_info_string(args->hdr_out, rec, "NOVELGT", args->novel_gts_smpl.s);
+ args->nnovel_gt++;
+ }
+ args->ntested++;
+ return 0;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"bg-samples",required_argument,0,'0'},
+ {"novel-samples",required_argument,0,'1'},
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "O:o:i:e:r:R:t:T:0:1:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case '0': args->bg_samples_str = optarg; break;
+ case '1': args->novel_samples_str = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ };
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, rec, NULL);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !pass ) continue;
+ }
+ process_record(args, rec);
+ bcf_write(args->out_fh, args->hdr_out, rec);
+ }
+
+ fprintf(bcftools_stderr,"Total/processed/skipped/novel_allele/novel_gt:\t%d\t%d\t%d\t%d\t%d\n", args->ntotal, args->ntested, args->nskipped, args->nnovel_al, args->nnovel_gt);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+/* plugins/counts.c -- counts SNPs, Indels, and total number of sites.
+
+ Copyright (C) 2013, 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+
+int nsamples, nsnps, nindels, nmnps, nothers, nsites;
+
+/*
+ This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+ return
+ "A minimal plugin which counts number of samples, SNPs,\n"
+ "INDELs, MNPs and total number of sites.\n";
+}
+
+/*
+ Called once at startup, allows to initialize local variables.
+ Return 1 to suppress VCF/BCF header from printing, 0 otherwise.
+*/
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ nsamples = bcf_hdr_nsamples(in);
+ nsnps = nindels = nmnps = nothers = nsites = 0;
+ return 1;
+}
+
+
+/*
+ Called for each VCF record. Return rec to output the line or NULL
+ to suppress output.
+*/
+bcf1_t *process(bcf1_t *rec)
+{
+ int type = bcf_get_variant_types(rec);
+ if ( type & VCF_SNP ) nsnps++;
+ if ( type & VCF_INDEL ) nindels++;
+ if ( type & VCF_MNP ) nmnps++;
+ if ( type & VCF_OTHER ) nothers++;
+ nsites++;
+ return NULL;
+}
+
+
+/*
+ Clean up.
+*/
+void destroy(void)
+{
+ printf("Number of samples: %d\n", nsamples);
+ printf("Number of SNPs: %d\n", nsnps);
+ printf("Number of INDELs: %d\n", nindels);
+ printf("Number of MNPs: %d\n", nmnps);
+ printf("Number of others: %d\n", nothers);
+ printf("Number of sites: %d\n", nsites);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/counts.c -- counts SNPs, Indels, and total number of sites.
+
+ Copyright (C) 2013, 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+
+int nsamples, nsnps, nindels, nmnps, nothers, nsites;
+
+/*
+ This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+ return
+ "A minimal plugin which counts number of samples, SNPs,\n"
+ "INDELs, MNPs and total number of sites.\n";
+}
+
+/*
+ Called once at startup, allows to initialize local variables.
+ Return 1 to suppress VCF/BCF header from printing, 0 otherwise.
+*/
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ nsamples = bcf_hdr_nsamples(in);
+ nsnps = nindels = nmnps = nothers = nsites = 0;
+ return 1;
+}
+
+
+/*
+ Called for each VCF record. Return rec to output the line or NULL
+ to suppress output.
+*/
+bcf1_t *process(bcf1_t *rec)
+{
+ int type = bcf_get_variant_types(rec);
+ if ( type & VCF_SNP ) nsnps++;
+ if ( type & VCF_INDEL ) nindels++;
+ if ( type & VCF_MNP ) nmnps++;
+ if ( type & VCF_OTHER ) nothers++;
+ nsites++;
+ return NULL;
+}
+
+
+/*
+ Clean up.
+*/
+void destroy(void)
+{
+ fprintf(bcftools_stdout, "Number of samples: %d\n", nsamples);
+ fprintf(bcftools_stdout, "Number of SNPs: %d\n", nsnps);
+ fprintf(bcftools_stdout, "Number of INDELs: %d\n", nindels);
+ fprintf(bcftools_stdout, "Number of MNPs: %d\n", nmnps);
+ fprintf(bcftools_stdout, "Number of others: %d\n", nothers);
+ fprintf(bcftools_stdout, "Number of sites: %d\n", nsites);
+}
+
+
--- /dev/null
+/* plugins/dosage.c -- prints genotype dosage.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+#include "bcftools.h"
+
+
+/*
+ This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+ return "Prints genotype dosage determined from tags requested by the user.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Print genotype dosage\n"
+ "Usage: bcftools +dosage [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -t, --tags <list> VCF tags to determine the dosage from [PL,GL,GT]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +dosage in.vcf -- -t GT\n"
+ "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL;
+int pl_type = 0, gl_type = 0;
+uint8_t *buf = NULL;
+int nbuf = 0; // NB: number of elements, not bytes
+char **tags = NULL;
+int ntags = 0;
+float *vals = NULL, *dsg = NULL;
+int mvals, mdsg;
+
+typedef int (*dosage_f) (bcf1_t *);
+dosage_f *handlers = NULL;
+int nhandlers = 0;
+
+
+int calc_dosage_PL(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_format_values(in_hdr,rec,"PL",(void**)&buf,&nbuf,pl_type);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1; // not diploid
+ hts_expand(float, nret, mvals, vals);
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ float sum = 0; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = exp(-0.1*ptr[j]); \
+ sum += vals[j]; \
+ } \
+ if ( j<nret ) \
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+ else \
+ { \
+ if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+ memset(dsg, 0, sizeof(float)*rec->n_allele); \
+ int k, l = 0; \
+ for (j=0; j<rec->n_allele; j++) \
+ { \
+ for (k=0; k<=j; k++) \
+ { \
+ dsg[j] += vals[l]; \
+ dsg[k] += vals[l]; \
+ } \
+ } \
+ } \
+ for (j=1; j<rec->n_allele; j++) \
+ printf("%c%.1f",j==1?'\t':',',dsg[j]); \
+ ptr += nret; \
+ } \
+ }
+ switch (pl_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+ return 0;
+}
+
+int calc_dosage_GL(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_format_values(in_hdr,rec,"GL",(void**)&buf,&nbuf,pl_type);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1; // not diploid
+ hts_expand(float, nret, mvals, vals);
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ float sum = 0; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = exp(ptr[j]); \
+ sum += vals[j]; \
+ } \
+ if ( j<nret ) \
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+ else \
+ { \
+ for (; j<nret; j++) vals[j] = 0; \
+ if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+ memset(dsg, 0, sizeof(float)*rec->n_allele); \
+ int k, l = 0; \
+ for (j=0; j<rec->n_allele; j++) \
+ { \
+ for (k=0; k<=j; k++) \
+ { \
+ dsg[j] += vals[l]; \
+ dsg[k] += vals[l]; \
+ } \
+ } \
+ } \
+ for (j=1; j<rec->n_allele; j++) \
+ printf("%c%.1f",j==1?'\t':',',dsg[j]); \
+ ptr += nret; \
+ } \
+ }
+ switch (pl_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+ return 0;
+}
+
+int calc_dosage_GT(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_genotypes(in_hdr,rec,(void**)&buf,&nbuf);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ int32_t *ptr = (int32_t*) buf;
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ for (i=0; i<rec->n_sample; i++)
+ {
+ memset(dsg, 0, sizeof(float)*rec->n_allele);
+ for (j=0; j<nret; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end || bcf_gt_is_missing(ptr[j]) ) break;
+ int idx = bcf_gt_allele(ptr[j]);
+ if ( idx > rec->n_allele ) error("The allele index is out of range at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ dsg[idx] += 1;
+ }
+ if ( !j )
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1;
+ for (j=1; j<rec->n_allele; j++)
+ printf("%c%.1f",j==1?'\t':',',dsg[j]);
+ ptr += nret;
+ }
+ return 0;
+}
+
+
+char **split_list(char *str, int *nitems)
+{
+ int n = 0, done = 0;
+ char *ss = strdup(str), **out = NULL;
+ while ( !done && *ss )
+ {
+ char *se = ss;
+ while ( *se && *se!=',' ) se++;
+ if ( !*se ) done = 1;
+ *se = 0;
+ n++;
+ out = (char**) realloc(out,sizeof(char*)*n);
+ out[n-1] = ss;
+ ss = se+1;
+ }
+ *nitems = n;
+ return out;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int i, id, c;
+ char *tags_str = "PL,GL,GT";
+
+ static struct option loptions[] =
+ {
+ {"tags",1,0,'t'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "t:?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 't': tags_str = optarg; break;
+ case 'h':
+ case '?':
+ default: fprintf(stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ tags = split_list(tags_str, &ntags);
+
+ in_hdr = in;
+ for (i=0; i<ntags; i++)
+ {
+ if ( !strcmp("PL",tags[i]) )
+ {
+ id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"PL");
+ if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+ {
+ pl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+ if ( pl_type!=BCF_HT_INT && pl_type!=BCF_HT_REAL )
+ {
+ fprintf(stderr,"Expected numeric type of FORMAT/PL\n");
+ return -1;
+ }
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_PL;
+ }
+ }
+ else if ( !strcmp("GL",tags[i]) )
+ {
+ id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"GL");
+ if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+ {
+ gl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+ if ( gl_type!=BCF_HT_INT && gl_type!=BCF_HT_REAL )
+ {
+ fprintf(stderr,"Expected numeric type of FORMAT/GL\n");
+ return -1;
+ }
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_GL;
+ }
+ }
+ else if ( !strcmp("GT",tags[i]) )
+ {
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_GT;
+ }
+ else
+ {
+ fprintf(stderr,"No handler for tag \"%s\"\n", tags[i]);
+ return -1;
+ }
+ }
+ free(tags[0]);
+ free(tags);
+
+ printf("#[1]CHROM\t[2]POS\t[3]REF\t[4]ALT");
+ for (i=0; i<bcf_hdr_nsamples(in_hdr); i++) printf("\t[%d]%s", i+5,in_hdr->samples[i]);
+ printf("\n");
+
+ return 1;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, ret;
+
+ printf("%s\t%d\t%s", bcf_seqname(in_hdr,rec),rec->pos+1,rec->d.allele[0]);
+ if ( rec->n_allele == 1 ) printf("\t.");
+ else for (i=1; i<rec->n_allele; i++) printf("%c%s", i==1?'\t':',', rec->d.allele[i]);
+ if ( rec->n_allele==1 )
+ {
+ for (j=0; j<rec->n_sample; j++) printf("\t0.0");
+ }
+ else
+ {
+ for (i=0; i<nhandlers; i++)
+ {
+ ret = handlers[i](rec);
+ if ( !ret ) break; // successfully printed
+ }
+ if ( i==nhandlers )
+ {
+ // none of the annotations present
+ for (i=0; i<rec->n_sample; i++) printf("\t-1.0");
+ }
+ }
+ printf("\n");
+
+ return NULL;
+}
+
+
+void destroy(void)
+{
+ free(vals);
+ free(dsg);
+ free(handlers);
+ free(buf);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/dosage.c -- prints genotype dosage.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+#include "bcftools.h"
+
+
+/*
+ This short description is used to generate the output of `bcftools plugin -l`.
+*/
+const char *about(void)
+{
+ return "Prints genotype dosage determined from tags requested by the user.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Print genotype dosage\n"
+ "Usage: bcftools +dosage [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -t, --tags <list> VCF tags to determine the dosage from [PL,GL,GT]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +dosage in.vcf -- -t GT\n"
+ "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL;
+int pl_type = 0, gl_type = 0;
+uint8_t *buf = NULL;
+int nbuf = 0; // NB: number of elements, not bytes
+char **tags = NULL;
+int ntags = 0;
+float *vals = NULL, *dsg = NULL;
+int mvals, mdsg;
+
+typedef int (*dosage_f) (bcf1_t *);
+dosage_f *handlers = NULL;
+int nhandlers = 0;
+
+
+int calc_dosage_PL(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_format_values(in_hdr,rec,"PL",(void**)&buf,&nbuf,pl_type);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1; // not diploid
+ hts_expand(float, nret, mvals, vals);
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ float sum = 0; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = exp(-0.1*ptr[j]); \
+ sum += vals[j]; \
+ } \
+ if ( j<nret ) \
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+ else \
+ { \
+ if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+ memset(dsg, 0, sizeof(float)*rec->n_allele); \
+ int k, l = 0; \
+ for (j=0; j<rec->n_allele; j++) \
+ { \
+ for (k=0; k<=j; k++) \
+ { \
+ dsg[j] += vals[l]; \
+ dsg[k] += vals[l]; \
+ } \
+ } \
+ } \
+ for (j=1; j<rec->n_allele; j++) \
+ fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]); \
+ ptr += nret; \
+ } \
+ }
+ switch (pl_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+ return 0;
+}
+
+int calc_dosage_GL(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_format_values(in_hdr,rec,"GL",(void**)&buf,&nbuf,pl_type);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ if ( nret != rec->n_allele*(rec->n_allele+1)/2 ) return -1; // not diploid
+ hts_expand(float, nret, mvals, vals);
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ float sum = 0; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = exp(ptr[j]); \
+ sum += vals[j]; \
+ } \
+ if ( j<nret ) \
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1; \
+ else \
+ { \
+ for (; j<nret; j++) vals[j] = 0; \
+ if ( sum ) for (j=0; j<nret; j++) vals[j] /= sum; \
+ memset(dsg, 0, sizeof(float)*rec->n_allele); \
+ int k, l = 0; \
+ for (j=0; j<rec->n_allele; j++) \
+ { \
+ for (k=0; k<=j; k++) \
+ { \
+ dsg[j] += vals[l]; \
+ dsg[k] += vals[l]; \
+ } \
+ } \
+ } \
+ for (j=1; j<rec->n_allele; j++) \
+ fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]); \
+ ptr += nret; \
+ } \
+ }
+ switch (pl_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+ return 0;
+}
+
+int calc_dosage_GT(bcf1_t *rec)
+{
+ int i, j, nret = bcf_get_genotypes(in_hdr,rec,(void**)&buf,&nbuf);
+ if ( nret<0 ) return -1;
+
+ nret /= rec->n_sample;
+ int32_t *ptr = (int32_t*) buf;
+ hts_expand(float, rec->n_allele, mdsg, dsg);
+ for (i=0; i<rec->n_sample; i++)
+ {
+ memset(dsg, 0, sizeof(float)*rec->n_allele);
+ for (j=0; j<nret; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end || bcf_gt_is_missing(ptr[j]) ) break;
+ int idx = bcf_gt_allele(ptr[j]);
+ if ( idx > rec->n_allele ) error("The allele index is out of range at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ dsg[idx] += 1;
+ }
+ if ( !j )
+ for (j=0; j<rec->n_allele; j++) dsg[j] = -1;
+ for (j=1; j<rec->n_allele; j++)
+ fprintf(bcftools_stdout, "%c%.1f",j==1?'\t':',',dsg[j]);
+ ptr += nret;
+ }
+ return 0;
+}
+
+
+char **split_list(char *str, int *nitems)
+{
+ int n = 0, done = 0;
+ char *ss = strdup(str), **out = NULL;
+ while ( !done && *ss )
+ {
+ char *se = ss;
+ while ( *se && *se!=',' ) se++;
+ if ( !*se ) done = 1;
+ *se = 0;
+ n++;
+ out = (char**) realloc(out,sizeof(char*)*n);
+ out[n-1] = ss;
+ ss = se+1;
+ }
+ *nitems = n;
+ return out;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int i, id, c;
+ char *tags_str = "PL,GL,GT";
+
+ static struct option loptions[] =
+ {
+ {"tags",1,0,'t'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "t:?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 't': tags_str = optarg; break;
+ case 'h':
+ case '?':
+ default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ tags = split_list(tags_str, &ntags);
+
+ in_hdr = in;
+ for (i=0; i<ntags; i++)
+ {
+ if ( !strcmp("PL",tags[i]) )
+ {
+ id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"PL");
+ if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+ {
+ pl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+ if ( pl_type!=BCF_HT_INT && pl_type!=BCF_HT_REAL )
+ {
+ fprintf(bcftools_stderr,"Expected numeric type of FORMAT/PL\n");
+ return -1;
+ }
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_PL;
+ }
+ }
+ else if ( !strcmp("GL",tags[i]) )
+ {
+ id = bcf_hdr_id2int(in_hdr,BCF_DT_ID,"GL");
+ if ( bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,id) )
+ {
+ gl_type = bcf_hdr_id2type(in_hdr,BCF_HL_FMT,id);
+ if ( gl_type!=BCF_HT_INT && gl_type!=BCF_HT_REAL )
+ {
+ fprintf(bcftools_stderr,"Expected numeric type of FORMAT/GL\n");
+ return -1;
+ }
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_GL;
+ }
+ }
+ else if ( !strcmp("GT",tags[i]) )
+ {
+ handlers = (dosage_f*) realloc(handlers,(nhandlers+1)*sizeof(*handlers));
+ handlers[nhandlers++] = calc_dosage_GT;
+ }
+ else
+ {
+ fprintf(bcftools_stderr,"No handler for tag \"%s\"\n", tags[i]);
+ return -1;
+ }
+ }
+ free(tags[0]);
+ free(tags);
+
+ fprintf(bcftools_stdout, "#[1]CHROM\t[2]POS\t[3]REF\t[4]ALT");
+ for (i=0; i<bcf_hdr_nsamples(in_hdr); i++) fprintf(bcftools_stdout, "\t[%d]%s", i+5,in_hdr->samples[i]);
+ fprintf(bcftools_stdout, "\n");
+
+ return 1;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, ret;
+
+ fprintf(bcftools_stdout, "%s\t%d\t%s", bcf_seqname(in_hdr,rec),rec->pos+1,rec->d.allele[0]);
+ if ( rec->n_allele == 1 ) fprintf(bcftools_stdout, "\t.");
+ else for (i=1; i<rec->n_allele; i++) fprintf(bcftools_stdout, "%c%s", i==1?'\t':',', rec->d.allele[i]);
+ if ( rec->n_allele==1 )
+ {
+ for (j=0; j<rec->n_sample; j++) fprintf(bcftools_stdout, "\t0.0");
+ }
+ else
+ {
+ for (i=0; i<nhandlers; i++)
+ {
+ ret = handlers[i](rec);
+ if ( !ret ) break; // successfully printed
+ }
+ if ( i==nhandlers )
+ {
+ // none of the annotations present
+ for (i=0; i<rec->n_sample; i++) fprintf(bcftools_stdout, "\t-1.0");
+ }
+ }
+ fprintf(bcftools_stdout, "\n");
+
+ return NULL;
+}
+
+
+void destroy(void)
+{
+ free(vals);
+ free(dsg);
+ free(handlers);
+ free(buf);
+}
+
+
--- /dev/null
+/* plugins/fill-AN-AC.c -- fills AN and AC INFO fields.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int *arr = NULL, marr = 0;
+
+const char *about(void)
+{
+ return "Fill INFO fields AN and AC.\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ in_hdr = in;
+ out_hdr = out;
+ bcf_hdr_append(out_hdr, "##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Allele count in genotypes\">");
+ bcf_hdr_append(out_hdr, "##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">");
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ hts_expand(int,rec->n_allele,marr,arr);
+ int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+ if ( ret>0 )
+ {
+ int i, an = 0;
+ for (i=0; i<rec->n_allele; i++) an += arr[i];
+ bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
+ bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
+ }
+ return rec;
+}
+
+void destroy(void)
+{
+ free(arr);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/fill-AN-AC.c -- fills AN and AC INFO fields.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int *arr = NULL, marr = 0;
+
+const char *about(void)
+{
+ return "Fill INFO fields AN and AC.\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ in_hdr = in;
+ out_hdr = out;
+ bcf_hdr_append(out_hdr, "##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Allele count in genotypes\">");
+ bcf_hdr_append(out_hdr, "##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">");
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ hts_expand(int,rec->n_allele,marr,arr);
+ int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+ if ( ret>0 )
+ {
+ int i, an = 0;
+ for (i=0; i<rec->n_allele; i++) an += arr[i];
+ bcf_update_info_int32(out_hdr, rec, "AN", &an, 1);
+ bcf_update_info_int32(out_hdr, rec, "AC", arr+1, rec->n_allele-1);
+ }
+ return rec;
+}
+
+void destroy(void)
+{
+ free(arr);
+}
+
+
--- /dev/null
+/* plugin/fill-from-fasta.c -- fill-from-fasta plugin.
+
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <htslib/vcf.h>
+#include <htslib/faidx.h>
+#include <htslib/kseq.h>
+#include "filter.h"
+#include "bcftools.h"
+
+const char *about(void)
+{
+ return "Fill INFO or REF field based on values in a fasta file\n";
+}
+
+const char *usage(void)
+{
+ return
+"\n"
+"About: Fill INFO or REF field based on values in a fasta file.\n"
+" The fasta file must be indexed with samtools faidx.\n"
+"Usage: bcftools +fill-from-fasta [General Options] -- [Plugin Options]\n"
+"\n"
+"General options:\n"
+" run \"bcftools plugin\" for a list of common options\n"
+"\n"
+"Plugin options:\n"
+" -c, --column <str> REF or INFO tag, e.g. AA for ancestral allele\n"
+" -f, --fasta <file> fasta file\n"
+" -h, --header-lines <file> optional file containing header lines to append\n"
+" -i, --include <expr> annotate only records passing filter expression\n"
+" -e, --exclude <expr> annotate only records failing filter expression\n"
+
+"\n"
+"Examples:\n"
+" # fill ancestral allele as INFO/AA for SNP records\n"
+" echo '##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral allele\">' > aa.hdr\n"
+" bcftools +fill-from-fasta in.vcf -- -c AA -f aa.fasta -h aa.hdr -i 'TYPE=\"snp\"'\n"
+"\n"
+" # fix the REF allele in VCFs where REF=N or other\n"
+" bcftools +fill-from-fasta in.vcf -- -c REF -f reference.fasta\n"
+"\n"
+" # select sites marked as P (PASS) in the 1000G Phase3 mask\n"
+" echo '##INFO=<ID=P3_MASK,Number=1,Type=String,Description=\"1000G Phase 3 mask\">' > mask.hdr\n"
+" bcftools +fill-from-fasta in.vcf -Ou -- -c P3_MASK -f 1000G_mask.fasta -h mask.hdr | bcftools view -i 'P3_MASK=\"P\"'\n"
+"\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+faidx_t *faidx;
+int anno = 0;
+char *column = NULL;
+
+#define ANNO_REF 1
+#define ANNO_STRING 2
+#define ANNO_INT 3
+
+filter_t *filter;
+char *filter_str;
+int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ char *ref_fname = NULL, *header_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"column",required_argument,NULL,'c'},
+ {"fasta",required_argument,NULL,'f'},
+ {"header-lines",required_argument,NULL,'h'},
+ {NULL,0,NULL,0}
+ };
+ while ((c = getopt_long(argc, argv, "c:f:?h:i:e:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': filter_str = optarg; filter_logic |= FLT_EXCLUDE; break;
+ case 'i': filter_str = optarg; filter_logic |= FLT_INCLUDE; break;
+ case 'c': column = optarg; break;
+ case 'f': ref_fname = optarg; break;
+ case 'h': header_fname = optarg; break;
+ case '?':
+ default: fprintf(stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ in_hdr = in;
+ out_hdr = out;
+ if ( filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) { fprintf(stderr,"Only one of -i or -e can be given.\n"); return -1; }
+
+ if ( !column )
+ {
+ fprintf(stderr,"--column option is required.\n");
+ return -1;
+ }
+ if (header_fname)
+ {
+ htsFile *file = hts_open(header_fname, "rb");
+ if ( !file ) { fprintf(stderr,"Error reading %s\n", header_fname); return -1; }
+ kstring_t str = {0,0,0};
+ while ( hts_getline(file, KS_SEP_LINE, &str) > 0 )
+ {
+ if ( bcf_hdr_append(out_hdr, str.s) ) { fprintf(stderr,"Could not parse %s: %s\n", header_fname, str.s); return -1; }
+ }
+ hts_close(file);
+ free(str.s);
+ bcf_hdr_sync(out_hdr);
+ }
+ if (!strcasecmp("REF", column)) anno = ANNO_REF;
+ else {
+ if ( !strncasecmp(column,"INFO/",5) ) column += 5;
+ int hdr_id = bcf_hdr_id2int(out_hdr, BCF_DT_ID, column);
+ if (hdr_id<0) { fprintf(stderr,"No header ID found for %s. Header lines can be added with the --header-lines option\n", column); return -1; }
+ switch ( bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id) )
+ {
+ case BCF_HT_INT:
+ anno=ANNO_INT;
+ break;
+ case BCF_HT_STR:
+ anno=ANNO_STRING;
+ break;
+ default:
+ fprintf(stderr,"The type of %s not recognised (%d)\n", column, bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id));
+ return -1;
+ }
+ }
+ if ( !ref_fname )
+ {
+ fprintf(stderr,"No fasta given.\n");
+ return -1;
+ }
+ faidx = fai_load(ref_fname);
+ if ( filter_str )
+ filter = filter_init(in, filter_str);
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ // filter determines if we will annotate the record
+ // return record unchanged if filter applied
+ if ( filter )
+ {
+ int ret = filter_test(filter, rec, NULL);
+ if ( filter_logic==FLT_INCLUDE ) { if ( !ret ) return rec; }
+ else if ( ret ) return rec;
+ }
+
+ int i;
+ char *ref = rec->d.allele[0];
+ int ref_len = strlen(ref);
+ int fa_len;
+ // could be sped up here by fetching the whole chromosome? could assume
+ // sorted, but revert to this when non-sorted records found?
+ char *fa = faidx_fetch_seq(faidx, bcf_seqname(in_hdr,rec), rec->pos, rec->pos+ref_len-1, &fa_len);
+ if ( !fa ) error("faidx_fetch_seq failed at %s:%d\n", bcf_hdr_id2name(in_hdr,rec->rid), rec->pos+1);
+ for (i=0; i<fa_len; i++)
+ if ( (int)fa[i]>96 ) fa[i] -= 32;
+
+ assert(ref_len == fa_len);
+ if (anno==ANNO_REF)
+ strncpy(rec->d.allele[0], fa, fa_len);
+ else if (anno==ANNO_STRING)
+ bcf_update_info_string(out_hdr, rec, column, fa);
+ else if (anno==ANNO_INT && ref_len==1)
+ {
+ int val = atoi(&fa[0]);
+ bcf_update_info_int32(out_hdr, rec, column, &val, 1);
+ }
+ free(fa);
+ return rec;
+}
+
+void destroy(void)
+{
+ fai_destroy(faidx);
+ if (filter) filter_destroy(filter);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugin/fill-from-fasta.c -- fill-from-fasta plugin.
+
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <htslib/vcf.h>
+#include <htslib/faidx.h>
+#include <htslib/kseq.h>
+#include "filter.h"
+#include "bcftools.h"
+
+const char *about(void)
+{
+ return "Fill INFO or REF field based on values in a fasta file\n";
+}
+
+const char *usage(void)
+{
+ return
+"\n"
+"About: Fill INFO or REF field based on values in a fasta file.\n"
+" The fasta file must be indexed with samtools faidx.\n"
+"Usage: bcftools +fill-from-fasta [General Options] -- [Plugin Options]\n"
+"\n"
+"General options:\n"
+" run \"bcftools plugin\" for a list of common options\n"
+"\n"
+"Plugin options:\n"
+" -c, --column <str> REF or INFO tag, e.g. AA for ancestral allele\n"
+" -f, --fasta <file> fasta file\n"
+" -h, --header-lines <file> optional file containing header lines to append\n"
+" -i, --include <expr> annotate only records passing filter expression\n"
+" -e, --exclude <expr> annotate only records failing filter expression\n"
+
+"\n"
+"Examples:\n"
+" # fill ancestral allele as INFO/AA for SNP records\n"
+" echo '##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral allele\">' > aa.hdr\n"
+" bcftools +fill-from-fasta in.vcf -- -c AA -f aa.fasta -h aa.hdr -i 'TYPE=\"snp\"'\n"
+"\n"
+" # fix the REF allele in VCFs where REF=N or other\n"
+" bcftools +fill-from-fasta in.vcf -- -c REF -f reference.fasta\n"
+"\n"
+" # select sites marked as P (PASS) in the 1000G Phase3 mask\n"
+" echo '##INFO=<ID=P3_MASK,Number=1,Type=String,Description=\"1000G Phase 3 mask\">' > mask.hdr\n"
+" bcftools +fill-from-fasta in.vcf -Ou -- -c P3_MASK -f 1000G_mask.fasta -h mask.hdr | bcftools view -i 'P3_MASK=\"P\"'\n"
+"\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+faidx_t *faidx;
+int anno = 0;
+char *column = NULL;
+
+#define ANNO_REF 1
+#define ANNO_STRING 2
+#define ANNO_INT 3
+
+filter_t *filter;
+char *filter_str;
+int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ char *ref_fname = NULL, *header_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"column",required_argument,NULL,'c'},
+ {"fasta",required_argument,NULL,'f'},
+ {"header-lines",required_argument,NULL,'h'},
+ {NULL,0,NULL,0}
+ };
+ while ((c = getopt_long(argc, argv, "c:f:?h:i:e:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': filter_str = optarg; filter_logic |= FLT_EXCLUDE; break;
+ case 'i': filter_str = optarg; filter_logic |= FLT_INCLUDE; break;
+ case 'c': column = optarg; break;
+ case 'f': ref_fname = optarg; break;
+ case 'h': header_fname = optarg; break;
+ case '?':
+ default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ in_hdr = in;
+ out_hdr = out;
+ if ( filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) { fprintf(bcftools_stderr,"Only one of -i or -e can be given.\n"); return -1; }
+
+ if ( !column )
+ {
+ fprintf(bcftools_stderr,"--column option is required.\n");
+ return -1;
+ }
+ if (header_fname)
+ {
+ htsFile *file = hts_open(header_fname, "rb");
+ if ( !file ) { fprintf(bcftools_stderr,"Error reading %s\n", header_fname); return -1; }
+ kstring_t str = {0,0,0};
+ while ( hts_getline(file, KS_SEP_LINE, &str) > 0 )
+ {
+ if ( bcf_hdr_append(out_hdr, str.s) ) { fprintf(bcftools_stderr,"Could not parse %s: %s\n", header_fname, str.s); return -1; }
+ }
+ hts_close(file);
+ free(str.s);
+ bcf_hdr_sync(out_hdr);
+ }
+ if (!strcasecmp("REF", column)) anno = ANNO_REF;
+ else {
+ if ( !strncasecmp(column,"INFO/",5) ) column += 5;
+ int hdr_id = bcf_hdr_id2int(out_hdr, BCF_DT_ID, column);
+ if (hdr_id<0) { fprintf(bcftools_stderr,"No header ID found for %s. Header lines can be added with the --header-lines option\n", column); return -1; }
+ switch ( bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id) )
+ {
+ case BCF_HT_INT:
+ anno=ANNO_INT;
+ break;
+ case BCF_HT_STR:
+ anno=ANNO_STRING;
+ break;
+ default:
+ fprintf(bcftools_stderr,"The type of %s not recognised (%d)\n", column, bcf_hdr_id2type(out_hdr,BCF_HL_INFO,hdr_id));
+ return -1;
+ }
+ }
+ if ( !ref_fname )
+ {
+ fprintf(bcftools_stderr,"No fasta given.\n");
+ return -1;
+ }
+ faidx = fai_load(ref_fname);
+ if ( filter_str )
+ filter = filter_init(in, filter_str);
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ // filter determines if we will annotate the record
+ // return record unchanged if filter applied
+ if ( filter )
+ {
+ int ret = filter_test(filter, rec, NULL);
+ if ( filter_logic==FLT_INCLUDE ) { if ( !ret ) return rec; }
+ else if ( ret ) return rec;
+ }
+
+ int i;
+ char *ref = rec->d.allele[0];
+ int ref_len = strlen(ref);
+ int fa_len;
+ // could be sped up here by fetching the whole chromosome? could assume
+ // sorted, but revert to this when non-sorted records found?
+ char *fa = faidx_fetch_seq(faidx, bcf_seqname(in_hdr,rec), rec->pos, rec->pos+ref_len-1, &fa_len);
+ if ( !fa ) error("faidx_fetch_seq failed at %s:%d\n", bcf_hdr_id2name(in_hdr,rec->rid), rec->pos+1);
+ for (i=0; i<fa_len; i++)
+ if ( (int)fa[i]>96 ) fa[i] -= 32;
+
+ assert(ref_len == fa_len);
+ if (anno==ANNO_REF)
+ strncpy(rec->d.allele[0], fa, fa_len);
+ else if (anno==ANNO_STRING)
+ bcf_update_info_string(out_hdr, rec, column, fa);
+ else if (anno==ANNO_INT && ref_len==1)
+ {
+ int val = atoi(&fa[0]);
+ bcf_update_info_int32(out_hdr, rec, column, &val, 1);
+ }
+ free(fa);
+ return rec;
+}
+
+void destroy(void)
+{
+ fai_destroy(faidx);
+ if (filter) filter_destroy(filter);
+}
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/kseq.h>
+#include <htslib/vcf.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+#define SET_AN (1<<0)
+#define SET_AC (1<<1)
+#define SET_AC_Hom (1<<2)
+#define SET_AC_Het (1<<3)
+#define SET_AC_Hemi (1<<4)
+#define SET_AF (1<<5)
+#define SET_NS (1<<6)
+#define SET_MAF (1<<7)
+#define SET_HWE (1<<8)
+#define SET_ExcHet (1<<9)
+
+typedef struct
+{
+ int nhom, nhet, nhemi, nac;
+}
+counts_t;
+
+typedef struct
+{
+ int ns;
+ int ncounts, mcounts;
+ counts_t *counts;
+ char *name, *suffix;
+ int nsmpl, *smpl;
+}
+pop_t;
+
+typedef struct
+{
+ bcf_hdr_t *in_hdr, *out_hdr;
+ int npop, tags, drop_missing, gt_id;
+ pop_t *pop, **smpl2pop;
+ float *farr;
+ int32_t *iarr, niarr, miarr, nfarr, mfarr;
+ double *hwe_probs;
+ int mhwe_probs;
+ kstring_t str;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+ return "Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n"
+ "Usage: bcftools +fill-tags [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --drop-missing do not count half-missing genotypes \"./1\" as hemizygous\n"
+ " -l, --list-tags list available tags with description\n"
+ " -t, --tags LIST list of output tags. By default, all tags are filled.\n"
+ " -S, --samples-file FILE list of samples (first column) and comma-separated list of populations (second column)\n"
+ "\n"
+ "Example:\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t AN,AC\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -d\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -S sample-group.txt -t HWE\n"
+ "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ void *pop2i = khash_str2int_init();
+ void *smpli = khash_str2int_init();
+ kstring_t str = {0,0,0};
+
+ int moff = 0, *off = NULL, nsmpl = 0;
+ while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 )
+ {
+ // NA12400 GRP1
+ // NA18507 GRP1,GRP2
+ char *pop_names = str.s + str.l - 1;
+ while ( pop_names >= str.s && isspace(*pop_names) ) pop_names--;
+ if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+ pop_names[1] = 0; // trailing spaces
+ while ( pop_names >= str.s && !isspace(*pop_names) ) pop_names--;
+ if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+
+ char *smpl = pop_names++;
+ while ( smpl >= str.s && isspace(*smpl) ) smpl--;
+ if ( smpl <= str.s+1 ) error("Could not parse the file: %s\n", str.s);
+ smpl[1] = 0;
+ smpl = str.s;
+
+ int ismpl = bcf_hdr_id2int(args->in_hdr,BCF_DT_SAMPLE,smpl);
+ if ( ismpl<0 )
+ {
+ fprintf(stderr,"Warning: The sample not present in the VCF: %s\n",smpl);
+ continue;
+ }
+ if ( khash_str2int_has_key(smpli,smpl) )
+ {
+ fprintf(stderr,"Warning: The sample is listed twice in %s: %s\n",fname,smpl);
+ continue;
+ }
+ khash_str2int_inc(smpli,strdup(smpl));
+
+ int i,npops = ksplit_core(pop_names,',',&moff,&off);
+ for (i=0; i<npops; i++)
+ {
+ char *pop_name = &pop_names[off[i]];
+ if ( !khash_str2int_has_key(pop2i,pop_name) )
+ {
+ pop_name = strdup(pop_name);
+ khash_str2int_set(pop2i,pop_name,args->npop);
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = pop_name;
+ args->pop[args->npop-1].suffix = (char*)malloc(strlen(pop_name)+2);
+ memcpy(args->pop[args->npop-1].suffix+1,pop_name,strlen(pop_name)+1);
+ args->pop[args->npop-1].suffix[0] = '_';
+ }
+ int ipop = 0;
+ khash_str2int_get(pop2i,pop_name,&ipop);
+ pop_t *pop = &args->pop[ipop];
+ pop->nsmpl++;
+ pop->smpl = (int*) realloc(pop->smpl,pop->nsmpl*sizeof(*pop->smpl));
+ pop->smpl[pop->nsmpl-1] = ismpl;
+ }
+ nsmpl++;
+ }
+
+ if ( nsmpl != bcf_hdr_nsamples(args->in_hdr) )
+ fprintf(stderr,"Warning: %d samples in the list, %d samples in the VCF.\n", nsmpl,bcf_hdr_nsamples(args->in_hdr));
+
+ if ( !args->npop ) error("No populations given?\n");
+
+ khash_str2int_destroy(pop2i);
+ khash_str2int_destroy_free(smpli);
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+void init_pops(args_t *args)
+{
+ int i,j, nsmpl;
+
+ // add the population "ALL", which is a summary population for all samples
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = strdup("");
+ args->pop[args->npop-1].suffix = strdup("");
+
+ nsmpl = bcf_hdr_nsamples(args->in_hdr);
+ args->smpl2pop = (pop_t**) calloc(nsmpl*(args->npop+1),sizeof(pop_t*));
+ for (i=0; i<nsmpl; i++)
+ args->smpl2pop[i*(args->npop+1)] = &args->pop[args->npop-1];
+
+ for (i=0; i<args->npop; i++)
+ {
+ for (j=0; j<args->pop[i].nsmpl; j++)
+ {
+ int ismpl = args->pop[i].smpl[j];
+ pop_t **smpl2pop = &args->smpl2pop[ismpl*(args->npop+1)];
+ while (*smpl2pop) smpl2pop++;
+ *smpl2pop = &args->pop[i];
+ }
+ }
+}
+
+int parse_tags(args_t *args, const char *str)
+{
+ int i, flag = 0, n_tags;
+ char **tags = hts_readlist(str, 0, &n_tags);
+ for(i=0; i<n_tags; i++)
+ {
+ if ( !strcasecmp(tags[i],"AN") ) flag |= SET_AN;
+ else if ( !strcasecmp(tags[i],"AC") ) flag |= SET_AC;
+ else if ( !strcasecmp(tags[i],"NS") ) flag |= SET_NS;
+ else if ( !strcasecmp(tags[i],"AC_Hom") ) flag |= SET_AC_Hom;
+ else if ( !strcasecmp(tags[i],"AC_Het") ) flag |= SET_AC_Het;
+ else if ( !strcasecmp(tags[i],"AC_Hemi") ) flag |= SET_AC_Hemi;
+ else if ( !strcasecmp(tags[i],"AF") ) flag |= SET_AF;
+ else if ( !strcasecmp(tags[i],"MAF") ) flag |= SET_MAF;
+ else if ( !strcasecmp(tags[i],"HWE") ) flag |= SET_HWE;
+ else if ( !strcasecmp(tags[i],"ExcHet") ) flag |= SET_ExcHet;
+ else
+ {
+ fprintf(stderr,"Error parsing \"--tags %s\": the tag \"%s\" is not supported\n", str,tags[i]);
+ exit(1);
+ }
+ free(tags[i]);
+ }
+ if (n_tags) free(tags);
+ return flag;
+}
+
+void hdr_append(args_t *args, char *fmt)
+{
+ int i;
+ for (i=0; i<args->npop; i++)
+ bcf_hdr_printf(args->out_hdr, fmt, args->pop[i].suffix,*args->pop[i].name ? " in " : "",args->pop[i].name);
+}
+
+void list_tags(void)
+{
+ error(
+ "INFO/AN Number:1 Type:Integer .. Total number of alleles in called genotypes\n"
+ "INFO/AC Number:A Type:Integer .. Allele count in genotypes\n"
+ "INFO/NS Number:1 Type:Integer .. Number of samples with data\n"
+ "INFO/AC_Hom Number:A Type:Integer .. Allele counts in homozygous genotypes\n"
+ "INFO/AC_Het Number:A Type:Integer .. Allele counts in heterozygous genotypes\n"
+ "INFO/AC_Hemi Number:A Type:Integer .. Allele counts in hemizygous genotypes\n"
+ "INFO/AF Number:A Type:Float .. Allele frequency\n"
+ "INFO/MAF Number:A Type:Float .. Minor Allele frequency\n"
+ "INFO/HWE Number:A Type:Float .. HWE test (PMID:15789306)\n"
+ "INFO/ExcHet Number:A Type:Float .. Probability of excess heterozygosity\n"
+ );
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->in_hdr = in;
+ args->out_hdr = out;
+ char *samples_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"list-tags",0,0,'l'},
+ {"drop-missing",0,0,'d'},
+ {"tags",1,0,'t'},
+ {"samples-file",1,0,'S'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:dS:l",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'l': list_tags(); break;
+ case 'd': args->drop_missing = 1; break;
+ case 't': args->tags |= parse_tags(args,optarg); break;
+ case 'S': samples_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+
+ if ( optind != argc ) error("%s",usage());
+
+ args->gt_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+ if ( !args->tags )
+ for (c=0; c<=9; c++) args->tags |= 1<<c; // by default all tags will be filled
+
+ if ( samples_fname ) parse_samples(args, samples_fname);
+ init_pops(args);
+
+ if ( args->tags & SET_AN ) hdr_append(args, "##INFO=<ID=AN%s,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes%s%s\">");
+ if ( args->tags & SET_AC ) hdr_append(args, "##INFO=<ID=AC%s,Number=A,Type=Integer,Description=\"Allele count in genotypes%s%s\">");
+ if ( args->tags & SET_NS ) hdr_append(args, "##INFO=<ID=NS%s,Number=1,Type=Integer,Description=\"Number of samples with data%s%s\">");
+ if ( args->tags & SET_AC_Hom ) hdr_append(args, "##INFO=<ID=AC_Hom%s,Number=A,Type=Integer,Description=\"Allele counts in homozygous genotypes%s%s\">");
+ if ( args->tags & SET_AC_Het ) hdr_append(args, "##INFO=<ID=AC_Het%s,Number=A,Type=Integer,Description=\"Allele counts in heterozygous genotypes%s%s\">");
+ if ( args->tags & SET_AC_Hemi ) hdr_append(args, "##INFO=<ID=AC_Hemi%s,Number=A,Type=Integer,Description=\"Allele counts in hemizygous genotypes%s%s\">");
+ if ( args->tags & SET_AF ) hdr_append(args, "##INFO=<ID=AF%s,Number=A,Type=Float,Description=\"Allele frequency%s%s\">");
+ if ( args->tags & SET_MAF ) hdr_append(args, "##INFO=<ID=MAF%s,Number=A,Type=Float,Description=\"Minor Allele frequency%s%s\">");
+ if ( args->tags & SET_HWE ) hdr_append(args, "##INFO=<ID=HWE%s,Number=A,Type=Float,Description=\"HWE test%s%s (PMID:15789306)\">");
+ if ( args->tags & SET_ExcHet ) hdr_append(args, "##INFO=<ID=ExcHet%s,Number=A,Type=Float,Description=\"Probability of excess heterozygosity\">");
+
+ return 0;
+}
+
+/*
+ Wigginton 2005, PMID: 15789306
+
+ nref .. number of reference alleles
+ nalt .. number of alt alleles
+ nhet .. number of het genotypes, assuming number of genotypes = (nref+nalt)*2
+
+*/
+
+void calc_hwe(args_t *args, int nref, int nalt, int nhet, float *p_hwe, float *p_exc_het)
+{
+ int ngt = (nref+nalt) / 2;
+ int nrare = nref < nalt ? nref : nalt;
+
+ // sanity check: there is odd/even number of rare alleles iff there is odd/even number of hets
+ if ( (nrare & 1) ^ (nhet & 1) ) error("nrare/nhet should be both odd or even: nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+ if ( nrare < nhet ) error("Fewer rare alleles than hets? nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+ if ( (nref+nalt) & 1 ) error("Expected diploid genotypes: nref=%d nalt=%d\n",nref,nalt);
+
+ // initialize het probs
+ hts_expand(double,nrare+1,args->mhwe_probs,args->hwe_probs);
+ memset(args->hwe_probs, 0, sizeof(*args->hwe_probs)*(nrare+1));
+ double *probs = args->hwe_probs;
+
+ // start at midpoint
+ int mid = nrare * (nref + nalt - nrare) / (nref + nalt);
+
+ // check to ensure that midpoint and rare alleles have same parity
+ if ( (nrare & 1) ^ (mid & 1) ) mid++;
+
+ int het = mid;
+ int hom_r = (nrare - mid) / 2;
+ int hom_c = ngt - het - hom_r;
+ double sum = probs[mid] = 1.0;
+
+ for (het = mid; het > 1; het -= 2)
+ {
+ probs[het - 2] = probs[het] * het * (het - 1.0) / (4.0 * (hom_r + 1.0) * (hom_c + 1.0));
+ sum += probs[het - 2];
+
+ // 2 fewer heterozygotes for next iteration -> add one rare, one common homozygote
+ hom_r++;
+ hom_c++;
+ }
+
+ het = mid;
+ hom_r = (nrare - mid) / 2;
+ hom_c = ngt - het - hom_r;
+ for (het = mid; het <= nrare - 2; het += 2)
+ {
+ probs[het + 2] = probs[het] * 4.0 * hom_r * hom_c / ((het + 2.0) * (het + 1.0));
+ sum += probs[het + 2];
+
+ // add 2 heterozygotes for next iteration -> subtract one rare, one common homozygote
+ hom_r--;
+ hom_c--;
+ }
+
+ for (het=0; het<nrare+1; het++) probs[het] /= sum;
+
+ double prob = probs[nhet];
+ for (het = nhet + 1; het <= nrare; het++) prob += probs[het];
+ *p_exc_het = prob;
+
+ prob = 0;
+ for (het=0; het <= nrare; het++)
+ {
+ if ( probs[het] > probs[nhet]) continue;
+ prob += probs[het];
+ }
+ if ( prob > 1 ) prob = 1;
+ *p_hwe = prob;
+}
+
+static inline void set_counts(pop_t *pop, int is_half, int is_hom, int is_hemi, int als)
+{
+ int ial;
+ for (ial=0; als; ial++)
+ {
+ if ( als&1 )
+ {
+ if ( is_half ) pop->counts[ial].nac++;
+ else if ( !is_hom ) pop->counts[ial].nhet++;
+ else if ( !is_hemi ) pop->counts[ial].nhom += 2;
+ else pop->counts[ial].nhemi++;
+ }
+ als >>= 1;
+ }
+ pop->ns++;
+}
+static void clean_counts(pop_t *pop, int nals)
+{
+ pop->ns = 0;
+ memset(pop->counts,0,sizeof(counts_t)*nals);
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, nsmpl = bcf_hdr_nsamples(args->in_hdr);;
+
+ bcf_unpack(rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<rec->n_fmt; i++)
+ if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) return rec; // no GT tag
+
+ hts_expand(int32_t,rec->n_allele, args->miarr, args->iarr);
+ hts_expand(float,rec->n_allele*2, args->mfarr, args->farr);
+ for (i=0; i<args->npop; i++)
+ hts_expand(counts_t,rec->n_allele,args->pop[i].mcounts, args->pop[i].counts);
+
+ for (i=0; i<args->npop; i++)
+ clean_counts(&args->pop[i], rec->n_allele);
+
+ assert( rec->n_allele < 8*sizeof(int) );
+
+ #define BRANCH_INT(type_t,vector_end) \
+ { \
+ for (i=0; i<nsmpl; i++) \
+ { \
+ type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+ int ial, als = 0, nals = 0, is_half, is_hom, is_hemi; \
+ for (ial=0; ial<fmt_gt->n; ial++) \
+ { \
+ if ( p[ial]==vector_end ) break; /* smaller ploidy */ \
+ if ( bcf_gt_is_missing(p[ial]) ) continue; /* missing allele */ \
+ int idx = bcf_gt_allele(p[ial]); \
+ nals++; \
+ \
+ if ( idx >= rec->n_allele ) \
+ error("Incorrect allele (\"%d\") in %s at %s:%d\n",idx,args->in_hdr->samples[i],bcf_seqname(args->in_hdr,rec),rec->pos+1); \
+ als |= (1<<idx); /* this breaks with too many alleles */ \
+ } \
+ if ( nals==0 ) continue; /* missing genotype */ \
+ is_hom = als && !(als & (als-1)); /* only one bit is set */ \
+ if ( nals!=ial ) \
+ { \
+ if ( args->drop_missing ) is_hemi = 0, is_half = 1; \
+ else is_hemi = 1, is_half = 0; \
+ } \
+ else if ( nals==1 ) is_hemi = 1, is_half = 0; \
+ else is_hemi = 0, is_half = 0; \
+ pop_t **pop = &args->smpl2pop[i*(args->npop+1)]; \
+ while ( *pop ) { set_counts(*pop,is_half,is_hom,is_hemi,als); pop++; }\
+ } \
+ }
+ switch (fmt_gt->type) {
+ case BCF_BT_INT8: BRANCH_INT(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+ default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->in_hdr,rec),rec->pos+1); break;
+ }
+ #undef BRANCH_INT
+
+ if ( args->tags & SET_NS )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "NS%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&args->pop[i].ns,1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AN )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ pop_t *pop = &args->pop[i];
+ int32_t an = 0;
+ for (j=0; j<rec->n_allele; j++)
+ an += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+
+ args->str.l = 0;
+ ksprintf(&args->str, "AN%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&an,1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & (SET_AF | SET_MAF) )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ int32_t an = 0;
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->farr, 0, sizeof(*args->farr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->farr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+ an = pop->counts[0].nhet + pop->counts[0].nhom + pop->counts[0].nhemi + pop->counts[0].nac;
+ for (j=1; j<rec->n_allele; j++) an += args->farr[j-1];
+ if ( !an ) continue;
+ for (j=1; j<rec->n_allele; j++) args->farr[j-1] /= an;
+ }
+ if ( args->tags & SET_AF )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "AF%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ if ( args->tags & SET_MAF )
+ {
+ if ( !an ) continue;
+ for (j=1; j<rec->n_allele; j++)
+ if ( args->farr[j-1] > 0.5 ) args->farr[j-1] = 1 - args->farr[j-1]; // todo: this is incorrect for multiallelic sites
+ args->str.l = 0;
+ ksprintf(&args->str, "MAF%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ }
+ if ( args->tags & SET_AC )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Het )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhet;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Het%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Hom )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhom;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Hom%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Hemi && rec->n_allele > 1 )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhemi;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Hemi%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & (SET_HWE|SET_ExcHet) )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ float *fhwe = args->farr;
+ float *fexc_het = args->farr + rec->n_allele;
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->farr, 0, sizeof(*args->farr)*(2*rec->n_allele));
+ int nref_tot = pop->counts[0].nhom;
+ for (j=0; j<rec->n_allele; j++) nref_tot += pop->counts[j].nhet; // NB this neglects multiallelic genotypes
+ for (j=1; j<rec->n_allele; j++)
+ {
+ int nref = nref_tot - pop->counts[j].nhet;
+ int nalt = pop->counts[j].nhet + pop->counts[j].nhom;
+ int nhet = pop->counts[j].nhet;
+ if ( nref>0 && nalt>0 )
+ calc_hwe(args, nref, nalt, nhet, &fhwe[j-1], &fexc_het[j-1]);
+ else
+ fhwe[j-1] = fexc_het[j-1] = 1;
+ }
+ }
+ if ( args->tags & SET_HWE )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "HWE%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fhwe,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ if ( args->tags & SET_ExcHet )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "ExcHet%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fexc_het,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ }
+
+ return rec;
+}
+
+void destroy(void)
+{
+ int i;
+ for (i=0; i<args->npop; i++)
+ {
+ free(args->pop[i].name);
+ free(args->pop[i].suffix);
+ free(args->pop[i].smpl);
+ free(args->pop[i].counts);
+ }
+ free(args->str.s);
+ free(args->pop);
+ free(args->smpl2pop);
+ free(args->iarr);
+ free(args->farr);
+ free(args->hwe_probs);
+ free(args);
+}
+
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/kseq.h>
+#include <htslib/vcf.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+#define SET_AN (1<<0)
+#define SET_AC (1<<1)
+#define SET_AC_Hom (1<<2)
+#define SET_AC_Het (1<<3)
+#define SET_AC_Hemi (1<<4)
+#define SET_AF (1<<5)
+#define SET_NS (1<<6)
+#define SET_MAF (1<<7)
+#define SET_HWE (1<<8)
+#define SET_ExcHet (1<<9)
+
+typedef struct
+{
+ int nhom, nhet, nhemi, nac;
+}
+counts_t;
+
+typedef struct
+{
+ int ns;
+ int ncounts, mcounts;
+ counts_t *counts;
+ char *name, *suffix;
+ int nsmpl, *smpl;
+}
+pop_t;
+
+typedef struct
+{
+ bcf_hdr_t *in_hdr, *out_hdr;
+ int npop, tags, drop_missing, gt_id;
+ pop_t *pop, **smpl2pop;
+ float *farr;
+ int32_t *iarr, niarr, miarr, nfarr, mfarr;
+ double *hwe_probs;
+ int mhwe_probs;
+ kstring_t str;
+}
+args_t;
+
+static args_t *args;
+
+const char *about(void)
+{
+ return "Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Set INFO tags AF, AC, AC_Hemi, AC_Hom, AC_Het, AN, ExcHet, HWE, MAF, NS.\n"
+ "Usage: bcftools +fill-tags [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --drop-missing do not count half-missing genotypes \"./1\" as hemizygous\n"
+ " -l, --list-tags list available tags with description\n"
+ " -t, --tags LIST list of output tags. By default, all tags are filled.\n"
+ " -S, --samples-file FILE list of samples (first column) and comma-separated list of populations (second column)\n"
+ "\n"
+ "Example:\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -t AN,AC\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -d\n"
+ " bcftools +fill-tags in.bcf -Ob -o out.bcf -- -S sample-group.txt -t HWE\n"
+ "\n";
+}
+
+void parse_samples(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ void *pop2i = khash_str2int_init();
+ void *smpli = khash_str2int_init();
+ kstring_t str = {0,0,0};
+
+ int moff = 0, *off = NULL, nsmpl = 0;
+ while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 )
+ {
+ // NA12400 GRP1
+ // NA18507 GRP1,GRP2
+ char *pop_names = str.s + str.l - 1;
+ while ( pop_names >= str.s && isspace(*pop_names) ) pop_names--;
+ if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+ pop_names[1] = 0; // trailing spaces
+ while ( pop_names >= str.s && !isspace(*pop_names) ) pop_names--;
+ if ( pop_names <= str.s ) error("Could not parse the file: %s\n", str.s);
+
+ char *smpl = pop_names++;
+ while ( smpl >= str.s && isspace(*smpl) ) smpl--;
+ if ( smpl <= str.s+1 ) error("Could not parse the file: %s\n", str.s);
+ smpl[1] = 0;
+ smpl = str.s;
+
+ int ismpl = bcf_hdr_id2int(args->in_hdr,BCF_DT_SAMPLE,smpl);
+ if ( ismpl<0 )
+ {
+ fprintf(bcftools_stderr,"Warning: The sample not present in the VCF: %s\n",smpl);
+ continue;
+ }
+ if ( khash_str2int_has_key(smpli,smpl) )
+ {
+ fprintf(bcftools_stderr,"Warning: The sample is listed twice in %s: %s\n",fname,smpl);
+ continue;
+ }
+ khash_str2int_inc(smpli,strdup(smpl));
+
+ int i,npops = ksplit_core(pop_names,',',&moff,&off);
+ for (i=0; i<npops; i++)
+ {
+ char *pop_name = &pop_names[off[i]];
+ if ( !khash_str2int_has_key(pop2i,pop_name) )
+ {
+ pop_name = strdup(pop_name);
+ khash_str2int_set(pop2i,pop_name,args->npop);
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = pop_name;
+ args->pop[args->npop-1].suffix = (char*)malloc(strlen(pop_name)+2);
+ memcpy(args->pop[args->npop-1].suffix+1,pop_name,strlen(pop_name)+1);
+ args->pop[args->npop-1].suffix[0] = '_';
+ }
+ int ipop = 0;
+ khash_str2int_get(pop2i,pop_name,&ipop);
+ pop_t *pop = &args->pop[ipop];
+ pop->nsmpl++;
+ pop->smpl = (int*) realloc(pop->smpl,pop->nsmpl*sizeof(*pop->smpl));
+ pop->smpl[pop->nsmpl-1] = ismpl;
+ }
+ nsmpl++;
+ }
+
+ if ( nsmpl != bcf_hdr_nsamples(args->in_hdr) )
+ fprintf(bcftools_stderr,"Warning: %d samples in the list, %d samples in the VCF.\n", nsmpl,bcf_hdr_nsamples(args->in_hdr));
+
+ if ( !args->npop ) error("No populations given?\n");
+
+ khash_str2int_destroy(pop2i);
+ khash_str2int_destroy_free(smpli);
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+void init_pops(args_t *args)
+{
+ int i,j, nsmpl;
+
+ // add the population "ALL", which is a summary population for all samples
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = strdup("");
+ args->pop[args->npop-1].suffix = strdup("");
+
+ nsmpl = bcf_hdr_nsamples(args->in_hdr);
+ args->smpl2pop = (pop_t**) calloc(nsmpl*(args->npop+1),sizeof(pop_t*));
+ for (i=0; i<nsmpl; i++)
+ args->smpl2pop[i*(args->npop+1)] = &args->pop[args->npop-1];
+
+ for (i=0; i<args->npop; i++)
+ {
+ for (j=0; j<args->pop[i].nsmpl; j++)
+ {
+ int ismpl = args->pop[i].smpl[j];
+ pop_t **smpl2pop = &args->smpl2pop[ismpl*(args->npop+1)];
+ while (*smpl2pop) smpl2pop++;
+ *smpl2pop = &args->pop[i];
+ }
+ }
+}
+
+int parse_tags(args_t *args, const char *str)
+{
+ int i, flag = 0, n_tags;
+ char **tags = hts_readlist(str, 0, &n_tags);
+ for(i=0; i<n_tags; i++)
+ {
+ if ( !strcasecmp(tags[i],"AN") ) flag |= SET_AN;
+ else if ( !strcasecmp(tags[i],"AC") ) flag |= SET_AC;
+ else if ( !strcasecmp(tags[i],"NS") ) flag |= SET_NS;
+ else if ( !strcasecmp(tags[i],"AC_Hom") ) flag |= SET_AC_Hom;
+ else if ( !strcasecmp(tags[i],"AC_Het") ) flag |= SET_AC_Het;
+ else if ( !strcasecmp(tags[i],"AC_Hemi") ) flag |= SET_AC_Hemi;
+ else if ( !strcasecmp(tags[i],"AF") ) flag |= SET_AF;
+ else if ( !strcasecmp(tags[i],"MAF") ) flag |= SET_MAF;
+ else if ( !strcasecmp(tags[i],"HWE") ) flag |= SET_HWE;
+ else if ( !strcasecmp(tags[i],"ExcHet") ) flag |= SET_ExcHet;
+ else
+ {
+ fprintf(bcftools_stderr,"Error parsing \"--tags %s\": the tag \"%s\" is not supported\n", str,tags[i]);
+ exit(1);
+ }
+ free(tags[i]);
+ }
+ if (n_tags) free(tags);
+ return flag;
+}
+
+void hdr_append(args_t *args, char *fmt)
+{
+ int i;
+ for (i=0; i<args->npop; i++)
+ bcf_hdr_printf(args->out_hdr, fmt, args->pop[i].suffix,*args->pop[i].name ? " in " : "",args->pop[i].name);
+}
+
+void list_tags(void)
+{
+ error(
+ "INFO/AN Number:1 Type:Integer .. Total number of alleles in called genotypes\n"
+ "INFO/AC Number:A Type:Integer .. Allele count in genotypes\n"
+ "INFO/NS Number:1 Type:Integer .. Number of samples with data\n"
+ "INFO/AC_Hom Number:A Type:Integer .. Allele counts in homozygous genotypes\n"
+ "INFO/AC_Het Number:A Type:Integer .. Allele counts in heterozygous genotypes\n"
+ "INFO/AC_Hemi Number:A Type:Integer .. Allele counts in hemizygous genotypes\n"
+ "INFO/AF Number:A Type:Float .. Allele frequency\n"
+ "INFO/MAF Number:A Type:Float .. Minor Allele frequency\n"
+ "INFO/HWE Number:A Type:Float .. HWE test (PMID:15789306)\n"
+ "INFO/ExcHet Number:A Type:Float .. Probability of excess heterozygosity\n"
+ );
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->in_hdr = in;
+ args->out_hdr = out;
+ char *samples_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"list-tags",0,0,'l'},
+ {"drop-missing",0,0,'d'},
+ {"tags",1,0,'t'},
+ {"samples-file",1,0,'S'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:dS:l",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'l': list_tags(); break;
+ case 'd': args->drop_missing = 1; break;
+ case 't': args->tags |= parse_tags(args,optarg); break;
+ case 'S': samples_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+
+ if ( optind != argc ) error("%s",usage());
+
+ args->gt_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,"GT");
+ if ( args->gt_id<0 ) error("Error: GT field is not present\n");
+
+ if ( !args->tags )
+ for (c=0; c<=9; c++) args->tags |= 1<<c; // by default all tags will be filled
+
+ if ( samples_fname ) parse_samples(args, samples_fname);
+ init_pops(args);
+
+ if ( args->tags & SET_AN ) hdr_append(args, "##INFO=<ID=AN%s,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes%s%s\">");
+ if ( args->tags & SET_AC ) hdr_append(args, "##INFO=<ID=AC%s,Number=A,Type=Integer,Description=\"Allele count in genotypes%s%s\">");
+ if ( args->tags & SET_NS ) hdr_append(args, "##INFO=<ID=NS%s,Number=1,Type=Integer,Description=\"Number of samples with data%s%s\">");
+ if ( args->tags & SET_AC_Hom ) hdr_append(args, "##INFO=<ID=AC_Hom%s,Number=A,Type=Integer,Description=\"Allele counts in homozygous genotypes%s%s\">");
+ if ( args->tags & SET_AC_Het ) hdr_append(args, "##INFO=<ID=AC_Het%s,Number=A,Type=Integer,Description=\"Allele counts in heterozygous genotypes%s%s\">");
+ if ( args->tags & SET_AC_Hemi ) hdr_append(args, "##INFO=<ID=AC_Hemi%s,Number=A,Type=Integer,Description=\"Allele counts in hemizygous genotypes%s%s\">");
+ if ( args->tags & SET_AF ) hdr_append(args, "##INFO=<ID=AF%s,Number=A,Type=Float,Description=\"Allele frequency%s%s\">");
+ if ( args->tags & SET_MAF ) hdr_append(args, "##INFO=<ID=MAF%s,Number=A,Type=Float,Description=\"Minor Allele frequency%s%s\">");
+ if ( args->tags & SET_HWE ) hdr_append(args, "##INFO=<ID=HWE%s,Number=A,Type=Float,Description=\"HWE test%s%s (PMID:15789306)\">");
+ if ( args->tags & SET_ExcHet ) hdr_append(args, "##INFO=<ID=ExcHet%s,Number=A,Type=Float,Description=\"Probability of excess heterozygosity\">");
+
+ return 0;
+}
+
+/*
+ Wigginton 2005, PMID: 15789306
+
+ nref .. number of reference alleles
+ nalt .. number of alt alleles
+ nhet .. number of het genotypes, assuming number of genotypes = (nref+nalt)*2
+
+*/
+
+void calc_hwe(args_t *args, int nref, int nalt, int nhet, float *p_hwe, float *p_exc_het)
+{
+ int ngt = (nref+nalt) / 2;
+ int nrare = nref < nalt ? nref : nalt;
+
+ // sanity check: there is odd/even number of rare alleles iff there is odd/even number of hets
+ if ( (nrare & 1) ^ (nhet & 1) ) error("nrare/nhet should be both odd or even: nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+ if ( nrare < nhet ) error("Fewer rare alleles than hets? nrare=%d nref=%d nalt=%d nhet=%d\n",nrare,nref,nalt,nhet);
+ if ( (nref+nalt) & 1 ) error("Expected diploid genotypes: nref=%d nalt=%d\n",nref,nalt);
+
+ // initialize het probs
+ hts_expand(double,nrare+1,args->mhwe_probs,args->hwe_probs);
+ memset(args->hwe_probs, 0, sizeof(*args->hwe_probs)*(nrare+1));
+ double *probs = args->hwe_probs;
+
+ // start at midpoint
+ int mid = nrare * (nref + nalt - nrare) / (nref + nalt);
+
+ // check to ensure that midpoint and rare alleles have same parity
+ if ( (nrare & 1) ^ (mid & 1) ) mid++;
+
+ int het = mid;
+ int hom_r = (nrare - mid) / 2;
+ int hom_c = ngt - het - hom_r;
+ double sum = probs[mid] = 1.0;
+
+ for (het = mid; het > 1; het -= 2)
+ {
+ probs[het - 2] = probs[het] * het * (het - 1.0) / (4.0 * (hom_r + 1.0) * (hom_c + 1.0));
+ sum += probs[het - 2];
+
+ // 2 fewer heterozygotes for next iteration -> add one rare, one common homozygote
+ hom_r++;
+ hom_c++;
+ }
+
+ het = mid;
+ hom_r = (nrare - mid) / 2;
+ hom_c = ngt - het - hom_r;
+ for (het = mid; het <= nrare - 2; het += 2)
+ {
+ probs[het + 2] = probs[het] * 4.0 * hom_r * hom_c / ((het + 2.0) * (het + 1.0));
+ sum += probs[het + 2];
+
+ // add 2 heterozygotes for next iteration -> subtract one rare, one common homozygote
+ hom_r--;
+ hom_c--;
+ }
+
+ for (het=0; het<nrare+1; het++) probs[het] /= sum;
+
+ double prob = probs[nhet];
+ for (het = nhet + 1; het <= nrare; het++) prob += probs[het];
+ *p_exc_het = prob;
+
+ prob = 0;
+ for (het=0; het <= nrare; het++)
+ {
+ if ( probs[het] > probs[nhet]) continue;
+ prob += probs[het];
+ }
+ if ( prob > 1 ) prob = 1;
+ *p_hwe = prob;
+}
+
+static inline void set_counts(pop_t *pop, int is_half, int is_hom, int is_hemi, int als)
+{
+ int ial;
+ for (ial=0; als; ial++)
+ {
+ if ( als&1 )
+ {
+ if ( is_half ) pop->counts[ial].nac++;
+ else if ( !is_hom ) pop->counts[ial].nhet++;
+ else if ( !is_hemi ) pop->counts[ial].nhom += 2;
+ else pop->counts[ial].nhemi++;
+ }
+ als >>= 1;
+ }
+ pop->ns++;
+}
+static void clean_counts(pop_t *pop, int nals)
+{
+ pop->ns = 0;
+ memset(pop->counts,0,sizeof(counts_t)*nals);
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, nsmpl = bcf_hdr_nsamples(args->in_hdr);;
+
+ bcf_unpack(rec, BCF_UN_FMT);
+ bcf_fmt_t *fmt_gt = NULL;
+ for (i=0; i<rec->n_fmt; i++)
+ if ( rec->d.fmt[i].id==args->gt_id ) { fmt_gt = &rec->d.fmt[i]; break; }
+ if ( !fmt_gt ) return rec; // no GT tag
+
+ hts_expand(int32_t,rec->n_allele, args->miarr, args->iarr);
+ hts_expand(float,rec->n_allele*2, args->mfarr, args->farr);
+ for (i=0; i<args->npop; i++)
+ hts_expand(counts_t,rec->n_allele,args->pop[i].mcounts, args->pop[i].counts);
+
+ for (i=0; i<args->npop; i++)
+ clean_counts(&args->pop[i], rec->n_allele);
+
+ assert( rec->n_allele < 8*sizeof(int) );
+
+ #define BRANCH_INT(type_t,vector_end) \
+ { \
+ for (i=0; i<nsmpl; i++) \
+ { \
+ type_t *p = (type_t*) (fmt_gt->p + i*fmt_gt->size); \
+ int ial, als = 0, nals = 0, is_half, is_hom, is_hemi; \
+ for (ial=0; ial<fmt_gt->n; ial++) \
+ { \
+ if ( p[ial]==vector_end ) break; /* smaller ploidy */ \
+ if ( bcf_gt_is_missing(p[ial]) ) continue; /* missing allele */ \
+ int idx = bcf_gt_allele(p[ial]); \
+ nals++; \
+ \
+ if ( idx >= rec->n_allele ) \
+ error("Incorrect allele (\"%d\") in %s at %s:%d\n",idx,args->in_hdr->samples[i],bcf_seqname(args->in_hdr,rec),rec->pos+1); \
+ als |= (1<<idx); /* this breaks with too many alleles */ \
+ } \
+ if ( nals==0 ) continue; /* missing genotype */ \
+ is_hom = als && !(als & (als-1)); /* only one bit is set */ \
+ if ( nals!=ial ) \
+ { \
+ if ( args->drop_missing ) is_hemi = 0, is_half = 1; \
+ else is_hemi = 1, is_half = 0; \
+ } \
+ else if ( nals==1 ) is_hemi = 1, is_half = 0; \
+ else is_hemi = 0, is_half = 0; \
+ pop_t **pop = &args->smpl2pop[i*(args->npop+1)]; \
+ while ( *pop ) { set_counts(*pop,is_half,is_hom,is_hemi,als); pop++; }\
+ } \
+ }
+ switch (fmt_gt->type) {
+ case BCF_BT_INT8: BRANCH_INT(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH_INT(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH_INT(int32_t, bcf_int32_vector_end); break;
+ default: error("The GT type is not recognised: %d at %s:%d\n",fmt_gt->type, bcf_seqname(args->in_hdr,rec),rec->pos+1); break;
+ }
+ #undef BRANCH_INT
+
+ if ( args->tags & SET_NS )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "NS%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&args->pop[i].ns,1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AN )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ pop_t *pop = &args->pop[i];
+ int32_t an = 0;
+ for (j=0; j<rec->n_allele; j++)
+ an += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+
+ args->str.l = 0;
+ ksprintf(&args->str, "AN%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,&an,1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & (SET_AF | SET_MAF) )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ int32_t an = 0;
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->farr, 0, sizeof(*args->farr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->farr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+ an = pop->counts[0].nhet + pop->counts[0].nhom + pop->counts[0].nhemi + pop->counts[0].nac;
+ for (j=1; j<rec->n_allele; j++) an += args->farr[j-1];
+ if ( !an ) continue;
+ for (j=1; j<rec->n_allele; j++) args->farr[j-1] /= an;
+ }
+ if ( args->tags & SET_AF )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "AF%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ if ( args->tags & SET_MAF )
+ {
+ if ( !an ) continue;
+ for (j=1; j<rec->n_allele; j++)
+ if ( args->farr[j-1] > 0.5 ) args->farr[j-1] = 1 - args->farr[j-1]; // todo: this is incorrect for multiallelic sites
+ args->str.l = 0;
+ ksprintf(&args->str, "MAF%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,args->farr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ }
+ if ( args->tags & SET_AC )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhet + pop->counts[j].nhom + pop->counts[j].nhemi + pop->counts[j].nac;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Het )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhet;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Het%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Hom )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhom;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Hom%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & SET_AC_Hemi && rec->n_allele > 1 )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->iarr, 0, sizeof(*args->iarr)*(rec->n_allele-1));
+ for (j=1; j<rec->n_allele; j++)
+ args->iarr[j-1] += pop->counts[j].nhemi;
+ }
+ args->str.l = 0;
+ ksprintf(&args->str, "AC_Hemi%s", args->pop[i].suffix);
+ if ( bcf_update_info_int32(args->out_hdr,rec,args->str.s,args->iarr,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ if ( args->tags & (SET_HWE|SET_ExcHet) )
+ {
+ for (i=0; i<args->npop; i++)
+ {
+ float *fhwe = args->farr;
+ float *fexc_het = args->farr + rec->n_allele;
+ if ( rec->n_allele > 1 )
+ {
+ pop_t *pop = &args->pop[i];
+ memset(args->farr, 0, sizeof(*args->farr)*(2*rec->n_allele));
+ int nref_tot = pop->counts[0].nhom;
+ for (j=0; j<rec->n_allele; j++) nref_tot += pop->counts[j].nhet; // NB this neglects multiallelic genotypes
+ for (j=1; j<rec->n_allele; j++)
+ {
+ int nref = nref_tot - pop->counts[j].nhet;
+ int nalt = pop->counts[j].nhet + pop->counts[j].nhom;
+ int nhet = pop->counts[j].nhet;
+ if ( nref>0 && nalt>0 )
+ calc_hwe(args, nref, nalt, nhet, &fhwe[j-1], &fexc_het[j-1]);
+ else
+ fhwe[j-1] = fexc_het[j-1] = 1;
+ }
+ }
+ if ( args->tags & SET_HWE )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "HWE%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fhwe,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ if ( args->tags & SET_ExcHet )
+ {
+ args->str.l = 0;
+ ksprintf(&args->str, "ExcHet%s", args->pop[i].suffix);
+ if ( bcf_update_info_float(args->out_hdr,rec,args->str.s,fexc_het,rec->n_allele-1)!=0 )
+ error("Error occurred while updating %s at %s:%d\n", args->str.s,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+ }
+ }
+ }
+
+ return rec;
+}
+
+void destroy(void)
+{
+ int i;
+ for (i=0; i<args->npop; i++)
+ {
+ free(args->pop[i].name);
+ free(args->pop[i].suffix);
+ free(args->pop[i].smpl);
+ free(args->pop[i].counts);
+ }
+ free(args->str.s);
+ free(args->pop);
+ free(args->smpl2pop);
+ free(args->iarr);
+ free(args->farr);
+ free(args->hwe_probs);
+ free(args);
+}
+
+
+
--- /dev/null
+/*
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/kseq.h>
+#include "bcftools.h"
+#include "ploidy.h"
+
+static bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+static int *sample2sex = NULL;
+static int n_sample = 0, nsex = 0, *sex2ploidy = NULL;
+static int32_t ngt_arr = 0, *gt_arr = NULL, *gt_arr2 = NULL, ngt_arr2 = 0;
+static ploidy_t *ploidy = NULL;
+static int force_ploidy = -1;
+
+const char *about(void)
+{
+ return "Fix ploidy.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Fix ploidy\n"
+ "Usage: bcftools +fixploidy [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --default-ploidy <int> default ploidy for regions unlisted in -p [2]\n"
+ " -f, --force-ploidy <int> ignore -p, set the same ploidy for all genotypes\n"
+ " -p, --ploidy <file> space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY\n"
+ " -s, --sex <file> list of samples, \"NAME SEX\"\n"
+ " -t, --tags <list> VCF tags to fix [GT]\n"
+ "\n"
+ "Example:\n"
+ " # Default ploidy, if -p not given. Unlisted regions have ploidy 2\n"
+ " X 1 60000 M 1\n"
+ " X 2699521 154931043 M 1\n"
+ " Y 1 59373566 M 1\n"
+ " Y 1 59373566 F 0\n"
+ " MT 1 16569 M 1\n"
+ " MT 1 16569 F 1\n"
+ " \n"
+ " # Example of -s file, sex of unlisted samples is \"F\"\n"
+ " sampleName1 M\n"
+ " \n"
+ " bcftools +fixploidy in.vcf -- -s samples.txt\n"
+ "\n";
+}
+
+void set_samples(char *fname, bcf_hdr_t *hdr, ploidy_t *ploidy, int *sample2sex)
+{
+ kstring_t tmp = {0,0,0};
+
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+ while ( hts_getline(fp, KS_SEP_LINE, &tmp) > 0 )
+ {
+ char *ss = tmp.s;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+ if ( *ss=='#' ) continue;
+ char *se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ char x = *se; *se = 0;
+
+ int ismpl = bcf_hdr_id2int(hdr, BCF_DT_SAMPLE, ss);
+ if ( ismpl < 0 ) { fprintf(stderr,"Warning: No such sample in the VCF: %s\n",ss); continue; }
+
+ *se = x;
+ ss = se+1;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+ se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se==ss ) error("Could not parse: %s\n", tmp.s);
+
+ sample2sex[ismpl] = ploidy_add_sex(ploidy, ss);
+ }
+ if ( hts_close(fp) ) error("Close failed: %s\n", fname);
+ free(tmp.s);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c, default_ploidy = 2;
+ char *tags_str = "GT";
+ char *ploidy_fname = NULL, *sex_fname = NULL;
+
+ static struct option loptions[] =
+ {
+ {"default-ploidy",1,0,'d'},
+ {"force-ploidy",1,0,'f'},
+ {"ploidy",1,0,'p'},
+ {"sex",1,0,'s'},
+ {"tags",1,0,'t'},
+ {0,0,0,0}
+ };
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "?ht:s:p:d:f:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'd':
+ default_ploidy = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+ break;
+ case 'f':
+ force_ploidy = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -f %s\n", optarg);
+ break;
+ case 'p': ploidy_fname = optarg; break;
+ case 's': sex_fname = optarg; break;
+ case 't': tags_str = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( strcasecmp("GT",tags_str) ) error("Only -t GT is currently supported, sorry\n");
+
+ n_sample = bcf_hdr_nsamples(in);
+ sample2sex = (int*) calloc(n_sample,sizeof(int));
+ in_hdr = in;
+ out_hdr = out;
+
+ if ( ploidy_fname )
+ ploidy = ploidy_init(ploidy_fname, default_ploidy);
+ else if ( force_ploidy==-1 )
+ {
+ ploidy = ploidy_init_string(
+ "X 1 60000 M 1\n"
+ "X 2699521 154931043 M 1\n"
+ "Y 1 59373566 M 1\n"
+ "Y 1 59373566 F 0\n"
+ "MT 1 16569 M 1\n"
+ "MT 1 16569 F 1\n", 2);
+ }
+ if ( force_ploidy==-1 )
+ {
+ if ( !ploidy ) return -1;
+
+ // add default sex in case it was not included
+ int i, dflt_sex_id = ploidy_add_sex(ploidy, "F");
+ for (i=0; i<n_sample; i++) sample2sex[i] = dflt_sex_id; // by default all are F
+ if ( sex_fname ) set_samples(sex_fname, in, ploidy, sample2sex);
+ nsex = ploidy_nsex(ploidy);
+ sex2ploidy = (int*) malloc(sizeof(int)*nsex);
+ }
+
+ return 0;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, max_ploidy;
+
+ int ngts = bcf_get_genotypes(in_hdr, rec, >_arr, &ngt_arr);
+ if ( ngts<0 )
+ return rec; // GT field not present
+
+ if ( ngts % n_sample )
+ error("Error at %s:%d: wrong number of GT fields\n",bcf_seqname(in_hdr,rec),rec->pos+1);
+
+ if ( force_ploidy==-1 )
+ ploidy_query(ploidy, (char*)bcf_seqname(in_hdr,rec), rec->pos, sex2ploidy,NULL,&max_ploidy);
+ else
+ max_ploidy = force_ploidy;
+
+ ngts /= n_sample;
+ if ( ngts < max_ploidy )
+ {
+ hts_expand(int32_t,max_ploidy*n_sample,ngt_arr2,gt_arr2);
+ for (i=0; i<n_sample; i++)
+ {
+ int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+ int32_t *src = >_arr[i*ngts];
+ int32_t *dst = >_arr2[i*max_ploidy];
+ j = 0;
+ if ( !ploidy ) { dst[j] = bcf_gt_missing; j++; }
+ else
+ while ( j<ngts && j<ploidy && src[j]!=bcf_int32_vector_end ) { dst[j] = src[j]; j++; }
+ assert( j );
+ while ( j<ploidy ) { dst[j] = dst[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+ while ( j<max_ploidy ) { dst[j] = bcf_int32_vector_end; j++; }
+ }
+ if ( bcf_update_genotypes(out_hdr,rec,gt_arr2,n_sample*max_ploidy) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ }
+ else if ( ngts!=1 || max_ploidy!=1 )
+ {
+ for (i=0; i<n_sample; i++)
+ {
+ int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+ int32_t *gts = >_arr[i*ngts];
+ j = 0;
+ if ( !ploidy ) { gts[j] = bcf_gt_missing; j++; }
+ else
+ while ( j<ngts && j<ploidy && gts[j]!=bcf_int32_vector_end ) j++;
+ assert( j );
+ while ( j<ploidy ) { gts[j] = gts[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+ while ( j<ngts ) { gts[j] = bcf_int32_vector_end; j++; }
+ }
+ if ( bcf_update_genotypes(out_hdr,rec,gt_arr,n_sample*ngts) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ }
+ return rec;
+}
+
+
+void destroy(void)
+{
+ free(gt_arr);
+ free(gt_arr2);
+ free(sample2sex);
+ free(sex2ploidy);
+ if ( ploidy ) ploidy_destroy(ploidy);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/kseq.h>
+#include "bcftools.h"
+#include "ploidy.h"
+
+static bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+static int *sample2sex = NULL;
+static int n_sample = 0, nsex = 0, *sex2ploidy = NULL;
+static int32_t ngt_arr = 0, *gt_arr = NULL, *gt_arr2 = NULL, ngt_arr2 = 0;
+static ploidy_t *ploidy = NULL;
+static int force_ploidy = -1;
+
+const char *about(void)
+{
+ return "Fix ploidy.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Fix ploidy\n"
+ "Usage: bcftools +fixploidy [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --default-ploidy <int> default ploidy for regions unlisted in -p [2]\n"
+ " -f, --force-ploidy <int> ignore -p, set the same ploidy for all genotypes\n"
+ " -p, --ploidy <file> space/tab-delimited list of CHROM,FROM,TO,SEX,PLOIDY\n"
+ " -s, --sex <file> list of samples, \"NAME SEX\"\n"
+ " -t, --tags <list> VCF tags to fix [GT]\n"
+ "\n"
+ "Example:\n"
+ " # Default ploidy, if -p not given. Unlisted regions have ploidy 2\n"
+ " X 1 60000 M 1\n"
+ " X 2699521 154931043 M 1\n"
+ " Y 1 59373566 M 1\n"
+ " Y 1 59373566 F 0\n"
+ " MT 1 16569 M 1\n"
+ " MT 1 16569 F 1\n"
+ " \n"
+ " # Example of -s file, sex of unlisted samples is \"F\"\n"
+ " sampleName1 M\n"
+ " \n"
+ " bcftools +fixploidy in.vcf -- -s samples.txt\n"
+ "\n";
+}
+
+void set_samples(char *fname, bcf_hdr_t *hdr, ploidy_t *ploidy, int *sample2sex)
+{
+ kstring_t tmp = {0,0,0};
+
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+ while ( hts_getline(fp, KS_SEP_LINE, &tmp) > 0 )
+ {
+ char *ss = tmp.s;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+ if ( *ss=='#' ) continue;
+ char *se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ char x = *se; *se = 0;
+
+ int ismpl = bcf_hdr_id2int(hdr, BCF_DT_SAMPLE, ss);
+ if ( ismpl < 0 ) { fprintf(bcftools_stderr,"Warning: No such sample in the VCF: %s\n",ss); continue; }
+
+ *se = x;
+ ss = se+1;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) error("Could not parse: %s\n", tmp.s);
+ se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se==ss ) error("Could not parse: %s\n", tmp.s);
+
+ sample2sex[ismpl] = ploidy_add_sex(ploidy, ss);
+ }
+ if ( hts_close(fp) ) error("Close failed: %s\n", fname);
+ free(tmp.s);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c, default_ploidy = 2;
+ char *tags_str = "GT";
+ char *ploidy_fname = NULL, *sex_fname = NULL;
+
+ static struct option loptions[] =
+ {
+ {"default-ploidy",1,0,'d'},
+ {"force-ploidy",1,0,'f'},
+ {"ploidy",1,0,'p'},
+ {"sex",1,0,'s'},
+ {"tags",1,0,'t'},
+ {0,0,0,0}
+ };
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "?ht:s:p:d:f:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'd':
+ default_ploidy = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -d %s\n", optarg);
+ break;
+ case 'f':
+ force_ploidy = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -f %s\n", optarg);
+ break;
+ case 'p': ploidy_fname = optarg; break;
+ case 's': sex_fname = optarg; break;
+ case 't': tags_str = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( strcasecmp("GT",tags_str) ) error("Only -t GT is currently supported, sorry\n");
+
+ n_sample = bcf_hdr_nsamples(in);
+ sample2sex = (int*) calloc(n_sample,sizeof(int));
+ in_hdr = in;
+ out_hdr = out;
+
+ if ( ploidy_fname )
+ ploidy = ploidy_init(ploidy_fname, default_ploidy);
+ else if ( force_ploidy==-1 )
+ {
+ ploidy = ploidy_init_string(
+ "X 1 60000 M 1\n"
+ "X 2699521 154931043 M 1\n"
+ "Y 1 59373566 M 1\n"
+ "Y 1 59373566 F 0\n"
+ "MT 1 16569 M 1\n"
+ "MT 1 16569 F 1\n", 2);
+ }
+ if ( force_ploidy==-1 )
+ {
+ if ( !ploidy ) return -1;
+
+ // add default sex in case it was not included
+ int i, dflt_sex_id = ploidy_add_sex(ploidy, "F");
+ for (i=0; i<n_sample; i++) sample2sex[i] = dflt_sex_id; // by default all are F
+ if ( sex_fname ) set_samples(sex_fname, in, ploidy, sample2sex);
+ nsex = ploidy_nsex(ploidy);
+ sex2ploidy = (int*) malloc(sizeof(int)*nsex);
+ }
+
+ return 0;
+}
+
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i,j, max_ploidy;
+
+ int ngts = bcf_get_genotypes(in_hdr, rec, >_arr, &ngt_arr);
+ if ( ngts<0 )
+ return rec; // GT field not present
+
+ if ( ngts % n_sample )
+ error("Error at %s:%d: wrong number of GT fields\n",bcf_seqname(in_hdr,rec),rec->pos+1);
+
+ if ( force_ploidy==-1 )
+ ploidy_query(ploidy, (char*)bcf_seqname(in_hdr,rec), rec->pos, sex2ploidy,NULL,&max_ploidy);
+ else
+ max_ploidy = force_ploidy;
+
+ ngts /= n_sample;
+ if ( ngts < max_ploidy )
+ {
+ hts_expand(int32_t,max_ploidy*n_sample,ngt_arr2,gt_arr2);
+ for (i=0; i<n_sample; i++)
+ {
+ int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+ int32_t *src = >_arr[i*ngts];
+ int32_t *dst = >_arr2[i*max_ploidy];
+ j = 0;
+ if ( !ploidy ) { dst[j] = bcf_gt_missing; j++; }
+ else
+ while ( j<ngts && j<ploidy && src[j]!=bcf_int32_vector_end ) { dst[j] = src[j]; j++; }
+ assert( j );
+ while ( j<ploidy ) { dst[j] = dst[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+ while ( j<max_ploidy ) { dst[j] = bcf_int32_vector_end; j++; }
+ }
+ if ( bcf_update_genotypes(out_hdr,rec,gt_arr2,n_sample*max_ploidy) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ }
+ else if ( ngts!=1 || max_ploidy!=1 )
+ {
+ for (i=0; i<n_sample; i++)
+ {
+ int ploidy = force_ploidy!=-1 ? force_ploidy : sex2ploidy[ sample2sex[i] ];
+ int32_t *gts = >_arr[i*ngts];
+ j = 0;
+ if ( !ploidy ) { gts[j] = bcf_gt_missing; j++; }
+ else
+ while ( j<ngts && j<ploidy && gts[j]!=bcf_int32_vector_end ) j++;
+ assert( j );
+ while ( j<ploidy ) { gts[j] = gts[j-1]; j++; } // expand "." to "./." and "0" to "0/0"
+ while ( j<ngts ) { gts[j] = bcf_int32_vector_end; j++; }
+ }
+ if ( bcf_update_genotypes(out_hdr,rec,gt_arr,n_sample*ngts) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(in_hdr,rec),rec->pos+1);
+ }
+ return rec;
+}
+
+
+void destroy(void)
+{
+ free(gt_arr);
+ free(gt_arr2);
+ free(sample2sex);
+ free(sex2ploidy);
+ if ( ploidy ) ploidy_destroy(ploidy);
+}
+
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+/*
+ Illumina TOP/BOT strand convention causes a lot of pain. This tool
+ attempts to determine the strand convention and convert it to the
+ forward reference strand.
+
+ On TOP strand, we can encounter
+ unambiguous SNPs:
+ A/G
+ A/C
+ ambiguous (context-dependent) SNPs:
+ C/G
+ A/T
+
+ On BOT strand:
+ unambiguous SNPs:
+ T/G
+ T/C
+ ambiguous (context-dependent) SNPs:
+ T/A
+ G/C
+
+
+ For unambiguous pairs (A/C, A/G, T/C, T/G), the knowledge of reference base
+ at the SNP position is enough to determine the strand:
+
+ TOP REF -> ALLELES TOP_ON_STRAND
+ -------------------------------------------
+ A/C A or C A/C 1
+ " T or G T/G -1
+ A/G A or G A/G 1
+ " T or C T/C -1
+
+
+ For ambiguous pairs (A/T, C/G), a sequence walking must be performed
+ (simultaneously upstream and downstream) until the first unambiguous pair
+ is encountered. The 5' base determines the strand:
+
+ TOP 5'REF_BASE -> ALLELES TOP_ON_STRAND
+ ------------------------------------------------
+ A/T A or T A/T 1
+ " C or G T/A -1
+ C/G A or T C/G 1
+ " C or G G/C -1
+
+ */
+
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <htslib/faidx.h>
+#include <htslib/khash.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+
+#define MODE_STATS 1
+#define MODE_TOP2FWD 2
+#define MODE_FLIP2FWD 3
+#define MODE_USE_ID 4
+
+typedef struct
+{
+ uint32_t pos;
+ uint8_t ref;
+}
+marker_t;
+
+KHASH_MAP_INIT_INT(i2m, marker_t)
+typedef khash_t(i2m) i2m_t;
+
+typedef struct
+{
+ char *dbsnp_fname;
+ int mode, discard;
+ bcf_hdr_t *hdr;
+ faidx_t *fai;
+ int rid, skip_rid;
+ i2m_t *i2m;
+ int32_t *gts, ngts, pos;
+ uint32_t nsite,nok,nflip,nunresolved,nswap,nflip_swap,nonSNP,nonACGT,nonbiallelic;
+ uint32_t count[4][4], npos_err, unsorted;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Fix reference strand orientation, e.g. from Illumina/TOP to fwd.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: This tool helps to determine and fix strand orientation.\n"
+ " Currently the following modes are recognised:\n"
+ " flip .. flips non-ambiguous SNPs and ignores the rest\n"
+ " id .. swap REF/ALT and GTs using the ID column to determine the REF allele\n"
+ " stats .. collect and print stats\n"
+ " top .. converts from Illumina TOP strand to fwd\n"
+ "\n"
+ " WARNING: Do not use the program blindly, make an effort to\n"
+ " understand what strand convention your data uses! Make sure\n"
+ " the reason for mismatching REF alleles is not a different\n"
+ " reference build!!\n"
+ "\n"
+ "Usage: bcftools +fixref [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --discard Discard sites which could not be resolved\n"
+ " -f, --fasta-ref <file.fa> Reference sequence\n"
+ " -i, --use-id <file.vcf> Swap REF/ALT using the ID column to determine the REF allele, implies -m id.\n"
+ " Download the dbSNP file from\n"
+ " https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf\n"
+ " -m, --mode <string> Collect stats (\"stats\") or convert (\"flip\", \"id\", \"top\") [stats]\n"
+ "\n"
+ "Examples:\n"
+ " # run stats\n"
+ " bcftools +fixref file.bcf -- -f ref.fa\n"
+ "\n"
+ " # convert from TOP to fwd\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -f ref.fa -m top\n"
+ "\n"
+ " # match the REF/ALT alleles based on the ID column, discard unknown sites\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i All_20151104.vcf.gz\n"
+ "\n"
+ " # assuming the reference build is correct, just flip to fwd, discarding the rest\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -m flip\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.skip_rid = -1;
+ args.hdr = in;
+ args.mode = MODE_STATS;
+ char *ref_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"mode",required_argument,NULL,'m'},
+ {"discard",no_argument,NULL,'d'},
+ {"fasta-ref",required_argument,NULL,'f'},
+ {"use-id",required_argument,NULL,'i'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?hf:m:di:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'm':
+ if ( !strcasecmp(optarg,"top") ) args.mode = MODE_TOP2FWD;
+ else if ( !strcasecmp(optarg,"flip") ) args.mode = MODE_FLIP2FWD;
+ else if ( !strcasecmp(optarg,"id") ) args.mode = MODE_USE_ID;
+ else if ( !strcasecmp(optarg,"stats") ) args.mode = MODE_STATS;
+ else error("The source strand convention not recognised: %s\n", optarg);
+ break;
+ case 'i': args.dbsnp_fname = optarg; args.mode = MODE_USE_ID; break;
+ case 'd': args.discard = 1; break;
+ case 'f': ref_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !ref_fname ) error("Expected the -f option\n");
+ args.fai = fai_load(ref_fname);
+ if ( !args.fai ) error("Failed to load the fai index: %s\n", ref_fname);
+
+ if ( args.mode==MODE_STATS ) return 1;
+ return 0;
+}
+
+static bcf1_t *set_ref_alt(args_t *args, bcf1_t *rec, const char ref, const char alt, int swap)
+{
+ rec->d.allele[0][0] = ref;
+ rec->d.allele[1][0] = alt;
+ rec->d.shared_dirty |= BCF1_DIRTY_ALS;
+
+ if ( !swap ) return rec; // only fix the alleles, leaving GTs unchanged
+
+ int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->ngts);
+ int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+ ngts /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ int32_t *ptr = args->gts + i*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_gt_unphased(0) ) ptr[j] = bcf_gt_unphased(1);
+ else if ( ptr[j]==bcf_gt_phased(0) ) ptr[j] = bcf_gt_phased(1);
+ else if ( ptr[j]==bcf_gt_unphased(1) ) ptr[j] = bcf_gt_unphased(0);
+ else if ( ptr[j]==bcf_gt_phased(1) ) ptr[j] = bcf_gt_phased(0);
+ }
+ }
+ bcf_update_genotypes(args->hdr,rec,args->gts,args->ngts);
+
+ return rec;
+}
+
+static inline int nt2int(char nt)
+{
+ nt = toupper(nt);
+ if ( nt=='A' ) return 0;
+ if ( nt=='C' ) return 1;
+ if ( nt=='G' ) return 2;
+ if ( nt=='T' ) return 3;
+ return -1;
+}
+#define int2nt(x) "ACGT"[x]
+#define revint(x) ("3210"[x]-'0')
+
+static inline uint32_t parse_rsid(char *name)
+{
+ if ( name[0]!='r' || name[1]!='s' )
+ {
+ name = strstr(name, "rs");
+ if ( !name ) return 0;
+ }
+ char *tmp;
+ name += 2;
+ uint64_t id = strtol(name, &tmp, 10);
+ if ( tmp==name || *tmp ) return 0;
+ if ( id > UINT32_MAX ) error("FIXME: the ID is too big for uint32_t: %s\n", name-2);
+ return id;
+}
+
+static int fetch_ref(args_t *args, bcf1_t *rec)
+{
+ // Get the reference allele
+ int len;
+ char *ref = faidx_fetch_seq(args->fai, (char*)bcf_seqname(args->hdr,rec), rec->pos, rec->pos, &len);
+ if ( !ref )
+ {
+ if ( faidx_has_seq(args->fai, bcf_seqname(args->hdr,rec))==0 )
+ {
+ fprintf(stderr,"Ignoring sequence \"%s\"\n", bcf_seqname(args->hdr,rec));
+ args->skip_rid = rec->rid;
+ return -2;
+ }
+ error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ }
+ int ir = nt2int(*ref);
+ free(ref);
+ return ir;
+}
+
+static void dbsnp_init(args_t *args, const char *chr)
+{
+ if ( args->i2m ) kh_destroy(i2m, args->i2m);
+ args->i2m = kh_init(i2m);
+ bcf_srs_t *sr = bcf_sr_init();
+ if ( bcf_sr_set_regions(sr, chr, 0) != 0 ) goto done;
+ if ( !bcf_sr_add_reader(sr,args->dbsnp_fname) ) error("Failed to open %s: %s\n", args->dbsnp_fname,bcf_sr_strerror(sr->errnum));
+ while ( bcf_sr_next_line(sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(sr, 0);
+ if ( rec->d.allele[0][1]!=0 || rec->d.allele[1][1]!=0 ) continue; // skip non-snps
+
+ int ref = nt2int(rec->d.allele[0][0]);
+ if ( ref<0 ) continue; // non-[ACGT] base
+
+ uint32_t id = parse_rsid(rec->d.id);
+ if ( !id ) continue;
+
+ int ret, k;
+ k = kh_put(i2m, args->i2m, id, &ret);
+ if ( ret<0 ) error("An error occurred while inserting the key %u\n", id);
+ if ( ret==0 ) continue; // skip ambiguous id
+ kh_val(args->i2m, k).pos = (uint32_t)rec->pos;
+ kh_val(args->i2m, k).ref = ref;
+ }
+done:
+ bcf_sr_destroy(sr);
+}
+
+static bcf1_t *dbsnp_check(args_t *args, bcf1_t *rec, int ir, int ia, int ib)
+{
+ int k, ref,pos;
+ uint32_t id = parse_rsid(rec->d.id);
+ if ( !id ) goto no_info;
+
+ k = kh_get(i2m, args->i2m, id);
+ if ( k==kh_end(args->i2m) ) goto no_info;
+
+ pos = (int)kh_val(args->i2m, k).pos;
+ if ( pos != rec->pos )
+ {
+ rec->pos = pos;
+ ir = fetch_ref(args, rec);
+ args->npos_err++;
+ }
+
+ ref = kh_val(args->i2m, k).ref;
+ if ( ref!=ir )
+ error("Reference base mismatch at %s:%d .. %c vs %c\n",bcf_seqname(args->hdr,rec),rec->pos+1,int2nt(ref),int2nt(ir));
+
+ if ( ia==ref ) return rec;
+ if ( ib==ref ) { args->nswap++; return set_ref_alt(args,rec,int2nt(ib),int2nt(ia),1); }
+
+no_info:
+ args->nunresolved++;
+ return args->discard ? NULL : rec;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( rec->rid == args.skip_rid ) return NULL;
+
+ bcf1_t *ret = args.mode==MODE_STATS ? NULL : rec;
+ args.nsite++;
+
+ // Skip non-SNPs
+ if ( bcf_get_variant_types(rec)!=VCF_SNP )
+ {
+ args.nonSNP++;
+ return args.discard ? NULL : ret;
+ }
+
+ // Get the reference allele
+ int ir = fetch_ref(&args, rec);
+ if ( ir==-2 ) return NULL;
+ if ( ir==-1 )
+ {
+ args.nonACGT++;
+ return args.discard ? NULL : ret; // not A,C,G,T
+ }
+
+ if ( rec->n_allele!=2 )
+ {
+ // not a biallelic site
+ args.nonbiallelic++;
+ return args.discard ? NULL : ret;
+ }
+
+ int ia = nt2int(rec->d.allele[0][0]);
+ if ( ia<0 )
+ {
+ // not A,C,G,T
+ args.nonACGT++;
+ return args.discard ? NULL : ret;
+ }
+
+ int ib = nt2int(rec->d.allele[1][0]);
+ if ( ib<0 )
+ {
+ // not A,C,G,T
+ args.nonACGT++;
+ return args.discard ? NULL : ret;
+ }
+
+ if ( ia==ib )
+ {
+ // should not happen in well-formed VCF
+ args.nonSNP++;
+ return args.discard ? NULL : ret;
+ }
+ args.count[ia][ib]++;
+
+ if ( ir==ia ) args.nok++;
+
+ if ( args.mode==MODE_USE_ID )
+ {
+ if ( !args.i2m || args.rid!=rec->rid )
+ {
+ args.pos = 0;
+ args.rid = rec->rid;
+ dbsnp_init(&args,bcf_seqname(args.hdr,rec));
+ }
+ ret = dbsnp_check(&args, rec, ir,ia,ib);
+ if ( !args.unsorted && args.pos > rec->pos )
+ {
+ fprintf(stderr,
+ "Warning: corrected position(s) results in unsorted VCF, for example %s:%d comes after %s:%d\n"
+ " The standard unix `sort` or `vcf-sort` from vcftools can be used to fix the order.\n",
+ bcf_seqname(args.hdr,rec),rec->pos+1,bcf_seqname(args.hdr,rec),args.pos);
+ args.unsorted = 1;
+ }
+ args.pos = rec->pos;
+ return ret;
+ }
+ else if ( args.mode==MODE_FLIP2FWD )
+ {
+ int pair = 1 << ia | 1 << ib;
+ if ( pair==0x9 || pair==0x6 ) // skip ambiguous pairs: A/T or C/G
+ {
+ args.nunresolved++;
+ return args.discard ? NULL : ret;
+ }
+ if ( ir==ia ) return ret;
+ if ( ir==ib ) { args.nswap++; return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1); }
+ if ( ir==revint(ia) ) { args.nflip++; return set_ref_alt(&args,rec,int2nt(revint(ia)),int2nt(revint(ib)),0); }
+ if ( ir==revint(ib) ) { args.nflip_swap++; return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1); }
+ error("FIXME: this should not happen %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ }
+ else if ( args.mode==MODE_TOP2FWD )
+ {
+ int pair = 1 << ia | 1 << ib;
+ if ( pair != 0x9 && pair != 0x6 ) // unambiguous pair: A/C or A/G
+ {
+ if ( ir==ia ) return ret;
+
+ int ia_rev = revint(ia);
+ if ( ir==ia_rev ) // vcfref is A, faref is T, flip
+ {
+ args.nflip++;
+ return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(revint(ib)),0);
+ }
+ if ( ir==ib ) // vcfalt is faref, swap
+ {
+ args.nswap++;
+ return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+ }
+ assert( ib==revint(ir) );
+
+ args.nflip_swap++;
+ return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1);
+ }
+ else // ambiguous pair, sequence walking must be performed
+ {
+ int len, win = rec->pos > 100 ? 100 : rec->pos, beg = rec->pos - win, end = rec->pos + win;
+ char *ref = faidx_fetch_seq(args.fai, (char*)bcf_seqname(args.hdr,rec), beg,end, &len);
+ if ( !ref ) error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ if ( end - beg + 1 != len ) error("FIXME: check win=%d,len=%d at %s:%d (%d %d)\n", win,len, bcf_seqname(args.hdr,rec),rec->pos+1, end,beg);
+
+ int i, mid = rec->pos - beg, strand = 0;
+ for (i=1; i<=win; i++)
+ {
+ int ra = nt2int(ref[mid-i]);
+ int rb = nt2int(ref[mid+i]);
+ if ( ra<0 || rb<0 || ra==rb ) continue; // skip N's and non-infomative pairs: A/A, C/C, G/G, T/T
+ pair = 1 << ra | 1 << rb;
+ if ( pair==0x9 || pair==0x6 ) continue; // skip ambiguous pairs: A/T or C/G
+ strand = 1 << ra & 0x9 ? 1 : -1;
+ break;
+ }
+ free(ref);
+
+ if ( strand==1 )
+ {
+ if ( ir==ia ) return ret;
+ if ( ir==ib )
+ {
+ args.nswap++;
+ return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+ }
+ }
+ else if ( strand==-1 )
+ {
+ int ia_rev = revint(ia);
+ int ib_rev = revint(ib);
+ if ( ir==ia_rev )
+ {
+ args.nflip++;
+ return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(ib_rev),0);
+ }
+ if ( ir==ib_rev )
+ {
+ args.nflip_swap++;
+ return set_ref_alt(&args,rec,int2nt(ib_rev),int2nt(ia_rev),1);
+ }
+ }
+
+ args.nunresolved++;
+ return args.discard ? NULL : ret;
+ }
+ }
+ return ret;
+}
+
+int top_mask[4][4] =
+{
+ {0,1,1,1},
+ {0,0,1,0},
+ {0,0,0,0},
+ {0,0,0,0},
+};
+int bot_mask[4][4] =
+{
+ {0,0,0,0},
+ {0,0,0,0},
+ {0,1,0,0},
+ {1,1,1,0},
+};
+
+void destroy(void)
+{
+ uint32_t i,j,tot = 0;
+ uint32_t top_err = 0, bot_err = 0;
+ for (i=0; i<4; i++)
+ {
+ for (j=0; j<4; j++)
+ {
+ tot += args.count[i][j];
+ if ( !top_mask[i][j] && args.count[i][j] ) top_err++;
+ if ( !bot_mask[i][j] && args.count[i][j] ) bot_err++;
+ }
+ }
+ uint32_t nskip = args.nonACGT+args.nonSNP+args.nonbiallelic;
+ uint32_t ncmp = args.nsite - nskip;
+
+ fprintf(stderr,"# SC, guessed strand convention\n");
+ fprintf(stderr,"SC\tTOP-compatible\t%d\n",top_err?0:1);
+ fprintf(stderr,"SC\tBOT-compatible\t%d\n",bot_err?0:1);
+
+ fprintf(stderr,"# ST, substitution types\n");
+ for (i=0; i<4; i++)
+ {
+ for (j=0; j<4; j++)
+ {
+ if ( i==j ) continue;
+ fprintf(stderr,"ST\t%c>%c\t%u\t%.1f%%\n", int2nt(i),int2nt(j),args.count[i][j], args.count[i][j]*100./tot);
+ }
+ }
+ fprintf(stderr,"# NS, Number of sites:\n");
+ fprintf(stderr,"NS\ttotal \t%u\n", args.nsite);
+ fprintf(stderr,"NS\tref match \t%u\t%.1f%%\n", args.nok,100.*args.nok/ncmp);
+ fprintf(stderr,"NS\tref mismatch \t%u\t%.1f%%\n", ncmp-args.nok,100.*(ncmp-args.nok)/ncmp);
+ if ( args.mode!=MODE_STATS )
+ {
+ fprintf(stderr,"NS\tflipped \t%u\t%.1f%%\n", args.nflip,100.*args.nflip/(args.nsite-nskip));
+ fprintf(stderr,"NS\tswapped \t%u\t%.1f%%\n", args.nswap,100.*args.nswap/(args.nsite-nskip));
+ fprintf(stderr,"NS\tflip+swap \t%u\t%.1f%%\n", args.nflip_swap,100.*args.nflip_swap/(args.nsite-nskip));
+ fprintf(stderr,"NS\tunresolved \t%u\t%.1f%%\n", args.nunresolved,100.*args.nunresolved/(args.nsite-nskip));
+ fprintf(stderr,"NS\tfixed pos \t%u\t%.1f%%\n", args.npos_err,100.*args.npos_err/(args.nsite-nskip));
+ }
+ fprintf(stderr,"NS\tskipped \t%u\n", nskip);
+ fprintf(stderr,"NS\tnon-ACGT \t%u\n", args.nonACGT);
+ fprintf(stderr,"NS\tnon-SNP \t%u\n", args.nonSNP);
+ fprintf(stderr,"NS\tnon-biallelic\t%u\n", args.nonbiallelic);
+
+ free(args.gts);
+ if ( args.fai ) fai_destroy(args.fai);
+ if ( args.i2m ) kh_destroy(i2m, args.i2m);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+/*
+ Illumina TOP/BOT strand convention causes a lot of pain. This tool
+ attempts to determine the strand convention and convert it to the
+ forward reference strand.
+
+ On TOP strand, we can encounter
+ unambiguous SNPs:
+ A/G
+ A/C
+ ambiguous (context-dependent) SNPs:
+ C/G
+ A/T
+
+ On BOT strand:
+ unambiguous SNPs:
+ T/G
+ T/C
+ ambiguous (context-dependent) SNPs:
+ T/A
+ G/C
+
+
+ For unambiguous pairs (A/C, A/G, T/C, T/G), the knowledge of reference base
+ at the SNP position is enough to determine the strand:
+
+ TOP REF -> ALLELES TOP_ON_STRAND
+ -------------------------------------------
+ A/C A or C A/C 1
+ " T or G T/G -1
+ A/G A or G A/G 1
+ " T or C T/C -1
+
+
+ For ambiguous pairs (A/T, C/G), a sequence walking must be performed
+ (simultaneously upstream and downstream) until the first unambiguous pair
+ is encountered. The 5' base determines the strand:
+
+ TOP 5'REF_BASE -> ALLELES TOP_ON_STRAND
+ ------------------------------------------------
+ A/T A or T A/T 1
+ " C or G T/A -1
+ C/G A or T C/G 1
+ " C or G G/C -1
+
+ */
+
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/kfunc.h>
+#include <htslib/faidx.h>
+#include <htslib/khash.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+
+#define MODE_STATS 1
+#define MODE_TOP2FWD 2
+#define MODE_FLIP2FWD 3
+#define MODE_USE_ID 4
+
+typedef struct
+{
+ uint32_t pos;
+ uint8_t ref;
+}
+marker_t;
+
+KHASH_MAP_INIT_INT(i2m, marker_t)
+typedef khash_t(i2m) i2m_t;
+
+typedef struct
+{
+ char *dbsnp_fname;
+ int mode, discard;
+ bcf_hdr_t *hdr;
+ faidx_t *fai;
+ int rid, skip_rid;
+ i2m_t *i2m;
+ int32_t *gts, ngts, pos;
+ uint32_t nsite,nok,nflip,nunresolved,nswap,nflip_swap,nonSNP,nonACGT,nonbiallelic;
+ uint32_t count[4][4], npos_err, unsorted;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Fix reference strand orientation, e.g. from Illumina/TOP to fwd.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: This tool helps to determine and fix strand orientation.\n"
+ " Currently the following modes are recognised:\n"
+ " flip .. flips non-ambiguous SNPs and ignores the rest\n"
+ " id .. swap REF/ALT and GTs using the ID column to determine the REF allele\n"
+ " stats .. collect and print stats\n"
+ " top .. converts from Illumina TOP strand to fwd\n"
+ "\n"
+ " WARNING: Do not use the program blindly, make an effort to\n"
+ " understand what strand convention your data uses! Make sure\n"
+ " the reason for mismatching REF alleles is not a different\n"
+ " reference build!!\n"
+ "\n"
+ "Usage: bcftools +fixref [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -d, --discard Discard sites which could not be resolved\n"
+ " -f, --fasta-ref <file.fa> Reference sequence\n"
+ " -i, --use-id <file.vcf> Swap REF/ALT using the ID column to determine the REF allele, implies -m id.\n"
+ " Download the dbSNP file from\n"
+ " https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf\n"
+ " -m, --mode <string> Collect stats (\"stats\") or convert (\"flip\", \"id\", \"top\") [stats]\n"
+ "\n"
+ "Examples:\n"
+ " # run stats\n"
+ " bcftools +fixref file.bcf -- -f ref.fa\n"
+ "\n"
+ " # convert from TOP to fwd\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -f ref.fa -m top\n"
+ "\n"
+ " # match the REF/ALT alleles based on the ID column, discard unknown sites\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -i All_20151104.vcf.gz\n"
+ "\n"
+ " # assuming the reference build is correct, just flip to fwd, discarding the rest\n"
+ " bcftools +fixref file.bcf -Ob -o out.bcf -- -d -f ref.fa -m flip\n"
+ "\n";
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.skip_rid = -1;
+ args.hdr = in;
+ args.mode = MODE_STATS;
+ char *ref_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"mode",required_argument,NULL,'m'},
+ {"discard",no_argument,NULL,'d'},
+ {"fasta-ref",required_argument,NULL,'f'},
+ {"use-id",required_argument,NULL,'i'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?hf:m:di:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'm':
+ if ( !strcasecmp(optarg,"top") ) args.mode = MODE_TOP2FWD;
+ else if ( !strcasecmp(optarg,"flip") ) args.mode = MODE_FLIP2FWD;
+ else if ( !strcasecmp(optarg,"id") ) args.mode = MODE_USE_ID;
+ else if ( !strcasecmp(optarg,"stats") ) args.mode = MODE_STATS;
+ else error("The source strand convention not recognised: %s\n", optarg);
+ break;
+ case 'i': args.dbsnp_fname = optarg; args.mode = MODE_USE_ID; break;
+ case 'd': args.discard = 1; break;
+ case 'f': ref_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !ref_fname ) error("Expected the -f option\n");
+ args.fai = fai_load(ref_fname);
+ if ( !args.fai ) error("Failed to load the fai index: %s\n", ref_fname);
+
+ if ( args.mode==MODE_STATS ) return 1;
+ return 0;
+}
+
+static bcf1_t *set_ref_alt(args_t *args, bcf1_t *rec, const char ref, const char alt, int swap)
+{
+ rec->d.allele[0][0] = ref;
+ rec->d.allele[1][0] = alt;
+ rec->d.shared_dirty |= BCF1_DIRTY_ALS;
+
+ if ( !swap ) return rec; // only fix the alleles, leaving GTs unchanged
+
+ int ngts = bcf_get_genotypes(args->hdr, rec, &args->gts, &args->ngts);
+ int i, j, nsmpl = bcf_hdr_nsamples(args->hdr);
+ ngts /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ int32_t *ptr = args->gts + i*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_gt_unphased(0) ) ptr[j] = bcf_gt_unphased(1);
+ else if ( ptr[j]==bcf_gt_phased(0) ) ptr[j] = bcf_gt_phased(1);
+ else if ( ptr[j]==bcf_gt_unphased(1) ) ptr[j] = bcf_gt_unphased(0);
+ else if ( ptr[j]==bcf_gt_phased(1) ) ptr[j] = bcf_gt_phased(0);
+ }
+ }
+ bcf_update_genotypes(args->hdr,rec,args->gts,args->ngts);
+
+ return rec;
+}
+
+static inline int nt2int(char nt)
+{
+ nt = toupper(nt);
+ if ( nt=='A' ) return 0;
+ if ( nt=='C' ) return 1;
+ if ( nt=='G' ) return 2;
+ if ( nt=='T' ) return 3;
+ return -1;
+}
+#define int2nt(x) "ACGT"[x]
+#define revint(x) ("3210"[x]-'0')
+
+static inline uint32_t parse_rsid(char *name)
+{
+ if ( name[0]!='r' || name[1]!='s' )
+ {
+ name = strstr(name, "rs");
+ if ( !name ) return 0;
+ }
+ char *tmp;
+ name += 2;
+ uint64_t id = strtol(name, &tmp, 10);
+ if ( tmp==name || *tmp ) return 0;
+ if ( id > UINT32_MAX ) error("FIXME: the ID is too big for uint32_t: %s\n", name-2);
+ return id;
+}
+
+static int fetch_ref(args_t *args, bcf1_t *rec)
+{
+ // Get the reference allele
+ int len;
+ char *ref = faidx_fetch_seq(args->fai, (char*)bcf_seqname(args->hdr,rec), rec->pos, rec->pos, &len);
+ if ( !ref )
+ {
+ if ( faidx_has_seq(args->fai, bcf_seqname(args->hdr,rec))==0 )
+ {
+ fprintf(bcftools_stderr,"Ignoring sequence \"%s\"\n", bcf_seqname(args->hdr,rec));
+ args->skip_rid = rec->rid;
+ return -2;
+ }
+ error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args->hdr,rec),rec->pos+1);
+ }
+ int ir = nt2int(*ref);
+ free(ref);
+ return ir;
+}
+
+static void dbsnp_init(args_t *args, const char *chr)
+{
+ if ( args->i2m ) kh_destroy(i2m, args->i2m);
+ args->i2m = kh_init(i2m);
+ bcf_srs_t *sr = bcf_sr_init();
+ if ( bcf_sr_set_regions(sr, chr, 0) != 0 ) goto done;
+ if ( !bcf_sr_add_reader(sr,args->dbsnp_fname) ) error("Failed to open %s: %s\n", args->dbsnp_fname,bcf_sr_strerror(sr->errnum));
+ while ( bcf_sr_next_line(sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(sr, 0);
+ if ( rec->d.allele[0][1]!=0 || rec->d.allele[1][1]!=0 ) continue; // skip non-snps
+
+ int ref = nt2int(rec->d.allele[0][0]);
+ if ( ref<0 ) continue; // non-[ACGT] base
+
+ uint32_t id = parse_rsid(rec->d.id);
+ if ( !id ) continue;
+
+ int ret, k;
+ k = kh_put(i2m, args->i2m, id, &ret);
+ if ( ret<0 ) error("An error occurred while inserting the key %u\n", id);
+ if ( ret==0 ) continue; // skip ambiguous id
+ kh_val(args->i2m, k).pos = (uint32_t)rec->pos;
+ kh_val(args->i2m, k).ref = ref;
+ }
+done:
+ bcf_sr_destroy(sr);
+}
+
+static bcf1_t *dbsnp_check(args_t *args, bcf1_t *rec, int ir, int ia, int ib)
+{
+ int k, ref,pos;
+ uint32_t id = parse_rsid(rec->d.id);
+ if ( !id ) goto no_info;
+
+ k = kh_get(i2m, args->i2m, id);
+ if ( k==kh_end(args->i2m) ) goto no_info;
+
+ pos = (int)kh_val(args->i2m, k).pos;
+ if ( pos != rec->pos )
+ {
+ rec->pos = pos;
+ ir = fetch_ref(args, rec);
+ args->npos_err++;
+ }
+
+ ref = kh_val(args->i2m, k).ref;
+ if ( ref!=ir )
+ error("Reference base mismatch at %s:%d .. %c vs %c\n",bcf_seqname(args->hdr,rec),rec->pos+1,int2nt(ref),int2nt(ir));
+
+ if ( ia==ref ) return rec;
+ if ( ib==ref ) { args->nswap++; return set_ref_alt(args,rec,int2nt(ib),int2nt(ia),1); }
+
+no_info:
+ args->nunresolved++;
+ return args->discard ? NULL : rec;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( rec->rid == args.skip_rid ) return NULL;
+
+ bcf1_t *ret = args.mode==MODE_STATS ? NULL : rec;
+ args.nsite++;
+
+ // Skip non-SNPs
+ if ( bcf_get_variant_types(rec)!=VCF_SNP )
+ {
+ args.nonSNP++;
+ return args.discard ? NULL : ret;
+ }
+
+ // Get the reference allele
+ int ir = fetch_ref(&args, rec);
+ if ( ir==-2 ) return NULL;
+ if ( ir==-1 )
+ {
+ args.nonACGT++;
+ return args.discard ? NULL : ret; // not A,C,G,T
+ }
+
+ if ( rec->n_allele!=2 )
+ {
+ // not a biallelic site
+ args.nonbiallelic++;
+ return args.discard ? NULL : ret;
+ }
+
+ int ia = nt2int(rec->d.allele[0][0]);
+ if ( ia<0 )
+ {
+ // not A,C,G,T
+ args.nonACGT++;
+ return args.discard ? NULL : ret;
+ }
+
+ int ib = nt2int(rec->d.allele[1][0]);
+ if ( ib<0 )
+ {
+ // not A,C,G,T
+ args.nonACGT++;
+ return args.discard ? NULL : ret;
+ }
+
+ if ( ia==ib )
+ {
+ // should not happen in well-formed VCF
+ args.nonSNP++;
+ return args.discard ? NULL : ret;
+ }
+ args.count[ia][ib]++;
+
+ if ( ir==ia ) args.nok++;
+
+ if ( args.mode==MODE_USE_ID )
+ {
+ if ( !args.i2m || args.rid!=rec->rid )
+ {
+ args.pos = 0;
+ args.rid = rec->rid;
+ dbsnp_init(&args,bcf_seqname(args.hdr,rec));
+ }
+ ret = dbsnp_check(&args, rec, ir,ia,ib);
+ if ( !args.unsorted && args.pos > rec->pos )
+ {
+ fprintf(bcftools_stderr,
+ "Warning: corrected position(s) results in unsorted VCF, for example %s:%d comes after %s:%d\n"
+ " The standard unix `sort` or `vcf-sort` from vcftools can be used to fix the order.\n",
+ bcf_seqname(args.hdr,rec),rec->pos+1,bcf_seqname(args.hdr,rec),args.pos);
+ args.unsorted = 1;
+ }
+ args.pos = rec->pos;
+ return ret;
+ }
+ else if ( args.mode==MODE_FLIP2FWD )
+ {
+ int pair = 1 << ia | 1 << ib;
+ if ( pair==0x9 || pair==0x6 ) // skip ambiguous pairs: A/T or C/G
+ {
+ args.nunresolved++;
+ return args.discard ? NULL : ret;
+ }
+ if ( ir==ia ) return ret;
+ if ( ir==ib ) { args.nswap++; return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1); }
+ if ( ir==revint(ia) ) { args.nflip++; return set_ref_alt(&args,rec,int2nt(revint(ia)),int2nt(revint(ib)),0); }
+ if ( ir==revint(ib) ) { args.nflip_swap++; return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1); }
+ error("FIXME: this should not happen %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ }
+ else if ( args.mode==MODE_TOP2FWD )
+ {
+ int pair = 1 << ia | 1 << ib;
+ if ( pair != 0x9 && pair != 0x6 ) // unambiguous pair: A/C or A/G
+ {
+ if ( ir==ia ) return ret;
+
+ int ia_rev = revint(ia);
+ if ( ir==ia_rev ) // vcfref is A, faref is T, flip
+ {
+ args.nflip++;
+ return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(revint(ib)),0);
+ }
+ if ( ir==ib ) // vcfalt is faref, swap
+ {
+ args.nswap++;
+ return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+ }
+ assert( ib==revint(ir) );
+
+ args.nflip_swap++;
+ return set_ref_alt(&args,rec,int2nt(revint(ib)),int2nt(revint(ia)),1);
+ }
+ else // ambiguous pair, sequence walking must be performed
+ {
+ int len, win = rec->pos > 100 ? 100 : rec->pos, beg = rec->pos - win, end = rec->pos + win;
+ char *ref = faidx_fetch_seq(args.fai, (char*)bcf_seqname(args.hdr,rec), beg,end, &len);
+ if ( !ref ) error("faidx_fetch_seq failed at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ if ( end - beg + 1 != len ) error("FIXME: check win=%d,len=%d at %s:%d (%d %d)\n", win,len, bcf_seqname(args.hdr,rec),rec->pos+1, end,beg);
+
+ int i, mid = rec->pos - beg, strand = 0;
+ for (i=1; i<=win; i++)
+ {
+ int ra = nt2int(ref[mid-i]);
+ int rb = nt2int(ref[mid+i]);
+ if ( ra<0 || rb<0 || ra==rb ) continue; // skip N's and non-infomative pairs: A/A, C/C, G/G, T/T
+ pair = 1 << ra | 1 << rb;
+ if ( pair==0x9 || pair==0x6 ) continue; // skip ambiguous pairs: A/T or C/G
+ strand = 1 << ra & 0x9 ? 1 : -1;
+ break;
+ }
+ free(ref);
+
+ if ( strand==1 )
+ {
+ if ( ir==ia ) return ret;
+ if ( ir==ib )
+ {
+ args.nswap++;
+ return set_ref_alt(&args,rec,int2nt(ib),int2nt(ia),1);
+ }
+ }
+ else if ( strand==-1 )
+ {
+ int ia_rev = revint(ia);
+ int ib_rev = revint(ib);
+ if ( ir==ia_rev )
+ {
+ args.nflip++;
+ return set_ref_alt(&args,rec,int2nt(ia_rev),int2nt(ib_rev),0);
+ }
+ if ( ir==ib_rev )
+ {
+ args.nflip_swap++;
+ return set_ref_alt(&args,rec,int2nt(ib_rev),int2nt(ia_rev),1);
+ }
+ }
+
+ args.nunresolved++;
+ return args.discard ? NULL : ret;
+ }
+ }
+ return ret;
+}
+
+int top_mask[4][4] =
+{
+ {0,1,1,1},
+ {0,0,1,0},
+ {0,0,0,0},
+ {0,0,0,0},
+};
+int bot_mask[4][4] =
+{
+ {0,0,0,0},
+ {0,0,0,0},
+ {0,1,0,0},
+ {1,1,1,0},
+};
+
+void destroy(void)
+{
+ uint32_t i,j,tot = 0;
+ uint32_t top_err = 0, bot_err = 0;
+ for (i=0; i<4; i++)
+ {
+ for (j=0; j<4; j++)
+ {
+ tot += args.count[i][j];
+ if ( !top_mask[i][j] && args.count[i][j] ) top_err++;
+ if ( !bot_mask[i][j] && args.count[i][j] ) bot_err++;
+ }
+ }
+ uint32_t nskip = args.nonACGT+args.nonSNP+args.nonbiallelic;
+ uint32_t ncmp = args.nsite - nskip;
+
+ fprintf(bcftools_stderr,"# SC, guessed strand convention\n");
+ fprintf(bcftools_stderr,"SC\tTOP-compatible\t%d\n",top_err?0:1);
+ fprintf(bcftools_stderr,"SC\tBOT-compatible\t%d\n",bot_err?0:1);
+
+ fprintf(bcftools_stderr,"# ST, substitution types\n");
+ for (i=0; i<4; i++)
+ {
+ for (j=0; j<4; j++)
+ {
+ if ( i==j ) continue;
+ fprintf(bcftools_stderr,"ST\t%c>%c\t%u\t%.1f%%\n", int2nt(i),int2nt(j),args.count[i][j], args.count[i][j]*100./tot);
+ }
+ }
+ fprintf(bcftools_stderr,"# NS, Number of sites:\n");
+ fprintf(bcftools_stderr,"NS\ttotal \t%u\n", args.nsite);
+ fprintf(bcftools_stderr,"NS\tref match \t%u\t%.1f%%\n", args.nok,100.*args.nok/ncmp);
+ fprintf(bcftools_stderr,"NS\tref mismatch \t%u\t%.1f%%\n", ncmp-args.nok,100.*(ncmp-args.nok)/ncmp);
+ if ( args.mode!=MODE_STATS )
+ {
+ fprintf(bcftools_stderr,"NS\tflipped \t%u\t%.1f%%\n", args.nflip,100.*args.nflip/(args.nsite-nskip));
+ fprintf(bcftools_stderr,"NS\tswapped \t%u\t%.1f%%\n", args.nswap,100.*args.nswap/(args.nsite-nskip));
+ fprintf(bcftools_stderr,"NS\tflip+swap \t%u\t%.1f%%\n", args.nflip_swap,100.*args.nflip_swap/(args.nsite-nskip));
+ fprintf(bcftools_stderr,"NS\tunresolved \t%u\t%.1f%%\n", args.nunresolved,100.*args.nunresolved/(args.nsite-nskip));
+ fprintf(bcftools_stderr,"NS\tfixed pos \t%u\t%.1f%%\n", args.npos_err,100.*args.npos_err/(args.nsite-nskip));
+ }
+ fprintf(bcftools_stderr,"NS\tskipped \t%u\n", nskip);
+ fprintf(bcftools_stderr,"NS\tnon-ACGT \t%u\n", args.nonACGT);
+ fprintf(bcftools_stderr,"NS\tnon-SNP \t%u\n", args.nonSNP);
+ fprintf(bcftools_stderr,"NS\tnon-biallelic\t%u\n", args.nonbiallelic);
+
+ free(args.gts);
+ if ( args.fai ) fai_destroy(args.fai);
+ if ( args.i2m ) kh_destroy(i2m, args.i2m);
+}
--- /dev/null
+/* plugins/frameshifts.c -- annotates frameshift indels.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+bcf_sr_regions_t *exons;
+int32_t *frm = NULL, nfrm = 0;
+
+const char *about(void)
+{
+ return
+ "Annotate frameshift indels.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Annotate frameshift indels\n"
+ "Usage: bcftools +frameshifts [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -e, --exons <file> list of exons, see \"--targets-file\" man page entry for details\n"
+ "\n"
+ "Example:\n"
+ " bcftools +frameshifts in.vcf -- -e exons.bed.gz\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ char *fname = NULL;
+
+ static struct option loptions[] =
+ {
+ {"exons",1,0,'e'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "e:?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': fname = optarg; break;
+ case 'h':
+ case '?':
+ default: fprintf(stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ if ( !fname )
+ {
+ fprintf(stderr,"Missing the -e option.\n");
+ return -1;
+ }
+
+ in_hdr = in;
+ out_hdr = out;
+
+ int ret = bcf_hdr_append(out_hdr,"##INFO=<ID=OOF,Number=A,Type=Integer,Description=\"Frameshift Indels: out-of-frame (1), in-frame (0), not-applicable (-1 or missing)\">");
+ if ( ret!=0 )
+ {
+ fprintf(stderr,"Error updating the header\n");
+ return -1;
+ }
+
+ exons = bcf_sr_regions_init(fname,1,0,1,2);
+ if ( !exons )
+ {
+ fprintf(stderr,"Error occurred while reading (was the file compressed with bgzip?): %s\n", fname);
+ return -1;
+ }
+
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( rec->n_allele<2 ) return rec; // not a variant
+
+ int type = bcf_get_variant_types(rec);
+ if ( !(type&VCF_INDEL) ) return rec; // not an indel
+
+ int i, len = 0;
+ for (i=1; i<rec->n_allele; i++)
+ if ( len > rec->d.var[i].n ) len = rec->d.var[i].n;
+
+ int pos_to = len!=0 ? rec->pos : rec->pos - len; // len is negative
+ if ( bcf_sr_regions_overlap(exons, bcf_seqname(in_hdr,rec),rec->pos,pos_to) ) return rec; // no overlap
+
+ hts_expand(int32_t,rec->n_allele-1,nfrm,frm);
+ for (i=1; i<rec->n_allele; i++)
+ {
+ if ( rec->d.var[i].type!=VCF_INDEL ) { frm[i-1] = -1; continue; }
+
+ int len = rec->d.var[i].n, tlen = 0;
+ if ( len>0 )
+ {
+ // insertion
+ if ( exons->start <= rec->pos && exons->end > rec->pos ) tlen = abs(len);
+ }
+ else if ( exons->start <= rec->pos + abs(len) )
+ {
+ // deletion
+ tlen = abs(len);
+ if ( rec->pos < exons->start ) // trim the beginning
+ tlen -= exons->start - rec->pos + 1;
+ if ( exons->end < rec->pos + abs(len) ) // trim the end
+ tlen -= rec->pos + abs(len) - exons->end;
+ }
+ if ( tlen ) // there are some deleted/inserted bases in the exon
+ {
+ if ( tlen%3 ) frm[i-1] = 1; // out-of-frame
+ else frm[i-1] = 0; // in-frame
+ }
+ else frm[i-1] = -1; // not applicable (is outside)
+ }
+
+ if ( bcf_update_info_int32(out_hdr,rec,"OOF",frm,rec->n_allele-1)<0 ) { fprintf(stderr, "Could not annotate OOF :-/\n"); exit(1); }
+ return rec;
+}
+
+
+void destroy(void)
+{
+ bcf_sr_regions_destroy(exons);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/frameshifts.c -- annotates frameshift indels.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+bcf_sr_regions_t *exons;
+int32_t *frm = NULL, nfrm = 0;
+
+const char *about(void)
+{
+ return
+ "Annotate frameshift indels.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Annotate frameshift indels\n"
+ "Usage: bcftools +frameshifts [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -e, --exons <file> list of exons, see \"--targets-file\" man page entry for details\n"
+ "\n"
+ "Example:\n"
+ " bcftools +frameshifts in.vcf -- -e exons.bed.gz\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ char *fname = NULL;
+
+ static struct option loptions[] =
+ {
+ {"exons",1,0,'e'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "e:?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': fname = optarg; break;
+ case 'h':
+ case '?':
+ default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ if ( !fname )
+ {
+ fprintf(bcftools_stderr,"Missing the -e option.\n");
+ return -1;
+ }
+
+ in_hdr = in;
+ out_hdr = out;
+
+ int ret = bcf_hdr_append(out_hdr,"##INFO=<ID=OOF,Number=A,Type=Integer,Description=\"Frameshift Indels: out-of-frame (1), in-frame (0), not-applicable (-1 or missing)\">");
+ if ( ret!=0 )
+ {
+ fprintf(bcftools_stderr,"Error updating the header\n");
+ return -1;
+ }
+
+ exons = bcf_sr_regions_init(fname,1,0,1,2);
+ if ( !exons )
+ {
+ fprintf(bcftools_stderr,"Error occurred while reading (was the file compressed with bgzip?): %s\n", fname);
+ return -1;
+ }
+
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( rec->n_allele<2 ) return rec; // not a variant
+
+ int type = bcf_get_variant_types(rec);
+ if ( !(type&VCF_INDEL) ) return rec; // not an indel
+
+ int i, len = 0;
+ for (i=1; i<rec->n_allele; i++)
+ if ( len > rec->d.var[i].n ) len = rec->d.var[i].n;
+
+ int pos_to = len!=0 ? rec->pos : rec->pos - len; // len is negative
+ if ( bcf_sr_regions_overlap(exons, bcf_seqname(in_hdr,rec),rec->pos,pos_to) ) return rec; // no overlap
+
+ hts_expand(int32_t,rec->n_allele-1,nfrm,frm);
+ for (i=1; i<rec->n_allele; i++)
+ {
+ if ( rec->d.var[i].type!=VCF_INDEL ) { frm[i-1] = -1; continue; }
+
+ int len = rec->d.var[i].n, tlen = 0;
+ if ( len>0 )
+ {
+ // insertion
+ if ( exons->start <= rec->pos && exons->end > rec->pos ) tlen = abs(len);
+ }
+ else if ( exons->start <= rec->pos + abs(len) )
+ {
+ // deletion
+ tlen = abs(len);
+ if ( rec->pos < exons->start ) // trim the beginning
+ tlen -= exons->start - rec->pos + 1;
+ if ( exons->end < rec->pos + abs(len) ) // trim the end
+ tlen -= rec->pos + abs(len) - exons->end;
+ }
+ if ( tlen ) // there are some deleted/inserted bases in the exon
+ {
+ if ( tlen%3 ) frm[i-1] = 1; // out-of-frame
+ else frm[i-1] = 0; // in-frame
+ }
+ else frm[i-1] = -1; // not applicable (is outside)
+ }
+
+ if ( bcf_update_info_int32(out_hdr,rec,"OOF",frm,rec->n_allele-1)<0 ) { fprintf(bcftools_stderr, "Could not annotate OOF :-/\n"); exit(1); }
+ return rec;
+}
+
+
+void destroy(void)
+{
+ bcf_sr_regions_destroy(exons);
+}
+
+
--- /dev/null
+/*
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define GUESS_GT 1
+#define GUESS_PL 2
+#define GUESS_GL 4
+
+typedef struct
+{
+ uint64_t ncount;
+ double phap, pdip;
+}
+count_t;
+
+typedef struct
+{
+ char *chr;
+ uint32_t start, end;
+ count_t *counts; // per-sample counts: counts[isample]
+}
+stats_t;
+
+typedef struct
+{
+ int argc;
+ char **argv, *af_tag;
+ double af_dflt;
+ stats_t stats;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic; // include or exclude sites which match the filters? One of FLT_INCLUDE/FLT_EXCLUDE
+ const uint8_t *smpl_pass;
+ int nsample, verbose, tag, include_indels;
+ int *counts, ncounts; // number of observed GTs with given ploidy, used when -g is not given
+ double *tmpf, *pl2p, gt_err_prob;
+ float *af;
+ int maf;
+ int32_t *arr, narr, nfarr;
+ float *farr;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Determine sample sex by checking genotype likelihoods in haploid regions.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Determine sample sex by checking genotype likelihoods (GL,PL) or genotypes (GT)\n"
+ " in the non-PAR region of chrX. The HWE is assumed, so given the alternate allele\n"
+ " frequency fA and the genotype likelihoods pRR,pRA,pAA, the probabilities are\n"
+ " calculated as\n"
+ " P(dip) = pRR*(1-fA)^2 + pAA*fA^2 + 2*pRA*(1-fA)*fA\n"
+ " P(hap) = pRR*(1-fA) + pAA*fA\n"
+ " When genotype likelihoods are not available, the -e option is used to account\n"
+ " for genotyping errors with -t GT. The alternate allele frequency fA is estimated\n"
+ " directly from the data (the default) or can be provided by an INFO tag.\n"
+ " The results can be visualized using the accompanied guess-ploidy.py script.\n"
+ " Note that this plugin is intended to replace the former vcf2sex plugin.\n"
+ "\n"
+ "Usage: bcftools +guess-ploidy <file.vcf.gz> [Plugin Options]\n"
+ "Plugin options:\n"
+ " --AF-dflt <float> the default alternate allele frequency [0.5]\n"
+ " --AF-tag <TAG> use TAG for allele frequency\n"
+ " -e, --error-rate <float> probability of GT being wrong (with -t GT) [1e-3]\n"
+ " --exclude <expr> exclude sites for which the expression is true\n"
+ " -i, --include-indels do not skip indel sites\n"
+ " --include <expr> include only sites for which the expression is true\n"
+ " -g, --genome <str> shortcut to select nonPAR region for common genomes b37|hg19|b38|hg38\n"
+ " -r, --regions <chr:beg-end> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ " -t, --tag <tag> genotype or genotype likelihoods: GT, PL, GL [PL]\n"
+ " -v, --verbose verbose output (specify twice to increase verbosity)\n"
+ "\n"
+ "Region shortcuts:\n"
+ " b37 .. -r X:2699521-154931043 # GRCh37 no-chr prefix\n"
+ " b38 .. -r X:2781480-155701381 # GRCh38 no-chr prefix\n"
+ " hg19 .. -r chrX:2699521-154931043 # GRCh37 chr prefix\n"
+ " hg38 .. -r chrX:2781480-155701381 # GRCh38 chr prefix\n"
+ "\n"
+ "Examples:\n"
+ " bcftools +guess-ploidy -g b37 in.vcf.gz\n"
+ " bcftools +guess-ploidy in.vcf.gz -t GL -r chrX:2699521-154931043\n"
+ " bcftools view file.vcf.gz -r chrX:2699521-154931043 | bcftools +guess-ploidy\n"
+ " bcftools +guess-ploidy in.bcf -v > ploidy.txt && guess-ploidy.py ploidy.txt img\n"
+ "\n";
+}
+
+static inline int smpl_pass(args_t *args, int ismpl)
+{
+ if ( !args->smpl_pass ) return 1;
+ int pass = args->smpl_pass[ismpl];
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( pass ) return 1;
+ return 0;
+}
+
+void process_region_guess(args_t *args)
+{
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( rec->n_allele==1 ) continue;
+ if ( !args->include_indels && !(bcf_get_variant_types(rec)&VCF_SNP) ) continue;
+
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, rec, &args->smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !args->smpl_pass && !pass ) continue; // site-level filtering, not per-sample filtering
+ }
+
+ double freq[2] = {0,0}, sum;
+ int ismpl,i;
+ if ( args->tag & GUESS_GT ) // use GTs to guess the ploidy, considering only one ALT
+ {
+ int ngt = bcf_get_genotypes(args->hdr,rec,&args->arr,&args->narr);
+ if ( ngt<=0 ) continue;
+ ngt /= args->nsample;
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*ngt;
+ double *tmp = args->tmpf + ismpl*3;
+
+ if ( ptr[0]==bcf_gt_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[1]==bcf_int32_vector_end )
+ {
+ if ( bcf_gt_allele(ptr[0])==0 ) // haploid R
+ {
+ tmp[0] = 1 - 2*args->gt_err_prob;
+ tmp[1] = tmp[2] = args->gt_err_prob;
+ }
+ else // haploid A
+ {
+ tmp[0] = tmp[1] = args->gt_err_prob;
+ tmp[2] = 1 - 2*args->gt_err_prob;
+ }
+ continue;
+ }
+ if ( bcf_gt_allele(ptr[0])==0 && bcf_gt_allele(ptr[1])==0 ) // RR
+ {
+ tmp[0] = 1 - 2*args->gt_err_prob;
+ tmp[1] = tmp[2] = args->gt_err_prob;
+ }
+ else if ( bcf_gt_allele(ptr[0])==bcf_gt_allele(ptr[1]) ) // AA
+ {
+ tmp[0] = tmp[1] = args->gt_err_prob;
+ tmp[2] = 1 - 2*args->gt_err_prob;
+ }
+ else // RA or hetAA, treating as RA
+ {
+ tmp[1] = 1 - 2*args->gt_err_prob;
+ tmp[0] = tmp[2] = args->gt_err_prob;
+ }
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ else if ( args->tag & GUESS_PL ) // use PL guess the ploidy, restrict to first ALT allele
+ {
+ int npl = bcf_get_format_int32(args->hdr,rec,"PL",&args->arr,&args->narr);
+ if ( npl<=0 ) continue;
+ npl /= args->nsample;
+ int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+ if ( npl==ndip_gt ) // diploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*npl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing || ptr[2]==bcf_int32_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[2]==bcf_int32_vector_end )
+ {
+ tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+ tmp[1] = args->pl2p[255];
+ tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+ }
+ else
+ for (i=0; i<3; i++)
+ tmp[i] = (ptr[i]<0 || ptr[i]>=256) ? args->pl2p[255] : args->pl2p[ptr[i]];
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ if ( ptr[2]==bcf_int32_vector_end )
+ {
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ else
+ {
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ }
+ else if ( npl==rec->n_allele ) // all samples haploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*npl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+ tmp[1] = args->pl2p[255];
+ tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ }
+ else
+ continue; // neither diploid nor haploid
+ }
+ else // use GL
+ {
+ int ngl = bcf_get_format_float(args->hdr,rec,"GL",&args->farr,&args->nfarr);
+ if ( ngl<=0 ) continue;
+ ngl /= args->nsample;
+ int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+ if ( ngl==ndip_gt ) // diploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ float *ptr = args->farr + ismpl*ngl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) || bcf_float_is_missing(ptr[2]) )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( bcf_float_is_vector_end(ptr[2]) )
+ {
+ tmp[0] = pow(10.,ptr[0]);
+ tmp[1] = 1e-26; // arbitrary small value for a het
+ tmp[2] = pow(10.,ptr[1]);
+ }
+ else
+ for (i=0; i<3; i++)
+ tmp[i] = pow(10.,ptr[i]);
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ if ( bcf_float_is_vector_end(ptr[2]) )
+ {
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ else
+ {
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ }
+ else if ( ngl==rec->n_allele ) // all samples haploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ float *ptr = args->farr + ismpl*ngl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ tmp[0] = pow(10.,ptr[0]);
+ tmp[1] = 1e-26;
+ tmp[2] = pow(10.,ptr[1]);
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ }
+ else
+ continue; // neither diploid nor haploid
+ }
+ if ( args->af_tag )
+ {
+ int ret = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af, &args->maf);
+ if ( ret>0 ) { freq[0] = 1 - args->af[0]; freq[1] = args->af[0]; }
+ }
+
+ if ( !freq[0] && !freq[1] ) { freq[0] = 1 - args->af_dflt; freq[1] = args->af_dflt; }
+ sum = freq[0] + freq[1];
+ freq[0] /= sum;
+ freq[1] /= sum;
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ count_t *counts = &args->stats.counts[ismpl];
+ double *tmp = args->tmpf + ismpl*3;
+ if ( tmp[0] < 0 ) continue;
+ double phap = freq[0]*tmp[0] + freq[1]*tmp[2];
+ double pdip = freq[0]*freq[0]*tmp[0] + 2*freq[0]*freq[1]*tmp[1] + freq[1]*freq[1]*tmp[2];
+ counts->phap += log(phap);
+ counts->pdip += log(pdip);
+ counts->ncount++;
+ if ( args->verbose>1 )
+ printf("DBG\t%s\t%d\t%s\t%e\t%e\t%e\t%e\t%e\t%e\n", bcf_seqname(args->hdr,rec),rec->pos+1,bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,ismpl),
+ freq[1],tmp[0],tmp[1],tmp[2],phap,pdip);
+ }
+ }
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->tag = GUESS_PL;
+ args->argc = argc; args->argv = argv;
+ args->gt_err_prob = 1e-3;
+ args->af_dflt = 0.5;
+ char *region = NULL;
+ int region_is_file = 0;
+ static struct option loptions[] =
+ {
+ {"AF-tag",required_argument,NULL,0},
+ {"AF-dflt",required_argument,NULL,1},
+ {"exclude",required_argument,NULL,2},
+ {"include",required_argument,NULL,3},
+ {"verbose",no_argument,NULL,'v'},
+ {"include-indels",no_argument,NULL,'i'},
+ {"error-rate",required_argument,NULL,'e'},
+ {"tag",required_argument,NULL,'t'},
+ {"genome",required_argument,NULL,'g'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"background",required_argument,NULL,'b'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:t:e:ig:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 0: args->af_tag = optarg; break;
+ case 1:
+ args->af_dflt = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
+ break;
+ case 2: args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 3: args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'i': args->include_indels = 1; break;
+ case 'e':
+ args->gt_err_prob = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -e %s\n", optarg);
+ if ( args->gt_err_prob<0 || args->gt_err_prob>1 ) error("Expected value from the interval [0,1]: -e %s\n", optarg);
+ break;
+ case 'g':
+ if ( !strcasecmp(optarg,"b37") ) region = "X:2699521-154931043";
+ else if ( !strcasecmp(optarg,"b38") ) region = "X:2781480-155701381";
+ else if ( !strcasecmp(optarg,"hg19") ) region = "chrX:2699521-154931043";
+ else if ( !strcasecmp(optarg,"hg38") ) region = "chrX:2781480-155701381";
+ else error("The argument not recognised, expected --genome b37, b38, hg19 or hg38: %s\n", optarg);
+ break;
+ case 'R': region_is_file = 1;
+ case 'r': region = optarg; break;
+ case 'v': args->verbose++; break;
+ case 't':
+ if ( !strcasecmp(optarg,"GT") ) args->tag = GUESS_GT;
+ else if ( !strcasecmp(optarg,"PL") ) args->tag = GUESS_PL;
+ else if ( !strcasecmp(optarg,"GL") ) args->tag = GUESS_GL;
+ else error("The argument not recognised, expected --tag GT, PL or GL: %s\n", optarg);
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of --include or --exclude can be given.\n");
+
+ char *fname = NULL;
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else fname = argv[optind];
+
+ args->sr = bcf_sr_init();
+ if ( strcmp("-",fname) )
+ {
+ if ( region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, region, region_is_file)<0 )
+ error("Failed to read the regions: %s\n",region);
+ }
+ }
+ else
+ {
+ if ( region )
+ {
+ if ( bcf_sr_set_targets(args->sr, region, region_is_file, 0)<0 )
+ error("Failed to read the targets: %s\n",region);
+ }
+ }
+ if ( !bcf_sr_add_reader(args->sr,fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = args->sr->readers[0].header;
+ args->nsample = bcf_hdr_nsamples(args->hdr);
+ args->stats.counts = (count_t*) calloc(args->nsample,sizeof(count_t));
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
+ if ( args->af_tag && !bcf_hdr_idinfo_exists(args->hdr,BCF_HL_INFO,bcf_hdr_id2int(args->hdr,BCF_DT_ID,args->af_tag)) )
+ error("No such INFO tag: %s\n", args->af_tag);
+
+ if ( args->tag&GUESS_PL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL")<0 )
+ {
+ fprintf(stderr, "Warning: PL tag not found in header, switching to GL\n");
+ args->tag = GUESS_GL;
+ }
+
+ if ( args->tag&GUESS_GL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GL")<0 )
+ {
+ fprintf(stderr, "Warning: GL tag not found in header, switching to GT\n");
+ args->tag = GUESS_GT;
+ }
+
+ if ( args->tag&GUESS_GT && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GT")<0 )
+ error("Error: GT tag not found in header\n");
+
+ int i;
+ if ( args->tag&GUESS_PL )
+ {
+ args->pl2p = (double*) calloc(256,sizeof(double));
+ for (i=0; i<256; i++) args->pl2p[i] = pow(10., -i/10.);
+ }
+ if ( args->tag&GUESS_PL || args->tag&GUESS_GL || args->tag&GUESS_GT )
+ args->tmpf = (double*) malloc(sizeof(*args->tmpf)*3*args->nsample);
+
+ if ( args->verbose )
+ {
+ printf("# This file was produced by: bcftools +guess-ploidy(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ printf("# The command line was:\tbcftools +%s", args->argv[0]);
+ for (i=1; i<args->argc; i++)
+ printf(" %s",args->argv[i]);
+ printf("\n");
+ printf("# [1]SEX\t[2]Sample\t[3]Predicted sex\t[4]log P(Haploid)/nSites\t[5]log P(Diploid)/nSites\t[6]nSites\t[7]Score: F < 0 < M ($4-$5)\n");
+ if ( args->verbose>1 )
+ printf("# [1]DBG\t[2]Chr\t[3]Pos\t[4]Sample\t[5]AF\t[6]pRR\t[7]pRA\t[8]pAA\t[9]P(Haploid)\t[10]P(Diploid)\n");
+ }
+
+ process_region_guess(args);
+
+ for (i=0; i<args->nsample; i++)
+ {
+ double phap = args->stats.counts[i].ncount ? args->stats.counts[i].phap / args->stats.counts[i].ncount : 0.5;
+ double pdip = args->stats.counts[i].ncount ? args->stats.counts[i].pdip / args->stats.counts[i].ncount : 0.5;
+ char predicted_sex = 'U';
+ if (phap>pdip) predicted_sex = 'M';
+ else if (phap<pdip) predicted_sex = 'F';
+ if ( args->verbose )
+ {
+ printf("SEX\t%s\t%c\t%f\t%f\t%"PRId64"\t%f\n", args->hdr->samples[i],predicted_sex,
+ phap,pdip,args->stats.counts[i].ncount,phap-pdip);
+ }
+ else
+ printf("%s\t%c\n", args->hdr->samples[i],predicted_sex);
+ }
+
+ if ( args->filter )
+ filter_destroy(args->filter);
+
+ bcf_sr_destroy(args->sr);
+ free(args->pl2p);
+ free(args->tmpf);
+ free(args->counts);
+ free(args->stats.counts);
+ free(args->arr);
+ free(args->farr);
+ free(args->af);
+ free(args);
+ return 0;
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define GUESS_GT 1
+#define GUESS_PL 2
+#define GUESS_GL 4
+
+typedef struct
+{
+ uint64_t ncount;
+ double phap, pdip;
+}
+count_t;
+
+typedef struct
+{
+ char *chr;
+ uint32_t start, end;
+ count_t *counts; // per-sample counts: counts[isample]
+}
+stats_t;
+
+typedef struct
+{
+ int argc;
+ char **argv, *af_tag;
+ double af_dflt;
+ stats_t stats;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic; // include or exclude sites which match the filters? One of FLT_INCLUDE/FLT_EXCLUDE
+ const uint8_t *smpl_pass;
+ int nsample, verbose, tag, include_indels;
+ int *counts, ncounts; // number of observed GTs with given ploidy, used when -g is not given
+ double *tmpf, *pl2p, gt_err_prob;
+ float *af;
+ int maf;
+ int32_t *arr, narr, nfarr;
+ float *farr;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Determine sample sex by checking genotype likelihoods in haploid regions.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Determine sample sex by checking genotype likelihoods (GL,PL) or genotypes (GT)\n"
+ " in the non-PAR region of chrX. The HWE is assumed, so given the alternate allele\n"
+ " frequency fA and the genotype likelihoods pRR,pRA,pAA, the probabilities are\n"
+ " calculated as\n"
+ " P(dip) = pRR*(1-fA)^2 + pAA*fA^2 + 2*pRA*(1-fA)*fA\n"
+ " P(hap) = pRR*(1-fA) + pAA*fA\n"
+ " When genotype likelihoods are not available, the -e option is used to account\n"
+ " for genotyping errors with -t GT. The alternate allele frequency fA is estimated\n"
+ " directly from the data (the default) or can be provided by an INFO tag.\n"
+ " The results can be visualized using the accompanied guess-ploidy.py script.\n"
+ " Note that this plugin is intended to replace the former vcf2sex plugin.\n"
+ "\n"
+ "Usage: bcftools +guess-ploidy <file.vcf.gz> [Plugin Options]\n"
+ "Plugin options:\n"
+ " --AF-dflt <float> the default alternate allele frequency [0.5]\n"
+ " --AF-tag <TAG> use TAG for allele frequency\n"
+ " -e, --error-rate <float> probability of GT being wrong (with -t GT) [1e-3]\n"
+ " --exclude <expr> exclude sites for which the expression is true\n"
+ " -i, --include-indels do not skip indel sites\n"
+ " --include <expr> include only sites for which the expression is true\n"
+ " -g, --genome <str> shortcut to select nonPAR region for common genomes b37|hg19|b38|hg38\n"
+ " -r, --regions <chr:beg-end> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ " -t, --tag <tag> genotype or genotype likelihoods: GT, PL, GL [PL]\n"
+ " -v, --verbose verbose output (specify twice to increase verbosity)\n"
+ "\n"
+ "Region shortcuts:\n"
+ " b37 .. -r X:2699521-154931043 # GRCh37 no-chr prefix\n"
+ " b38 .. -r X:2781480-155701381 # GRCh38 no-chr prefix\n"
+ " hg19 .. -r chrX:2699521-154931043 # GRCh37 chr prefix\n"
+ " hg38 .. -r chrX:2781480-155701381 # GRCh38 chr prefix\n"
+ "\n"
+ "Examples:\n"
+ " bcftools +guess-ploidy -g b37 in.vcf.gz\n"
+ " bcftools +guess-ploidy in.vcf.gz -t GL -r chrX:2699521-154931043\n"
+ " bcftools view file.vcf.gz -r chrX:2699521-154931043 | bcftools +guess-ploidy\n"
+ " bcftools +guess-ploidy in.bcf -v > ploidy.txt && guess-ploidy.py ploidy.txt img\n"
+ "\n";
+}
+
+static inline int smpl_pass(args_t *args, int ismpl)
+{
+ if ( !args->smpl_pass ) return 1;
+ int pass = args->smpl_pass[ismpl];
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( pass ) return 1;
+ return 0;
+}
+
+void process_region_guess(args_t *args)
+{
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( rec->n_allele==1 ) continue;
+ if ( !args->include_indels && !(bcf_get_variant_types(rec)&VCF_SNP) ) continue;
+
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, rec, &args->smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !args->smpl_pass && !pass ) continue; // site-level filtering, not per-sample filtering
+ }
+
+ double freq[2] = {0,0}, sum;
+ int ismpl,i;
+ if ( args->tag & GUESS_GT ) // use GTs to guess the ploidy, considering only one ALT
+ {
+ int ngt = bcf_get_genotypes(args->hdr,rec,&args->arr,&args->narr);
+ if ( ngt<=0 ) continue;
+ ngt /= args->nsample;
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*ngt;
+ double *tmp = args->tmpf + ismpl*3;
+
+ if ( ptr[0]==bcf_gt_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[1]==bcf_int32_vector_end )
+ {
+ if ( bcf_gt_allele(ptr[0])==0 ) // haploid R
+ {
+ tmp[0] = 1 - 2*args->gt_err_prob;
+ tmp[1] = tmp[2] = args->gt_err_prob;
+ }
+ else // haploid A
+ {
+ tmp[0] = tmp[1] = args->gt_err_prob;
+ tmp[2] = 1 - 2*args->gt_err_prob;
+ }
+ continue;
+ }
+ if ( bcf_gt_allele(ptr[0])==0 && bcf_gt_allele(ptr[1])==0 ) // RR
+ {
+ tmp[0] = 1 - 2*args->gt_err_prob;
+ tmp[1] = tmp[2] = args->gt_err_prob;
+ }
+ else if ( bcf_gt_allele(ptr[0])==bcf_gt_allele(ptr[1]) ) // AA
+ {
+ tmp[0] = tmp[1] = args->gt_err_prob;
+ tmp[2] = 1 - 2*args->gt_err_prob;
+ }
+ else // RA or hetAA, treating as RA
+ {
+ tmp[1] = 1 - 2*args->gt_err_prob;
+ tmp[0] = tmp[2] = args->gt_err_prob;
+ }
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ else if ( args->tag & GUESS_PL ) // use PL guess the ploidy, restrict to first ALT allele
+ {
+ int npl = bcf_get_format_int32(args->hdr,rec,"PL",&args->arr,&args->narr);
+ if ( npl<=0 ) continue;
+ npl /= args->nsample;
+ int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+ if ( npl==ndip_gt ) // diploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*npl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing || ptr[2]==bcf_int32_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[2]==bcf_int32_vector_end )
+ {
+ tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+ tmp[1] = args->pl2p[255];
+ tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+ }
+ else
+ for (i=0; i<3; i++)
+ tmp[i] = (ptr[i]<0 || ptr[i]>=256) ? args->pl2p[255] : args->pl2p[ptr[i]];
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ if ( ptr[2]==bcf_int32_vector_end )
+ {
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ else
+ {
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ }
+ else if ( npl==rec->n_allele ) // all samples haploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ int32_t *ptr = args->arr + ismpl*npl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( ptr[0]==bcf_int32_missing || ptr[1]==bcf_int32_missing )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ tmp[0] = (ptr[0]<0 || ptr[0]>=256) ? args->pl2p[255] : args->pl2p[ptr[0]];
+ tmp[1] = args->pl2p[255];
+ tmp[2] = (ptr[1]<0 || ptr[1]>=256) ? args->pl2p[255] : args->pl2p[ptr[1]];
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ }
+ else
+ continue; // neither diploid nor haploid
+ }
+ else // use GL
+ {
+ int ngl = bcf_get_format_float(args->hdr,rec,"GL",&args->farr,&args->nfarr);
+ if ( ngl<=0 ) continue;
+ ngl /= args->nsample;
+ int ndip_gt = rec->n_allele*(rec->n_allele+1)/2;
+ if ( ngl==ndip_gt ) // diploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ float *ptr = args->farr + ismpl*ngl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) || bcf_float_is_missing(ptr[2]) )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( ptr[0]==ptr[1] && ptr[0]==ptr[2] ) // non-informative
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ if ( bcf_float_is_vector_end(ptr[2]) )
+ {
+ tmp[0] = pow(10.,ptr[0]);
+ tmp[1] = 1e-26; // arbitrary small value for a het
+ tmp[2] = pow(10.,ptr[1]);
+ }
+ else
+ for (i=0; i<3; i++)
+ tmp[i] = pow(10.,ptr[i]);
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ if ( bcf_float_is_vector_end(ptr[2]) )
+ {
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ else
+ {
+ freq[0] += 2*tmp[0]+tmp[1];
+ freq[1] += tmp[1]+2*tmp[2];
+ }
+ }
+ }
+ else if ( ngl==rec->n_allele ) // all samples haploid
+ {
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ float *ptr = args->farr + ismpl*ngl;
+ double *tmp = args->tmpf + ismpl*3;
+
+ // restrict to first ALT
+ if ( bcf_float_is_missing(ptr[0]) || bcf_float_is_missing(ptr[1]) )
+ {
+ tmp[0] = -1;
+ continue;
+ }
+ tmp[0] = pow(10.,ptr[0]);
+ tmp[1] = 1e-26;
+ tmp[2] = pow(10.,ptr[1]);
+
+ sum = 0;
+ for (i=0; i<3; i++) sum += tmp[i];
+ for (i=0; i<3; i++) tmp[i] /= sum;
+
+ freq[0] += tmp[0];
+ freq[1] += tmp[2];
+ }
+ }
+ else
+ continue; // neither diploid nor haploid
+ }
+ if ( args->af_tag )
+ {
+ int ret = bcf_get_info_float(args->hdr,rec,args->af_tag,&args->af, &args->maf);
+ if ( ret>0 ) { freq[0] = 1 - args->af[0]; freq[1] = args->af[0]; }
+ }
+
+ if ( !freq[0] && !freq[1] ) { freq[0] = 1 - args->af_dflt; freq[1] = args->af_dflt; }
+ sum = freq[0] + freq[1];
+ freq[0] /= sum;
+ freq[1] /= sum;
+ for (ismpl=0; ismpl<args->nsample; ismpl++)
+ {
+ if ( !smpl_pass(args,ismpl) ) continue;
+ count_t *counts = &args->stats.counts[ismpl];
+ double *tmp = args->tmpf + ismpl*3;
+ if ( tmp[0] < 0 ) continue;
+ double phap = freq[0]*tmp[0] + freq[1]*tmp[2];
+ double pdip = freq[0]*freq[0]*tmp[0] + 2*freq[0]*freq[1]*tmp[1] + freq[1]*freq[1]*tmp[2];
+ counts->phap += log(phap);
+ counts->pdip += log(pdip);
+ counts->ncount++;
+ if ( args->verbose>1 )
+ fprintf(bcftools_stdout, "DBG\t%s\t%d\t%s\t%e\t%e\t%e\t%e\t%e\t%e\n", bcf_seqname(args->hdr,rec),rec->pos+1,bcf_hdr_int2id(args->hdr,BCF_DT_SAMPLE,ismpl),
+ freq[1],tmp[0],tmp[1],tmp[2],phap,pdip);
+ }
+ }
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->tag = GUESS_PL;
+ args->argc = argc; args->argv = argv;
+ args->gt_err_prob = 1e-3;
+ args->af_dflt = 0.5;
+ char *region = NULL;
+ int region_is_file = 0;
+ static struct option loptions[] =
+ {
+ {"AF-tag",required_argument,NULL,0},
+ {"AF-dflt",required_argument,NULL,1},
+ {"exclude",required_argument,NULL,2},
+ {"include",required_argument,NULL,3},
+ {"verbose",no_argument,NULL,'v'},
+ {"include-indels",no_argument,NULL,'i'},
+ {"error-rate",required_argument,NULL,'e'},
+ {"tag",required_argument,NULL,'t'},
+ {"genome",required_argument,NULL,'g'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"background",required_argument,NULL,'b'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:t:e:ig:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 0: args->af_tag = optarg; break;
+ case 1:
+ args->af_dflt = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
+ break;
+ case 2: args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 3: args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'i': args->include_indels = 1; break;
+ case 'e':
+ args->gt_err_prob = strtod(optarg,&tmp);
+ if ( *tmp ) error("Could not parse: -e %s\n", optarg);
+ if ( args->gt_err_prob<0 || args->gt_err_prob>1 ) error("Expected value from the interval [0,1]: -e %s\n", optarg);
+ break;
+ case 'g':
+ if ( !strcasecmp(optarg,"b37") ) region = "X:2699521-154931043";
+ else if ( !strcasecmp(optarg,"b38") ) region = "X:2781480-155701381";
+ else if ( !strcasecmp(optarg,"hg19") ) region = "chrX:2699521-154931043";
+ else if ( !strcasecmp(optarg,"hg38") ) region = "chrX:2781480-155701381";
+ else error("The argument not recognised, expected --genome b37, b38, hg19 or hg38: %s\n", optarg);
+ break;
+ case 'R': region_is_file = 1;
+ case 'r': region = optarg; break;
+ case 'v': args->verbose++; break;
+ case 't':
+ if ( !strcasecmp(optarg,"GT") ) args->tag = GUESS_GT;
+ else if ( !strcasecmp(optarg,"PL") ) args->tag = GUESS_PL;
+ else if ( !strcasecmp(optarg,"GL") ) args->tag = GUESS_GL;
+ else error("The argument not recognised, expected --tag GT, PL or GL: %s\n", optarg);
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of --include or --exclude can be given.\n");
+
+ char *fname = NULL;
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else fname = argv[optind];
+
+ args->sr = bcf_sr_init();
+ if ( strcmp("-",fname) )
+ {
+ if ( region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, region, region_is_file)<0 )
+ error("Failed to read the regions: %s\n",region);
+ }
+ }
+ else
+ {
+ if ( region )
+ {
+ if ( bcf_sr_set_targets(args->sr, region, region_is_file, 0)<0 )
+ error("Failed to read the targets: %s\n",region);
+ }
+ }
+ if ( !bcf_sr_add_reader(args->sr,fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = args->sr->readers[0].header;
+ args->nsample = bcf_hdr_nsamples(args->hdr);
+ args->stats.counts = (count_t*) calloc(args->nsample,sizeof(count_t));
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
+ if ( args->af_tag && !bcf_hdr_idinfo_exists(args->hdr,BCF_HL_INFO,bcf_hdr_id2int(args->hdr,BCF_DT_ID,args->af_tag)) )
+ error("No such INFO tag: %s\n", args->af_tag);
+
+ if ( args->tag&GUESS_PL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL")<0 )
+ {
+ fprintf(bcftools_stderr, "Warning: PL tag not found in header, switching to GL\n");
+ args->tag = GUESS_GL;
+ }
+
+ if ( args->tag&GUESS_GL && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GL")<0 )
+ {
+ fprintf(bcftools_stderr, "Warning: GL tag not found in header, switching to GT\n");
+ args->tag = GUESS_GT;
+ }
+
+ if ( args->tag&GUESS_GT && bcf_hdr_id2int(args->hdr, BCF_DT_ID, "GT")<0 )
+ error("Error: GT tag not found in header\n");
+
+ int i;
+ if ( args->tag&GUESS_PL )
+ {
+ args->pl2p = (double*) calloc(256,sizeof(double));
+ for (i=0; i<256; i++) args->pl2p[i] = pow(10., -i/10.);
+ }
+ if ( args->tag&GUESS_PL || args->tag&GUESS_GL || args->tag&GUESS_GT )
+ args->tmpf = (double*) malloc(sizeof(*args->tmpf)*3*args->nsample);
+
+ if ( args->verbose )
+ {
+ fprintf(bcftools_stdout, "# This file was produced by: bcftools +guess-ploidy(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ fprintf(bcftools_stdout, "# The command line was:\tbcftools +%s", args->argv[0]);
+ for (i=1; i<args->argc; i++)
+ fprintf(bcftools_stdout, " %s",args->argv[i]);
+ fprintf(bcftools_stdout, "\n");
+ fprintf(bcftools_stdout, "# [1]SEX\t[2]Sample\t[3]Predicted sex\t[4]log P(Haploid)/nSites\t[5]log P(Diploid)/nSites\t[6]nSites\t[7]Score: F < 0 < M ($4-$5)\n");
+ if ( args->verbose>1 )
+ fprintf(bcftools_stdout, "# [1]DBG\t[2]Chr\t[3]Pos\t[4]Sample\t[5]AF\t[6]pRR\t[7]pRA\t[8]pAA\t[9]P(Haploid)\t[10]P(Diploid)\n");
+ }
+
+ process_region_guess(args);
+
+ for (i=0; i<args->nsample; i++)
+ {
+ double phap = args->stats.counts[i].ncount ? args->stats.counts[i].phap / args->stats.counts[i].ncount : 0.5;
+ double pdip = args->stats.counts[i].ncount ? args->stats.counts[i].pdip / args->stats.counts[i].ncount : 0.5;
+ char predicted_sex = 'U';
+ if (phap>pdip) predicted_sex = 'M';
+ else if (phap<pdip) predicted_sex = 'F';
+ if ( args->verbose )
+ {
+ fprintf(bcftools_stdout, "SEX\t%s\t%c\t%f\t%f\t%"PRId64"\t%f\n", args->hdr->samples[i],predicted_sex,
+ phap,pdip,args->stats.counts[i].ncount,phap-pdip);
+ }
+ else
+ fprintf(bcftools_stdout, "%s\t%c\n", args->hdr->samples[i],predicted_sex);
+ }
+
+ if ( args->filter )
+ filter_destroy(args->filter);
+
+ bcf_sr_destroy(args->sr);
+ free(args->pl2p);
+ free(args->tmpf);
+ free(args->counts);
+ free(args->stats.counts);
+ free(args->arr);
+ free(args->farr);
+ free(args->af);
+ free(args);
+ return 0;
+}
--- /dev/null
+/* plugins/impute-info.c -- adds info metrics to a VCF file.
+
+ Copyright (C) 2015-2016 Genome Research Ltd.
+
+ Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+
+/*
+
+Marchini & Howie, Nature Genetics, 11 (July 2010)
+
+Let G_ij in {0,1,2} be the genotype of the ith individual at the
+jth SNP in a study cohort of N samples. Let
+
+ p_ijk = P(G_ij = k | H,G)
+
+be the probability that the genotype at the jth SNP of the ith
+individual is k.
+
+Let the expected allele dosage for the genotype at the jth SNP
+of the ith individual be
+
+ e_ij = p_ij1 + 2 * p_ij2
+
+and define
+
+ f_ij = p_ij1 + 4 * p_ij2
+
+Let theta_j denote the (unknown) population allele frequency of the jth SNP
+with:
+
+ theta_j = SUM[i=1..N] e_ij / 2 * N
+
+The IMPUTE2 information measure is then:
+
+ if theta_j in (0,1):
+ I(theta_j) = 1 - SUM[i=1..N](f_ij - e_ij^2) / 2 * N * theta_j * (1 - theta_j)
+ else:
+ I(theta_j) = 1
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+
+const char *about(void)
+{
+ return "Add imputation information metrics to the INFO field based on selected FORMAT tags.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Add imputation information metrics to the INFO field based\n"
+ " on selected FORMAT tags. Only the IMPUTE2 INFO metric from\n"
+ " FORMAT/GP tags is currently available.\n"
+ "Usage: bcftools +impute-info [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ // "Plugin options:\n"
+ // " -i, --info <list> information metrics to add [INFO]\n" // [BEAGLE_R2,MACH_R2]
+ // " -t, --tags <tag> VCF tags to determine the information from [GP]\n"
+ // "\n"
+ "Example:\n"
+ " bcftools +impute-info in.vcf\n"
+ "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+int gp_type = BCF_HT_REAL;
+uint8_t *buf = NULL;
+int nbuf = 0; // NB: number of elements, not bytes
+int nrec = 0, nskip_gp = 0, nskip_dip = 0;
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ in_hdr = in;
+ out_hdr = out;
+ bcf_hdr_append(out_hdr,"##INFO=<ID=INFO,Number=1,Type=Float,Description=\"IMPUTE2 info score\">");
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int nval = 0, i, j, nret = bcf_get_format_values(in_hdr,rec,"GP",(void**)&buf,&nbuf,gp_type);
+ if ( nret<0 )
+ {
+ if (!nskip_gp) fprintf(stderr, "[impute-info.c] Warning: info tag not added to sites without GP tag\n");
+ nskip_gp++;
+ return rec; // require FORMAT/GP tag, return site unchanged
+ }
+
+ nret /= rec->n_sample;
+ if ( nret != 3 )
+ {
+ if (!nskip_dip) fprintf(stderr, "[impute-info.c] Warning: info tag not added to sites that are not biallelic diploid\n");
+ nskip_dip++;
+ return rec; // require biallelic diploid, return site unchanged
+ }
+
+ double esum = 0, e2sum = 0, fsum = 0;
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ double vals[3] = {0,0,0}; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = ptr[j]; \
+ } \
+ double norm = vals[0]+vals[1]+vals[2]; \
+ if ( norm ) for (j=0; j<3; j++) vals[j] /= norm; \
+ esum += vals[1] + 2*vals[2]; \
+ e2sum += (vals[1] + 2*vals[2]) * (vals[1] + 2*vals[2]); \
+ fsum += vals[1] + 4*vals[2]; \
+ ptr += nret; \
+ nval++; \
+ } \
+ }
+ switch (gp_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+
+ double theta = esum / (2 * (double)nval);
+ float info = (theta>0 && theta<1) ? (float)(1 - (fsum - e2sum) / (2 * (double)nval * theta * (1.0 - theta))) : 1;
+
+ bcf_update_info_float(out_hdr, rec, "INFO", &info, 1);
+ nrec++;
+ return rec;
+}
+
+
+void destroy(void)
+{
+ fprintf(stderr,"Lines total/info-added/unchanged-no-tag/unchanged-not-biallelic-diploid:\t%d/%d/%d/%d\n", nrec+nskip_gp+nskip_dip, nrec, nskip_gp, nskip_dip);
+ free(buf);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/impute-info.c -- adds info metrics to a VCF file.
+
+ Copyright (C) 2015-2016 Genome Research Ltd.
+
+ Author: Shane McCarthy <sm15@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+
+/*
+
+Marchini & Howie, Nature Genetics, 11 (July 2010)
+
+Let G_ij in {0,1,2} be the genotype of the ith individual at the
+jth SNP in a study cohort of N samples. Let
+
+ p_ijk = P(G_ij = k | H,G)
+
+be the probability that the genotype at the jth SNP of the ith
+individual is k.
+
+Let the expected allele dosage for the genotype at the jth SNP
+of the ith individual be
+
+ e_ij = p_ij1 + 2 * p_ij2
+
+and define
+
+ f_ij = p_ij1 + 4 * p_ij2
+
+Let theta_j denote the (unknown) population allele frequency of the jth SNP
+with:
+
+ theta_j = SUM[i=1..N] e_ij / 2 * N
+
+The IMPUTE2 information measure is then:
+
+ if theta_j in (0,1):
+ I(theta_j) = 1 - SUM[i=1..N](f_ij - e_ij^2) / 2 * N * theta_j * (1 - theta_j)
+ else:
+ I(theta_j) = 1
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <math.h>
+#include <getopt.h>
+
+const char *about(void)
+{
+ return "Add imputation information metrics to the INFO field based on selected FORMAT tags.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Add imputation information metrics to the INFO field based\n"
+ " on selected FORMAT tags. Only the IMPUTE2 INFO metric from\n"
+ " FORMAT/GP tags is currently available.\n"
+ "Usage: bcftools +impute-info [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ // "Plugin options:\n"
+ // " -i, --info <list> information metrics to add [INFO]\n" // [BEAGLE_R2,MACH_R2]
+ // " -t, --tags <tag> VCF tags to determine the information from [GP]\n"
+ // "\n"
+ "Example:\n"
+ " bcftools +impute-info in.vcf\n"
+ "\n";
+}
+
+bcf_hdr_t *in_hdr = NULL, *out_hdr = NULL;
+int gp_type = BCF_HT_REAL;
+uint8_t *buf = NULL;
+int nbuf = 0; // NB: number of elements, not bytes
+int nrec = 0, nskip_gp = 0, nskip_dip = 0;
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ in_hdr = in;
+ out_hdr = out;
+ bcf_hdr_append(out_hdr,"##INFO=<ID=INFO,Number=1,Type=Float,Description=\"IMPUTE2 info score\">");
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int nval = 0, i, j, nret = bcf_get_format_values(in_hdr,rec,"GP",(void**)&buf,&nbuf,gp_type);
+ if ( nret<0 )
+ {
+ if (!nskip_gp) fprintf(bcftools_stderr, "[impute-info.c] Warning: info tag not added to sites without GP tag\n");
+ nskip_gp++;
+ return rec; // require FORMAT/GP tag, return site unchanged
+ }
+
+ nret /= rec->n_sample;
+ if ( nret != 3 )
+ {
+ if (!nskip_dip) fprintf(bcftools_stderr, "[impute-info.c] Warning: info tag not added to sites that are not biallelic diploid\n");
+ nskip_dip++;
+ return rec; // require biallelic diploid, return site unchanged
+ }
+
+ double esum = 0, e2sum = 0, fsum = 0;
+ #define BRANCH(type_t,is_missing,is_vector_end) \
+ { \
+ type_t *ptr = (type_t*) buf; \
+ for (i=0; i<rec->n_sample; i++) \
+ { \
+ double vals[3] = {0,0,0}; \
+ for (j=0; j<nret; j++) \
+ { \
+ if ( is_missing || is_vector_end ) break; \
+ vals[j] = ptr[j]; \
+ } \
+ double norm = vals[0]+vals[1]+vals[2]; \
+ if ( norm ) for (j=0; j<3; j++) vals[j] /= norm; \
+ esum += vals[1] + 2*vals[2]; \
+ e2sum += (vals[1] + 2*vals[2]) * (vals[1] + 2*vals[2]); \
+ fsum += vals[1] + 4*vals[2]; \
+ ptr += nret; \
+ nval++; \
+ } \
+ }
+ switch (gp_type)
+ {
+ case BCF_HT_INT: BRANCH(int32_t,ptr[j]==bcf_int32_missing,ptr[j]==bcf_int32_vector_end); break;
+ case BCF_HT_REAL: BRANCH(float,bcf_float_is_missing(ptr[j]),bcf_float_is_vector_end(ptr[j])); break;
+ }
+ #undef BRANCH
+
+ double theta = esum / (2 * (double)nval);
+ float info = (theta>0 && theta<1) ? (float)(1 - (fsum - e2sum) / (2 * (double)nval * theta * (1.0 - theta))) : 1;
+
+ bcf_update_info_float(out_hdr, rec, "INFO", &info, 1);
+ nrec++;
+ return rec;
+}
+
+
+void destroy(void)
+{
+ fprintf(bcftools_stderr,"Lines total/info-added/unchanged-no-tag/unchanged-not-biallelic-diploid:\t%d/%d/%d/%d\n", nrec+nskip_gp+nskip_dip, nrec, nskip_gp, nskip_dip);
+ free(buf);
+}
--- /dev/null
+/*
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "smpl_ilist.h"
+
+typedef struct
+{
+ int argc, output_type, regions_is_file, targets_is_file;
+ char **argv, *output_fname, *regions_list, *targets_list;
+ int32_t *arr_a, narr_a, *arr_b, narr_b;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr_a, *hdr_b;
+ htsFile *out_fh;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Compare two files and set non-identical genotypes to missing.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Compare two files and set non-identical genotypes in the first file to missing.\n"
+ "\n"
+ "Usage: bcftools +isecGT <A.bcf> <B.bcf> [Plugin Options]\n"
+ "Plugin options:\n"
+ " -o, --output <file> write output to a file [standard output]\n"
+ " -O, --output-type <b|u|z|v> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+ " -r, --regions <region> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ " -t, --targets <region> similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file <file> similar to -R but streams rather than index-jumps\n"
+ "\n";
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->output_fname = "-";
+ args->output_type = FT_VCF;
+ static struct option loptions[] =
+ {
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"targets",required_argument,NULL,'t'},
+ {"targets-file",required_argument,NULL,'T'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "o:O:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'r': args->regions_list = optarg; break;
+ case 'R': args->regions_list = optarg; args->regions_is_file = 1; break;
+ case 't': args->targets_list = optarg; break;
+ case 'T': args->targets_list = optarg; args->targets_is_file = 1; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+
+ if ( optind+2!=argc ) error("%s",usage_text());
+
+ args->sr = bcf_sr_init();
+ args->sr->require_index = 1;
+ if ( args->regions_list )
+ {
+ if ( bcf_sr_set_regions(args->sr, args->regions_list, args->regions_is_file)<0 )
+ error("Failed to read the regions: %s\n", args->regions_list);
+ }
+ if ( args->targets_list )
+ {
+ if ( bcf_sr_set_targets(args->sr, args->targets_list, args->targets_is_file, 0)<0 )
+ error("Failed to read the targets: %s\n", args->targets_list);
+ args->sr->collapse |= COLLAPSE_BOTH;
+ }
+ if ( !bcf_sr_add_reader(args->sr,argv[optind]) ) error("Error opening %s: %s\n", argv[optind],bcf_sr_strerror(args->sr->errnum));
+ if ( !bcf_sr_add_reader(args->sr,argv[optind+1]) ) error("Error opening %s: %s\n", argv[optind+1],bcf_sr_strerror(args->sr->errnum));
+ args->hdr_a = bcf_sr_get_header(args->sr,0);
+ args->hdr_b = bcf_sr_get_header(args->sr,1);
+ smpl_ilist_t *smpl = smpl_ilist_map(args->hdr_a, args->hdr_b, SMPL_STRICT);
+ args->out_fh = hts_open(args->output_fname, hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ bcf_hdr_write(args->out_fh, args->hdr_a);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ if ( !bcf_sr_has_line(args->sr,0) ) continue;
+ if ( !bcf_sr_has_line(args->sr,1) )
+ {
+ bcf_write(args->out_fh, args->hdr_a, bcf_sr_get_line(args->sr,0));
+ continue;
+ }
+
+ bcf1_t *line_a = bcf_sr_get_line(args->sr,0);
+ bcf1_t *line_b = bcf_sr_get_line(args->sr,1);
+ int ngt_a = bcf_get_genotypes(args->hdr_a, line_a, &args->arr_a, &args->narr_a);
+ int ngt_b = bcf_get_genotypes(args->hdr_b, line_b, &args->arr_b, &args->narr_b);
+ assert( ngt_a==ngt_b ); // todo
+ ngt_a /= smpl->n;
+ ngt_b /= smpl->n;
+ int i, j, dirty = 0;
+ for (i=0; i<smpl->n; i++)
+ {
+ int32_t *a = args->arr_a + i*ngt_a;
+ int32_t *b = args->arr_b + smpl->idx[i]*ngt_b;
+ for (j=0; j<ngt_a; j++)
+ if ( a[j]!=b[j] ) break;
+ if ( j<ngt_a )
+ {
+ dirty = 1;
+ for (j=0; j<ngt_a; j++) a[j] = bcf_gt_missing;
+ }
+ }
+ if ( dirty ) bcf_update_genotypes(args->hdr_a, line_a, args->arr_a, ngt_a*smpl->n);
+ bcf_write(args->out_fh, args->hdr_a, line_a);
+ }
+
+ if ( hts_close(args->out_fh)!=0 ) error("Close failed: %s\n",args->output_fname);
+ smpl_ilist_destroy(smpl);
+ bcf_sr_destroy(args->sr);
+ free(args->arr_a);
+ free(args->arr_b);
+ free(args);
+ return 0;
+}
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <stdint.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <unistd.h>
+#include <errno.h>
+#include "bcftools.h"
+#include "smpl_ilist.h"
+
+typedef struct
+{
+ int argc, output_type, regions_is_file, targets_is_file;
+ char **argv, *output_fname, *regions_list, *targets_list;
+ int32_t *arr_a, narr_a, *arr_b, narr_b;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr_a, *hdr_b;
+ htsFile *out_fh;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Compare two files and set non-identical genotypes to missing.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Compare two files and set non-identical genotypes in the first file to missing.\n"
+ "\n"
+ "Usage: bcftools +isecGT <A.bcf> <B.bcf> [Plugin Options]\n"
+ "Plugin options:\n"
+ " -o, --output <file> write output to a file [standard output]\n"
+ " -O, --output-type <b|u|z|v> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+ " -r, --regions <region> restrict to comma-separated list of regions\n"
+ " -R, --regions-file <file> restrict to regions listed in a file\n"
+ " -t, --targets <region> similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file <file> similar to -R but streams rather than index-jumps\n"
+ "\n";
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->output_fname = "-";
+ args->output_type = FT_VCF;
+ static struct option loptions[] =
+ {
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"targets",required_argument,NULL,'t'},
+ {"targets-file",required_argument,NULL,'T'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "o:O:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'r': args->regions_list = optarg; break;
+ case 'R': args->regions_list = optarg; args->regions_is_file = 1; break;
+ case 't': args->targets_list = optarg; break;
+ case 'T': args->targets_list = optarg; args->targets_is_file = 1; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+
+ if ( optind+2!=argc ) error("%s",usage_text());
+
+ args->sr = bcf_sr_init();
+ args->sr->require_index = 1;
+ if ( args->regions_list )
+ {
+ if ( bcf_sr_set_regions(args->sr, args->regions_list, args->regions_is_file)<0 )
+ error("Failed to read the regions: %s\n", args->regions_list);
+ }
+ if ( args->targets_list )
+ {
+ if ( bcf_sr_set_targets(args->sr, args->targets_list, args->targets_is_file, 0)<0 )
+ error("Failed to read the targets: %s\n", args->targets_list);
+ args->sr->collapse |= COLLAPSE_BOTH;
+ }
+ if ( !bcf_sr_add_reader(args->sr,argv[optind]) ) error("Error opening %s: %s\n", argv[optind],bcf_sr_strerror(args->sr->errnum));
+ if ( !bcf_sr_add_reader(args->sr,argv[optind+1]) ) error("Error opening %s: %s\n", argv[optind+1],bcf_sr_strerror(args->sr->errnum));
+ args->hdr_a = bcf_sr_get_header(args->sr,0);
+ args->hdr_b = bcf_sr_get_header(args->sr,1);
+ smpl_ilist_t *smpl = smpl_ilist_map(args->hdr_a, args->hdr_b, SMPL_STRICT);
+ args->out_fh = hts_open(args->output_fname, hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ bcf_hdr_write(args->out_fh, args->hdr_a);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ if ( !bcf_sr_has_line(args->sr,0) ) continue;
+ if ( !bcf_sr_has_line(args->sr,1) )
+ {
+ bcf_write(args->out_fh, args->hdr_a, bcf_sr_get_line(args->sr,0));
+ continue;
+ }
+
+ bcf1_t *line_a = bcf_sr_get_line(args->sr,0);
+ bcf1_t *line_b = bcf_sr_get_line(args->sr,1);
+ int ngt_a = bcf_get_genotypes(args->hdr_a, line_a, &args->arr_a, &args->narr_a);
+ int ngt_b = bcf_get_genotypes(args->hdr_b, line_b, &args->arr_b, &args->narr_b);
+ assert( ngt_a==ngt_b ); // todo
+ ngt_a /= smpl->n;
+ ngt_b /= smpl->n;
+ int i, j, dirty = 0;
+ for (i=0; i<smpl->n; i++)
+ {
+ int32_t *a = args->arr_a + i*ngt_a;
+ int32_t *b = args->arr_b + smpl->idx[i]*ngt_b;
+ for (j=0; j<ngt_a; j++)
+ if ( a[j]!=b[j] ) break;
+ if ( j<ngt_a )
+ {
+ dirty = 1;
+ for (j=0; j<ngt_a; j++) a[j] = bcf_gt_missing;
+ }
+ }
+ if ( dirty ) bcf_update_genotypes(args->hdr_a, line_a, args->arr_a, ngt_a*smpl->n);
+ bcf_write(args->out_fh, args->hdr_a, line_a);
+ }
+
+ if ( hts_close(args->out_fh)!=0 ) error("Close failed: %s\n",args->output_fname);
+ smpl_ilist_destroy(smpl);
+ bcf_sr_destroy(args->sr);
+ free(args->arr_a);
+ free(args->arr_b);
+ free(args);
+ return 0;
+}
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <errno.h>
+#include <ctype.h>
+#include <unistd.h> // for isatty
+#include "bcftools.h"
+#include "regidx.h"
+
+#define MODE_COUNT 1
+#define MODE_LIST_GOOD 2
+#define MODE_LIST_BAD 4
+#define MODE_DELETE 8
+
+typedef struct
+{
+ int nok, nbad;
+ int imother,ifather,ichild;
+}
+trio_t;
+
+typedef struct
+{
+ int mpl, fpl, cpl; // ploidies - mother, father, child
+ int mal, fal; // expect an allele from mother and father
+}
+rule_t;
+
+typedef struct _args_t
+{
+ regidx_t *rules;
+ regitr_t *itr, *itr_ori;
+ bcf_hdr_t *hdr;
+ htsFile *out_fh;
+ int32_t *gt_arr;
+ int mode;
+ int ngt_arr, nrec;
+ trio_t *trios;
+ int ntrios;
+ int output_type;
+ char *output_fname;
+ bcf_srs_t *sr;
+}
+args_t;
+
+static args_t args;
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr);
+static bcf1_t *process(bcf1_t *rec);
+
+const char *about(void)
+{
+ return "Count Mendelian consistent / inconsistent genotypes.\n";
+}
+
+typedef struct
+{
+ const char *alias, *about, *rules;
+}
+rules_predef_t;
+
+static rules_predef_t rules_predefs[] =
+{
+ { .alias = "GRCh37",
+ .about = "Human Genome reference assembly GRCh37 / hg19, both chr naming conventions",
+ .rules =
+ " X:1-60000 M/M + F > M\n"
+ " X:1-60000 M/M + F > M/F\n"
+ " X:2699521-154931043 M/M + F > M\n"
+ " X:2699521-154931043 M/M + F > M/F\n"
+ " Y:1-59373566 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " chrX:1-60000 M/M + F > M\n"
+ " chrX:1-60000 M/M + F > M/F\n"
+ " chrX:2699521-154931043 M/M + F > M\n"
+ " chrX:2699521-154931043 M/M + F > M/F\n"
+ " chrY:1-59373566 . + F > F\n"
+ " chrM:1-16569 M + F > M\n"
+ },
+ { .alias = "GRCh38",
+ .about = "Human Genome reference assembly GRCh38 / hg38, both chr naming conventions",
+ .rules =
+ " X:1-9999 M/M + F > M\n"
+ " X:1-9999 M/M + F > M/F\n"
+ " X:2781480-155701381 M/M + F > M\n"
+ " X:2781480-155701381 M/M + F > M/F\n"
+ " Y:1-57227415 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " chrX:1-9999 M/M + F > M\n"
+ " chrX:1-9999 M/M + F > M/F\n"
+ " chrX:2781480-155701381 M/M + F > M\n"
+ " chrX:2781480-155701381 M/M + F > M/F\n"
+ " chrY:1-57227415 . + F > F\n"
+ " chrM:1-16569 M + F > M\n"
+ },
+ {
+ .alias = NULL,
+ .about = NULL,
+ .rules = NULL,
+ }
+};
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Count Mendelian consistent / inconsistent genotypes.\n"
+ "Usage: bcftools +mendelian [Options]\n"
+ "Options:\n"
+ " -c, --count count the number of consistent sites\n"
+ " -d, --delete delete inconsistent genotypes (set to \"./.\")\n"
+ " -l, --list [+x] list consistent (+) or inconsistent (x) sites\n"
+ " -o, --output <file> write output to a file [standard output]\n"
+ " -O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+ " -r, --rules <assembly>[?] predefined rules, 'list' to print available settings, append '?' for details\n"
+ " -R, --rules-file <file> inheritance rules, see example below\n"
+ " -t, --trio <m,f,c> names of mother, father and the child\n"
+ " -T, --trio-file <file> list of trios, one per line\n"
+ "\n"
+ "Example:\n"
+ " # Default inheritance patterns, override with -r\n"
+ " # region mothernal_ploidy + paternal > offspring\n"
+ " X:1-60000 M/M + F > M\n"
+ " X:1-60000 M/M + F > M/F\n"
+ " X:2699521-154931043 M/M + F > M\n"
+ " X:2699521-154931043 M/M + F > M/F\n"
+ " Y:1-59373566 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " bcftools +mendelian in.vcf -t Mother,Father,Child -c\n"
+ "\n";
+}
+
+regidx_t *init_rules(args_t *args, char *alias)
+{
+ const rules_predef_t *rules = rules_predefs;
+ if ( !alias ) alias = "GRCh37";
+
+ int detailed = 0, len = strlen(alias);
+ if ( alias[len-1]=='?' ) { detailed = 1; alias[len-1] = 0; }
+
+ while ( rules->alias && strcasecmp(alias,rules->alias) ) rules++;
+
+ if ( !rules->alias )
+ {
+ fprintf(stderr,"\nPRE-DEFINED INHERITANCE RULES\n\n");
+ fprintf(stderr," * Columns are: CHROM:BEG-END MATERNAL_PLOIDY + PATERNAL_PLOIDY > OFFSPRING\n");
+ fprintf(stderr," * Coordinates are 1-based inclusive.\n\n");
+ rules = rules_predefs;
+ while ( rules->alias )
+ {
+ fprintf(stderr,"%s\n .. %s\n\n", rules->alias,rules->about);
+ if ( detailed )
+ fprintf(stderr,"%s\n", rules->rules);
+ rules++;
+ }
+ fprintf(stderr,"Run as --rules <alias> (e.g. --rules GRCh37).\n");
+ fprintf(stderr,"To see the detailed ploidy definition, append a question mark (e.g. --rules GRCh37?).\n");
+ fprintf(stderr,"\n");
+ exit(-1);
+ }
+ else if ( detailed )
+ {
+ fprintf(stderr,"%s", rules->rules);
+ exit(-1);
+ }
+ return regidx_init_string(rules->rules, parse_rules, NULL, sizeof(rule_t), &args);
+}
+
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ // e.g. "Y:1-59373566 . + F > . # daugther"
+
+ // eat any leading spaces
+ char *ss = (char*) line;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) return -1; // skip empty lines
+
+ // chromosome name, beg, end
+ char *tmp, *se = ss;
+ while ( se[1] && !isspace(se[1]) ) se++;
+ while ( se > ss && isdigit(*se) ) se--;
+ if ( *se!='-' ) error("Could not parse the region: %s\n", line);
+ *end = strtol(se+1, &tmp, 10) - 1;
+ if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+ while ( se > ss && *se!=':' ) se--;
+ *beg = strtol(se+1, &tmp, 10) - 1;
+ if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+
+ *chr_beg = ss;
+ *chr_end = se-1;
+
+ // skip region
+ while ( *ss && !isspace(*ss) ) ss++;
+ while ( *ss && isspace(*ss) ) ss++;
+
+ rule_t *rule = (rule_t*) payload;
+ memset(rule, 0, sizeof(rule_t));
+
+ // mothernal ploidy
+ se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ int err = 0;
+ if ( se - ss == 1 )
+ {
+ if ( *ss=='M' ) rule->mpl = 1;
+ else if ( *ss=='.' ) rule->mpl = 0;
+ else err = 1;
+ }
+ else if ( se - ss == 3 )
+ {
+ if ( !strncmp(ss,"M/M",3) ) rule->mpl = 2;
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the mothernal ploidy, only \"M\", \"M/M\" and \".\" currently supported: %s\n",line);
+
+ // skip "+"
+ while ( *se && isspace(*se) ) se++;
+ if ( *se != '+' ) error("Could not parse the line: %s\n",line);
+ se++;
+ while ( *se && isspace(*se) ) se++;
+
+ // paternal ploidy
+ ss = se;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se - ss == 1 )
+ {
+ if ( *ss=='F' ) rule->fpl = 1;
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the paternal ploidy, only \"F\" is currently supported: %s [%s]\n",line, ss);
+
+ // skip ">"
+ while ( *se && isspace(*se) ) se++;
+ if ( *se != '>' ) error("Could not parse the line: %s\n",line);
+ se++;
+ while ( *se && isspace(*se) ) se++;
+
+ // ploidy in offspring
+ ss = se;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se - ss == 3 )
+ {
+ if ( !strncmp(ss,"M/F",3) ) { rule->cpl = 2; rule->fal = 1; rule->mal = 1; }
+ else err = 1;
+ }
+ else if ( se - ss == 1 )
+ {
+ if ( *ss=='F' ) { rule->cpl = 1; rule->fal = 1; }
+ else if ( *ss=='M' ) { rule->cpl = 1; rule->mal = 1; }
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the offspring's ploidy, only \"M\", \"F\" or \"M/F\" is currently supported: %s\n",line);
+
+ return 0;
+}
+
+int run(int argc, char **argv)
+{
+ char *trio_samples = NULL, *trio_file = NULL, *rules_fname = NULL, *rules_string = NULL;
+ memset(&args,0,sizeof(args_t));
+ args.mode = 0;
+ args.output_fname = "-";
+
+ static struct option loptions[] =
+ {
+ {"trio",1,0,'t'},
+ {"trio-file",1,0,'T'},
+ {"delete",0,0,'d'},
+ {"list",1,0,'l'},
+ {"count",0,0,'c'},
+ {"rules",1,0,'r'},
+ {"rules-file",1,0,'R'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:T:l:cdr:R:o:O:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'o': args.output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args.output_type = FT_BCF_GZ; break;
+ case 'u': args.output_type = FT_BCF; break;
+ case 'z': args.output_type = FT_VCF_GZ; break;
+ case 'v': args.output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ };
+ break;
+ case 'R': rules_fname = optarg; break;
+ case 'r': rules_string = optarg; break;
+ case 'd': args.mode |= MODE_DELETE; break;
+ case 'c': args.mode |= MODE_COUNT; break;
+ case 'l':
+ if ( !strcmp("+",optarg) ) args.mode |= MODE_LIST_GOOD;
+ else if ( !strcmp("x",optarg) ) args.mode |= MODE_LIST_BAD;
+ else error("The argument not recognised: --list %s\n", optarg);
+ break;
+ case 't': trio_samples = optarg; break;
+ case 'T': trio_file = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s",usage()); break;
+ }
+ }
+ if ( rules_fname )
+ args.rules = regidx_init(rules_fname, parse_rules, NULL, sizeof(rule_t), &args);
+ else
+ args.rules = init_rules(&args, rules_string);
+ if ( !args.rules ) return -1;
+ args.itr = regitr_init(args.rules);
+ args.itr_ori = regitr_init(args.rules);
+
+ char *fname = NULL;
+ if ( optind>=argc || argv[optind][0]=='-' )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) fname = "-"; // reading from stdin
+ else error("%s",usage());
+ }
+ else
+ fname = argv[optind];
+
+ if ( !trio_samples && !trio_file ) error("Expected the -t/T option\n");
+ if ( !args.mode ) error("Expected one of the -c, -d or -l options\n");
+ if ( args.mode&MODE_DELETE && !(args.mode&(MODE_LIST_GOOD|MODE_LIST_BAD)) ) args.mode |= MODE_LIST_GOOD|MODE_LIST_BAD;
+
+ args.sr = bcf_sr_init();
+ if ( !bcf_sr_add_reader(args.sr, fname) ) error("Failed to open %s: %s\n", fname,bcf_sr_strerror(args.sr->errnum));
+ args.hdr = bcf_sr_get_header(args.sr, 0);
+ args.out_fh = hts_open(args.output_fname,hts_bcf_wmode(args.output_type));
+ if ( args.out_fh == NULL ) error("Can't write to \"%s\": %s\n", args.output_fname, strerror(errno));
+ bcf_hdr_write(args.out_fh, args.hdr);
+
+
+ int i, n = 0;
+ char **list;
+ if ( trio_samples )
+ {
+ args.ntrios = 1;
+ args.trios = (trio_t*) calloc(1,sizeof(trio_t));
+ list = hts_readlist(trio_samples, 0, &n);
+ if ( n!=3 ) error("Expected three sample names with -t\n");
+ args.trios[0].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.trios[0].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ args.trios[0].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ }
+ if ( trio_file )
+ {
+ list = hts_readlist(trio_file, 1, &n);
+ args.ntrios = n;
+ args.trios = (trio_t*) calloc(n,sizeof(trio_t));
+ for (i=0; i<n; i++)
+ {
+ char *ss = list[i], *se;
+ se = strchr(ss, ',');
+ if ( !se ) error("Could not parse %s: %s\n",trio_file, ss);
+ *se = 0;
+ args.trios[i].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].imother<0 ) error("No such sample: \"%s\"\n", ss);
+ ss = ++se;
+ se = strchr(ss, ',');
+ if ( !se ) error("Could not parse %s\n",trio_file);
+ *se = 0;
+ args.trios[i].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].ifather<0 ) error("No such sample: \"%s\"\n", ss);
+ ss = ++se;
+ if ( *ss=='\0' ) error("Could not parse %s\n",trio_file);
+ args.trios[i].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].ichild<0 ) error("No such sample: \"%s\"\n", ss);
+ free(list[i]);
+ }
+ free(list);
+ }
+
+ while ( bcf_sr_next_line(args.sr) )
+ {
+ bcf1_t *line = bcf_sr_get_line(args.sr,0);
+ line = process(line);
+ if ( line )
+ {
+ if ( line->errcode ) error("TODO: Unchecked error (%d), exiting\n",line->errcode);
+ bcf_write1(args.out_fh, args.hdr, line);
+ }
+ }
+
+
+ fprintf(stderr,"# [1]nOK\t[2]nBad\t[3]nSkipped\t[4]Trio\n");
+ for (i=0; i<args.ntrios; i++)
+ {
+ trio_t *trio = &args.trios[i];
+ fprintf(stderr,"%d\t%d\t%d\t%s,%s,%s\n",
+ trio->nok,trio->nbad,args.nrec-(trio->nok+trio->nbad),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->imother),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ifather),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ichild)
+ );
+ }
+ free(args.gt_arr);
+ free(args.trios);
+ regitr_destroy(args.itr);
+ regitr_destroy(args.itr_ori);
+ regidx_destroy(args.rules);
+ bcf_sr_destroy(args.sr);
+ if ( hts_close(args.out_fh)!=0 ) error("Error: close failed\n");
+ return 0;
+}
+
+static void warn_ploidy(bcf1_t *rec)
+{
+ static int warned = 0;
+ if ( warned ) return;
+ fprintf(stderr,"Incorrect ploidy at %s:%d, skipping the trio. (This warning is printed only once.)\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ warned = 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ bcf1_t *dflt = args.mode&MODE_LIST_GOOD ? rec : NULL;
+ args.nrec++;
+
+ if ( rec->n_allele > 63 ) return dflt; // we use 64bit bitmask below
+
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return dflt;
+ if ( ngt!=2*bcf_hdr_nsamples(args.hdr) && ngt!=bcf_hdr_nsamples(args.hdr) ) return dflt;
+ ngt /= bcf_hdr_nsamples(args.hdr);
+
+ int itr_set = regidx_overlap(args.rules, bcf_seqname(args.hdr,rec),rec->pos,rec->pos, args.itr_ori);
+
+ int i, has_bad = 0, needs_update = 0;
+ for (i=0; i<args.ntrios; i++)
+ {
+ int32_t a,b,c,d,e,f;
+ trio_t *trio = &args.trios[i];
+
+ a = args.gt_arr[ngt*trio->imother];
+ b = ngt==2 ? args.gt_arr[ngt*trio->imother+1] : bcf_int32_vector_end;
+ c = args.gt_arr[ngt*trio->ifather];
+ d = ngt==2 ? args.gt_arr[ngt*trio->ifather+1] : bcf_int32_vector_end;
+ e = args.gt_arr[ngt*trio->ichild];
+ f = ngt==2 ? args.gt_arr[ngt*trio->ichild+1] : bcf_int32_vector_end;
+
+ // skip sites with missing data in child
+ if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) continue;
+
+ uint64_t mother = 0, father = 0,child1,child2;
+
+ int is_ok = 0;
+ if ( !itr_set )
+ {
+ if ( f==bcf_int32_vector_end ) { warn_ploidy(rec); continue; }
+
+ // All M,F,C genotypes are diploid. Missing data are considered consistent.
+ child1 = 1<<bcf_gt_allele(e);
+ child2 = 1<<bcf_gt_allele(f);
+ mother = bcf_gt_is_missing(a) ? child1|child2 : 1<<bcf_gt_allele(a);
+ mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(b);
+ father = bcf_gt_is_missing(c) ? child1|child2 : 1<<bcf_gt_allele(c);
+ father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(d);
+
+ if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+ }
+ else
+ {
+ child1 = 1<<bcf_gt_allele(e);
+ child2 = bcf_gt_is_missing(f) || f==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(f);
+ mother |= bcf_gt_is_missing(a) ? 0 : 1<<bcf_gt_allele(a);
+ mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(b);
+ father |= bcf_gt_is_missing(c) ? 0 : 1<<bcf_gt_allele(c);
+ father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(d);
+
+ regitr_copy(args.itr, args.itr_ori);
+ while ( !is_ok && regitr_overlap(args.itr) )
+ {
+ rule_t *rule = ®itr_payload(args.itr,rule_t);
+ if ( child1 && child2 )
+ {
+ if ( !rule->mal || !rule->fal ) continue; // wrong rule (haploid), but this is a diploid GT
+ if ( !mother ) mother = child1|child2;
+ if ( !father ) father = child1|child2;
+ if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+ continue;
+ }
+ if ( rule->mal )
+ {
+ if ( mother && !(child1&mother) ) continue;
+ }
+ if ( rule->fal )
+ {
+ if ( father && !(child1&father) ) continue;
+ }
+ is_ok = 1;
+ }
+ }
+ if ( is_ok )
+ {
+ trio->nok++;
+ }
+ else
+ {
+ trio->nbad++;
+ has_bad = 1;
+ if ( args.mode&MODE_DELETE )
+ {
+ args.gt_arr[ngt*trio->imother] = bcf_gt_missing;
+ if ( b!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->imother+1] = bcf_gt_missing; // should be always true
+ args.gt_arr[ngt*trio->ifather] = bcf_gt_missing;
+ if ( d!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ifather+1] = bcf_gt_missing;
+ args.gt_arr[ngt*trio->ichild] = bcf_gt_missing;
+ if ( f!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ichild+1] = bcf_gt_missing;
+ needs_update = 1;
+ }
+ }
+ }
+
+ if ( needs_update && bcf_update_genotypes(args.hdr,rec,args.gt_arr,ngt*bcf_hdr_nsamples(args.hdr)) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+
+ if ( args.mode&MODE_DELETE ) return rec;
+ if ( args.mode&MODE_LIST_GOOD ) return has_bad ? NULL : rec;
+ if ( args.mode&MODE_LIST_BAD ) return has_bad ? rec : NULL;
+
+ return NULL;
+}
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2015 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <errno.h>
+#include <ctype.h>
+#include <unistd.h> // for isatty
+#include "bcftools.h"
+#include "regidx.h"
+
+#define MODE_COUNT 1
+#define MODE_LIST_GOOD 2
+#define MODE_LIST_BAD 4
+#define MODE_DELETE 8
+
+typedef struct
+{
+ int nok, nbad;
+ int imother,ifather,ichild;
+}
+trio_t;
+
+typedef struct
+{
+ int mpl, fpl, cpl; // ploidies - mother, father, child
+ int mal, fal; // expect an allele from mother and father
+}
+rule_t;
+
+typedef struct _args_t
+{
+ regidx_t *rules;
+ regitr_t *itr, *itr_ori;
+ bcf_hdr_t *hdr;
+ htsFile *out_fh;
+ int32_t *gt_arr;
+ int mode;
+ int ngt_arr, nrec;
+ trio_t *trios;
+ int ntrios;
+ int output_type;
+ char *output_fname;
+ bcf_srs_t *sr;
+}
+args_t;
+
+static args_t args;
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr);
+static bcf1_t *process(bcf1_t *rec);
+
+const char *about(void)
+{
+ return "Count Mendelian consistent / inconsistent genotypes.\n";
+}
+
+typedef struct
+{
+ const char *alias, *about, *rules;
+}
+rules_predef_t;
+
+static rules_predef_t rules_predefs[] =
+{
+ { .alias = "GRCh37",
+ .about = "Human Genome reference assembly GRCh37 / hg19, both chr naming conventions",
+ .rules =
+ " X:1-60000 M/M + F > M\n"
+ " X:1-60000 M/M + F > M/F\n"
+ " X:2699521-154931043 M/M + F > M\n"
+ " X:2699521-154931043 M/M + F > M/F\n"
+ " Y:1-59373566 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " chrX:1-60000 M/M + F > M\n"
+ " chrX:1-60000 M/M + F > M/F\n"
+ " chrX:2699521-154931043 M/M + F > M\n"
+ " chrX:2699521-154931043 M/M + F > M/F\n"
+ " chrY:1-59373566 . + F > F\n"
+ " chrM:1-16569 M + F > M\n"
+ },
+ { .alias = "GRCh38",
+ .about = "Human Genome reference assembly GRCh38 / hg38, both chr naming conventions",
+ .rules =
+ " X:1-9999 M/M + F > M\n"
+ " X:1-9999 M/M + F > M/F\n"
+ " X:2781480-155701381 M/M + F > M\n"
+ " X:2781480-155701381 M/M + F > M/F\n"
+ " Y:1-57227415 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " chrX:1-9999 M/M + F > M\n"
+ " chrX:1-9999 M/M + F > M/F\n"
+ " chrX:2781480-155701381 M/M + F > M\n"
+ " chrX:2781480-155701381 M/M + F > M/F\n"
+ " chrY:1-57227415 . + F > F\n"
+ " chrM:1-16569 M + F > M\n"
+ },
+ {
+ .alias = NULL,
+ .about = NULL,
+ .rules = NULL,
+ }
+};
+
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Count Mendelian consistent / inconsistent genotypes.\n"
+ "Usage: bcftools +mendelian [Options]\n"
+ "Options:\n"
+ " -c, --count count the number of consistent sites\n"
+ " -d, --delete delete inconsistent genotypes (set to \"./.\")\n"
+ " -l, --list [+x] list consistent (+) or inconsistent (x) sites\n"
+ " -o, --output <file> write output to a file [standard output]\n"
+ " -O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]\n"
+ " -r, --rules <assembly>[?] predefined rules, 'list' to print available settings, append '?' for details\n"
+ " -R, --rules-file <file> inheritance rules, see example below\n"
+ " -t, --trio <m,f,c> names of mother, father and the child\n"
+ " -T, --trio-file <file> list of trios, one per line\n"
+ "\n"
+ "Example:\n"
+ " # Default inheritance patterns, override with -r\n"
+ " # region mothernal_ploidy + paternal > offspring\n"
+ " X:1-60000 M/M + F > M\n"
+ " X:1-60000 M/M + F > M/F\n"
+ " X:2699521-154931043 M/M + F > M\n"
+ " X:2699521-154931043 M/M + F > M/F\n"
+ " Y:1-59373566 . + F > F\n"
+ " MT:1-16569 M + F > M\n"
+ "\n"
+ " bcftools +mendelian in.vcf -t Mother,Father,Child -c\n"
+ "\n";
+}
+
+regidx_t *init_rules(args_t *args, char *alias)
+{
+ const rules_predef_t *rules = rules_predefs;
+ if ( !alias ) alias = "GRCh37";
+
+ int detailed = 0, len = strlen(alias);
+ if ( alias[len-1]=='?' ) { detailed = 1; alias[len-1] = 0; }
+
+ while ( rules->alias && strcasecmp(alias,rules->alias) ) rules++;
+
+ if ( !rules->alias )
+ {
+ fprintf(bcftools_stderr,"\nPRE-DEFINED INHERITANCE RULES\n\n");
+ fprintf(bcftools_stderr," * Columns are: CHROM:BEG-END MATERNAL_PLOIDY + PATERNAL_PLOIDY > OFFSPRING\n");
+ fprintf(bcftools_stderr," * Coordinates are 1-based inclusive.\n\n");
+ rules = rules_predefs;
+ while ( rules->alias )
+ {
+ fprintf(bcftools_stderr,"%s\n .. %s\n\n", rules->alias,rules->about);
+ if ( detailed )
+ fprintf(bcftools_stderr,"%s\n", rules->rules);
+ rules++;
+ }
+ fprintf(bcftools_stderr,"Run as --rules <alias> (e.g. --rules GRCh37).\n");
+ fprintf(bcftools_stderr,"To see the detailed ploidy definition, append a question mark (e.g. --rules GRCh37?).\n");
+ fprintf(bcftools_stderr,"\n");
+ exit(-1);
+ }
+ else if ( detailed )
+ {
+ fprintf(bcftools_stderr,"%s", rules->rules);
+ exit(-1);
+ }
+ return regidx_init_string(rules->rules, parse_rules, NULL, sizeof(rule_t), &args);
+}
+
+static int parse_rules(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ // e.g. "Y:1-59373566 . + F > . # daugther"
+
+ // eat any leading spaces
+ char *ss = (char*) line;
+ while ( *ss && isspace(*ss) ) ss++;
+ if ( !*ss ) return -1; // skip empty lines
+
+ // chromosome name, beg, end
+ char *tmp, *se = ss;
+ while ( se[1] && !isspace(se[1]) ) se++;
+ while ( se > ss && isdigit(*se) ) se--;
+ if ( *se!='-' ) error("Could not parse the region: %s\n", line);
+ *end = strtol(se+1, &tmp, 10) - 1;
+ if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+ while ( se > ss && *se!=':' ) se--;
+ *beg = strtol(se+1, &tmp, 10) - 1;
+ if ( tmp==se+1 ) error("Could not parse the region:%s\n",line);
+
+ *chr_beg = ss;
+ *chr_end = se-1;
+
+ // skip region
+ while ( *ss && !isspace(*ss) ) ss++;
+ while ( *ss && isspace(*ss) ) ss++;
+
+ rule_t *rule = (rule_t*) payload;
+ memset(rule, 0, sizeof(rule_t));
+
+ // mothernal ploidy
+ se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ int err = 0;
+ if ( se - ss == 1 )
+ {
+ if ( *ss=='M' ) rule->mpl = 1;
+ else if ( *ss=='.' ) rule->mpl = 0;
+ else err = 1;
+ }
+ else if ( se - ss == 3 )
+ {
+ if ( !strncmp(ss,"M/M",3) ) rule->mpl = 2;
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the mothernal ploidy, only \"M\", \"M/M\" and \".\" currently supported: %s\n",line);
+
+ // skip "+"
+ while ( *se && isspace(*se) ) se++;
+ if ( *se != '+' ) error("Could not parse the line: %s\n",line);
+ se++;
+ while ( *se && isspace(*se) ) se++;
+
+ // paternal ploidy
+ ss = se;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se - ss == 1 )
+ {
+ if ( *ss=='F' ) rule->fpl = 1;
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the paternal ploidy, only \"F\" is currently supported: %s [%s]\n",line, ss);
+
+ // skip ">"
+ while ( *se && isspace(*se) ) se++;
+ if ( *se != '>' ) error("Could not parse the line: %s\n",line);
+ se++;
+ while ( *se && isspace(*se) ) se++;
+
+ // ploidy in offspring
+ ss = se;
+ while ( *se && !isspace(*se) ) se++;
+ if ( se - ss == 3 )
+ {
+ if ( !strncmp(ss,"M/F",3) ) { rule->cpl = 2; rule->fal = 1; rule->mal = 1; }
+ else err = 1;
+ }
+ else if ( se - ss == 1 )
+ {
+ if ( *ss=='F' ) { rule->cpl = 1; rule->fal = 1; }
+ else if ( *ss=='M' ) { rule->cpl = 1; rule->mal = 1; }
+ else err = 1;
+ }
+ else err = 1;
+ if ( err ) error("Could not parse the offspring's ploidy, only \"M\", \"F\" or \"M/F\" is currently supported: %s\n",line);
+
+ return 0;
+}
+
+int run(int argc, char **argv)
+{
+ char *trio_samples = NULL, *trio_file = NULL, *rules_fname = NULL, *rules_string = NULL;
+ memset(&args,0,sizeof(args_t));
+ args.mode = 0;
+ args.output_fname = "-";
+
+ static struct option loptions[] =
+ {
+ {"trio",1,0,'t'},
+ {"trio-file",1,0,'T'},
+ {"delete",0,0,'d'},
+ {"list",1,0,'l'},
+ {"count",0,0,'c'},
+ {"rules",1,0,'r'},
+ {"rules-file",1,0,'R'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?ht:T:l:cdr:R:o:O:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'o': args.output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args.output_type = FT_BCF_GZ; break;
+ case 'u': args.output_type = FT_BCF; break;
+ case 'z': args.output_type = FT_VCF_GZ; break;
+ case 'v': args.output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ };
+ break;
+ case 'R': rules_fname = optarg; break;
+ case 'r': rules_string = optarg; break;
+ case 'd': args.mode |= MODE_DELETE; break;
+ case 'c': args.mode |= MODE_COUNT; break;
+ case 'l':
+ if ( !strcmp("+",optarg) ) args.mode |= MODE_LIST_GOOD;
+ else if ( !strcmp("x",optarg) ) args.mode |= MODE_LIST_BAD;
+ else error("The argument not recognised: --list %s\n", optarg);
+ break;
+ case 't': trio_samples = optarg; break;
+ case 'T': trio_file = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s",usage()); break;
+ }
+ }
+ if ( rules_fname )
+ args.rules = regidx_init(rules_fname, parse_rules, NULL, sizeof(rule_t), &args);
+ else
+ args.rules = init_rules(&args, rules_string);
+ if ( !args.rules ) return -1;
+ args.itr = regitr_init(args.rules);
+ args.itr_ori = regitr_init(args.rules);
+
+ char *fname = NULL;
+ if ( optind>=argc || argv[optind][0]=='-' )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) fname = "-"; // reading from stdin
+ else error("%s",usage());
+ }
+ else
+ fname = argv[optind];
+
+ if ( !trio_samples && !trio_file ) error("Expected the -t/T option\n");
+ if ( !args.mode ) error("Expected one of the -c, -d or -l options\n");
+ if ( args.mode&MODE_DELETE && !(args.mode&(MODE_LIST_GOOD|MODE_LIST_BAD)) ) args.mode |= MODE_LIST_GOOD|MODE_LIST_BAD;
+
+ args.sr = bcf_sr_init();
+ if ( !bcf_sr_add_reader(args.sr, fname) ) error("Failed to open %s: %s\n", fname,bcf_sr_strerror(args.sr->errnum));
+ args.hdr = bcf_sr_get_header(args.sr, 0);
+ args.out_fh = hts_open(args.output_fname,hts_bcf_wmode(args.output_type));
+ if ( args.out_fh == NULL ) error("Can't write to \"%s\": %s\n", args.output_fname, strerror(errno));
+ bcf_hdr_write(args.out_fh, args.hdr);
+
+
+ int i, n = 0;
+ char **list;
+ if ( trio_samples )
+ {
+ args.ntrios = 1;
+ args.trios = (trio_t*) calloc(1,sizeof(trio_t));
+ list = hts_readlist(trio_samples, 0, &n);
+ if ( n!=3 ) error("Expected three sample names with -t\n");
+ args.trios[0].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[0]);
+ args.trios[0].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[1]);
+ args.trios[0].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, list[2]);
+ for (i=0; i<n; i++) free(list[i]);
+ free(list);
+ }
+ if ( trio_file )
+ {
+ list = hts_readlist(trio_file, 1, &n);
+ args.ntrios = n;
+ args.trios = (trio_t*) calloc(n,sizeof(trio_t));
+ for (i=0; i<n; i++)
+ {
+ char *ss = list[i], *se;
+ se = strchr(ss, ',');
+ if ( !se ) error("Could not parse %s: %s\n",trio_file, ss);
+ *se = 0;
+ args.trios[i].imother = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].imother<0 ) error("No such sample: \"%s\"\n", ss);
+ ss = ++se;
+ se = strchr(ss, ',');
+ if ( !se ) error("Could not parse %s\n",trio_file);
+ *se = 0;
+ args.trios[i].ifather = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].ifather<0 ) error("No such sample: \"%s\"\n", ss);
+ ss = ++se;
+ if ( *ss=='\0' ) error("Could not parse %s\n",trio_file);
+ args.trios[i].ichild = bcf_hdr_id2int(args.hdr, BCF_DT_SAMPLE, ss);
+ if ( args.trios[i].ichild<0 ) error("No such sample: \"%s\"\n", ss);
+ free(list[i]);
+ }
+ free(list);
+ }
+
+ while ( bcf_sr_next_line(args.sr) )
+ {
+ bcf1_t *line = bcf_sr_get_line(args.sr,0);
+ line = process(line);
+ if ( line )
+ {
+ if ( line->errcode ) error("TODO: Unchecked error (%d), exiting\n",line->errcode);
+ bcf_write1(args.out_fh, args.hdr, line);
+ }
+ }
+
+
+ fprintf(bcftools_stderr,"# [1]nOK\t[2]nBad\t[3]nSkipped\t[4]Trio\n");
+ for (i=0; i<args.ntrios; i++)
+ {
+ trio_t *trio = &args.trios[i];
+ fprintf(bcftools_stderr,"%d\t%d\t%d\t%s,%s,%s\n",
+ trio->nok,trio->nbad,args.nrec-(trio->nok+trio->nbad),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->imother),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ifather),
+ bcf_hdr_int2id(args.hdr, BCF_DT_SAMPLE, trio->ichild)
+ );
+ }
+ free(args.gt_arr);
+ free(args.trios);
+ regitr_destroy(args.itr);
+ regitr_destroy(args.itr_ori);
+ regidx_destroy(args.rules);
+ bcf_sr_destroy(args.sr);
+ if ( hts_close(args.out_fh)!=0 ) error("Error: close failed\n");
+ return 0;
+}
+
+static void warn_ploidy(bcf1_t *rec)
+{
+ static int warned = 0;
+ if ( warned ) return;
+ fprintf(bcftools_stderr,"Incorrect ploidy at %s:%d, skipping the trio. (This warning is printed only once.)\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+ warned = 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ bcf1_t *dflt = args.mode&MODE_LIST_GOOD ? rec : NULL;
+ args.nrec++;
+
+ if ( rec->n_allele > 63 ) return dflt; // we use 64bit bitmask below
+
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.ngt_arr);
+ if ( ngt<0 ) return dflt;
+ if ( ngt!=2*bcf_hdr_nsamples(args.hdr) && ngt!=bcf_hdr_nsamples(args.hdr) ) return dflt;
+ ngt /= bcf_hdr_nsamples(args.hdr);
+
+ int itr_set = regidx_overlap(args.rules, bcf_seqname(args.hdr,rec),rec->pos,rec->pos, args.itr_ori);
+
+ int i, has_bad = 0, needs_update = 0;
+ for (i=0; i<args.ntrios; i++)
+ {
+ int32_t a,b,c,d,e,f;
+ trio_t *trio = &args.trios[i];
+
+ a = args.gt_arr[ngt*trio->imother];
+ b = ngt==2 ? args.gt_arr[ngt*trio->imother+1] : bcf_int32_vector_end;
+ c = args.gt_arr[ngt*trio->ifather];
+ d = ngt==2 ? args.gt_arr[ngt*trio->ifather+1] : bcf_int32_vector_end;
+ e = args.gt_arr[ngt*trio->ichild];
+ f = ngt==2 ? args.gt_arr[ngt*trio->ichild+1] : bcf_int32_vector_end;
+
+ // skip sites with missing data in child
+ if ( bcf_gt_is_missing(e) || bcf_gt_is_missing(f) ) continue;
+
+ uint64_t mother = 0, father = 0,child1,child2;
+
+ int is_ok = 0;
+ if ( !itr_set )
+ {
+ if ( f==bcf_int32_vector_end ) { warn_ploidy(rec); continue; }
+
+ // All M,F,C genotypes are diploid. Missing data are considered consistent.
+ child1 = 1<<bcf_gt_allele(e);
+ child2 = 1<<bcf_gt_allele(f);
+ mother = bcf_gt_is_missing(a) ? child1|child2 : 1<<bcf_gt_allele(a);
+ mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(b);
+ father = bcf_gt_is_missing(c) ? child1|child2 : 1<<bcf_gt_allele(c);
+ father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? child1|child2 : 1<<bcf_gt_allele(d);
+
+ if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+ }
+ else
+ {
+ child1 = 1<<bcf_gt_allele(e);
+ child2 = bcf_gt_is_missing(f) || f==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(f);
+ mother |= bcf_gt_is_missing(a) ? 0 : 1<<bcf_gt_allele(a);
+ mother |= bcf_gt_is_missing(b) || b==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(b);
+ father |= bcf_gt_is_missing(c) ? 0 : 1<<bcf_gt_allele(c);
+ father |= bcf_gt_is_missing(d) || d==bcf_int32_vector_end ? 0 : 1<<bcf_gt_allele(d);
+
+ regitr_copy(args.itr, args.itr_ori);
+ while ( !is_ok && regitr_overlap(args.itr) )
+ {
+ rule_t *rule = ®itr_payload(args.itr,rule_t);
+ if ( child1 && child2 )
+ {
+ if ( !rule->mal || !rule->fal ) continue; // wrong rule (haploid), but this is a diploid GT
+ if ( !mother ) mother = child1|child2;
+ if ( !father ) father = child1|child2;
+ if ( (mother&child1 && father&child2) || (mother&child2 && father&child1) ) is_ok = 1;
+ continue;
+ }
+ if ( rule->mal )
+ {
+ if ( mother && !(child1&mother) ) continue;
+ }
+ if ( rule->fal )
+ {
+ if ( father && !(child1&father) ) continue;
+ }
+ is_ok = 1;
+ }
+ }
+ if ( is_ok )
+ {
+ trio->nok++;
+ }
+ else
+ {
+ trio->nbad++;
+ has_bad = 1;
+ if ( args.mode&MODE_DELETE )
+ {
+ args.gt_arr[ngt*trio->imother] = bcf_gt_missing;
+ if ( b!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->imother+1] = bcf_gt_missing; // should be always true
+ args.gt_arr[ngt*trio->ifather] = bcf_gt_missing;
+ if ( d!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ifather+1] = bcf_gt_missing;
+ args.gt_arr[ngt*trio->ichild] = bcf_gt_missing;
+ if ( f!=bcf_int32_vector_end ) args.gt_arr[ngt*trio->ichild+1] = bcf_gt_missing;
+ needs_update = 1;
+ }
+ }
+ }
+
+ if ( needs_update && bcf_update_genotypes(args.hdr,rec,args.gt_arr,ngt*bcf_hdr_nsamples(args.hdr)) )
+ error("Could not update GT field at %s:%d\n", bcf_seqname(args.hdr,rec),rec->pos+1);
+
+ if ( args.mode&MODE_DELETE ) return rec;
+ if ( args.mode&MODE_LIST_GOOD ) return has_bad ? NULL : rec;
+ if ( args.mode&MODE_LIST_BAD ) return has_bad ? rec : NULL;
+
+ return NULL;
+}
+
--- /dev/null
+/* plugins/missing2ref.c -- sets missing genotypes to reference allele.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int32_t *gts = NULL, mgts = 0;
+int *arr = NULL, marr = 0;
+uint64_t nchanged = 0;
+int new_gt = bcf_gt_unphased(0);
+int use_major = 0;
+
+const char *about(void)
+{
+ return "Set missing genotypes (\"./.\") to ref or major allele (\"0/0\" or \"0|0\").\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Set missing genotypes\n"
+ "Usage: bcftools +missing2ref [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --phased Set to \"0|0\" \n"
+ " -m, --major Set to major allele \n"
+ "\n"
+ "Example:\n"
+ " bcftools +missing2ref in.vcf -- -p\n"
+ " bcftools +missing2ref in.vcf -- -p -m\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ static struct option loptions[] =
+ {
+ {"phased",0,0,'p'},
+ {"major",0,0,'m'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "mp?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': new_gt = bcf_gt_phased(0); break;
+ case 'm': use_major = 1; break;
+ case 'h':
+ case '?':
+ default: fprintf(stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ in_hdr = in;
+ out_hdr = out;
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int ngts = bcf_get_genotypes(in_hdr, rec, >s, &mgts);
+ int i, changed = 0;
+
+ // Calculating allele frequency for each allele and determining major allele
+ // only do this if use_major is true
+ int majorAllele = -1;
+ int maxAC = -1;
+ int an = 0;
+ if(use_major == 1){
+ hts_expand(int,rec->n_allele,marr,arr);
+ int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+ if(ret > 0){
+ for(i=0; i < rec->n_allele; ++i){
+ an += arr[i];
+ if(*(arr+i) > maxAC){
+ maxAC = *(arr+i);
+ majorAllele = i;
+ }
+ }
+ }
+ else{
+ fprintf(stderr,"Warning: Could not calculate allele count at position %d\n", rec->pos);
+ exit(1);
+ }
+
+ // replacing new_gt by major allele
+ if(bcf_gt_is_phased(new_gt))
+ new_gt = bcf_gt_phased(majorAllele);
+ else
+ new_gt = bcf_gt_unphased(majorAllele);
+ }
+
+ // replace gts
+ for (i=0; i<ngts; i++)
+ {
+ if ( gts[i]==bcf_gt_missing )
+ {
+ gts[i] = new_gt;
+ changed++;
+ }
+ }
+ nchanged += changed;
+ if ( changed ) bcf_update_genotypes(out_hdr, rec, gts, ngts);
+ return rec;
+}
+
+void destroy(void)
+{
+ free(arr);
+ fprintf(stderr,"Filled %"PRId64" REF alleles\n", nchanged);
+ free(gts);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/missing2ref.c -- sets missing genotypes to reference allele.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <inttypes.h>
+#include <getopt.h>
+
+bcf_hdr_t *in_hdr, *out_hdr;
+int32_t *gts = NULL, mgts = 0;
+int *arr = NULL, marr = 0;
+uint64_t nchanged = 0;
+int new_gt = bcf_gt_unphased(0);
+int use_major = 0;
+
+const char *about(void)
+{
+ return "Set missing genotypes (\"./.\") to ref or major allele (\"0/0\" or \"0|0\").\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Set missing genotypes\n"
+ "Usage: bcftools +missing2ref [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --phased Set to \"0|0\" \n"
+ " -m, --major Set to major allele \n"
+ "\n"
+ "Example:\n"
+ " bcftools +missing2ref in.vcf -- -p\n"
+ " bcftools +missing2ref in.vcf -- -p -m\n"
+ "\n";
+}
+
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ int c;
+ static struct option loptions[] =
+ {
+ {"phased",0,0,'p'},
+ {"major",0,0,'m'},
+ {0,0,0,0}
+ };
+ while ((c = getopt_long(argc, argv, "mp?h",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': new_gt = bcf_gt_phased(0); break;
+ case 'm': use_major = 1; break;
+ case 'h':
+ case '?':
+ default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+ }
+ }
+ in_hdr = in;
+ out_hdr = out;
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int ngts = bcf_get_genotypes(in_hdr, rec, >s, &mgts);
+ int i, changed = 0;
+
+ // Calculating allele frequency for each allele and determining major allele
+ // only do this if use_major is true
+ int majorAllele = -1;
+ int maxAC = -1;
+ int an = 0;
+ if(use_major == 1){
+ hts_expand(int,rec->n_allele,marr,arr);
+ int ret = bcf_calc_ac(in_hdr,rec,arr,BCF_UN_FMT);
+ if(ret > 0){
+ for(i=0; i < rec->n_allele; ++i){
+ an += arr[i];
+ if(*(arr+i) > maxAC){
+ maxAC = *(arr+i);
+ majorAllele = i;
+ }
+ }
+ }
+ else{
+ fprintf(bcftools_stderr,"Warning: Could not calculate allele count at position %d\n", rec->pos);
+ exit(1);
+ }
+
+ // replacing new_gt by major allele
+ if(bcf_gt_is_phased(new_gt))
+ new_gt = bcf_gt_phased(majorAllele);
+ else
+ new_gt = bcf_gt_unphased(majorAllele);
+ }
+
+ // replace gts
+ for (i=0; i<ngts; i++)
+ {
+ if ( gts[i]==bcf_gt_missing )
+ {
+ gts[i] = new_gt;
+ changed++;
+ }
+ }
+ nchanged += changed;
+ if ( changed ) bcf_update_genotypes(out_hdr, rec, gts, ngts);
+ return rec;
+}
+
+void destroy(void)
+{
+ free(arr);
+ fprintf(bcftools_stderr,"Filled %"PRId64" REF alleles\n", nchanged);
+ free(gts);
+}
+
+
--- /dev/null
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+/*
+ Prune sites by missingness, LD
+
+ See calc_ld() in vcfbuf.c for the actual LD calculation
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "vcfbuf.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ filter_t *filter;
+ char *filter_str, *af_tag;
+ int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+ vcfbuf_t *vcfbuf;
+ int argc, region_is_file, target_is_file, output_type, filter_r2_id, rand_missing, nsites, ld_win;
+ char **argv, *region, *target, *fname, *output_fname, *info_pos, *info_r2, *filter_r2;
+ htsFile *out_fh;
+ bcf_hdr_t *hdr;
+ bcf_srs_t *sr;
+ double max_ld;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Prune sites by missingness, linkage disequilibrium\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Prune sites by missingness or linkage disequilibrium.\n"
+ "\n"
+ "Usage: bcftools +prune [Options]\n"
+ "Plugin options:\n"
+ " --AF-tag STR use this tag with -n to determine allele frequency\n"
+ " -a, --annotate-info STR add INFO/STR_POS and STR_R2 annotation: an upstream site with the biggest r2 value\n"
+ " -e, --exclude EXPR exclude sites for which the expression is true\n"
+ " -f, --set-filter STR annotate FILTER column with STR instead of discarding the site\n"
+ " -i, --include EXPR include only sites for which the expression is true\n"
+ " -l, --max-LD R2 remove sites with r2 bigger than R2 within within the -w window\n"
+ " -n, --nsites-per-win N keep at most N sites in the -w window, removing sites with small AF first\n"
+ " -o, --output FILE write output to the FILE [standard output]\n"
+ " -O, --output-type b|u|z|v b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " --randomize-missing replace missing data with randomly assigned genotype based on site's allele frequency\n"
+ " -r, --regions REGION restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REGION similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ " -w, --window INT[bp|kb] the window size of INT sites/bp/kb for the -n/-l options [100kb]\n"
+ "Examples:\n"
+ " # Discard records with r2 bigger than 0.6 in a window of 1000 sites\n"
+ " bcftools +prune -l 0.6 -w 1000 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Set FILTER (but do not discard) records with r2 bigger than 0.4 in the default window of 100kb\n"
+ " bcftools +prune -l 0.4 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Annotate INFO field of all records with maximum r2 in a window of 1000 sites\n"
+ " bcftools +prune -l 0.6 -w 1000 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Discard records with r2 bigger than 0.6, first removing records with more than 2% of genotypes missing\n"
+ " bcftools +prune -l 0.6 -e'F_MISSING>=0.02' input.bcf -Ob -o output.bcf\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+ }
+ if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ if ( args->filter_r2 )
+ {
+ bcf_hdr_printf(args->hdr,"##FILTER=<ID=%s,Description=\"A site with r2>%e upstream within %d%s\">",args->filter_r2,args->max_ld,
+ args->ld_win < 0 ? -args->ld_win/1000 : args->ld_win,
+ args->ld_win < 0 ? "kb" : " sites");
+ }
+ if ( args->info_r2 )
+ {
+ bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Integer,Description=\"A site with r2>%e upstream\">",args->info_pos,args->max_ld);
+ bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Float,Description=\"A site with r2>%e upstream\">",args->info_r2,args->max_ld);
+ }
+ bcf_hdr_write(args->out_fh, args->hdr);
+ if ( args->filter_r2 )
+ args->filter_r2_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, args->filter_r2);
+
+ args->vcfbuf = vcfbuf_init(args->hdr, args->ld_win);
+ vcfbuf_set_opt(args->vcfbuf,double,VCFBUF_LD_MAX,args->max_ld);
+ if ( args->nsites ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_NSITES,args->nsites);
+ if ( args->af_tag ) vcfbuf_set_opt(args->vcfbuf,char*,VCFBUF_AF_TAG,args->af_tag);
+ if ( args->rand_missing ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_RAND_MISSING,1);
+ vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_SKIP_FILTER,args->filter_r2 ? 1 : 0);
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+}
+static void destroy_data(args_t *args)
+{
+ if ( args->filter )
+ filter_destroy(args->filter);
+ hts_close(args->out_fh);
+ vcfbuf_destroy(args->vcfbuf);
+ bcf_sr_destroy(args->sr);
+ free(args->info_pos);
+ free(args->info_r2);
+ free(args);
+}
+static void flush(args_t *args, int flush_all)
+{
+ bcf1_t *rec;
+ while ( (rec = vcfbuf_flush(args->vcfbuf, flush_all)) )
+ bcf_write1(args->out_fh, args->hdr, rec);
+}
+static void process(args_t *args)
+{
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( args->filter )
+ {
+ int ret = filter_test(args->filter, rec, NULL);
+ if ( args->filter_logic==FLT_INCLUDE ) { if ( !ret ) return; }
+ else if ( ret ) return;
+ }
+ bcf_sr_t *sr = bcf_sr_get_reader(args->sr, 0);
+ if ( args->max_ld )
+ {
+ double ld_val;
+ bcf1_t *ld_rec = vcfbuf_max_ld(args->vcfbuf, rec, &ld_val);
+ if ( ld_rec && ld_val > args->max_ld )
+ {
+ if ( !args->filter_r2 ) return;
+ bcf_add_filter(args->hdr, rec, args->filter_r2_id);
+ }
+ if ( ld_rec && args->info_r2 )
+ {
+ float tmp = ld_val;
+ int32_t tmp_pos = ld_rec->pos + 1;
+ bcf_update_info_float(args->hdr, rec, args->info_r2, &tmp, 1);
+ bcf_update_info_int32(args->hdr, rec, args->info_pos, &tmp_pos, 1);
+ }
+ }
+ sr->buffer[0] = vcfbuf_push(args->vcfbuf, rec, 1);
+ flush(args,0);
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_type = FT_VCF;
+ args->output_fname = "-";
+ args->ld_win = -100e3;
+ static struct option loptions[] =
+ {
+ {"randomize-missing",no_argument,NULL,1},
+ {"AF-tag",required_argument,NULL,2},
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"annotate-info",required_argument,NULL,'a'},
+ {"set-filter",required_argument,NULL,'f'},
+ {"max-LD",required_argument,NULL,'l'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {"nsites-per-win",required_argument,NULL,'n'},
+ {"window",required_argument,NULL,'w'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:t:T:l:o:O:a:f:i:e:n:w:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 1 : args->rand_missing = 1; break;
+ case 2 : args->af_tag = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'a':
+ {
+ int l = strlen(optarg);
+ args->info_pos = (char*)malloc(l+5);
+ args->info_r2 = (char*)malloc(l+5);
+ sprintf(args->info_pos,"%s_POS", optarg);
+ sprintf(args->info_r2,"%s_R2", optarg);
+ }
+ break;
+ case 'f': args->filter_r2 = optarg; break;
+ case 'n':
+ args->nsites = strtod(optarg,&tmp);
+ if ( tmp==optarg || *tmp ) error("Could not parse: --nsites-per-win %s\n", optarg);
+ break;
+ case 'l':
+ args->max_ld = strtod(optarg,&tmp);
+ if ( tmp==optarg || *tmp ) error("Could not parse: --max-LD %s\n", optarg);
+ break;
+ case 'w':
+ args->ld_win = strtod(optarg,&tmp);
+ if ( !*tmp ) break;
+ if ( tmp==optarg ) error("Could not parse: --window %s\n", optarg);
+ else if ( !strcasecmp("bp",tmp) ) args->ld_win *= -1;
+ else if ( !strcasecmp("kb",tmp) ) args->ld_win *= -1000;
+ else error("Could not parse: --window %s\n", optarg);
+ break;
+ case 'T': args->target_is_file = 1;
+ case 't': args->target = optarg; break;
+ case 'R': args->region_is_file = 1;
+ case 'r': args->region = optarg; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+ if ( !args->max_ld && !args->nsites ) error("%sError: Expected --max-LD, --nsites-per-win or both\n\n", usage_text());
+
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) ) process(args);
+ flush(args,1);
+
+ destroy_data(args);
+ return 0;
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+/*
+ Prune sites by missingness, LD
+
+ See calc_ld() in vcfbuf.c for the actual LD calculation
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "vcfbuf.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ filter_t *filter;
+ char *filter_str, *af_tag;
+ int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+ vcfbuf_t *vcfbuf;
+ int argc, region_is_file, target_is_file, output_type, filter_r2_id, rand_missing, nsites, ld_win;
+ char **argv, *region, *target, *fname, *output_fname, *info_pos, *info_r2, *filter_r2;
+ htsFile *out_fh;
+ bcf_hdr_t *hdr;
+ bcf_srs_t *sr;
+ double max_ld;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Prune sites by missingness, linkage disequilibrium\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Prune sites by missingness or linkage disequilibrium.\n"
+ "\n"
+ "Usage: bcftools +prune [Options]\n"
+ "Plugin options:\n"
+ " --AF-tag STR use this tag with -n to determine allele frequency\n"
+ " -a, --annotate-info STR add INFO/STR_POS and STR_R2 annotation: an upstream site with the biggest r2 value\n"
+ " -e, --exclude EXPR exclude sites for which the expression is true\n"
+ " -f, --set-filter STR annotate FILTER column with STR instead of discarding the site\n"
+ " -i, --include EXPR include only sites for which the expression is true\n"
+ " -l, --max-LD R2 remove sites with r2 bigger than R2 within within the -w window\n"
+ " -n, --nsites-per-win N keep at most N sites in the -w window, removing sites with small AF first\n"
+ " -o, --output FILE write output to the FILE [standard output]\n"
+ " -O, --output-type b|u|z|v b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " --randomize-missing replace missing data with randomly assigned genotype based on site's allele frequency\n"
+ " -r, --regions REGION restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REGION similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ " -w, --window INT[bp|kb] the window size of INT sites/bp/kb for the -n/-l options [100kb]\n"
+ "Examples:\n"
+ " # Discard records with r2 bigger than 0.6 in a window of 1000 sites\n"
+ " bcftools +prune -l 0.6 -w 1000 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Set FILTER (but do not discard) records with r2 bigger than 0.4 in the default window of 100kb\n"
+ " bcftools +prune -l 0.4 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Annotate INFO field of all records with maximum r2 in a window of 1000 sites\n"
+ " bcftools +prune -l 0.6 -w 1000 -f MAX_R2 input.bcf -Ob -o output.bcf\n"
+ "\n"
+ " # Discard records with r2 bigger than 0.6, first removing records with more than 2% of genotypes missing\n"
+ " bcftools +prune -l 0.6 -e'F_MISSING>=0.02' input.bcf -Ob -o output.bcf\n"
+ "\n";
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+ }
+ if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ args->out_fh = hts_open(args->output_fname,hts_bcf_wmode(args->output_type));
+ if ( args->out_fh == NULL ) error("Can't write to \"%s\": %s\n", args->output_fname, strerror(errno));
+ if ( args->filter_r2 )
+ {
+ bcf_hdr_printf(args->hdr,"##FILTER=<ID=%s,Description=\"A site with r2>%e upstream within %d%s\">",args->filter_r2,args->max_ld,
+ args->ld_win < 0 ? -args->ld_win/1000 : args->ld_win,
+ args->ld_win < 0 ? "kb" : " sites");
+ }
+ if ( args->info_r2 )
+ {
+ bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Integer,Description=\"A site with r2>%e upstream\">",args->info_pos,args->max_ld);
+ bcf_hdr_printf(args->hdr,"##INFO=<ID=%s,Number=1,Type=Float,Description=\"A site with r2>%e upstream\">",args->info_r2,args->max_ld);
+ }
+ bcf_hdr_write(args->out_fh, args->hdr);
+ if ( args->filter_r2 )
+ args->filter_r2_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, args->filter_r2);
+
+ args->vcfbuf = vcfbuf_init(args->hdr, args->ld_win);
+ vcfbuf_set_opt(args->vcfbuf,double,VCFBUF_LD_MAX,args->max_ld);
+ if ( args->nsites ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_NSITES,args->nsites);
+ if ( args->af_tag ) vcfbuf_set_opt(args->vcfbuf,char*,VCFBUF_AF_TAG,args->af_tag);
+ if ( args->rand_missing ) vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_RAND_MISSING,1);
+ vcfbuf_set_opt(args->vcfbuf,int,VCFBUF_SKIP_FILTER,args->filter_r2 ? 1 : 0);
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+}
+static void destroy_data(args_t *args)
+{
+ if ( args->filter )
+ filter_destroy(args->filter);
+ hts_close(args->out_fh);
+ vcfbuf_destroy(args->vcfbuf);
+ bcf_sr_destroy(args->sr);
+ free(args->info_pos);
+ free(args->info_r2);
+ free(args);
+}
+static void flush(args_t *args, int flush_all)
+{
+ bcf1_t *rec;
+ while ( (rec = vcfbuf_flush(args->vcfbuf, flush_all)) )
+ bcf_write1(args->out_fh, args->hdr, rec);
+}
+static void process(args_t *args)
+{
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ if ( args->filter )
+ {
+ int ret = filter_test(args->filter, rec, NULL);
+ if ( args->filter_logic==FLT_INCLUDE ) { if ( !ret ) return; }
+ else if ( ret ) return;
+ }
+ bcf_sr_t *sr = bcf_sr_get_reader(args->sr, 0);
+ if ( args->max_ld )
+ {
+ double ld_val;
+ bcf1_t *ld_rec = vcfbuf_max_ld(args->vcfbuf, rec, &ld_val);
+ if ( ld_rec && ld_val > args->max_ld )
+ {
+ if ( !args->filter_r2 ) return;
+ bcf_add_filter(args->hdr, rec, args->filter_r2_id);
+ }
+ if ( ld_rec && args->info_r2 )
+ {
+ float tmp = ld_val;
+ int32_t tmp_pos = ld_rec->pos + 1;
+ bcf_update_info_float(args->hdr, rec, args->info_r2, &tmp, 1);
+ bcf_update_info_int32(args->hdr, rec, args->info_pos, &tmp_pos, 1);
+ }
+ }
+ sr->buffer[0] = vcfbuf_push(args->vcfbuf, rec, 1);
+ flush(args,0);
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_type = FT_VCF;
+ args->output_fname = "-";
+ args->ld_win = -100e3;
+ static struct option loptions[] =
+ {
+ {"randomize-missing",no_argument,NULL,1},
+ {"AF-tag",required_argument,NULL,2},
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"annotate-info",required_argument,NULL,'a'},
+ {"set-filter",required_argument,NULL,'f'},
+ {"max-LD",required_argument,NULL,'l'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {"nsites-per-win",required_argument,NULL,'n'},
+ {"window",required_argument,NULL,'w'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *tmp;
+ while ((c = getopt_long(argc, argv, "vr:R:t:T:l:o:O:a:f:i:e:n:w:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 1 : args->rand_missing = 1; break;
+ case 2 : args->af_tag = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'a':
+ {
+ int l = strlen(optarg);
+ args->info_pos = (char*)malloc(l+5);
+ args->info_r2 = (char*)malloc(l+5);
+ sprintf(args->info_pos,"%s_POS", optarg);
+ sprintf(args->info_r2,"%s_R2", optarg);
+ }
+ break;
+ case 'f': args->filter_r2 = optarg; break;
+ case 'n':
+ args->nsites = strtod(optarg,&tmp);
+ if ( tmp==optarg || *tmp ) error("Could not parse: --nsites-per-win %s\n", optarg);
+ break;
+ case 'l':
+ args->max_ld = strtod(optarg,&tmp);
+ if ( tmp==optarg || *tmp ) error("Could not parse: --max-LD %s\n", optarg);
+ break;
+ case 'w':
+ args->ld_win = strtod(optarg,&tmp);
+ if ( !*tmp ) break;
+ if ( tmp==optarg ) error("Could not parse: --window %s\n", optarg);
+ else if ( !strcasecmp("bp",tmp) ) args->ld_win *= -1;
+ else if ( !strcasecmp("kb",tmp) ) args->ld_win *= -1000;
+ else error("Could not parse: --window %s\n", optarg);
+ break;
+ case 'T': args->target_is_file = 1;
+ case 't': args->target = optarg; break;
+ case 'R': args->region_is_file = 1;
+ case 'r': args->region = optarg; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+ if ( !args->max_ld && !args->nsites ) error("%sError: Expected --max-LD, --nsites-per-win or both\n\n", usage_text());
+
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) ) process(args);
+ flush(args,1);
+
+ destroy_data(args);
+ return 0;
+}
+
+
--- /dev/null
+/* plugins/setGT.c -- set gentoypes to given values
+
+ Copyright (C) 2015-2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include <getopt.h>
+#include <ctype.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef int (*cmp_f)(double a, double b);
+
+static int cmp_eq(double a, double b) { return a==b ? 1 : 0; }
+static int cmp_le(double a, double b) { return a<=b ? 1 : 0; }
+static int cmp_ge(double a, double b) { return a>=b ? 1 : 0; }
+static int cmp_lt(double a, double b) { return a<b ? 1 : 0; }
+static int cmp_gt(double a, double b) { return a>b ? 1 : 0; }
+
+typedef struct
+{
+ bcf_hdr_t *in_hdr, *out_hdr;
+ int32_t *gts, mgts, *iarr, miarr;
+ int *arr, marr;
+ uint64_t nchanged;
+ int tgt_mask, new_mask, new_gt;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic;
+ uint8_t *smpl_pass;
+ double binom_val;
+ char *binom_tag;
+ cmp_f binom_cmp;
+}
+args_t;
+
+args_t *args = NULL;
+
+#define GT_MISSING 1
+#define GT_PARTIAL (1<<1)
+#define GT_REF (1<<2)
+#define GT_MAJOR (1<<3)
+#define GT_PHASED (1<<4)
+#define GT_UNPHASED (1<<5)
+#define GT_ALL (1<<6)
+#define GT_QUERY (1<<7)
+#define GT_BINOM (1<<8)
+
+const char *about(void)
+{
+ return "Set genotypes: partially missing to missing, missing to ref/major allele, etc.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "About: Sets genotypes. The target genotypes can be specified as:\n"
+ " ./. .. completely missing (\".\" or \"./.\", depending on ploidy)\n"
+ " ./x .. partially missing (e.g., \"./0\" or \".|1\" but not \"./.\")\n"
+ " . .. partially or completely missing\n"
+ " a .. all genotypes\n"
+ " b .. heterozygous genotypes failing two-tailed binomial test (example below)\n"
+ " q .. select genotypes using -i/-e options\n"
+ " and the new genotype can be one of:\n"
+ " . .. missing (\".\" or \"./.\", keeps ploidy)\n"
+ " 0 .. reference allele\n"
+ " M .. major allele\n"
+ " p .. phased genotype\n"
+ " u .. unphase genotype and sort by allele (1|0 becomes 0/1)\n"
+ "Usage: bcftools +setGT [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -e, --exclude <expr> Exclude a genotype if true (requires -t q)\n"
+ " -i, --include <expr> include a genotype if true (requires -t q)\n"
+ " -n, --new-gt <type> Genotypes to set, see above\n"
+ " -t, --target-gt <type> Genotypes to change, see above\n"
+ "\n"
+ "Example:\n"
+ " # set missing genotypes (\"./.\") to phased ref genotypes (\"0|0\")\n"
+ " bcftools +setGT in.vcf -- -t . -n 0p\n"
+ "\n"
+ " # set missing genotypes with DP>0 and GQ>20 to ref genotypes (\"0/0\")\n"
+ " bcftools +setGT in.vcf -- -t q -n 0 -i 'GT=\".\" && FMT/DP>0 && GQ>20'\n"
+ "\n"
+ " # set partially missing genotypes to completely missing\n"
+ " bcftools +setGT in.vcf -- -t ./x -n .\n"
+ "\n"
+ " # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3\n"
+ " bcftools +setGT in.vcf -- -t \"b:AD<1e-3\" -n 0\n" // todo: make -i/-e recognise something like is_het or gt="het" so that this can be generalized?
+ "\n";
+}
+
+static void _parse_binom_expr_error(char *str)
+{
+ error(
+ "Error parsing the expression: %s\n"
+ "Expected TAG CMP VAL, where\n"
+ " TAG .. one of the format tags\n"
+ " CMP .. operator, one of <, <=, >, >=\n"
+ " VAL .. value\n"
+ "For example:\n"
+ " bcftools +setGT in.vcf -- -t \"b:AD>1e-3\" -n 0\n"
+ "\n", str
+ );
+}
+void parse_binom_expr(args_t *args, char *str)
+{
+ if ( str[1]!=':' ) _parse_binom_expr_error(str);
+
+ char *beg = str+2;
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) _parse_binom_expr_error(str);
+ char *end = beg;
+ while ( *end )
+ {
+ if ( isspace(*end) || *end=='<' || *end=='=' || *end=='>' ) break;
+ end++;
+ }
+ if ( !*end ) _parse_binom_expr_error(str);
+ args->binom_tag = (char*) calloc(1,end-beg+1);
+ memcpy(args->binom_tag,beg,end-beg);
+ int tag_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,args->binom_tag);
+ if ( !bcf_hdr_idinfo_exists(args->in_hdr,BCF_HL_FMT,tag_id) ) error("The FORMAT tag \"%s\" is not present in the VCF\n", args->binom_tag);
+
+ while ( *end && isspace(*end) ) end++;
+ if ( !*end ) _parse_binom_expr_error(str);
+
+ if ( !strncmp(end,"<=",2) ) { args->binom_cmp = cmp_le; beg = end+2; }
+ else if ( !strncmp(end,">=",2) ) { args->binom_cmp = cmp_ge; beg = end+2; }
+ else if ( !strncmp(end,"==",2) ) { args->binom_cmp = cmp_eq; beg = end+2; }
+ else if ( !strncmp(end,"<",1) ) { args->binom_cmp = cmp_lt; beg = end+1; }
+ else if ( !strncmp(end,">",1) ) { args->binom_cmp = cmp_gt; beg = end+1; }
+ else if ( !strncmp(end,"=",1) ) { args->binom_cmp = cmp_eq; beg = end+1; }
+ else _parse_binom_expr_error(str);
+
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) _parse_binom_expr_error(str);
+
+ args->binom_val = strtod(beg, &end);
+ while ( *end && isspace(*end) ) end++;
+ if ( *end ) _parse_binom_expr_error(str);
+
+ args->tgt_mask |= GT_BINOM;
+ return;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->in_hdr = in;
+ args->out_hdr = out;
+
+ int c;
+ static struct option loptions[] =
+ {
+ {"include",required_argument,NULL,'i'},
+ {"exclude",required_argument,NULL,'e'},
+ {"new-gt",required_argument,NULL,'n'},
+ {"target-gt",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ while ((c = getopt_long(argc, argv, "?hn:t:i:e:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'i': args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+ case 'e': args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
+ case 'n': args->new_mask = 0;
+ if ( strchr(optarg,'.') ) args->new_mask |= GT_MISSING;
+ if ( strchr(optarg,'0') ) args->new_mask |= GT_REF;
+ if ( strchr(optarg,'M') ) args->new_mask |= GT_MAJOR;
+ if ( strchr(optarg,'p') ) args->new_mask |= GT_PHASED;
+ if ( strchr(optarg,'u') ) args->new_mask |= GT_UNPHASED;
+ if ( args->new_mask==0 ) error("Unknown parameter to --new-gt: %s\n", optarg);
+ break;
+ case 't':
+ if ( !strcmp(optarg,".") ) args->tgt_mask |= GT_MISSING|GT_PARTIAL;
+ if ( !strcmp(optarg,"./x") ) args->tgt_mask |= GT_PARTIAL;
+ if ( !strcmp(optarg,"./.") ) args->tgt_mask |= GT_MISSING;
+ if ( !strcmp(optarg,"a") ) args->tgt_mask |= GT_ALL;
+ if ( !strcmp(optarg,"q") ) args->tgt_mask |= GT_QUERY;
+ if ( !strcmp(optarg,"?") ) args->tgt_mask |= GT_QUERY; // for backward compatibility
+ if ( strchr(optarg,'b') ) parse_binom_expr(args, strchr(optarg,'b'));
+ if ( args->tgt_mask==0 ) error("Unknown parameter to --target-gt: %s\n", optarg);
+ break;
+ case 'h':
+ case '?':
+ default: fprintf(stderr,"%s", usage()); exit(1); break;
+ }
+ }
+
+ if ( !args->new_mask ) error("Expected -n option\n");
+ if ( !args->tgt_mask ) error("Expected -t option\n");
+
+ if ( args->new_mask & GT_MISSING ) args->new_gt = bcf_gt_missing;
+ if ( args->new_mask & GT_REF ) args->new_gt = args->new_mask>_PHASED ? bcf_gt_phased(0) : bcf_gt_unphased(0);
+
+ if ( args->filter_str && !(args->tgt_mask>_QUERY) ) error("Expected -tq with -i/-e\n");
+ if ( !args->filter_str && args->tgt_mask>_QUERY ) error("Expected -i/-e with -tq\n");
+ if ( args->filter_str ) args->filter = filter_init(in,args->filter_str);
+
+ return 0;
+}
+
+static inline int phase_gt(int32_t *ptr, int ngts)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_phased(ptr[j]) ) continue;
+ ptr[j] = bcf_gt_phased(bcf_gt_allele(ptr[j])); // add phasing; this may need a fix, I think the flag should be set for one allele only?
+ changed++;
+ }
+ return changed;
+}
+
+static inline int unphase_gt(int32_t *ptr, int ngts)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( !bcf_gt_is_phased(ptr[j]) ) continue;
+ ptr[j] = bcf_gt_unphased(bcf_gt_allele(ptr[j])); // remove phasing
+ changed++;
+ }
+
+ // insertion sort
+ int k, l;
+ for (k=1; k<j; k++)
+ {
+ int32_t x = ptr[k];
+ l = k;
+ while ( l>0 && ptr[l-1]>x )
+ {
+ ptr[l] = ptr[l-1];
+ l--;
+ }
+ ptr[l] = x;
+ }
+ return changed;
+}
+static inline int set_gt(int32_t *ptr, int ngts, int gt)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( ptr[j] != gt ) changed++;
+ ptr[j] = gt;
+ }
+ return changed;
+}
+
+static inline double calc_binom(int na, int nb)
+{
+ if ( na + nb == 0 ) return 1;
+
+ /*
+ kfunc.h implements kf_betai, which is the regularized beta function I_x(a,b) = P(X<=a/(a+b))
+ */
+ double prob = na > nb ? 2*kf_betai(na, nb + 1, 0.5) : 2*kf_betai(nb, na + 1, 0.5);
+ if ( prob > 1 ) prob = 1;
+
+ return prob;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( !rec->n_sample ) return rec;
+
+ int ngts = bcf_get_genotypes(args->in_hdr, rec, &args->gts, &args->mgts);
+ ngts /= rec->n_sample;
+ int i, j, changed = 0;
+
+ int nbinom = 0;
+ if ( args->tgt_mask & GT_BINOM )
+ {
+ nbinom = bcf_get_format_int32(args->in_hdr, rec, args->binom_tag, &args->iarr, &args->miarr);
+ if ( nbinom<0 ) nbinom = 0;
+ nbinom /= rec->n_sample;
+ }
+
+ // Calculating allele frequency for each allele and determining major allele
+ // only do this if use_major is true
+ int an = 0, maxAC = -1, majorAllele = -1;
+ if ( args->new_mask & GT_MAJOR )
+ {
+ hts_expand(int,rec->n_allele,args->marr,args->arr);
+ int ret = bcf_calc_ac(args->in_hdr,rec,args->arr,BCF_UN_FMT);
+ if ( ret<= 0 )
+ error("Could not calculate allele count at %s:%d\n", bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+ for(i=0; i < rec->n_allele; ++i)
+ {
+ an += args->arr[i];
+ if (args->arr[i] > maxAC)
+ {
+ maxAC = args->arr[i];
+ majorAllele = i;
+ }
+ }
+
+ // replacing new_gt by major allele
+ args->new_gt = args->new_mask & GT_PHASED ? bcf_gt_phased(majorAllele) : bcf_gt_unphased(majorAllele);
+ }
+
+ // replace gts
+ if ( nbinom && ngts>=2 ) // only diploid genotypes are considered: higher ploidy ignored further, haploid here
+ {
+ if ( args->filter ) filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( args->smpl_pass )
+ {
+ if ( !args->smpl_pass[i] && args->filter_logic==FLT_INCLUDE ) continue;
+ if ( args->smpl_pass[i] && args->filter_logic==FLT_EXCLUDE ) continue;
+ }
+ int32_t *ptr = args->gts + i*ngts;
+ if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end ) continue;
+ if ( ptr[0]==ptr[1] ) continue; // a hom
+ int ia = bcf_gt_allele(ptr[0]);
+ int ib = bcf_gt_allele(ptr[1]);
+ if ( ia>=nbinom || ib>=nbinom )
+ error("The sample %s has incorrect number of %s fields at %s:%d\n",
+ args->in_hdr->samples[i],args->binom_tag,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+ double prob = calc_binom(args->iarr[i*nbinom+ia],args->iarr[i*nbinom+ib]);
+ if ( !args->binom_cmp(prob,args->binom_val) ) continue;
+
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(ptr, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(ptr, ngts);
+ else
+ changed += set_gt(ptr, ngts, args->new_gt);
+ }
+ }
+ else if ( args->tgt_mask>_QUERY )
+ {
+ int pass_site = filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+ if ( pass_site && args->filter_logic==FLT_EXCLUDE )
+ {
+ // -i can include a site but exclude a sample, -e exclude a site but include a sample
+ if ( pass_site )
+ {
+ if ( !args->smpl_pass ) return rec;
+ pass_site = 0;
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( args->smpl_pass[i] ) args->smpl_pass[i] = 0;
+ else { args->smpl_pass[i] = 1; pass_site = 1; }
+ }
+ if ( !pass_site ) return rec;
+ }
+ else if ( args->smpl_pass )
+ for (i=0; i<rec->n_sample; i++) args->smpl_pass[i] = 1;
+ }
+ else if ( !pass_site ) return rec;
+
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( !args->smpl_pass[i] ) continue;
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(args->gts + i*ngts, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(args->gts + i*ngts, ngts);
+ else
+ changed += set_gt(args->gts + i*ngts, ngts, args->new_gt);
+ }
+ }
+ else
+ {
+ for (i=0; i<rec->n_sample; i++)
+ {
+ int ploidy = 0, nmiss = 0;
+ int32_t *ptr = args->gts + i*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ ploidy++;
+ if ( ptr[j]==bcf_gt_missing ) nmiss++;
+ }
+
+ int do_set = 0;
+ if ( args->tgt_mask>_ALL ) do_set = 1;
+ else if ( args->tgt_mask>_PARTIAL && nmiss ) do_set = 1;
+ else if ( args->tgt_mask>_MISSING && ploidy==nmiss ) do_set = 1;
+
+ if ( !do_set ) continue;
+
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(ptr, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(ptr, ngts);
+ else
+ changed += set_gt(ptr, ngts, args->new_gt);
+ }
+ }
+ args->nchanged += changed;
+ if ( changed ) bcf_update_genotypes(args->out_hdr, rec, args->gts, ngts*rec->n_sample);
+ return rec;
+}
+
+void destroy(void)
+{
+ fprintf(stderr,"Filled %"PRId64" alleles\n", args->nchanged);
+ free(args->binom_tag);
+ if ( args->filter ) filter_destroy(args->filter);
+ free(args->arr);
+ free(args->iarr);
+ free(args->gts);
+ free(args);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/setGT.c -- set gentoypes to given values
+
+ Copyright (C) 2015-2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <htslib/vcf.h>
+#include <htslib/vcfutils.h>
+#include <htslib/kfunc.h>
+#include <inttypes.h>
+#include <getopt.h>
+#include <ctype.h>
+#include "bcftools.h"
+#include "filter.h"
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef int (*cmp_f)(double a, double b);
+
+static int cmp_eq(double a, double b) { return a==b ? 1 : 0; }
+static int cmp_le(double a, double b) { return a<=b ? 1 : 0; }
+static int cmp_ge(double a, double b) { return a>=b ? 1 : 0; }
+static int cmp_lt(double a, double b) { return a<b ? 1 : 0; }
+static int cmp_gt(double a, double b) { return a>b ? 1 : 0; }
+
+typedef struct
+{
+ bcf_hdr_t *in_hdr, *out_hdr;
+ int32_t *gts, mgts, *iarr, miarr;
+ int *arr, marr;
+ uint64_t nchanged;
+ int tgt_mask, new_mask, new_gt;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic;
+ uint8_t *smpl_pass;
+ double binom_val;
+ char *binom_tag;
+ cmp_f binom_cmp;
+}
+args_t;
+
+args_t *args = NULL;
+
+#define GT_MISSING 1
+#define GT_PARTIAL (1<<1)
+#define GT_REF (1<<2)
+#define GT_MAJOR (1<<3)
+#define GT_PHASED (1<<4)
+#define GT_UNPHASED (1<<5)
+#define GT_ALL (1<<6)
+#define GT_QUERY (1<<7)
+#define GT_BINOM (1<<8)
+
+const char *about(void)
+{
+ return "Set genotypes: partially missing to missing, missing to ref/major allele, etc.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "About: Sets genotypes. The target genotypes can be specified as:\n"
+ " ./. .. completely missing (\".\" or \"./.\", depending on ploidy)\n"
+ " ./x .. partially missing (e.g., \"./0\" or \".|1\" but not \"./.\")\n"
+ " . .. partially or completely missing\n"
+ " a .. all genotypes\n"
+ " b .. heterozygous genotypes failing two-tailed binomial test (example below)\n"
+ " q .. select genotypes using -i/-e options\n"
+ " and the new genotype can be one of:\n"
+ " . .. missing (\".\" or \"./.\", keeps ploidy)\n"
+ " 0 .. reference allele\n"
+ " M .. major allele\n"
+ " p .. phased genotype\n"
+ " u .. unphase genotype and sort by allele (1|0 becomes 0/1)\n"
+ "Usage: bcftools +setGT [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -e, --exclude <expr> Exclude a genotype if true (requires -t q)\n"
+ " -i, --include <expr> include a genotype if true (requires -t q)\n"
+ " -n, --new-gt <type> Genotypes to set, see above\n"
+ " -t, --target-gt <type> Genotypes to change, see above\n"
+ "\n"
+ "Example:\n"
+ " # set missing genotypes (\"./.\") to phased ref genotypes (\"0|0\")\n"
+ " bcftools +setGT in.vcf -- -t . -n 0p\n"
+ "\n"
+ " # set missing genotypes with DP>0 and GQ>20 to ref genotypes (\"0/0\")\n"
+ " bcftools +setGT in.vcf -- -t q -n 0 -i 'GT=\".\" && FMT/DP>0 && GQ>20'\n"
+ "\n"
+ " # set partially missing genotypes to completely missing\n"
+ " bcftools +setGT in.vcf -- -t ./x -n .\n"
+ "\n"
+ " # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3\n"
+ " bcftools +setGT in.vcf -- -t \"b:AD<1e-3\" -n 0\n" // todo: make -i/-e recognise something like is_het or gt="het" so that this can be generalized?
+ "\n";
+}
+
+static void _parse_binom_expr_error(char *str)
+{
+ error(
+ "Error parsing the expression: %s\n"
+ "Expected TAG CMP VAL, where\n"
+ " TAG .. one of the format tags\n"
+ " CMP .. operator, one of <, <=, >, >=\n"
+ " VAL .. value\n"
+ "For example:\n"
+ " bcftools +setGT in.vcf -- -t \"b:AD>1e-3\" -n 0\n"
+ "\n", str
+ );
+}
+void parse_binom_expr(args_t *args, char *str)
+{
+ if ( str[1]!=':' ) _parse_binom_expr_error(str);
+
+ char *beg = str+2;
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) _parse_binom_expr_error(str);
+ char *end = beg;
+ while ( *end )
+ {
+ if ( isspace(*end) || *end=='<' || *end=='=' || *end=='>' ) break;
+ end++;
+ }
+ if ( !*end ) _parse_binom_expr_error(str);
+ args->binom_tag = (char*) calloc(1,end-beg+1);
+ memcpy(args->binom_tag,beg,end-beg);
+ int tag_id = bcf_hdr_id2int(args->in_hdr,BCF_DT_ID,args->binom_tag);
+ if ( !bcf_hdr_idinfo_exists(args->in_hdr,BCF_HL_FMT,tag_id) ) error("The FORMAT tag \"%s\" is not present in the VCF\n", args->binom_tag);
+
+ while ( *end && isspace(*end) ) end++;
+ if ( !*end ) _parse_binom_expr_error(str);
+
+ if ( !strncmp(end,"<=",2) ) { args->binom_cmp = cmp_le; beg = end+2; }
+ else if ( !strncmp(end,">=",2) ) { args->binom_cmp = cmp_ge; beg = end+2; }
+ else if ( !strncmp(end,"==",2) ) { args->binom_cmp = cmp_eq; beg = end+2; }
+ else if ( !strncmp(end,"<",1) ) { args->binom_cmp = cmp_lt; beg = end+1; }
+ else if ( !strncmp(end,">",1) ) { args->binom_cmp = cmp_gt; beg = end+1; }
+ else if ( !strncmp(end,"=",1) ) { args->binom_cmp = cmp_eq; beg = end+1; }
+ else _parse_binom_expr_error(str);
+
+ while ( *beg && isspace(*beg) ) beg++;
+ if ( !*beg ) _parse_binom_expr_error(str);
+
+ args->binom_val = strtod(beg, &end);
+ while ( *end && isspace(*end) ) end++;
+ if ( *end ) _parse_binom_expr_error(str);
+
+ args->tgt_mask |= GT_BINOM;
+ return;
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ args = (args_t*) calloc(1,sizeof(args_t));
+ args->in_hdr = in;
+ args->out_hdr = out;
+
+ int c;
+ static struct option loptions[] =
+ {
+ {"include",required_argument,NULL,'i'},
+ {"exclude",required_argument,NULL,'e'},
+ {"new-gt",required_argument,NULL,'n'},
+ {"target-gt",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ while ((c = getopt_long(argc, argv, "?hn:t:i:e:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'i': args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+ case 'e': args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
+ case 'n': args->new_mask = 0;
+ if ( strchr(optarg,'.') ) args->new_mask |= GT_MISSING;
+ if ( strchr(optarg,'0') ) args->new_mask |= GT_REF;
+ if ( strchr(optarg,'M') ) args->new_mask |= GT_MAJOR;
+ if ( strchr(optarg,'p') ) args->new_mask |= GT_PHASED;
+ if ( strchr(optarg,'u') ) args->new_mask |= GT_UNPHASED;
+ if ( args->new_mask==0 ) error("Unknown parameter to --new-gt: %s\n", optarg);
+ break;
+ case 't':
+ if ( !strcmp(optarg,".") ) args->tgt_mask |= GT_MISSING|GT_PARTIAL;
+ if ( !strcmp(optarg,"./x") ) args->tgt_mask |= GT_PARTIAL;
+ if ( !strcmp(optarg,"./.") ) args->tgt_mask |= GT_MISSING;
+ if ( !strcmp(optarg,"a") ) args->tgt_mask |= GT_ALL;
+ if ( !strcmp(optarg,"q") ) args->tgt_mask |= GT_QUERY;
+ if ( !strcmp(optarg,"?") ) args->tgt_mask |= GT_QUERY; // for backward compatibility
+ if ( strchr(optarg,'b') ) parse_binom_expr(args, strchr(optarg,'b'));
+ if ( args->tgt_mask==0 ) error("Unknown parameter to --target-gt: %s\n", optarg);
+ break;
+ case 'h':
+ case '?':
+ default: fprintf(bcftools_stderr,"%s", usage()); exit(1); break;
+ }
+ }
+
+ if ( !args->new_mask ) error("Expected -n option\n");
+ if ( !args->tgt_mask ) error("Expected -t option\n");
+
+ if ( args->new_mask & GT_MISSING ) args->new_gt = bcf_gt_missing;
+ if ( args->new_mask & GT_REF ) args->new_gt = args->new_mask>_PHASED ? bcf_gt_phased(0) : bcf_gt_unphased(0);
+
+ if ( args->filter_str && !(args->tgt_mask>_QUERY) ) error("Expected -tq with -i/-e\n");
+ if ( !args->filter_str && args->tgt_mask>_QUERY ) error("Expected -i/-e with -tq\n");
+ if ( args->filter_str ) args->filter = filter_init(in,args->filter_str);
+
+ return 0;
+}
+
+static inline int phase_gt(int32_t *ptr, int ngts)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( bcf_gt_is_phased(ptr[j]) ) continue;
+ ptr[j] = bcf_gt_phased(bcf_gt_allele(ptr[j])); // add phasing; this may need a fix, I think the flag should be set for one allele only?
+ changed++;
+ }
+ return changed;
+}
+
+static inline int unphase_gt(int32_t *ptr, int ngts)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( !bcf_gt_is_phased(ptr[j]) ) continue;
+ ptr[j] = bcf_gt_unphased(bcf_gt_allele(ptr[j])); // remove phasing
+ changed++;
+ }
+
+ // insertion sort
+ int k, l;
+ for (k=1; k<j; k++)
+ {
+ int32_t x = ptr[k];
+ l = k;
+ while ( l>0 && ptr[l-1]>x )
+ {
+ ptr[l] = ptr[l-1];
+ l--;
+ }
+ ptr[l] = x;
+ }
+ return changed;
+}
+static inline int set_gt(int32_t *ptr, int ngts, int gt)
+{
+ int j, changed = 0;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( ptr[j] != gt ) changed++;
+ ptr[j] = gt;
+ }
+ return changed;
+}
+
+static inline double calc_binom(int na, int nb)
+{
+ if ( na + nb == 0 ) return 1;
+
+ /*
+ kfunc.h implements kf_betai, which is the regularized beta function I_x(a,b) = P(X<=a/(a+b))
+ */
+ double prob = na > nb ? 2*kf_betai(na, nb + 1, 0.5) : 2*kf_betai(nb, na + 1, 0.5);
+ if ( prob > 1 ) prob = 1;
+
+ return prob;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ if ( !rec->n_sample ) return rec;
+
+ int ngts = bcf_get_genotypes(args->in_hdr, rec, &args->gts, &args->mgts);
+ ngts /= rec->n_sample;
+ int i, j, changed = 0;
+
+ int nbinom = 0;
+ if ( args->tgt_mask & GT_BINOM )
+ {
+ nbinom = bcf_get_format_int32(args->in_hdr, rec, args->binom_tag, &args->iarr, &args->miarr);
+ if ( nbinom<0 ) nbinom = 0;
+ nbinom /= rec->n_sample;
+ }
+
+ // Calculating allele frequency for each allele and determining major allele
+ // only do this if use_major is true
+ int an = 0, maxAC = -1, majorAllele = -1;
+ if ( args->new_mask & GT_MAJOR )
+ {
+ hts_expand(int,rec->n_allele,args->marr,args->arr);
+ int ret = bcf_calc_ac(args->in_hdr,rec,args->arr,BCF_UN_FMT);
+ if ( ret<= 0 )
+ error("Could not calculate allele count at %s:%d\n", bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+ for(i=0; i < rec->n_allele; ++i)
+ {
+ an += args->arr[i];
+ if (args->arr[i] > maxAC)
+ {
+ maxAC = args->arr[i];
+ majorAllele = i;
+ }
+ }
+
+ // replacing new_gt by major allele
+ args->new_gt = args->new_mask & GT_PHASED ? bcf_gt_phased(majorAllele) : bcf_gt_unphased(majorAllele);
+ }
+
+ // replace gts
+ if ( nbinom && ngts>=2 ) // only diploid genotypes are considered: higher ploidy ignored further, haploid here
+ {
+ if ( args->filter ) filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( args->smpl_pass )
+ {
+ if ( !args->smpl_pass[i] && args->filter_logic==FLT_INCLUDE ) continue;
+ if ( args->smpl_pass[i] && args->filter_logic==FLT_EXCLUDE ) continue;
+ }
+ int32_t *ptr = args->gts + i*ngts;
+ if ( bcf_gt_is_missing(ptr[0]) || bcf_gt_is_missing(ptr[1]) || ptr[1]==bcf_int32_vector_end ) continue;
+ if ( ptr[0]==ptr[1] ) continue; // a hom
+ int ia = bcf_gt_allele(ptr[0]);
+ int ib = bcf_gt_allele(ptr[1]);
+ if ( ia>=nbinom || ib>=nbinom )
+ error("The sample %s has incorrect number of %s fields at %s:%d\n",
+ args->in_hdr->samples[i],args->binom_tag,bcf_seqname(args->in_hdr,rec),rec->pos+1);
+
+ double prob = calc_binom(args->iarr[i*nbinom+ia],args->iarr[i*nbinom+ib]);
+ if ( !args->binom_cmp(prob,args->binom_val) ) continue;
+
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(ptr, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(ptr, ngts);
+ else
+ changed += set_gt(ptr, ngts, args->new_gt);
+ }
+ }
+ else if ( args->tgt_mask>_QUERY )
+ {
+ int pass_site = filter_test(args->filter,rec,(const uint8_t **)&args->smpl_pass);
+ if ( pass_site && args->filter_logic==FLT_EXCLUDE )
+ {
+ // -i can include a site but exclude a sample, -e exclude a site but include a sample
+ if ( pass_site )
+ {
+ if ( !args->smpl_pass ) return rec;
+ pass_site = 0;
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( args->smpl_pass[i] ) args->smpl_pass[i] = 0;
+ else { args->smpl_pass[i] = 1; pass_site = 1; }
+ }
+ if ( !pass_site ) return rec;
+ }
+ else if ( args->smpl_pass )
+ for (i=0; i<rec->n_sample; i++) args->smpl_pass[i] = 1;
+ }
+ else if ( !pass_site ) return rec;
+
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( !args->smpl_pass[i] ) continue;
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(args->gts + i*ngts, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(args->gts + i*ngts, ngts);
+ else
+ changed += set_gt(args->gts + i*ngts, ngts, args->new_gt);
+ }
+ }
+ else
+ {
+ for (i=0; i<rec->n_sample; i++)
+ {
+ int ploidy = 0, nmiss = 0;
+ int32_t *ptr = args->gts + i*ngts;
+ for (j=0; j<ngts; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ ploidy++;
+ if ( ptr[j]==bcf_gt_missing ) nmiss++;
+ }
+
+ int do_set = 0;
+ if ( args->tgt_mask>_ALL ) do_set = 1;
+ else if ( args->tgt_mask>_PARTIAL && nmiss ) do_set = 1;
+ else if ( args->tgt_mask>_MISSING && ploidy==nmiss ) do_set = 1;
+
+ if ( !do_set ) continue;
+
+ if ( args->new_mask>_UNPHASED )
+ changed += unphase_gt(ptr, ngts);
+ else if ( args->new_mask==GT_PHASED )
+ changed += phase_gt(ptr, ngts);
+ else
+ changed += set_gt(ptr, ngts, args->new_gt);
+ }
+ }
+ args->nchanged += changed;
+ if ( changed ) bcf_update_genotypes(args->out_hdr, rec, args->gts, ngts*rec->n_sample);
+ return rec;
+}
+
+void destroy(void)
+{
+ fprintf(bcftools_stderr,"Filled %"PRId64" alleles\n", args->nchanged);
+ free(args->binom_tag);
+ if ( args->filter ) filter_destroy(args->filter);
+ free(args->arr);
+ free(args->iarr);
+ free(args->gts);
+ free(args);
+}
+
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ uint32_t
+ npass, // number of genotypes passing the filter
+ nnon_ref, // number of non-reference genotypes
+ nhomRR,
+ nhomAA,
+ nhemi,
+ nhet,
+ nSNV,
+ nIndel,
+ nmissing,
+ nsingleton, // het different from everyone else
+ nts, ntv; // number of transitions and transversions
+}
+stats_t;
+
+typedef struct
+{
+ stats_t *stats, site_stats;
+ filter_t *filter;
+ char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file;
+ int nflt_str;
+ char *filter_str, **flt_str;
+ char **argv, *output_fname, *fname, *regions, *targets;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+ flt_stats_t *filters;
+ int nfilters, nsmpl;
+ int32_t *gt_arr, *ac;
+ int mgt_arr, mac;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate basic per-sample stats scanning over a range of thresholds simultaneously.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Calculates basic per-sample stats. Use curly brackets to scan a range of values simultaneously\n"
+ "Usage: bcftools +smpl-stats [Plugin Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [stdout]\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " bcftools +smpl-stats -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+ "\n";
+}
+
+static void parse_filters(args_t *args)
+{
+ if ( !args->filter_str ) return;
+ int mflt = 1;
+ args->nflt_str = 1;
+ args->flt_str = (char**) malloc(sizeof(char*));
+ args->flt_str[0] = strdup(args->filter_str);
+ while (1)
+ {
+ int i, expanded = 0;
+ for (i=args->nflt_str-1; i>=0; i--)
+ {
+ char *exp_beg = strchr(args->flt_str[i], '{');
+ if ( !exp_beg ) continue;
+ char *exp_end = strchr(exp_beg+1, '}');
+ if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+ char *beg = exp_beg+1, *mid = beg;
+ while ( mid<exp_end )
+ {
+ while ( mid<exp_end && *mid!=',' ) mid++;
+ kstring_t tmp = {0,0,0};
+ kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+ kputsn(beg, mid - beg, &tmp);
+ kputs(exp_end+1, &tmp);
+ args->nflt_str++;
+ hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+ args->flt_str[args->nflt_str-1] = tmp.s;
+ beg = ++mid;
+ }
+ expanded = 1;
+ free(args->flt_str[i]);
+ memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+ args->nflt_str--;
+ args->flt_str[args->nflt_str] = NULL;
+ }
+ if ( !expanded ) break;
+ }
+
+ fprintf(stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ parse_filters(args);
+
+ int i;
+ if ( !args->nflt_str )
+ {
+ args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+ args->nfilters = 1;
+ args->filters[0].expr = strdup("all");
+ }
+ else
+ {
+ args->nfilters = args->nflt_str;
+ args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+ for (i=0; i<args->nfilters; i++)
+ {
+ args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+ args->filters[i].expr = strdup(args->flt_str[i]);
+
+ // replace tab's with spaces so that the output stays parsable
+ char *tmp = args->filters[i].expr;
+ while ( *tmp )
+ {
+ if ( *tmp=='\t' ) *tmp = ' ';
+ tmp++;
+ }
+ }
+ }
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ for (i=0; i<args->nfilters; i++)
+ args->filters[i].stats = (stats_t*) calloc(args->nsmpl,sizeof(stats_t));
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ for (i=0; i<args->nfilters; i++)
+ {
+ if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+ free(args->filters[i].stats);
+ free(args->filters[i].expr);
+ }
+ free(args->filters);
+ for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+ free(args->flt_str);
+ bcf_sr_destroy(args->sr);
+ free(args->ac);
+ free(args->gt_arr);
+ free(args);
+}
+static void report_stats(args_t *args)
+{
+ int i = 0,j;
+ FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? stdout : fopen(args->output_fname,"w");
+ if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+ fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+ fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+ fprintf(fh,"# FLT* lines report numbers for every threshold and every sample:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) sample\n", ++i);
+ fprintf(fh,"# %d) number of genotypes which pass the filter\n", ++i);
+ fprintf(fh,"# %d) number of non-reference genotypes\n", ++i);
+ fprintf(fh,"# %d) number of homozygous ref genotypes (0/0 or 0)\n", ++i);
+ fprintf(fh,"# %d) number of homozygous alt genotypes (1/1, 2/2, etc)\n", ++i);
+ fprintf(fh,"# %d) number of heterozygous genotypes (0/1, 1/2, etc)\n", ++i);
+ fprintf(fh,"# %d) number of hemizygous genotypes (0, 1, etc)\n", ++i);
+ fprintf(fh,"# %d) number of SNVs\n", ++i);
+ fprintf(fh,"# %d) number of indels\n", ++i);
+ fprintf(fh,"# %d) number of singletons\n", ++i);
+ fprintf(fh,"# %d) number of missing genotypes (./., ., ./0, etc)\n", ++i);
+ fprintf(fh,"# %d) number of transitions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+ fprintf(fh,"# %d) number of transversions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv\n", ++i);
+ i = 0;
+ fprintf(fh,"# SITE* lines report numbers for every threshold and site:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) number of sites which pass the filter\n", ++i);
+ fprintf(fh,"# %d) number of SNVs\n", ++i);
+ fprintf(fh,"# %d) number of indels\n", ++i);
+ fprintf(fh,"# %d) number of singletons\n", ++i);
+ fprintf(fh,"# %d) number of transitions (counted at most once at multiallelic sites)\n", ++i);
+ fprintf(fh,"# %d) number of transversions (counted at most once at multiallelic sites)\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv\n", ++i);
+ fprintf(fh, "CMD\t%s", args->argv[0]);
+ for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+ fprintf(fh, "\n");
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+ }
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ for (j=0; j<args->nsmpl; j++)
+ {
+ fprintf(fh,"FLT%d", i);
+ fprintf(fh,"\t%s",args->hdr->samples[j]);
+ stats_t *stats = &flt->stats[j];
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nnon_ref);
+ fprintf(fh,"\t%d", stats->nhomRR);
+ fprintf(fh,"\t%d", stats->nhomAA);
+ fprintf(fh,"\t%d", stats->nhet);
+ fprintf(fh,"\t%d", stats->nhemi);
+ fprintf(fh,"\t%d", stats->nSNV);
+ fprintf(fh,"\t%d", stats->nIndel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->nmissing);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ fprintf(fh,"SITE%d", i);
+ stats_t *stats = &flt->site_stats;
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nSNV);
+ fprintf(fh,"\t%d", stats->nIndel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+ int32_t *ptr = arr + ngt1 * idx;
+ if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+ als[0] = bcf_gt_allele(ptr[0]);
+
+ if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { ptr[1] = ptr[0]; return -2; }
+
+ if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+ als[1] = bcf_gt_allele(ptr[1]);
+
+ return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+ int i,j;
+ uint8_t *smpl_pass = NULL;
+
+ // Find out which trios pass and if the site passes
+ if ( flt->filter )
+ {
+ int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE )
+ {
+ if ( pass_site )
+ {
+ if ( !smpl_pass ) return;
+ pass_site = 0;
+ for (i=0; i<args->nsmpl; i++)
+ {
+ if ( smpl_pass[i] ) smpl_pass[i] = 0;
+ else { smpl_pass[i] = 1; pass_site = 1; }
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->nsmpl; i++) smpl_pass[i] = 1;
+ }
+ else if ( !pass_site ) return;
+ }
+
+ // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+ hts_expand(int, rec->n_allele, args->mac, args->ac);
+ if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+
+ // Get the genotypes
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+ if ( ngt<0 ) return;
+ int ngt1 = ngt / rec->n_sample;
+
+
+ // For ts/tv: numeric code of the reference allele, -1 for insertions
+ int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+ int star_allele = -1;
+ for (i=1; i<rec->n_allele; i++)
+ if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+ // Run the stats
+ int site_pass = 0;
+ int site_SNV = 0;
+ int site_Indel = 0;
+ int site_has_ts = 0;
+ int site_has_tv = 0;
+ int site_singleton = 0;
+ for (i=0; i<args->nsmpl; i++)
+ {
+ if ( smpl_pass && !smpl_pass[i] ) continue;
+ stats_t *stats = &flt->stats[i];
+
+ // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+ int als[2];
+ int ret = parse_genotype(args->gt_arr, ngt1, i, als);
+ if ( ret==-1 ) { stats->nmissing++; continue; } // missing allele
+ if ( ret==-2 ) stats->nhemi++;
+ else if ( als[0]!=als[1] ) stats->nhet++;
+ else if ( als[0]==0 ) stats->nhomRR++;
+ else stats->nhomAA++;
+
+ stats->npass++;
+ site_pass = 1;
+
+ // Is there an alternate allele other than *?
+ int has_nonref = 0;
+ for (j=0; j<2; j++)
+ {
+ if ( als[j]==star_allele ) continue;
+ if ( als[j]==0 ) continue;
+ has_nonref = 1;
+ }
+ if ( !has_nonref ) continue; // only ref or * in this genotype
+
+ stats->nnon_ref++;
+
+ // Calculate ts/tv, count SNPs, indels. It does the right thing and handles also HetAA genotypes
+ {
+ int has_ts = 0, has_tv = 0, has_snv = 0, has_indel = 0;
+ for (j=0; j<2; j++)
+ {
+ if ( als[j]==0 || als[j]==star_allele ) continue;
+ if ( als[j] >= rec->n_allele )
+ error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[j]);
+
+ if ( args->ac[als[j]]==1 ) { stats->nsingleton++; site_singleton = 1; }
+
+ int var_type = bcf_get_variant_type(rec, als[j]);
+ if ( var_type==VCF_SNP || var_type==VCF_MNP )
+ {
+ int k = 0;
+ while ( rec->d.allele[0][k] && rec->d.allele[als[j]][k] )
+ {
+ if ( rec->d.allele[0][k]==rec->d.allele[als[j]][k] ) { k++; continue; }
+
+ int alt = bcf_acgt2int(rec->d.allele[als[j]][k]);
+ if ( abs(ref-alt)==2 ) has_ts = 1;
+ else has_tv = 1;
+ has_snv = 1;
+
+ k++;
+ }
+ }
+ else if ( var_type==VCF_INDEL ) has_indel = 1;
+ }
+ if ( has_ts ) { stats->nts++; site_has_ts = 1; }
+ if ( has_tv ) { stats->ntv++; site_has_tv = 1; }
+ if ( has_snv ) { stats->nSNV++; site_SNV = 1; }
+ if ( has_indel ) { stats->nIndel++; site_Indel = 1; }
+ }
+ }
+ flt->site_stats.npass += site_pass;
+ flt->site_stats.nSNV += site_SNV;
+ flt->site_stats.nIndel += site_Indel;
+ flt->site_stats.nts += site_has_ts;
+ flt->site_stats.ntv += site_has_tv;
+ flt->site_stats.nsingleton += site_singleton;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c, i;
+ while ((c = getopt_long(argc, argv, "o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ for (i=0; i<args->nfilters; i++)
+ process_record(args, rec, &args->filters[i]);
+ }
+
+ report_stats(args);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ uint32_t
+ npass, // number of genotypes passing the filter
+ nnon_ref, // number of non-reference genotypes
+ nhomRR,
+ nhomAA,
+ nhemi,
+ nhet,
+ nSNV,
+ nIndel,
+ nmissing,
+ nsingleton, // het different from everyone else
+ nts, ntv; // number of transitions and transversions
+}
+stats_t;
+
+typedef struct
+{
+ stats_t *stats, site_stats;
+ filter_t *filter;
+ char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file;
+ int nflt_str;
+ char *filter_str, **flt_str;
+ char **argv, *output_fname, *fname, *regions, *targets;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+ flt_stats_t *filters;
+ int nfilters, nsmpl;
+ int32_t *gt_arr, *ac;
+ int mgt_arr, mac;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate basic per-sample stats scanning over a range of thresholds simultaneously.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Calculates basic per-sample stats. Use curly brackets to scan a range of values simultaneously\n"
+ "Usage: bcftools +smpl-stats [Plugin Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [bcftools_stdout]\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " bcftools +smpl-stats -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+ "\n";
+}
+
+static void parse_filters(args_t *args)
+{
+ if ( !args->filter_str ) return;
+ int mflt = 1;
+ args->nflt_str = 1;
+ args->flt_str = (char**) malloc(sizeof(char*));
+ args->flt_str[0] = strdup(args->filter_str);
+ while (1)
+ {
+ int i, expanded = 0;
+ for (i=args->nflt_str-1; i>=0; i--)
+ {
+ char *exp_beg = strchr(args->flt_str[i], '{');
+ if ( !exp_beg ) continue;
+ char *exp_end = strchr(exp_beg+1, '}');
+ if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+ char *beg = exp_beg+1, *mid = beg;
+ while ( mid<exp_end )
+ {
+ while ( mid<exp_end && *mid!=',' ) mid++;
+ kstring_t tmp = {0,0,0};
+ kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+ kputsn(beg, mid - beg, &tmp);
+ kputs(exp_end+1, &tmp);
+ args->nflt_str++;
+ hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+ args->flt_str[args->nflt_str-1] = tmp.s;
+ beg = ++mid;
+ }
+ expanded = 1;
+ free(args->flt_str[i]);
+ memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+ args->nflt_str--;
+ args->flt_str[args->nflt_str] = NULL;
+ }
+ if ( !expanded ) break;
+ }
+
+ fprintf(bcftools_stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ parse_filters(args);
+
+ int i;
+ if ( !args->nflt_str )
+ {
+ args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+ args->nfilters = 1;
+ args->filters[0].expr = strdup("all");
+ }
+ else
+ {
+ args->nfilters = args->nflt_str;
+ args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+ for (i=0; i<args->nfilters; i++)
+ {
+ args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+ args->filters[i].expr = strdup(args->flt_str[i]);
+
+ // replace tab's with spaces so that the output stays parsable
+ char *tmp = args->filters[i].expr;
+ while ( *tmp )
+ {
+ if ( *tmp=='\t' ) *tmp = ' ';
+ tmp++;
+ }
+ }
+ }
+ args->nsmpl = bcf_hdr_nsamples(args->hdr);
+ for (i=0; i<args->nfilters; i++)
+ args->filters[i].stats = (stats_t*) calloc(args->nsmpl,sizeof(stats_t));
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ for (i=0; i<args->nfilters; i++)
+ {
+ if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+ free(args->filters[i].stats);
+ free(args->filters[i].expr);
+ }
+ free(args->filters);
+ for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+ free(args->flt_str);
+ bcf_sr_destroy(args->sr);
+ free(args->ac);
+ free(args->gt_arr);
+ free(args);
+}
+static void report_stats(args_t *args)
+{
+ int i = 0,j;
+ FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? bcftools_stdout : fopen(args->output_fname,"w");
+ if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+ fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+ fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+ fprintf(fh,"# FLT* lines report numbers for every threshold and every sample:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) sample\n", ++i);
+ fprintf(fh,"# %d) number of genotypes which pass the filter\n", ++i);
+ fprintf(fh,"# %d) number of non-reference genotypes\n", ++i);
+ fprintf(fh,"# %d) number of homozygous ref genotypes (0/0 or 0)\n", ++i);
+ fprintf(fh,"# %d) number of homozygous alt genotypes (1/1, 2/2, etc)\n", ++i);
+ fprintf(fh,"# %d) number of heterozygous genotypes (0/1, 1/2, etc)\n", ++i);
+ fprintf(fh,"# %d) number of hemizygous genotypes (0, 1, etc)\n", ++i);
+ fprintf(fh,"# %d) number of SNVs\n", ++i);
+ fprintf(fh,"# %d) number of indels\n", ++i);
+ fprintf(fh,"# %d) number of singletons\n", ++i);
+ fprintf(fh,"# %d) number of missing genotypes (./., ., ./0, etc)\n", ++i);
+ fprintf(fh,"# %d) number of transitions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+ fprintf(fh,"# %d) number of transversions (genotypes such as \"1/2\" are counted twice)\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv\n", ++i);
+ i = 0;
+ fprintf(fh,"# SITE* lines report numbers for every threshold and site:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) number of sites which pass the filter\n", ++i);
+ fprintf(fh,"# %d) number of SNVs\n", ++i);
+ fprintf(fh,"# %d) number of indels\n", ++i);
+ fprintf(fh,"# %d) number of singletons\n", ++i);
+ fprintf(fh,"# %d) number of transitions (counted at most once at multiallelic sites)\n", ++i);
+ fprintf(fh,"# %d) number of transversions (counted at most once at multiallelic sites)\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv\n", ++i);
+ fprintf(fh, "CMD\t%s", args->argv[0]);
+ for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+ fprintf(fh, "\n");
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+ }
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ for (j=0; j<args->nsmpl; j++)
+ {
+ fprintf(fh,"FLT%d", i);
+ fprintf(fh,"\t%s",args->hdr->samples[j]);
+ stats_t *stats = &flt->stats[j];
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nnon_ref);
+ fprintf(fh,"\t%d", stats->nhomRR);
+ fprintf(fh,"\t%d", stats->nhomAA);
+ fprintf(fh,"\t%d", stats->nhet);
+ fprintf(fh,"\t%d", stats->nhemi);
+ fprintf(fh,"\t%d", stats->nSNV);
+ fprintf(fh,"\t%d", stats->nIndel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->nmissing);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ fprintf(fh,"SITE%d", i);
+ stats_t *stats = &flt->site_stats;
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nSNV);
+ fprintf(fh,"\t%d", stats->nIndel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "bcftools_stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+ int32_t *ptr = arr + ngt1 * idx;
+ if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+ als[0] = bcf_gt_allele(ptr[0]);
+
+ if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { ptr[1] = ptr[0]; return -2; }
+
+ if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+ als[1] = bcf_gt_allele(ptr[1]);
+
+ return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+ int i,j;
+ uint8_t *smpl_pass = NULL;
+
+ // Find out which trios pass and if the site passes
+ if ( flt->filter )
+ {
+ int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE )
+ {
+ if ( pass_site )
+ {
+ if ( !smpl_pass ) return;
+ pass_site = 0;
+ for (i=0; i<args->nsmpl; i++)
+ {
+ if ( smpl_pass[i] ) smpl_pass[i] = 0;
+ else { smpl_pass[i] = 1; pass_site = 1; }
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->nsmpl; i++) smpl_pass[i] = 1;
+ }
+ else if ( !pass_site ) return;
+ }
+
+ // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+ hts_expand(int, rec->n_allele, args->mac, args->ac);
+ if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+
+ // Get the genotypes
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+ if ( ngt<0 ) return;
+ int ngt1 = ngt / rec->n_sample;
+
+
+ // For ts/tv: numeric code of the reference allele, -1 for insertions
+ int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+ int star_allele = -1;
+ for (i=1; i<rec->n_allele; i++)
+ if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+ // Run the stats
+ int site_pass = 0;
+ int site_SNV = 0;
+ int site_Indel = 0;
+ int site_has_ts = 0;
+ int site_has_tv = 0;
+ int site_singleton = 0;
+ for (i=0; i<args->nsmpl; i++)
+ {
+ if ( smpl_pass && !smpl_pass[i] ) continue;
+ stats_t *stats = &flt->stats[i];
+
+ // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+ int als[2];
+ int ret = parse_genotype(args->gt_arr, ngt1, i, als);
+ if ( ret==-1 ) { stats->nmissing++; continue; } // missing allele
+ if ( ret==-2 ) stats->nhemi++;
+ else if ( als[0]!=als[1] ) stats->nhet++;
+ else if ( als[0]==0 ) stats->nhomRR++;
+ else stats->nhomAA++;
+
+ stats->npass++;
+ site_pass = 1;
+
+ // Is there an alternate allele other than *?
+ int has_nonref = 0;
+ for (j=0; j<2; j++)
+ {
+ if ( als[j]==star_allele ) continue;
+ if ( als[j]==0 ) continue;
+ has_nonref = 1;
+ }
+ if ( !has_nonref ) continue; // only ref or * in this genotype
+
+ stats->nnon_ref++;
+
+ // Calculate ts/tv, count SNPs, indels. It does the right thing and handles also HetAA genotypes
+ {
+ int has_ts = 0, has_tv = 0, has_snv = 0, has_indel = 0;
+ for (j=0; j<2; j++)
+ {
+ if ( als[j]==0 || als[j]==star_allele ) continue;
+ if ( als[j] >= rec->n_allele )
+ error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[j]);
+
+ if ( args->ac[als[j]]==1 ) { stats->nsingleton++; site_singleton = 1; }
+
+ int var_type = bcf_get_variant_type(rec, als[j]);
+ if ( var_type==VCF_SNP || var_type==VCF_MNP )
+ {
+ int k = 0;
+ while ( rec->d.allele[0][k] && rec->d.allele[als[j]][k] )
+ {
+ if ( rec->d.allele[0][k]==rec->d.allele[als[j]][k] ) { k++; continue; }
+
+ int alt = bcf_acgt2int(rec->d.allele[als[j]][k]);
+ if ( abs(ref-alt)==2 ) has_ts = 1;
+ else has_tv = 1;
+ has_snv = 1;
+
+ k++;
+ }
+ }
+ else if ( var_type==VCF_INDEL ) has_indel = 1;
+ }
+ if ( has_ts ) { stats->nts++; site_has_ts = 1; }
+ if ( has_tv ) { stats->ntv++; site_has_tv = 1; }
+ if ( has_snv ) { stats->nSNV++; site_SNV = 1; }
+ if ( has_indel ) { stats->nIndel++; site_Indel = 1; }
+ }
+ }
+ flt->site_stats.npass += site_pass;
+ flt->site_stats.nSNV += site_SNV;
+ flt->site_stats.nIndel += site_Indel;
+ flt->site_stats.nts += site_has_ts;
+ flt->site_stats.ntv += site_has_tv;
+ flt->site_stats.nsingleton += site_singleton;
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c, i;
+ while ((c = getopt_long(argc, argv, "o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s",usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s",usage_text());
+ else args->fname = argv[optind];
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ for (i=0; i<args->nfilters; i++)
+ process_record(args, rec, &args->filters[i]);
+ }
+
+ report_stats(args);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+/*
+ Split VCF by sample(s)
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ htsFile **fh;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+ uint8_t *info_tags, *fmt_tags;
+ int ninfo_tags, minfo_tags, nfmt_tags, mfmt_tags, keep_info, keep_fmt;
+ int argc, region_is_file, target_is_file, output_type;
+ char **argv, *region, *target, *fname, *output_dir, *keep_tags, **bnames, *samples_fname;
+ bcf_hdr_t *hdr_in, *hdr_out;
+ bcf_srs_t *sr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Split VCF by sample creating single-sample VCFs\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Split VCF by sample, creating single-sample VCFs.\n"
+ "\n"
+ "Usage: bcftools +split [Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites for which the expression is true (applied on the outputs)\n"
+ " -i, --include EXPR include only sites for which the expression is true (applied on the outputs)\n"
+ " -k, --keep-tags LIST list of tags to keep. By default all tags are preserved\n"
+ " -o, --output DIR write output to the directory DIR\n"
+ " -O, --output-type b|u|z|v b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " -r, --regions REGION restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -S, --samples-file FILE list of samples to keep with second (optional) column for basename of the new file\n"
+ " -t, --targets REGION similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "Examples:\n"
+ " # Split a VCF file\n"
+ " bcftools +split input.bcf -Ob -o dir\n"
+ "\n"
+ " # Exclude sites with missing or hom-ref genotypes\n"
+ " bcftools +split input.bcf -Ob -o dir -i'GT=\"alt\"'\n"
+ "\n"
+ " # Keep all INFO tags but only GT and PL in FORMAT\n"
+ " bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL\n"
+ "\n"
+ " # Keep all FORMAT tags but drop all INFO tags\n"
+ " bcftools +split input.bcf -Ob -o dir -k FMT\n"
+ "\n";
+}
+
+void mkdir_p(const char *fmt, ...);
+
+char **set_file_base_names(args_t *args)
+{
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ char **fnames = (char**) calloc(nsmpl,sizeof(char*));
+ if ( args->samples_fname )
+ {
+ kstring_t str = {0,0,0};
+ int nsamples = 0;
+ char **samples = NULL;
+ samples = hts_readlines(args->samples_fname, &nsamples);
+ for (i=0; i<nsamples; i++)
+ {
+ str.l = 0;
+ int escaped = 0;
+ char *ptr = samples[i];
+ while ( *ptr )
+ {
+ if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+ if ( isspace(*ptr) && !escaped ) break;
+ kputc(*ptr, &str);
+ escaped = 0;
+ ptr++;
+ }
+ int idx = bcf_hdr_id2int(args->hdr_in, BCF_DT_SAMPLE, str.s);
+ if ( idx<0 )
+ {
+ fprintf(stderr,"Warning: The sample \"%s\" is not present in %s\n", str.s,args->fname);
+ continue;
+ }
+ while ( *ptr && isspace(*ptr) ) ptr++;
+ if ( !*ptr )
+ {
+ fnames[idx] = strdup(str.s);
+ continue;
+ }
+ str.l = 0;
+ while ( *ptr )
+ {
+ if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+ if ( isspace(*ptr) && !escaped ) break;
+ kputc(*ptr, &str);
+ escaped = 0;
+ ptr++;
+ }
+ fnames[idx] = strdup(str.s);
+ }
+ for (i=0; i<nsamples; i++) free(samples[i]);
+ free(samples);
+ free(str.s);
+ }
+ else
+ {
+ for (i=0; i<nsmpl; i++)
+ fnames[i] = strdup(args->hdr_in->samples[i]);
+ }
+ return fnames;
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+ }
+ if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr_in = bcf_sr_get_header(args->sr,0);
+ args->hdr_out = bcf_hdr_dup(args->hdr_in);
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr_in, args->filter_str);
+
+ mkdir_p("%s/",args->output_dir);
+
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ if ( !nsmpl ) error("No samples to split: %s\n", args->fname);
+ args->fh = (htsFile**)calloc(nsmpl,sizeof(*args->fh));
+ args->bnames = set_file_base_names(args);
+ kstring_t str = {0,0,0};
+ for (i=0; i<nsmpl; i++)
+ {
+ if ( !args->bnames[i] ) continue;
+ str.l = 0;
+ kputs(args->output_dir, &str);
+ if ( str.s[str.l-1] != '/' ) kputc('/', &str);
+ int k, l = str.l;
+ kputs(args->bnames[i], &str);
+ for (k=l; k<str.l; k++) if ( isspace(str.s[k]) ) str.s[k] = '_';
+ if ( args->output_type & FT_BCF ) kputs(".bcf", &str);
+ else if ( args->output_type & FT_GZ ) kputs(".vcf.gz", &str);
+ else kputs(".vcf", &str);
+ args->fh[i] = hts_open(str.s, hts_bcf_wmode(args->output_type));
+ if ( args->fh[i] == NULL ) error("Can't write to \"%s\": %s\n", str.s, strerror(errno));
+ bcf_hdr_nsamples(args->hdr_out) = 1;
+ args->hdr_out->samples[0] = args->bnames[i];
+ bcf_hdr_write(args->fh[i], args->hdr_out);
+ }
+ free(str.s);
+
+ // parse tags
+ int is_info = 0, is_fmt = 0;
+ char *beg = args->keep_tags;
+ while ( beg && *beg )
+ {
+ if ( !strncasecmp("INFO/",beg,5) ) { is_info = 1; is_fmt = 0; beg += 5; }
+ else if ( !strcasecmp("INFO",beg) ) { args->keep_info = 1; break; }
+ else if ( !strncasecmp("INFO,",beg,5) ) { args->keep_info = 1; beg += 5; continue; }
+ else if ( !strncasecmp("FMT/",beg,4) ) { is_info = 0; is_fmt = 1; beg += 4; }
+ else if ( !strncasecmp("FORMAT/",beg,7) ) { is_info = 0; is_fmt = 1; beg += 7; }
+ else if ( !strcasecmp("FMT",beg) ) { args->keep_fmt = 1; break; }
+ else if ( !strcasecmp("FORMAT",beg) ) { args->keep_fmt = 1; break; }
+ else if ( !strncasecmp("FMT,",beg,4) ) { args->keep_fmt = 1; beg += 4; continue; }
+ else if ( !strncasecmp("FORMAT,",beg,7) ) { args->keep_fmt = 1; beg += 7; continue; }
+ char *end = beg;
+ while ( *end && *end!=',' ) end++;
+ char tmp = *end; *end = 0;
+ int id = bcf_hdr_id2int(args->hdr_in, BCF_DT_ID, beg);
+ beg = tmp ? end + 1 : end;
+ if ( is_info && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_INFO,id) )
+ {
+ if ( id >= args->ninfo_tags ) args->ninfo_tags = id + 1;
+ hts_expand0(uint8_t, args->ninfo_tags, args->minfo_tags, args->info_tags);
+ args->info_tags[id] = 1;
+ }
+ if ( is_fmt && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_FMT,id) )
+ {
+ if ( id >= args->nfmt_tags ) args->nfmt_tags = id + 1;
+ hts_expand0(uint8_t, args->nfmt_tags, args->mfmt_tags, args->fmt_tags);
+ args->fmt_tags[id] = 1;
+ }
+ }
+ if ( !args->keep_info && !args->keep_fmt && !args->ninfo_tags && !args->nfmt_tags )
+ {
+ args->keep_info = args->keep_fmt = 1;
+ }
+}
+static void destroy_data(args_t *args)
+{
+ free(args->info_tags);
+ free(args->fmt_tags);
+ if ( args->filter )
+ filter_destroy(args->filter);
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ for (i=0; i<nsmpl; i++)
+ {
+ if ( args->fh[i] && hts_close(args->fh[i])!=0 ) error("Error: close failed!\n");
+ free(args->bnames[i]);
+ }
+ free(args->bnames);
+ free(args->fh);
+ bcf_sr_destroy(args->sr);
+ bcf_hdr_destroy(args->hdr_out);
+ free(args);
+}
+
+static bcf1_t *rec_set_info(args_t *args, bcf1_t *rec)
+{
+ bcf1_t *out = bcf_init1();
+ out->rid = rec->rid;
+ out->pos = rec->pos;
+ out->rlen = rec->rlen;
+ out->qual = rec->qual;
+ out->n_allele = rec->n_allele;
+ out->n_sample = 1;
+ if ( args->keep_info )
+ {
+ out->n_info = rec->n_info;
+ out->shared.m = out->shared.l = rec->shared.l;
+ out->shared.s = (char*) malloc(out->shared.l);
+ memcpy(out->shared.s,rec->shared.s,out->shared.l);
+ return out;
+ }
+
+ // build the BCF record
+ kstring_t tmp = {0,0,0};
+ char *ptr = rec->shared.s;
+ kputsn_(ptr, rec->unpack_size[0], &tmp); ptr += rec->unpack_size[0]; // ID
+ kputsn_(ptr, rec->unpack_size[1], &tmp); ptr += rec->unpack_size[1]; // REF+ALT
+ kputsn_(ptr, rec->unpack_size[2], &tmp); // FILTER
+ if ( args->ninfo_tags )
+ {
+ int i;
+ for (i=0; i<rec->n_info; i++)
+ {
+ bcf_info_t *info = &rec->d.info[i];
+ int id = info->key;
+ if ( !args->info_tags[id] ) continue;
+ kputsn_(info->vptr - info->vptr_off, info->vptr_len + info->vptr_off, &tmp);
+ out->n_info++;
+ }
+ }
+ out->shared.m = tmp.m;
+ out->shared.s = tmp.s;
+ out->shared.l = tmp.l;
+ out->unpacked = 0;
+ return out;
+}
+
+static bcf1_t *rec_set_format(args_t *args, bcf1_t *src, int ismpl, bcf1_t *dst)
+{
+ dst->n_fmt = 0;
+ kstring_t tmp = dst->indiv; tmp.l = 0;
+ int i;
+ for (i=0; i<src->n_fmt; i++)
+ {
+ bcf_fmt_t *fmt = &src->d.fmt[i];
+ int id = fmt->id;
+ if ( !args->keep_fmt && !args->fmt_tags[id] ) continue;
+
+ bcf_enc_int1(&tmp, id);
+ bcf_enc_size(&tmp, fmt->n, fmt->type);
+ kputsn_(fmt->p + ismpl*fmt->size, fmt->size, &tmp);
+
+ dst->n_fmt++;
+ }
+ dst->indiv = tmp;
+ return dst;
+}
+
+static void process(args_t *args)
+{
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ bcf_unpack(rec, BCF_UN_ALL);
+
+ int i, site_pass = 1;
+ const uint8_t *smpl_pass = NULL;
+ if ( args->filter )
+ {
+ site_pass = filter_test(args->filter, rec, &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE ) site_pass = site_pass ? 0 : 1;
+ }
+ bcf1_t *out = NULL;
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( !args->fh[i] ) continue;
+ if ( !smpl_pass && !site_pass ) continue;
+ if ( smpl_pass )
+ {
+ int pass = args->filter_logic & FLT_EXCLUDE ? ( smpl_pass[i] ? 0 : 1) : smpl_pass[i];
+ if ( !pass ) continue;
+ }
+ if ( !out ) out = rec_set_info(args, rec);
+ rec_set_format(args, rec, i, out);
+ bcf_write(args->fh[i], args->hdr_out, out);
+ }
+ if ( out ) bcf_destroy(out);
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_type = FT_VCF;
+ static struct option loptions[] =
+ {
+ {"keep-tags",required_argument,NULL,'k'},
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"samples-file",required_argument,NULL,'S'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "vr:R:t:T:o:O:i:e:k:S:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'k': args->keep_tags = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'T': args->target = optarg; args->target_is_file = 1; break;
+ case 't': args->target = optarg; break;
+ case 'R': args->region = optarg; args->region_is_file = 1; break;
+ case 'S': args->samples_fname = optarg; break;
+ case 'r': args->region = optarg; break;
+ case 'o': args->output_dir = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s", usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s", usage_text());
+ else args->fname = argv[optind];
+
+ if ( !args->output_dir ) error("Missing the -o option\n");
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) ) process(args);
+
+ destroy_data(args);
+ return 0;
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/*
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+/*
+ Split VCF by sample(s)
+
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <strings.h>
+#include <getopt.h>
+#include <stdarg.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <errno.h>
+#include <ctype.h>
+#include <htslib/vcf.h>
+#include <htslib/synced_bcf_reader.h>
+#include "bcftools.h"
+#include "filter.h"
+
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+typedef struct
+{
+ htsFile **fh;
+ filter_t *filter;
+ char *filter_str;
+ int filter_logic; // one of FLT_INCLUDE/FLT_EXCLUDE (-i or -e)
+ uint8_t *info_tags, *fmt_tags;
+ int ninfo_tags, minfo_tags, nfmt_tags, mfmt_tags, keep_info, keep_fmt;
+ int argc, region_is_file, target_is_file, output_type;
+ char **argv, *region, *target, *fname, *output_dir, *keep_tags, **bnames, *samples_fname;
+ bcf_hdr_t *hdr_in, *hdr_out;
+ bcf_srs_t *sr;
+}
+args_t;
+
+const char *about(void)
+{
+ return "Split VCF by sample creating single-sample VCFs\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Split VCF by sample, creating single-sample VCFs.\n"
+ "\n"
+ "Usage: bcftools +split [Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites for which the expression is true (applied on the outputs)\n"
+ " -i, --include EXPR include only sites for which the expression is true (applied on the outputs)\n"
+ " -k, --keep-tags LIST list of tags to keep. By default all tags are preserved\n"
+ " -o, --output DIR write output to the directory DIR\n"
+ " -O, --output-type b|u|z|v b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]\n"
+ " -r, --regions REGION restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -S, --samples-file FILE list of samples to keep with second (optional) column for basename of the new file\n"
+ " -t, --targets REGION similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "Examples:\n"
+ " # Split a VCF file\n"
+ " bcftools +split input.bcf -Ob -o dir\n"
+ "\n"
+ " # Exclude sites with missing or hom-ref genotypes\n"
+ " bcftools +split input.bcf -Ob -o dir -i'GT=\"alt\"'\n"
+ "\n"
+ " # Keep all INFO tags but only GT and PL in FORMAT\n"
+ " bcftools +split input.bcf -Ob -o dir -k INFO,FMT/GT,PL\n"
+ "\n"
+ " # Keep all FORMAT tags but drop all INFO tags\n"
+ " bcftools +split input.bcf -Ob -o dir -k FMT\n"
+ "\n";
+}
+
+void mkdir_p(const char *fmt, ...);
+
+char **set_file_base_names(args_t *args)
+{
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ char **fnames = (char**) calloc(nsmpl,sizeof(char*));
+ if ( args->samples_fname )
+ {
+ kstring_t str = {0,0,0};
+ int nsamples = 0;
+ char **samples = NULL;
+ samples = hts_readlines(args->samples_fname, &nsamples);
+ for (i=0; i<nsamples; i++)
+ {
+ str.l = 0;
+ int escaped = 0;
+ char *ptr = samples[i];
+ while ( *ptr )
+ {
+ if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+ if ( isspace(*ptr) && !escaped ) break;
+ kputc(*ptr, &str);
+ escaped = 0;
+ ptr++;
+ }
+ int idx = bcf_hdr_id2int(args->hdr_in, BCF_DT_SAMPLE, str.s);
+ if ( idx<0 )
+ {
+ fprintf(bcftools_stderr,"Warning: The sample \"%s\" is not present in %s\n", str.s,args->fname);
+ continue;
+ }
+ while ( *ptr && isspace(*ptr) ) ptr++;
+ if ( !*ptr )
+ {
+ fnames[idx] = strdup(str.s);
+ continue;
+ }
+ str.l = 0;
+ while ( *ptr )
+ {
+ if ( *ptr=='\\' && !escaped ) { escaped = 1; ptr++; continue; }
+ if ( isspace(*ptr) && !escaped ) break;
+ kputc(*ptr, &str);
+ escaped = 0;
+ ptr++;
+ }
+ fnames[idx] = strdup(str.s);
+ }
+ for (i=0; i<nsamples; i++) free(samples[i]);
+ free(samples);
+ free(str.s);
+ }
+ else
+ {
+ for (i=0; i<nsmpl; i++)
+ fnames[i] = strdup(args->hdr_in->samples[i]);
+ }
+ return fnames;
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->region )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->region, args->region_is_file)<0 ) error("Failed to read the regions: %s\n",args->region);
+ }
+ if ( args->target && bcf_sr_set_targets(args->sr, args->target, args->target_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->target);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr_in = bcf_sr_get_header(args->sr,0);
+ args->hdr_out = bcf_hdr_dup(args->hdr_in);
+
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr_in, args->filter_str);
+
+ mkdir_p("%s/",args->output_dir);
+
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ if ( !nsmpl ) error("No samples to split: %s\n", args->fname);
+ args->fh = (htsFile**)calloc(nsmpl,sizeof(*args->fh));
+ args->bnames = set_file_base_names(args);
+ kstring_t str = {0,0,0};
+ for (i=0; i<nsmpl; i++)
+ {
+ if ( !args->bnames[i] ) continue;
+ str.l = 0;
+ kputs(args->output_dir, &str);
+ if ( str.s[str.l-1] != '/' ) kputc('/', &str);
+ int k, l = str.l;
+ kputs(args->bnames[i], &str);
+ for (k=l; k<str.l; k++) if ( isspace(str.s[k]) ) str.s[k] = '_';
+ if ( args->output_type & FT_BCF ) kputs(".bcf", &str);
+ else if ( args->output_type & FT_GZ ) kputs(".vcf.gz", &str);
+ else kputs(".vcf", &str);
+ args->fh[i] = hts_open(str.s, hts_bcf_wmode(args->output_type));
+ if ( args->fh[i] == NULL ) error("Can't write to \"%s\": %s\n", str.s, strerror(errno));
+ bcf_hdr_nsamples(args->hdr_out) = 1;
+ args->hdr_out->samples[0] = args->bnames[i];
+ bcf_hdr_write(args->fh[i], args->hdr_out);
+ }
+ free(str.s);
+
+ // parse tags
+ int is_info = 0, is_fmt = 0;
+ char *beg = args->keep_tags;
+ while ( beg && *beg )
+ {
+ if ( !strncasecmp("INFO/",beg,5) ) { is_info = 1; is_fmt = 0; beg += 5; }
+ else if ( !strcasecmp("INFO",beg) ) { args->keep_info = 1; break; }
+ else if ( !strncasecmp("INFO,",beg,5) ) { args->keep_info = 1; beg += 5; continue; }
+ else if ( !strncasecmp("FMT/",beg,4) ) { is_info = 0; is_fmt = 1; beg += 4; }
+ else if ( !strncasecmp("FORMAT/",beg,7) ) { is_info = 0; is_fmt = 1; beg += 7; }
+ else if ( !strcasecmp("FMT",beg) ) { args->keep_fmt = 1; break; }
+ else if ( !strcasecmp("FORMAT",beg) ) { args->keep_fmt = 1; break; }
+ else if ( !strncasecmp("FMT,",beg,4) ) { args->keep_fmt = 1; beg += 4; continue; }
+ else if ( !strncasecmp("FORMAT,",beg,7) ) { args->keep_fmt = 1; beg += 7; continue; }
+ char *end = beg;
+ while ( *end && *end!=',' ) end++;
+ char tmp = *end; *end = 0;
+ int id = bcf_hdr_id2int(args->hdr_in, BCF_DT_ID, beg);
+ beg = tmp ? end + 1 : end;
+ if ( is_info && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_INFO,id) )
+ {
+ if ( id >= args->ninfo_tags ) args->ninfo_tags = id + 1;
+ hts_expand0(uint8_t, args->ninfo_tags, args->minfo_tags, args->info_tags);
+ args->info_tags[id] = 1;
+ }
+ if ( is_fmt && bcf_hdr_idinfo_exists(args->hdr_in,BCF_HL_FMT,id) )
+ {
+ if ( id >= args->nfmt_tags ) args->nfmt_tags = id + 1;
+ hts_expand0(uint8_t, args->nfmt_tags, args->mfmt_tags, args->fmt_tags);
+ args->fmt_tags[id] = 1;
+ }
+ }
+ if ( !args->keep_info && !args->keep_fmt && !args->ninfo_tags && !args->nfmt_tags )
+ {
+ args->keep_info = args->keep_fmt = 1;
+ }
+}
+static void destroy_data(args_t *args)
+{
+ free(args->info_tags);
+ free(args->fmt_tags);
+ if ( args->filter )
+ filter_destroy(args->filter);
+ int i, nsmpl = bcf_hdr_nsamples(args->hdr_in);
+ for (i=0; i<nsmpl; i++)
+ {
+ if ( args->fh[i] && hts_close(args->fh[i])!=0 ) error("Error: close failed!\n");
+ free(args->bnames[i]);
+ }
+ free(args->bnames);
+ free(args->fh);
+ bcf_sr_destroy(args->sr);
+ bcf_hdr_destroy(args->hdr_out);
+ free(args);
+}
+
+static bcf1_t *rec_set_info(args_t *args, bcf1_t *rec)
+{
+ bcf1_t *out = bcf_init1();
+ out->rid = rec->rid;
+ out->pos = rec->pos;
+ out->rlen = rec->rlen;
+ out->qual = rec->qual;
+ out->n_allele = rec->n_allele;
+ out->n_sample = 1;
+ if ( args->keep_info )
+ {
+ out->n_info = rec->n_info;
+ out->shared.m = out->shared.l = rec->shared.l;
+ out->shared.s = (char*) malloc(out->shared.l);
+ memcpy(out->shared.s,rec->shared.s,out->shared.l);
+ return out;
+ }
+
+ // build the BCF record
+ kstring_t tmp = {0,0,0};
+ char *ptr = rec->shared.s;
+ kputsn_(ptr, rec->unpack_size[0], &tmp); ptr += rec->unpack_size[0]; // ID
+ kputsn_(ptr, rec->unpack_size[1], &tmp); ptr += rec->unpack_size[1]; // REF+ALT
+ kputsn_(ptr, rec->unpack_size[2], &tmp); // FILTER
+ if ( args->ninfo_tags )
+ {
+ int i;
+ for (i=0; i<rec->n_info; i++)
+ {
+ bcf_info_t *info = &rec->d.info[i];
+ int id = info->key;
+ if ( !args->info_tags[id] ) continue;
+ kputsn_(info->vptr - info->vptr_off, info->vptr_len + info->vptr_off, &tmp);
+ out->n_info++;
+ }
+ }
+ out->shared.m = tmp.m;
+ out->shared.s = tmp.s;
+ out->shared.l = tmp.l;
+ out->unpacked = 0;
+ return out;
+}
+
+static bcf1_t *rec_set_format(args_t *args, bcf1_t *src, int ismpl, bcf1_t *dst)
+{
+ dst->n_fmt = 0;
+ kstring_t tmp = dst->indiv; tmp.l = 0;
+ int i;
+ for (i=0; i<src->n_fmt; i++)
+ {
+ bcf_fmt_t *fmt = &src->d.fmt[i];
+ int id = fmt->id;
+ if ( !args->keep_fmt && !args->fmt_tags[id] ) continue;
+
+ bcf_enc_int1(&tmp, id);
+ bcf_enc_size(&tmp, fmt->n, fmt->type);
+ kputsn_(fmt->p + ismpl*fmt->size, fmt->size, &tmp);
+
+ dst->n_fmt++;
+ }
+ dst->indiv = tmp;
+ return dst;
+}
+
+static void process(args_t *args)
+{
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ bcf_unpack(rec, BCF_UN_ALL);
+
+ int i, site_pass = 1;
+ const uint8_t *smpl_pass = NULL;
+ if ( args->filter )
+ {
+ site_pass = filter_test(args->filter, rec, &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE ) site_pass = site_pass ? 0 : 1;
+ }
+ bcf1_t *out = NULL;
+ for (i=0; i<rec->n_sample; i++)
+ {
+ if ( !args->fh[i] ) continue;
+ if ( !smpl_pass && !site_pass ) continue;
+ if ( smpl_pass )
+ {
+ int pass = args->filter_logic & FLT_EXCLUDE ? ( smpl_pass[i] ? 0 : 1) : smpl_pass[i];
+ if ( !pass ) continue;
+ }
+ if ( !out ) out = rec_set_info(args, rec);
+ rec_set_format(args, rec, i, out);
+ bcf_write(args->fh[i], args->hdr_out, out);
+ }
+ if ( out ) bcf_destroy(out);
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_type = FT_VCF;
+ static struct option loptions[] =
+ {
+ {"keep-tags",required_argument,NULL,'k'},
+ {"exclude",required_argument,NULL,'e'},
+ {"include",required_argument,NULL,'i'},
+ {"regions",required_argument,NULL,'r'},
+ {"regions-file",required_argument,NULL,'R'},
+ {"samples-file",required_argument,NULL,'S'},
+ {"output",required_argument,NULL,'o'},
+ {"output-type",required_argument,NULL,'O'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "vr:R:t:T:o:O:i:e:k:S:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'k': args->keep_tags = optarg; break;
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 'T': args->target = optarg; args->target_is_file = 1; break;
+ case 't': args->target = optarg; break;
+ case 'R': args->region = optarg; args->region_is_file = 1; break;
+ case 'S': args->samples_fname = optarg; break;
+ case 'r': args->region = optarg; break;
+ case 'o': args->output_dir = optarg; break;
+ case 'O':
+ switch (optarg[0]) {
+ case 'b': args->output_type = FT_BCF_GZ; break;
+ case 'u': args->output_type = FT_BCF; break;
+ case 'z': args->output_type = FT_VCF_GZ; break;
+ case 'v': args->output_type = FT_VCF; break;
+ default: error("The output type \"%s\" not recognised\n", optarg);
+ }
+ break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s", usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s", usage_text());
+ else args->fname = argv[optind];
+
+ if ( !args->output_dir ) error("Missing the -o option\n");
+ if ( args->filter_logic == (FLT_EXCLUDE|FLT_INCLUDE) ) error("Only one of -i or -e can be given.\n");
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) ) process(args);
+
+ destroy_data(args);
+ return 0;
+}
+
+
--- /dev/null
+/* plugins/tag2tag.c -- convert between similar tags
+
+ Copyright (C) 2014-2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+
+
+#define GP_TO_GL 1
+#define GL_TO_PL 2
+#define GP_TO_GT 3
+#define PL_TO_GL 4
+
+static int mode = 0, drop_source_tag = 0;
+static bcf_hdr_t *in_hdr, *out_hdr;
+static float *farr = NULL, thresh = 0.1;
+static int32_t *iarr = NULL;
+static int mfarr = 0, miarr = 0;
+
+const char *about(void)
+{
+ return "Convert between similar tags, such as GL, PL and GP.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Convert between similar tags, such as GL, PL and GP.\n"
+ "Usage: bcftools +tag2tag [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " --gp-to-gl convert FORMAT/GP to FORMAT/GL\n"
+ " --gp-to-gt convert FORMAT/GP to FORMAT/GT by taking argmax of GP\n"
+ " --gl-to-pl convert FORMAT/GL to FORMAT/PL\n"
+ " --pl-to-gl convert FORMAT/PL to FORMAT/GL\n"
+ " -r, --replace drop the source tag\n"
+ " -t, --threshold <float> threshold for GP to GT hard-call [0.1]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +tag2tag in.vcf -- -r --gp-to-gl\n"
+ "\n";
+}
+
+
+static void init_header(bcf_hdr_t *hdr, const char *ori, int ori_type, const char *new_hdr_line)
+{
+ if ( ori )
+ bcf_hdr_remove(hdr,ori_type,ori);
+
+ bcf_hdr_append(hdr, new_hdr_line);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ static struct option loptions[] =
+ {
+ {"replace",no_argument,NULL,'r'},
+ {"gp-to-gl",no_argument,NULL,1},
+ {"gl-to-pl",no_argument,NULL,2},
+ {"gp-to-gt",no_argument,NULL,3},
+ {"pl-to-gl",no_argument,NULL,4},
+ {"threshold",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *src_tag = "GP";
+ while ((c = getopt_long(argc, argv, "?hrt:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 1 : src_tag = "GP"; mode = GP_TO_GL; break;
+ case 2 : src_tag = "GL"; mode = GL_TO_PL; break;
+ case 3 : src_tag = "GP"; mode = GP_TO_GT; break;
+ case 4 : src_tag = "PL"; mode = PL_TO_GL; break;
+ case 'r': drop_source_tag = 1; break;
+ case 't': thresh = atof(optarg); break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !mode ) mode = GP_TO_GL;
+
+ in_hdr = in;
+ out_hdr = out;
+
+ if ( mode==GP_TO_GL )
+ init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype Likelihoods\">");
+ else if ( mode==GL_TO_PL )
+ init_header(out_hdr,drop_source_tag?"GL":NULL,BCF_HL_FMT,"##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype likelihoods\">");
+ else if ( mode==PL_TO_GL )
+ init_header(out_hdr,drop_source_tag?"PL":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype likelihoods\">");
+ else if ( mode==GP_TO_GT ) {
+ if (thresh<0||thresh>1) error("--threshold must be in the range [0,1]: %f\n", thresh);
+ init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
+ }
+
+ int tag_id;
+ if ( (tag_id=bcf_hdr_id2int(in_hdr,BCF_DT_ID,src_tag))<0 || !bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,tag_id) )
+ error("The source tag does not exist: %s\n", src_tag);
+
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i, n;
+ if ( mode==GP_TO_GL )
+ {
+ n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+ for (i=0; i<n; i++)
+ {
+ if ( bcf_float_is_missing(farr[i]) || bcf_float_is_vector_end(farr[i]) ) continue;
+ farr[i] = farr[i] ? log10(farr[i]) : -99;
+ }
+ bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+ }
+ else if ( mode==PL_TO_GL )
+ {
+ n = bcf_get_format_int32(in_hdr,rec,"PL",&iarr,&miarr);
+ if ( n<=0 ) return rec;
+ hts_expand(float, n, mfarr, farr);
+ for (i=0; i<n; i++)
+ {
+ if ( iarr[i]==bcf_int32_missing )
+ bcf_float_set_missing(farr[i]);
+ else if ( iarr[i]==bcf_int32_vector_end )
+ bcf_float_set_vector_end(farr[i]);
+ else
+ farr[i] = -0.1 * iarr[i];
+ }
+ bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+ if ( drop_source_tag )
+ bcf_update_format_int32(out_hdr,rec,"PL",NULL,0);
+ }
+ else if ( mode==GL_TO_PL )
+ {
+ n = bcf_get_format_float(in_hdr,rec,"GL",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+ hts_expand(int32_t, n, miarr, iarr);
+ for (i=0; i<n; i++)
+ {
+ if ( bcf_float_is_missing(farr[i]) )
+ iarr[i] = bcf_int32_missing;
+ else if ( bcf_float_is_vector_end(farr[i]) )
+ iarr[i] = bcf_int32_vector_end;
+ else
+ iarr[i] = lroundf(-10 * farr[i]);
+ }
+ bcf_update_format_int32(out_hdr,rec,"PL",iarr,n);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GL",NULL,0);
+ }
+ else if ( mode==GP_TO_GT )
+ {
+ int nals = rec->n_allele;
+ int nsmpl = bcf_hdr_nsamples(in_hdr);
+ hts_expand(int32_t,nsmpl*2,miarr,iarr);
+
+ n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+
+ n /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ float *ptr = farr + i*n;
+ if ( bcf_float_is_missing(ptr[0]) )
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+ continue;
+ }
+
+ int j, jmax = 0;
+ for (j=1; j<n; j++)
+ {
+ if ( bcf_float_is_missing(ptr[j]) || bcf_float_is_vector_end(ptr[j]) ) break;
+ if ( ptr[j] > ptr[jmax] ) jmax = j;
+ }
+
+ // haploid genotype
+ if ( j==nals )
+ {
+ iarr[2*i] = ptr[jmax] < 1-thresh ? bcf_gt_missing : bcf_gt_unphased(jmax);
+ iarr[2*i+1] = bcf_int32_vector_end;
+ continue;
+ }
+
+ if ( j!=nals*(nals+1)/2 )
+ error("Wrong number of GP values for diploid genotype at %s:%d, expected %d, found %d\n",
+ bcf_seqname(in_hdr,rec),rec->pos+1, nals*(nals+1)/2,j);
+
+ if (ptr[jmax] < 1-thresh)
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+ continue;
+ }
+
+ // most common case: RR
+ if ( jmax==0 )
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_unphased(0);
+ continue;
+ }
+
+ int a,b;
+ bcf_gt2alleles(jmax,&a,&b);
+ iarr[2*i] = bcf_gt_unphased(a);
+ iarr[2*i+1] = bcf_gt_unphased(b);
+ }
+ bcf_update_genotypes(out_hdr,rec,iarr,nsmpl*2);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+ }
+ return rec;
+}
+
+void destroy(void)
+{
+ free(farr);
+ free(iarr);
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* plugins/tag2tag.c -- convert between similar tags
+
+ Copyright (C) 2014-2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include "bcftools.h"
+
+
+#define GP_TO_GL 1
+#define GL_TO_PL 2
+#define GP_TO_GT 3
+#define PL_TO_GL 4
+
+static int mode = 0, drop_source_tag = 0;
+static bcf_hdr_t *in_hdr, *out_hdr;
+static float *farr = NULL, thresh = 0.1;
+static int32_t *iarr = NULL;
+static int mfarr = 0, miarr = 0;
+
+const char *about(void)
+{
+ return "Convert between similar tags, such as GL, PL and GP.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Convert between similar tags, such as GL, PL and GP.\n"
+ "Usage: bcftools +tag2tag [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " --gp-to-gl convert FORMAT/GP to FORMAT/GL\n"
+ " --gp-to-gt convert FORMAT/GP to FORMAT/GT by taking argmax of GP\n"
+ " --gl-to-pl convert FORMAT/GL to FORMAT/PL\n"
+ " --pl-to-gl convert FORMAT/PL to FORMAT/GL\n"
+ " -r, --replace drop the source tag\n"
+ " -t, --threshold <float> threshold for GP to GT hard-call [0.1]\n"
+ "\n"
+ "Example:\n"
+ " bcftools +tag2tag in.vcf -- -r --gp-to-gl\n"
+ "\n";
+}
+
+
+static void init_header(bcf_hdr_t *hdr, const char *ori, int ori_type, const char *new_hdr_line)
+{
+ if ( ori )
+ bcf_hdr_remove(hdr,ori_type,ori);
+
+ bcf_hdr_append(hdr, new_hdr_line);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ static struct option loptions[] =
+ {
+ {"replace",no_argument,NULL,'r'},
+ {"gp-to-gl",no_argument,NULL,1},
+ {"gl-to-pl",no_argument,NULL,2},
+ {"gp-to-gt",no_argument,NULL,3},
+ {"pl-to-gl",no_argument,NULL,4},
+ {"threshold",required_argument,NULL,'t'},
+ {NULL,0,NULL,0}
+ };
+ int c;
+ char *src_tag = "GP";
+ while ((c = getopt_long(argc, argv, "?hrt:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 1 : src_tag = "GP"; mode = GP_TO_GL; break;
+ case 2 : src_tag = "GL"; mode = GL_TO_PL; break;
+ case 3 : src_tag = "GP"; mode = GP_TO_GT; break;
+ case 4 : src_tag = "PL"; mode = PL_TO_GL; break;
+ case 'r': drop_source_tag = 1; break;
+ case 't': thresh = atof(optarg); break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !mode ) mode = GP_TO_GL;
+
+ in_hdr = in;
+ out_hdr = out;
+
+ if ( mode==GP_TO_GL )
+ init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype Likelihoods\">");
+ else if ( mode==GL_TO_PL )
+ init_header(out_hdr,drop_source_tag?"GL":NULL,BCF_HL_FMT,"##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype likelihoods\">");
+ else if ( mode==PL_TO_GL )
+ init_header(out_hdr,drop_source_tag?"PL":NULL,BCF_HL_FMT,"##FORMAT=<ID=GL,Number=G,Type=Float,Description=\"Genotype likelihoods\">");
+ else if ( mode==GP_TO_GT ) {
+ if (thresh<0||thresh>1) error("--threshold must be in the range [0,1]: %f\n", thresh);
+ init_header(out_hdr,drop_source_tag?"GP":NULL,BCF_HL_FMT,"##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
+ }
+
+ int tag_id;
+ if ( (tag_id=bcf_hdr_id2int(in_hdr,BCF_DT_ID,src_tag))<0 || !bcf_hdr_idinfo_exists(in_hdr,BCF_HL_FMT,tag_id) )
+ error("The source tag does not exist: %s\n", src_tag);
+
+ return 0;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int i, n;
+ if ( mode==GP_TO_GL )
+ {
+ n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+ for (i=0; i<n; i++)
+ {
+ if ( bcf_float_is_missing(farr[i]) || bcf_float_is_vector_end(farr[i]) ) continue;
+ farr[i] = farr[i] ? log10(farr[i]) : -99;
+ }
+ bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+ }
+ else if ( mode==PL_TO_GL )
+ {
+ n = bcf_get_format_int32(in_hdr,rec,"PL",&iarr,&miarr);
+ if ( n<=0 ) return rec;
+ hts_expand(float, n, mfarr, farr);
+ for (i=0; i<n; i++)
+ {
+ if ( iarr[i]==bcf_int32_missing )
+ bcf_float_set_missing(farr[i]);
+ else if ( iarr[i]==bcf_int32_vector_end )
+ bcf_float_set_vector_end(farr[i]);
+ else
+ farr[i] = -0.1 * iarr[i];
+ }
+ bcf_update_format_float(out_hdr,rec,"GL",farr,n);
+ if ( drop_source_tag )
+ bcf_update_format_int32(out_hdr,rec,"PL",NULL,0);
+ }
+ else if ( mode==GL_TO_PL )
+ {
+ n = bcf_get_format_float(in_hdr,rec,"GL",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+ hts_expand(int32_t, n, miarr, iarr);
+ for (i=0; i<n; i++)
+ {
+ if ( bcf_float_is_missing(farr[i]) )
+ iarr[i] = bcf_int32_missing;
+ else if ( bcf_float_is_vector_end(farr[i]) )
+ iarr[i] = bcf_int32_vector_end;
+ else
+ iarr[i] = lroundf(-10 * farr[i]);
+ }
+ bcf_update_format_int32(out_hdr,rec,"PL",iarr,n);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GL",NULL,0);
+ }
+ else if ( mode==GP_TO_GT )
+ {
+ int nals = rec->n_allele;
+ int nsmpl = bcf_hdr_nsamples(in_hdr);
+ hts_expand(int32_t,nsmpl*2,miarr,iarr);
+
+ n = bcf_get_format_float(in_hdr,rec,"GP",&farr,&mfarr);
+ if ( n<=0 ) return rec;
+
+ n /= nsmpl;
+ for (i=0; i<nsmpl; i++)
+ {
+ float *ptr = farr + i*n;
+ if ( bcf_float_is_missing(ptr[0]) )
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+ continue;
+ }
+
+ int j, jmax = 0;
+ for (j=1; j<n; j++)
+ {
+ if ( bcf_float_is_missing(ptr[j]) || bcf_float_is_vector_end(ptr[j]) ) break;
+ if ( ptr[j] > ptr[jmax] ) jmax = j;
+ }
+
+ // haploid genotype
+ if ( j==nals )
+ {
+ iarr[2*i] = ptr[jmax] < 1-thresh ? bcf_gt_missing : bcf_gt_unphased(jmax);
+ iarr[2*i+1] = bcf_int32_vector_end;
+ continue;
+ }
+
+ if ( j!=nals*(nals+1)/2 )
+ error("Wrong number of GP values for diploid genotype at %s:%d, expected %d, found %d\n",
+ bcf_seqname(in_hdr,rec),rec->pos+1, nals*(nals+1)/2,j);
+
+ if (ptr[jmax] < 1-thresh)
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_missing;
+ continue;
+ }
+
+ // most common case: RR
+ if ( jmax==0 )
+ {
+ iarr[2*i] = iarr[2*i+1] = bcf_gt_unphased(0);
+ continue;
+ }
+
+ int a,b;
+ bcf_gt2alleles(jmax,&a,&b);
+ iarr[2*i] = bcf_gt_unphased(a);
+ iarr[2*i+1] = bcf_gt_unphased(b);
+ }
+ bcf_update_genotypes(out_hdr,rec,iarr,nsmpl*2);
+ if ( drop_source_tag )
+ bcf_update_format_float(out_hdr,rec,"GP",NULL,0);
+ }
+ return rec;
+}
+
+void destroy(void)
+{
+ free(farr);
+ free(iarr);
+}
+
+
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define iCHILD 0
+#define iFATHER 1
+#define iMOTHER 2
+
+typedef struct
+{
+ int idx[3]; // VCF sample index for father, mother and child
+ int pass; // do all three pass the filters?
+}
+trio_t;
+
+typedef struct
+{
+ uint32_t
+ npass, // number of genotypes passing the filter
+ nnon_ref, // number of non-reference genotypes
+ nmendel_err, // number of mendelian errors
+ nnovel, // a singleton allele, but observed only in the child. Counted as mendel_err as well.
+ nsingleton, // het mother or father different from everyone else
+ ndoubleton, // het mother+child or father+child different from everyone else
+ nts, ntv; // number of transitions and transversions
+}
+trio_stats_t;
+
+typedef struct
+{
+ trio_stats_t *stats;
+ filter_t *filter;
+ char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file;
+ int nflt_str;
+ char *filter_str, **flt_str;
+ char **argv, *ped_fname, *output_fname, *fname, *regions, *targets;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+ trio_t *trio;
+ int ntrio, mtrio;
+ flt_stats_t *filters;
+ int nfilters;
+ int32_t *gt_arr, *ac, *ac_trio;
+ int mgt_arr, mac, mac_trio;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate transmission rate and other stats in trio children.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Calculate transmission rate in trio children. Use curly brackets to scan\n"
+ " a range of values simultaneously\n"
+ "Usage: bcftools +trio-stats [Plugin Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [stdout]\n"
+ " -p, --ped FILE PED file\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " bcftools +trio-stats -p file.ped -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+ "\n";
+}
+
+static int cmp_trios(const void *_a, const void *_b)
+{
+ trio_t *a = (trio_t *) _a;
+ trio_t *b = (trio_t *) _b;
+ int i;
+ int amin = a->idx[0];
+ for (i=1; i<3; i++)
+ if ( amin > a->idx[i] ) amin = a->idx[i];
+ int bmin = b->idx[0];
+ for (i=1; i<3; i++)
+ if ( bmin > b->idx[i] ) bmin = b->idx[i];
+ if ( amin < bmin ) return -1;
+ if ( amin > bmin ) return 1;
+ return 0;
+}
+
+static void parse_ped(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // familyID sampleID paternalID maternalID sex phenotype population relationship siblings secondOrder thirdOrder children comment
+ // BB03 HG01884 HG01885 HG01956 2 0 ACB child 0 0 0 0
+ int ncols = ksplit_core(str.s,0,&moff,&off);
+ if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+ int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+ if ( father<0 ) continue;
+ int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+ if ( mother<0 ) continue;
+ int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( child<0 ) continue;
+
+ args->ntrio++;
+ hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+ trio_t *trio = &args->trio[args->ntrio-1];
+ trio->idx[iFATHER] = father;
+ trio->idx[iMOTHER] = mother;
+ trio->idx[iCHILD] = child;
+ }
+ while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ fprintf(stderr,"Identified %d complete trios in the VCF file\n", args->ntrio);
+
+ // sort the sample by index so that they are accessed more or less sequentially
+ qsort(args->trio,args->ntrio,sizeof(trio_t),cmp_trios);
+
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+static void parse_filters(args_t *args)
+{
+ if ( !args->filter_str ) return;
+ int mflt = 1;
+ args->nflt_str = 1;
+ args->flt_str = (char**) malloc(sizeof(char*));
+ args->flt_str[0] = strdup(args->filter_str);
+ while (1)
+ {
+ int i, expanded = 0;
+ for (i=args->nflt_str-1; i>=0; i--)
+ {
+ char *exp_beg = strchr(args->flt_str[i], '{');
+ if ( !exp_beg ) continue;
+ char *exp_end = strchr(exp_beg+1, '}');
+ if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+ char *beg = exp_beg+1, *mid = beg;
+ while ( mid<exp_end )
+ {
+ while ( mid<exp_end && *mid!=',' ) mid++;
+ kstring_t tmp = {0,0,0};
+ kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+ kputsn(beg, mid - beg, &tmp);
+ kputs(exp_end+1, &tmp);
+ args->nflt_str++;
+ hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+ args->flt_str[args->nflt_str-1] = tmp.s;
+ beg = ++mid;
+ }
+ expanded = 1;
+ free(args->flt_str[i]);
+ memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+ args->nflt_str--;
+ args->flt_str[args->nflt_str] = NULL;
+ }
+ if ( !expanded ) break;
+ }
+
+ fprintf(stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ parse_ped(args, args->ped_fname);
+ parse_filters(args);
+
+ int i;
+ if ( !args->nflt_str )
+ {
+ args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+ args->nfilters = 1;
+ args->filters[0].expr = strdup("all");
+ }
+ else
+ {
+ args->nfilters = args->nflt_str;
+ args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+ for (i=0; i<args->nfilters; i++)
+ {
+ args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+ args->filters[i].expr = strdup(args->flt_str[i]);
+
+ // replace tab's with spaces so that the output stays parsable
+ char *tmp = args->filters[i].expr;
+ while ( *tmp )
+ {
+ if ( *tmp=='\t' ) *tmp = ' ';
+ tmp++;
+ }
+ }
+ }
+ for (i=0; i<args->nfilters; i++)
+ args->filters[i].stats = (trio_stats_t*) calloc(args->ntrio,sizeof(trio_stats_t));
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ for (i=0; i<args->nfilters; i++)
+ {
+ if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+ free(args->filters[i].stats);
+ free(args->filters[i].expr);
+ }
+ free(args->filters);
+ for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+ free(args->flt_str);
+ bcf_sr_destroy(args->sr);
+ free(args->trio);
+ free(args->ac);
+ free(args->ac_trio);
+ free(args->gt_arr);
+ free(args);
+}
+static void report_stats(args_t *args)
+{
+ int i = 0,j;
+ FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? stdout : fopen(args->output_fname,"w");
+ if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+ fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+ fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+ fprintf(fh,"# FLT* lines report numbers for every threshold and every trio:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) child\n", ++i);
+ fprintf(fh,"# %d) father\n", ++i);
+ fprintf(fh,"# %d) mother\n", ++i);
+ fprintf(fh,"# %d) number of valid trio genotypes (all trio members pass filters, all non-missing)\n", ++i);
+ fprintf(fh,"# %d) number of non-reference trio GTs (at least one trio member carries an alternate allele)\n", ++i);
+ fprintf(fh,"# %d) number of Mendelian errors\n", ++i);
+ fprintf(fh,"# %d) number of novel singleton alleles in the child (counted also as a Mendelian error)\n", ++i);
+ fprintf(fh,"# %d) number of untransmitted singletons, present only in one parent\n", ++i);
+ fprintf(fh,"# %d) number of transmitted singletons, present only in one parent and the child\n", ++i);
+ fprintf(fh,"# %d) number of transitions, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh,"# %d) number of transversions, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh, "CMD\t%s", args->argv[0]);
+ for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+ fprintf(fh, "\n");
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+ }
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ for (j=0; j<args->ntrio; j++)
+ {
+ fprintf(fh,"FLT%d", i);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iCHILD]]);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iFATHER]]);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iMOTHER]]);
+ trio_stats_t *stats = &flt->stats[j];
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nnon_ref);
+ fprintf(fh,"\t%d", stats->nmendel_err);
+ fprintf(fh,"\t%d", stats->nnovel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->ndoubleton);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ }
+ if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+ int32_t *ptr = arr + ngt1 * idx;
+ if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+ als[0] = bcf_gt_allele(ptr[0]);
+
+ // treat haploid GTs as homozygous diploid
+ if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { als[1] = als[0]; return 0; }
+
+ if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+ als[1] = bcf_gt_allele(ptr[1]);
+
+ return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+ int i,j;
+
+ // Find out which trios pass and if the site passes
+ if ( flt->filter )
+ {
+ uint8_t *smpl_pass = NULL;
+ int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE )
+ {
+ if ( pass_site )
+ {
+ if ( !smpl_pass ) return;
+ pass_site = 0;
+ for (i=0; i<args->ntrio; i++)
+ {
+ int pass_trio = 1;
+ for (j=0; j<3; j++)
+ {
+ int idx = args->trio[i].idx[j];
+ if ( smpl_pass[idx] ) { pass_trio = 0; break; }
+ }
+ args->trio[i].pass = pass_trio;
+ if ( pass_trio ) pass_site = 1;
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+ }
+ else if ( !pass_site ) return;
+ else if ( smpl_pass )
+ {
+ pass_site = 0;
+ for (i=0; i<args->ntrio; i++)
+ {
+ int pass_trio = 1;
+ for (j=0; j<3; j++)
+ {
+ int idx = args->trio[i].idx[j];
+ if ( !smpl_pass[idx] ) { pass_trio = 0; break; }
+ }
+ args->trio[i].pass = pass_trio;
+ if ( pass_trio ) pass_site = 1;
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+ }
+
+ // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+ hts_expand(int, rec->n_allele, args->mac, args->ac);
+ if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+ hts_expand(int, rec->n_allele, args->mac_trio, args->ac_trio);
+
+ // Get the genotypes
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+ if ( ngt<0 ) return;
+ int ngt1 = ngt / rec->n_sample;
+
+
+ // For ts/tv: numeric code of the reference allele, -1 for insertions
+ int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+ int star_allele = -1;
+ for (i=1; i<rec->n_allele; i++)
+ if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+ // Run the stats
+ for (i=0; i<args->ntrio; i++)
+ {
+ if ( flt->filter && !args->trio[i].pass ) continue;
+ trio_stats_t *stats = &flt->stats[i];
+
+ // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+ // the order is: child, father, mother
+ int als[6], *als_child = als, *als_father = als+2, *als_mother = als+4;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iCHILD], als_child) < 0 ) continue;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iFATHER], als_father) < 0 ) continue;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iMOTHER], als_mother) < 0 ) continue;
+
+ stats->npass++;
+
+ // Has the trio an alternate allele other than *?
+ int has_star_allele = 0, has_nonref = 0;
+ memset(args->ac_trio,0,rec->n_allele*sizeof(*args->ac_trio));
+ for (j=0; j<6; j++)
+ {
+ if ( als[j]==star_allele ) { has_star_allele = 1; continue; }
+ if ( als[j]==0 ) continue;
+ has_nonref = 1;
+ args->ac_trio[ als[j] ]++;
+ }
+ if ( !has_nonref ) continue; // only ref or * in this trio
+
+ stats->nnon_ref++;
+
+ // Calculate ts/tv. It does the right thing and handles also HetAA genotypes
+ if ( ref != -1 )
+ {
+ int has_ts = 0, has_tv = 0;
+ for (j=0; j<6; j++)
+ {
+ if ( als[j]==0 || als[j]==star_allele ) continue;
+ if ( als[j] >= rec->n_allele )
+ error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[args->trio[i].idx[j/2]]);
+ if ( rec->d.allele[als[j]][1] ) continue;
+
+ int alt = bcf_acgt2int(rec->d.allele[als[j]][0]);
+ if ( abs(ref-alt)==2 ) has_ts = 1;
+ else has_tv = 1;
+ }
+ if ( has_ts ) stats->nts++;
+ if ( has_tv ) stats->ntv++;
+ }
+
+ // Skip some stats if the star allele is present as it was already checked at the primary record, we do not want to count the same
+ // thing multiple times. There can be other alternate allele, but we ignore that for simplicity.
+ if ( has_star_allele ) continue;
+
+ // Detect mendelian errors
+ int mendel_ok = (als_child[0]==als_father[0] || als_child[0]==als_father[1]) && (als_child[1]==als_mother[0] || als_child[1]==als_mother[1]) ? 1 : 0;
+ if ( !mendel_ok ) mendel_ok = (als_child[1]==als_father[0] || als_child[1]==als_father[1]) && (als_child[0]==als_mother[0] || als_child[0]==als_mother[1]) ? 1 : 0;
+ if ( !mendel_ok ) stats->nmendel_err++;
+
+ // Is this a singleton, doubleton, neither?
+ for (j=1; j<rec->n_allele; j++)
+ {
+ if ( args->ac_trio[j]==1 && args->ac[j]==1 ) // singleton (in parent) or novel (in child)
+ {
+ if ( als_child[0]==j || als_child[1]==j ) stats->nnovel++;
+ else stats->nsingleton++;
+ }
+ else if ( args->ac_trio[j]==2 && args->ac[j]==2 ) // possibly a doubleton
+ {
+ if ( (als_child[0]==j || als_child[1]==j) && (als_child[0]!=j || als_child[1]!=j) ) stats->ndoubleton++;
+ }
+ }
+ }
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"ped",required_argument,NULL,'p'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c, i;
+ while ((c = getopt_long(argc, argv, "p:o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'p': args->ped_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s", usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s", usage_text());
+ else args->fname = argv[optind];
+
+ if ( !args->ped_fname ) error("Missing the -p, --ped option\n");
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ for (i=0; i<args->nfilters; i++)
+ process_record(args, rec, &args->filters[i]);
+ }
+
+ report_stats(args);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2018 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h> // for isatty
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/synced_bcf_reader.h>
+#include <htslib/vcfutils.h>
+#include "bcftools.h"
+#include "filter.h"
+
+
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+#define iCHILD 0
+#define iFATHER 1
+#define iMOTHER 2
+
+typedef struct
+{
+ int idx[3]; // VCF sample index for father, mother and child
+ int pass; // do all three pass the filters?
+}
+trio_t;
+
+typedef struct
+{
+ uint32_t
+ npass, // number of genotypes passing the filter
+ nnon_ref, // number of non-reference genotypes
+ nmendel_err, // number of mendelian errors
+ nnovel, // a singleton allele, but observed only in the child. Counted as mendel_err as well.
+ nsingleton, // het mother or father different from everyone else
+ ndoubleton, // het mother+child or father+child different from everyone else
+ nts, ntv; // number of transitions and transversions
+}
+trio_stats_t;
+
+typedef struct
+{
+ trio_stats_t *stats;
+ filter_t *filter;
+ char *expr;
+}
+flt_stats_t;
+
+typedef struct
+{
+ int argc, filter_logic, regions_is_file, targets_is_file;
+ int nflt_str;
+ char *filter_str, **flt_str;
+ char **argv, *ped_fname, *output_fname, *fname, *regions, *targets;
+ bcf_srs_t *sr;
+ bcf_hdr_t *hdr;
+ trio_t *trio;
+ int ntrio, mtrio;
+ flt_stats_t *filters;
+ int nfilters;
+ int32_t *gt_arr, *ac, *ac_trio;
+ int mgt_arr, mac, mac_trio;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate transmission rate and other stats in trio children.\n";
+}
+
+static const char *usage_text(void)
+{
+ return
+ "\n"
+ "About: Calculate transmission rate in trio children. Use curly brackets to scan\n"
+ " a range of values simultaneously\n"
+ "Usage: bcftools +trio-stats [Plugin Options]\n"
+ "Plugin options:\n"
+ " -e, --exclude EXPR exclude sites and samples for which the expression is true\n"
+ " -i, --include EXPR include sites and samples for which the expression is true\n"
+ " -o, --output FILE output file name [bcftools_stdout]\n"
+ " -p, --ped FILE PED file\n"
+ " -r, --regions REG restrict to comma-separated list of regions\n"
+ " -R, --regions-file FILE restrict to regions listed in a file\n"
+ " -t, --targets REG similar to -r but streams rather than index-jumps\n"
+ " -T, --targets-file FILE similar to -R but streams rather than index-jumps\n"
+ "\n"
+ "Example:\n"
+ " bcftools +trio-stats -p file.ped -i 'GQ>{10,20,30,40,50}' file.bcf\n"
+ "\n";
+}
+
+static int cmp_trios(const void *_a, const void *_b)
+{
+ trio_t *a = (trio_t *) _a;
+ trio_t *b = (trio_t *) _b;
+ int i;
+ int amin = a->idx[0];
+ for (i=1; i<3; i++)
+ if ( amin > a->idx[i] ) amin = a->idx[i];
+ int bmin = b->idx[0];
+ for (i=1; i<3; i++)
+ if ( bmin > b->idx[i] ) bmin = b->idx[i];
+ if ( amin < bmin ) return -1;
+ if ( amin > bmin ) return 1;
+ return 0;
+}
+
+static void parse_ped(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // familyID sampleID paternalID maternalID sex phenotype population relationship siblings secondOrder thirdOrder children comment
+ // BB03 HG01884 HG01885 HG01956 2 0 ACB child 0 0 0 0
+ int ncols = ksplit_core(str.s,0,&moff,&off);
+ if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+ int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+ if ( father<0 ) continue;
+ int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+ if ( mother<0 ) continue;
+ int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( child<0 ) continue;
+
+ args->ntrio++;
+ hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+ trio_t *trio = &args->trio[args->ntrio-1];
+ trio->idx[iFATHER] = father;
+ trio->idx[iMOTHER] = mother;
+ trio->idx[iCHILD] = child;
+ }
+ while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ fprintf(bcftools_stderr,"Identified %d complete trios in the VCF file\n", args->ntrio);
+
+ // sort the sample by index so that they are accessed more or less sequentially
+ qsort(args->trio,args->ntrio,sizeof(trio_t),cmp_trios);
+
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+static void parse_filters(args_t *args)
+{
+ if ( !args->filter_str ) return;
+ int mflt = 1;
+ args->nflt_str = 1;
+ args->flt_str = (char**) malloc(sizeof(char*));
+ args->flt_str[0] = strdup(args->filter_str);
+ while (1)
+ {
+ int i, expanded = 0;
+ for (i=args->nflt_str-1; i>=0; i--)
+ {
+ char *exp_beg = strchr(args->flt_str[i], '{');
+ if ( !exp_beg ) continue;
+ char *exp_end = strchr(exp_beg+1, '}');
+ if ( !exp_end ) error("Could not parse the expression: %s\n", args->filter_str);
+ char *beg = exp_beg+1, *mid = beg;
+ while ( mid<exp_end )
+ {
+ while ( mid<exp_end && *mid!=',' ) mid++;
+ kstring_t tmp = {0,0,0};
+ kputsn(args->flt_str[i], exp_beg - args->flt_str[i], &tmp);
+ kputsn(beg, mid - beg, &tmp);
+ kputs(exp_end+1, &tmp);
+ args->nflt_str++;
+ hts_expand(char*, args->nflt_str, mflt, args->flt_str);
+ args->flt_str[args->nflt_str-1] = tmp.s;
+ beg = ++mid;
+ }
+ expanded = 1;
+ free(args->flt_str[i]);
+ memmove(&args->flt_str[i], &args->flt_str[i+1], (args->nflt_str-i-1)*sizeof(*args->flt_str));
+ args->nflt_str--;
+ args->flt_str[args->nflt_str] = NULL;
+ }
+ if ( !expanded ) break;
+ }
+
+ fprintf(bcftools_stderr,"Collecting data for %d filtering expressions\n", args->nflt_str);
+}
+
+static void init_data(args_t *args)
+{
+ args->sr = bcf_sr_init();
+ if ( args->regions )
+ {
+ args->sr->require_index = 1;
+ if ( bcf_sr_set_regions(args->sr, args->regions, args->regions_is_file)<0 ) error("Failed to read the regions: %s\n",args->regions);
+ }
+ if ( args->targets && bcf_sr_set_targets(args->sr, args->targets, args->targets_is_file, 0)<0 ) error("Failed to read the targets: %s\n",args->targets);
+ if ( !bcf_sr_add_reader(args->sr,args->fname) ) error("Error: %s\n", bcf_sr_strerror(args->sr->errnum));
+ args->hdr = bcf_sr_get_header(args->sr,0);
+
+ parse_ped(args, args->ped_fname);
+ parse_filters(args);
+
+ int i;
+ if ( !args->nflt_str )
+ {
+ args->filters = (flt_stats_t*) calloc(1, sizeof(flt_stats_t));
+ args->nfilters = 1;
+ args->filters[0].expr = strdup("all");
+ }
+ else
+ {
+ args->nfilters = args->nflt_str;
+ args->filters = (flt_stats_t*) calloc(args->nfilters, sizeof(flt_stats_t));
+ for (i=0; i<args->nfilters; i++)
+ {
+ args->filters[i].filter = filter_init(args->hdr, args->flt_str[i]);
+ args->filters[i].expr = strdup(args->flt_str[i]);
+
+ // replace tab's with spaces so that the output stays parsable
+ char *tmp = args->filters[i].expr;
+ while ( *tmp )
+ {
+ if ( *tmp=='\t' ) *tmp = ' ';
+ tmp++;
+ }
+ }
+ }
+ for (i=0; i<args->nfilters; i++)
+ args->filters[i].stats = (trio_stats_t*) calloc(args->ntrio,sizeof(trio_stats_t));
+}
+static void destroy_data(args_t *args)
+{
+ int i;
+ for (i=0; i<args->nfilters; i++)
+ {
+ if ( args->filters[i].filter ) filter_destroy(args->filters[i].filter);
+ free(args->filters[i].stats);
+ free(args->filters[i].expr);
+ }
+ free(args->filters);
+ for (i=0; i<args->nflt_str; i++) free(args->flt_str[i]);
+ free(args->flt_str);
+ bcf_sr_destroy(args->sr);
+ free(args->trio);
+ free(args->ac);
+ free(args->ac_trio);
+ free(args->gt_arr);
+ free(args);
+}
+static void report_stats(args_t *args)
+{
+ int i = 0,j;
+ FILE *fh = !args->output_fname || !strcmp("-",args->output_fname) ? bcftools_stdout : fopen(args->output_fname,"w");
+ if ( !fh ) error("Could not open the file for writing: %s\n", args->output_fname);
+ fprintf(fh,"# CMD line shows the command line used to generate this output\n");
+ fprintf(fh,"# DEF lines define expressions for all tested thresholds\n");
+ fprintf(fh,"# FLT* lines report numbers for every threshold and every trio:\n");
+ fprintf(fh,"# %d) filter id\n", ++i);
+ fprintf(fh,"# %d) child\n", ++i);
+ fprintf(fh,"# %d) father\n", ++i);
+ fprintf(fh,"# %d) mother\n", ++i);
+ fprintf(fh,"# %d) number of valid trio genotypes (all trio members pass filters, all non-missing)\n", ++i);
+ fprintf(fh,"# %d) number of non-reference trio GTs (at least one trio member carries an alternate allele)\n", ++i);
+ fprintf(fh,"# %d) number of Mendelian errors\n", ++i);
+ fprintf(fh,"# %d) number of novel singleton alleles in the child (counted also as a Mendelian error)\n", ++i);
+ fprintf(fh,"# %d) number of untransmitted singletons, present only in one parent\n", ++i);
+ fprintf(fh,"# %d) number of transmitted singletons, present only in one parent and the child\n", ++i);
+ fprintf(fh,"# %d) number of transitions, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh,"# %d) number of transversions, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh,"# %d) overall ts/tv, all ALT alleles present in the trio are considered\n", ++i);
+ fprintf(fh, "CMD\t%s", args->argv[0]);
+ for (i=1; i<args->argc; i++) fprintf(fh, " %s",args->argv[i]);
+ fprintf(fh, "\n");
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ fprintf(fh,"DEF\tFLT%d\t%s\n", i, flt->expr);
+ }
+ for (i=0; i<args->nfilters; i++)
+ {
+ flt_stats_t *flt = &args->filters[i];
+ for (j=0; j<args->ntrio; j++)
+ {
+ fprintf(fh,"FLT%d", i);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iCHILD]]);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iFATHER]]);
+ fprintf(fh,"\t%s",args->hdr->samples[args->trio[j].idx[iMOTHER]]);
+ trio_stats_t *stats = &flt->stats[j];
+ fprintf(fh,"\t%d", stats->npass);
+ fprintf(fh,"\t%d", stats->nnon_ref);
+ fprintf(fh,"\t%d", stats->nmendel_err);
+ fprintf(fh,"\t%d", stats->nnovel);
+ fprintf(fh,"\t%d", stats->nsingleton);
+ fprintf(fh,"\t%d", stats->ndoubleton);
+ fprintf(fh,"\t%d", stats->nts);
+ fprintf(fh,"\t%d", stats->ntv);
+ fprintf(fh,"\t%.2f", stats->ntv ? (float)stats->nts/stats->ntv : INFINITY);
+ fprintf(fh,"\n");
+ }
+ }
+ if ( fclose(fh)!=0 ) error("Close failed: %s\n", (!args->output_fname || !strcmp("-",args->output_fname)) ? "bcftools_stdout" : args->output_fname);
+}
+
+static inline int parse_genotype(int32_t *arr, int ngt1, int idx, int als[2])
+{
+ int32_t *ptr = arr + ngt1 * idx;
+ if ( bcf_gt_is_missing(ptr[0]) ) return -1;
+ als[0] = bcf_gt_allele(ptr[0]);
+
+ // treat haploid GTs as homozygous diploid
+ if ( ngt1==1 || ptr[1]==bcf_int32_vector_end ) { als[1] = als[0]; return 0; }
+
+ if ( bcf_gt_is_missing(ptr[1]) ) return -1;
+ als[1] = bcf_gt_allele(ptr[1]);
+
+ return 0;
+}
+
+static void process_record(args_t *args, bcf1_t *rec, flt_stats_t *flt)
+{
+ int i,j;
+
+ // Find out which trios pass and if the site passes
+ if ( flt->filter )
+ {
+ uint8_t *smpl_pass = NULL;
+ int pass_site = filter_test(flt->filter, rec, (const uint8_t**) &smpl_pass);
+ if ( args->filter_logic & FLT_EXCLUDE )
+ {
+ if ( pass_site )
+ {
+ if ( !smpl_pass ) return;
+ pass_site = 0;
+ for (i=0; i<args->ntrio; i++)
+ {
+ int pass_trio = 1;
+ for (j=0; j<3; j++)
+ {
+ int idx = args->trio[i].idx[j];
+ if ( smpl_pass[idx] ) { pass_trio = 0; break; }
+ }
+ args->trio[i].pass = pass_trio;
+ if ( pass_trio ) pass_site = 1;
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+ }
+ else if ( !pass_site ) return;
+ else if ( smpl_pass )
+ {
+ pass_site = 0;
+ for (i=0; i<args->ntrio; i++)
+ {
+ int pass_trio = 1;
+ for (j=0; j<3; j++)
+ {
+ int idx = args->trio[i].idx[j];
+ if ( !smpl_pass[idx] ) { pass_trio = 0; break; }
+ }
+ args->trio[i].pass = pass_trio;
+ if ( pass_trio ) pass_site = 1;
+ }
+ if ( !pass_site ) return;
+ }
+ else
+ for (i=0; i<args->ntrio; i++) args->trio[i].pass = 1;
+ }
+
+ // Find out the allele counts. Try to use INFO/AC, if not present, determine from the genotypes
+ hts_expand(int, rec->n_allele, args->mac, args->ac);
+ if ( !bcf_calc_ac(args->hdr, rec, args->ac, BCF_UN_INFO|BCF_UN_FMT) ) return;
+ hts_expand(int, rec->n_allele, args->mac_trio, args->ac_trio);
+
+ // Get the genotypes
+ int ngt = bcf_get_genotypes(args->hdr, rec, &args->gt_arr, &args->mgt_arr);
+ if ( ngt<0 ) return;
+ int ngt1 = ngt / rec->n_sample;
+
+
+ // For ts/tv: numeric code of the reference allele, -1 for insertions
+ int ref = !rec->d.allele[0][1] ? bcf_acgt2int(*rec->d.allele[0]) : -1;
+
+ int star_allele = -1;
+ for (i=1; i<rec->n_allele; i++)
+ if ( !rec->d.allele[i][1] && rec->d.allele[i][0]=='*' ) { star_allele = i; break; }
+
+ // Run the stats
+ for (i=0; i<args->ntrio; i++)
+ {
+ if ( flt->filter && !args->trio[i].pass ) continue;
+ trio_stats_t *stats = &flt->stats[i];
+
+ // Determine the alternate allele and the genotypes, skip if any of the alleles is missing.
+ // the order is: child, father, mother
+ int als[6], *als_child = als, *als_father = als+2, *als_mother = als+4;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iCHILD], als_child) < 0 ) continue;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iFATHER], als_father) < 0 ) continue;
+ if ( parse_genotype(args->gt_arr, ngt1, args->trio[i].idx[iMOTHER], als_mother) < 0 ) continue;
+
+ stats->npass++;
+
+ // Has the trio an alternate allele other than *?
+ int has_star_allele = 0, has_nonref = 0;
+ memset(args->ac_trio,0,rec->n_allele*sizeof(*args->ac_trio));
+ for (j=0; j<6; j++)
+ {
+ if ( als[j]==star_allele ) { has_star_allele = 1; continue; }
+ if ( als[j]==0 ) continue;
+ has_nonref = 1;
+ args->ac_trio[ als[j] ]++;
+ }
+ if ( !has_nonref ) continue; // only ref or * in this trio
+
+ stats->nnon_ref++;
+
+ // Calculate ts/tv. It does the right thing and handles also HetAA genotypes
+ if ( ref != -1 )
+ {
+ int has_ts = 0, has_tv = 0;
+ for (j=0; j<6; j++)
+ {
+ if ( als[j]==0 || als[j]==star_allele ) continue;
+ if ( als[j] >= rec->n_allele )
+ error("The GT index is out of range at %s:%d in %s\n", bcf_seqname(args->hdr,rec),rec->pos+1,args->hdr->samples[args->trio[i].idx[j/2]]);
+ if ( rec->d.allele[als[j]][1] ) continue;
+
+ int alt = bcf_acgt2int(rec->d.allele[als[j]][0]);
+ if ( abs(ref-alt)==2 ) has_ts = 1;
+ else has_tv = 1;
+ }
+ if ( has_ts ) stats->nts++;
+ if ( has_tv ) stats->ntv++;
+ }
+
+ // Skip some stats if the star allele is present as it was already checked at the primary record, we do not want to count the same
+ // thing multiple times. There can be other alternate allele, but we ignore that for simplicity.
+ if ( has_star_allele ) continue;
+
+ // Detect mendelian errors
+ int mendel_ok = (als_child[0]==als_father[0] || als_child[0]==als_father[1]) && (als_child[1]==als_mother[0] || als_child[1]==als_mother[1]) ? 1 : 0;
+ if ( !mendel_ok ) mendel_ok = (als_child[1]==als_father[0] || als_child[1]==als_father[1]) && (als_child[0]==als_mother[0] || als_child[0]==als_mother[1]) ? 1 : 0;
+ if ( !mendel_ok ) stats->nmendel_err++;
+
+ // Is this a singleton, doubleton, neither?
+ for (j=1; j<rec->n_allele; j++)
+ {
+ if ( args->ac_trio[j]==1 && args->ac[j]==1 ) // singleton (in parent) or novel (in child)
+ {
+ if ( als_child[0]==j || als_child[1]==j ) stats->nnovel++;
+ else stats->nsingleton++;
+ }
+ else if ( args->ac_trio[j]==2 && args->ac[j]==2 ) // possibly a doubleton
+ {
+ if ( (als_child[0]==j || als_child[1]==j) && (als_child[0]!=j || als_child[1]!=j) ) stats->ndoubleton++;
+ }
+ }
+ }
+}
+
+int run(int argc, char **argv)
+{
+ args_t *args = (args_t*) calloc(1,sizeof(args_t));
+ args->argc = argc; args->argv = argv;
+ args->output_fname = "-";
+ static struct option loptions[] =
+ {
+ {"include",required_argument,0,'i'},
+ {"exclude",required_argument,0,'e'},
+ {"output",required_argument,NULL,'o'},
+ {"ped",required_argument,NULL,'p'},
+ {"regions",1,0,'r'},
+ {"regions-file",1,0,'R'},
+ {"targets",1,0,'t'},
+ {"targets-file",1,0,'T'},
+ {NULL,0,NULL,0}
+ };
+ int c, i;
+ while ((c = getopt_long(argc, argv, "p:o:s:i:e:r:R:t:T:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'e': args->filter_str = optarg; args->filter_logic |= FLT_EXCLUDE; break;
+ case 'i': args->filter_str = optarg; args->filter_logic |= FLT_INCLUDE; break;
+ case 't': args->targets = optarg; break;
+ case 'T': args->targets = optarg; args->targets_is_file = 1; break;
+ case 'r': args->regions = optarg; break;
+ case 'R': args->regions = optarg; args->regions_is_file = 1; break;
+ case 'o': args->output_fname = optarg; break;
+ case 'p': args->ped_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage_text()); break;
+ }
+ }
+ if ( optind==argc )
+ {
+ if ( !isatty(fileno((FILE *)stdin)) ) args->fname = "-"; // reading from stdin
+ else { error("%s", usage_text()); }
+ }
+ else if ( optind+1!=argc ) error("%s", usage_text());
+ else args->fname = argv[optind];
+
+ if ( !args->ped_fname ) error("Missing the -p, --ped option\n");
+
+ init_data(args);
+
+ while ( bcf_sr_next_line(args->sr) )
+ {
+ bcf1_t *rec = bcf_sr_get_line(args->sr,0);
+ for (i=0; i<args->nfilters; i++)
+ process_record(args, rec, &args->filters[i]);
+ }
+
+ report_stats(args);
+ destroy_data(args);
+
+ return 0;
+}
--- /dev/null
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ int father, mother, child; // VCF sample index
+ int prev, ipop;
+ uint32_t err, nswitch, ntest;
+}
+trio_t;
+
+typedef struct
+{
+ char *name;
+ uint32_t err, nswitch, ntest, ntrio;
+ float pswitch;
+}
+pop_t;
+
+typedef struct
+{
+ int argc;
+ char **argv;
+ bcf_hdr_t *hdr;
+ trio_t *trio;
+ int ntrio, mtrio;
+ int32_t *gt_arr;
+ int npop;
+ pop_t *pop;
+ int mgt_arr, prev_rid;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate phase switch rate in trio samples, children samples must have phased GTs.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Calculate phase switch rate in trio children.\n"
+ "Usage: bcftools +trio-swich-rate [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --ped <file> PED file with optional 7th column to group\n"
+ " results by population\n"
+ "\n"
+ "Example:\n"
+ " bcftools +trio-switch-rate file.bcf -- -p file.ped\n"
+ "\n";
+}
+
+void parse_ped(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ void *pop2i = khash_str2int_init();
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // familyID sampleID paternalID maternalID sex phenotype population relationship siblings secondOrder thirdOrder children comment
+ // BB03 HG01884 HG01885 HG01956 2 0 ACB child 0 0 0 0
+ int ncols = ksplit_core(str.s,0,&moff,&off);
+ if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+ int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+ if ( father<0 ) continue;
+ int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+ if ( mother<0 ) continue;
+ int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( child<0 ) continue;
+
+ args->ntrio++;
+ hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+ trio_t *trio = &args->trio[args->ntrio-1];
+ trio->father = father;
+ trio->mother = mother;
+ trio->child = child;
+
+ if (ncols>6) {
+ char *pop_name = &str.s[off[6]];
+ if ( !khash_str2int_has_key(pop2i,pop_name) )
+ {
+ pop_name = strdup(&str.s[off[6]]);
+ khash_str2int_set(pop2i,pop_name,args->npop);
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = pop_name;
+ }
+ khash_str2int_get(pop2i,pop_name,&trio->ipop);
+ args->pop[trio->ipop].ntrio++;
+ }
+ } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ khash_str2int_destroy(pop2i);
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.argc = argc; args.argv = argv;
+ args.prev_rid = -1;
+ args.hdr = in;
+ char *ped_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"ped",required_argument,NULL,'p'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?hp:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': ped_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !ped_fname ) error("Expected the -p option\n");
+ parse_ped(&args, ped_fname);
+ return 1;
+}
+
+typedef struct
+{
+ int a, b, phased;
+}
+gt_t;
+
+int parse_genotype(gt_t *gt, int32_t *ptr);
+
+inline int parse_genotype(gt_t *gt, int32_t *ptr)
+{
+ if ( ptr[0]==bcf_gt_missing ) return 0;
+ if ( ptr[1]==bcf_gt_missing ) return 0;
+ if ( ptr[1]==bcf_int32_vector_end ) return 0;
+ gt->phased = bcf_gt_is_phased(ptr[1]) ? 1 : 0;
+ gt->a = bcf_gt_allele(ptr[0]); if ( gt->a > 1 ) return 0; // consider only the first two alleles at biallelic sites
+ gt->b = bcf_gt_allele(ptr[1]); if ( gt->b > 1 ) return 0;
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.mgt_arr);
+ if ( ngt<0 ) return NULL;
+ ngt /= bcf_hdr_nsamples(args.hdr);
+ if ( ngt!=2 ) return NULL;
+
+ int i;
+ if ( rec->rid!=args.prev_rid )
+ {
+ args.prev_rid = rec->rid;
+ for (i=0; i<args.ntrio; i++) args.trio[i].prev = 0;
+ }
+
+ gt_t child, father, mother;
+ for (i=0; i<args.ntrio; i++)
+ {
+ trio_t *trio = &args.trio[i];
+
+ if ( !parse_genotype(&child, args.gt_arr + ngt*trio->child) ) continue;
+ if ( !child.phased ) continue;
+ if ( child.a+child.b != 1 ) continue; // child is not a het
+
+ if ( !parse_genotype(&father, args.gt_arr + ngt*trio->father) ) continue;
+ if ( !parse_genotype(&mother, args.gt_arr + ngt*trio->mother) ) continue;
+ if ( father.a+father.b == 1 && mother.a+mother.b == 1 ) continue; // both parents are hets
+ if ( father.a+father.b == mother.a+mother.b ) { trio->err++; continue; } // mendelian error
+
+ int test_phase = 0;
+ if ( father.a==father.b ) test_phase = 1 + (child.a==father.a);
+ else if ( mother.a==mother.b ) test_phase = 1 + (child.b==mother.a);
+ if ( trio->prev > 0 )
+ {
+ if ( trio->prev!=test_phase ) trio->nswitch++;
+ }
+ trio->ntest++;
+ trio->prev = test_phase;
+ }
+ return NULL;
+}
+
+void destroy(void)
+{
+ int i;
+ printf("# This file was produced by: bcftools +trio-switch-rate(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ printf("# The command line was:\tbcftools +trio-switch-rate %s", args.argv[0]);
+ for (i=1; i<args.argc; i++) printf(" %s",args.argv[i]);
+ printf("\n#\n");
+ printf("# TRIO\t[2]Father\t[3]Mother\t[4]Child\t[5]nTested\t[6]nMendelian Errors\t[7]nSwitch\t[8]nSwitch (%%)\n");
+ for (i=0; i<args.ntrio; i++)
+ {
+ trio_t *trio = &args.trio[i];
+ printf("TRIO\t%s\t%s\t%s\t%d\t%d\t%d\t%.2f\n",
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->father),
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->mother),
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->child),
+ trio->ntest, trio->err, trio->nswitch, trio->ntest ? trio->nswitch*100./trio->ntest : 0
+ );
+ if (args.npop) {
+ pop_t *pop = &args.pop[trio->ipop];
+ pop->err += trio->err;
+ pop->nswitch += trio->nswitch;
+ pop->ntest += trio->ntest;
+ pop->pswitch += trio->ntest ? trio->nswitch*100./trio->ntest : 0;
+ }
+ }
+ printf("# POP\tpopulation or other grouping defined by an optional 7-th column of the PED file\n");
+ printf("# POP\t[2]Name\t[3]Number of trios\t[4]avgTested\t[5]avgMendelian Errors\t[6]avgSwitch\t[7]avgSwitch (%%)\n");
+ for (i=0; i<args.npop; i++)
+ {
+ pop_t *pop = &args.pop[i];
+ printf("POP\t%s\t%d\t%.0f\t%.0f\t%.0f\t%.2f\n", pop->name,pop->ntrio,
+ (float)pop->ntest/pop->ntrio,(float)pop->err/pop->ntrio,(float)pop->nswitch/pop->ntrio,
+ pop->pswitch/pop->ntrio);
+ }
+ for (i=0; i<args.npop; i++) free(args.pop[i].name);
+ free(args.pop);
+ free(args.trio);
+ free(args.gt_arr);
+}
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* The MIT License
+
+ Copyright (c) 2016 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <math.h>
+#include <htslib/hts.h>
+#include <htslib/vcf.h>
+#include <htslib/kstring.h>
+#include <htslib/kseq.h>
+#include <htslib/khash_str2int.h>
+#include "bcftools.h"
+
+typedef struct
+{
+ int father, mother, child; // VCF sample index
+ int prev, ipop;
+ uint32_t err, nswitch, ntest;
+}
+trio_t;
+
+typedef struct
+{
+ char *name;
+ uint32_t err, nswitch, ntest, ntrio;
+ float pswitch;
+}
+pop_t;
+
+typedef struct
+{
+ int argc;
+ char **argv;
+ bcf_hdr_t *hdr;
+ trio_t *trio;
+ int ntrio, mtrio;
+ int32_t *gt_arr;
+ int npop;
+ pop_t *pop;
+ int mgt_arr, prev_rid;
+}
+args_t;
+
+args_t args;
+
+const char *about(void)
+{
+ return "Calculate phase switch rate in trio samples, children samples must have phased GTs.\n";
+}
+
+const char *usage(void)
+{
+ return
+ "\n"
+ "About: Calculate phase switch rate in trio children.\n"
+ "Usage: bcftools +trio-swich-rate [General Options] -- [Plugin Options]\n"
+ "Options:\n"
+ " run \"bcftools plugin\" for a list of common options\n"
+ "\n"
+ "Plugin options:\n"
+ " -p, --ped <file> PED file with optional 7th column to group\n"
+ " results by population\n"
+ "\n"
+ "Example:\n"
+ " bcftools +trio-switch-rate file.bcf -- -p file.ped\n"
+ "\n";
+}
+
+void parse_ped(args_t *args, char *fname)
+{
+ htsFile *fp = hts_open(fname, "r");
+ if ( !fp ) error("Could not read: %s\n", fname);
+
+ kstring_t str = {0,0,0};
+ if ( hts_getline(fp, KS_SEP_LINE, &str) <= 0 ) error("Empty file: %s\n", fname);
+
+ void *pop2i = khash_str2int_init();
+
+ int moff = 0, *off = NULL;
+ do
+ {
+ // familyID sampleID paternalID maternalID sex phenotype population relationship siblings secondOrder thirdOrder children comment
+ // BB03 HG01884 HG01885 HG01956 2 0 ACB child 0 0 0 0
+ int ncols = ksplit_core(str.s,0,&moff,&off);
+ if ( ncols<4 ) error("Could not parse the ped file: %s\n", str.s);
+
+ int father = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[2]]);
+ if ( father<0 ) continue;
+ int mother = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[3]]);
+ if ( mother<0 ) continue;
+ int child = bcf_hdr_id2int(args->hdr,BCF_DT_SAMPLE,&str.s[off[1]]);
+ if ( child<0 ) continue;
+
+ args->ntrio++;
+ hts_expand0(trio_t,args->ntrio,args->mtrio,args->trio);
+ trio_t *trio = &args->trio[args->ntrio-1];
+ trio->father = father;
+ trio->mother = mother;
+ trio->child = child;
+
+ if (ncols>6) {
+ char *pop_name = &str.s[off[6]];
+ if ( !khash_str2int_has_key(pop2i,pop_name) )
+ {
+ pop_name = strdup(&str.s[off[6]]);
+ khash_str2int_set(pop2i,pop_name,args->npop);
+ args->npop++;
+ args->pop = (pop_t*) realloc(args->pop,args->npop*sizeof(*args->pop));
+ memset(args->pop+args->npop-1,0,sizeof(*args->pop));
+ args->pop[args->npop-1].name = pop_name;
+ }
+ khash_str2int_get(pop2i,pop_name,&trio->ipop);
+ args->pop[trio->ipop].ntrio++;
+ }
+ } while ( hts_getline(fp, KS_SEP_LINE, &str)>=0 );
+
+ khash_str2int_destroy(pop2i);
+ free(str.s);
+ free(off);
+ hts_close(fp);
+}
+
+int init(int argc, char **argv, bcf_hdr_t *in, bcf_hdr_t *out)
+{
+ memset(&args,0,sizeof(args_t));
+ args.argc = argc; args.argv = argv;
+ args.prev_rid = -1;
+ args.hdr = in;
+ char *ped_fname = NULL;
+ static struct option loptions[] =
+ {
+ {"ped",required_argument,NULL,'p'},
+ {0,0,0,0}
+ };
+ int c;
+ while ((c = getopt_long(argc, argv, "?hp:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 'p': ped_fname = optarg; break;
+ case 'h':
+ case '?':
+ default: error("%s", usage()); break;
+ }
+ }
+ if ( !ped_fname ) error("Expected the -p option\n");
+ parse_ped(&args, ped_fname);
+ return 1;
+}
+
+typedef struct
+{
+ int a, b, phased;
+}
+gt_t;
+
+int parse_genotype(gt_t *gt, int32_t *ptr);
+
+inline int parse_genotype(gt_t *gt, int32_t *ptr)
+{
+ if ( ptr[0]==bcf_gt_missing ) return 0;
+ if ( ptr[1]==bcf_gt_missing ) return 0;
+ if ( ptr[1]==bcf_int32_vector_end ) return 0;
+ gt->phased = bcf_gt_is_phased(ptr[1]) ? 1 : 0;
+ gt->a = bcf_gt_allele(ptr[0]); if ( gt->a > 1 ) return 0; // consider only the first two alleles at biallelic sites
+ gt->b = bcf_gt_allele(ptr[1]); if ( gt->b > 1 ) return 0;
+ return 1;
+}
+
+bcf1_t *process(bcf1_t *rec)
+{
+ int ngt = bcf_get_genotypes(args.hdr, rec, &args.gt_arr, &args.mgt_arr);
+ if ( ngt<0 ) return NULL;
+ ngt /= bcf_hdr_nsamples(args.hdr);
+ if ( ngt!=2 ) return NULL;
+
+ int i;
+ if ( rec->rid!=args.prev_rid )
+ {
+ args.prev_rid = rec->rid;
+ for (i=0; i<args.ntrio; i++) args.trio[i].prev = 0;
+ }
+
+ gt_t child, father, mother;
+ for (i=0; i<args.ntrio; i++)
+ {
+ trio_t *trio = &args.trio[i];
+
+ if ( !parse_genotype(&child, args.gt_arr + ngt*trio->child) ) continue;
+ if ( !child.phased ) continue;
+ if ( child.a+child.b != 1 ) continue; // child is not a het
+
+ if ( !parse_genotype(&father, args.gt_arr + ngt*trio->father) ) continue;
+ if ( !parse_genotype(&mother, args.gt_arr + ngt*trio->mother) ) continue;
+ if ( father.a+father.b == 1 && mother.a+mother.b == 1 ) continue; // both parents are hets
+ if ( father.a+father.b == mother.a+mother.b ) { trio->err++; continue; } // mendelian error
+
+ int test_phase = 0;
+ if ( father.a==father.b ) test_phase = 1 + (child.a==father.a);
+ else if ( mother.a==mother.b ) test_phase = 1 + (child.b==mother.a);
+ if ( trio->prev > 0 )
+ {
+ if ( trio->prev!=test_phase ) trio->nswitch++;
+ }
+ trio->ntest++;
+ trio->prev = test_phase;
+ }
+ return NULL;
+}
+
+void destroy(void)
+{
+ int i;
+ fprintf(bcftools_stdout, "# This file was produced by: bcftools +trio-switch-rate(%s+htslib-%s)\n", bcftools_version(),hts_version());
+ fprintf(bcftools_stdout, "# The command line was:\tbcftools +trio-switch-rate %s", args.argv[0]);
+ for (i=1; i<args.argc; i++) fprintf(bcftools_stdout, " %s",args.argv[i]);
+ fprintf(bcftools_stdout, "\n#\n");
+ fprintf(bcftools_stdout, "# TRIO\t[2]Father\t[3]Mother\t[4]Child\t[5]nTested\t[6]nMendelian Errors\t[7]nSwitch\t[8]nSwitch (%%)\n");
+ for (i=0; i<args.ntrio; i++)
+ {
+ trio_t *trio = &args.trio[i];
+ fprintf(bcftools_stdout, "TRIO\t%s\t%s\t%s\t%d\t%d\t%d\t%.2f\n",
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->father),
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->mother),
+ bcf_hdr_int2id(args.hdr,BCF_DT_SAMPLE,trio->child),
+ trio->ntest, trio->err, trio->nswitch, trio->ntest ? trio->nswitch*100./trio->ntest : 0
+ );
+ if (args.npop) {
+ pop_t *pop = &args.pop[trio->ipop];
+ pop->err += trio->err;
+ pop->nswitch += trio->nswitch;
+ pop->ntest += trio->ntest;
+ pop->pswitch += trio->ntest ? trio->nswitch*100./trio->ntest : 0;
+ }
+ }
+ fprintf(bcftools_stdout, "# POP\tpopulation or other grouping defined by an optional 7-th column of the PED file\n");
+ fprintf(bcftools_stdout, "# POP\t[2]Name\t[3]Number of trios\t[4]avgTested\t[5]avgMendelian Errors\t[6]avgSwitch\t[7]avgSwitch (%%)\n");
+ for (i=0; i<args.npop; i++)
+ {
+ pop_t *pop = &args.pop[i];
+ fprintf(bcftools_stdout, "POP\t%s\t%d\t%.0f\t%.0f\t%.0f\t%.2f\n", pop->name,pop->ntrio,
+ (float)pop->ntest/pop->ntrio,(float)pop->err/pop->ntrio,(float)pop->nswitch/pop->ntrio,
+ pop->pswitch/pop->ntrio);
+ }
+ for (i=0; i<args.npop; i++) free(args.pop[i].name);
+ free(args.pop);
+ free(args.trio);
+ free(args.gt_arr);
+}
parser = regidx_parse_bed;
else if ( len>=4 && !strcasecmp(".bed",fname+len-4) )
parser = regidx_parse_bed;
+ else if ( len>=4 && !strcasecmp(".vcf",fname+len-4) )
+ parser = regidx_parse_vcf;
+ else if ( len>=7 && !strcasecmp(".vcf.gz",fname+len-7) )
+ parser = regidx_parse_vcf;
else
parser = regidx_parse_tab;
}
{
ss = se+1;
*end = strtod(ss, &se);
- if ( ss==se ) *end = *beg;
+ if ( ss==se || (*se && !isspace(*se)) ) *end = *beg;
else if ( *end==0 ) { fprintf(stderr,"Could not parse tab line, expected 1-based coordinate: %s\n", line); return -2; }
else (*end)--;
}
return 0;
}
+int regidx_parse_vcf(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ int ret = regidx_parse_tab(line, chr_beg, chr_end, beg, end, payload, usr);
+ if ( !ret ) *end = *beg;
+ return ret;
+}
+
int regidx_parse_reg(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
{
char *ss = (char*) line;
parser = regidx_parse_bed;
else if ( len>=4 && !strcasecmp(".bed",fname+len-4) )
parser = regidx_parse_bed;
+ else if ( len>=4 && !strcasecmp(".vcf",fname+len-4) )
+ parser = regidx_parse_vcf;
+ else if ( len>=7 && !strcasecmp(".vcf.gz",fname+len-7) )
+ parser = regidx_parse_vcf;
else
parser = regidx_parse_tab;
}
{
ss = se+1;
*end = strtod(ss, &se);
- if ( ss==se ) *end = *beg;
+ if ( ss==se || (*se && !isspace(*se)) ) *end = *beg;
else if ( *end==0 ) { fprintf(bcftools_stderr,"Could not parse tab line, expected 1-based coordinate: %s\n", line); return -2; }
else (*end)--;
}
return 0;
}
+int regidx_parse_vcf(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ int ret = regidx_parse_tab(line, chr_beg, chr_end, beg, end, payload, usr);
+ if ( !ret ) *end = *beg;
+ return ret;
+}
+
int regidx_parse_reg(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
{
char *ss = (char*) line;
int regidx_parse_bed(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*); // CHROM or whitespace-sepatated CHROM,FROM,TO (0-based,right-open)
int regidx_parse_tab(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*); // CHROM or whitespace-separated CHROM,POS (1-based, inclusive)
int regidx_parse_reg(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*); // CHROM, CHROM:POS, CHROM:FROM-TO, CHROM:FROM- (1-based, inclusive)
+int regidx_parse_vcf(const char*,char**,char**,uint32_t*,uint32_t*,void*,void*);
/*
* regidx_init() - creates new index
/* reheader.c -- reheader subcommand.
- Copyright (C) 2014-2017 Genome Research Ltd.
+ Copyright (C) 2014-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
+#include <inttypes.h>
#include <fcntl.h>
#include <math.h>
#include <htslib/vcf.h>
#include <htslib/bgzf.h>
#include <htslib/tbx.h> // for hts_get_bgzfp()
#include <htslib/kseq.h>
+#include <htslib/thread_pool.h>
#include "bcftools.h"
#include "khash_str2str.h"
char **argv, *fname, *samples_fname, *header_fname, *output_fname;
htsFile *fp;
htsFormat type;
- int argc;
+ htsThreadPool *threads;
+ int argc, n_threads;
}
args_t;
int out = args->output_fname ? open(args->output_fname, O_WRONLY|O_CREAT|O_TRUNC, 0666) : STDOUT_FILENO;
if ( out==-1 ) error("%s: %s\n", args->output_fname,strerror(errno));
- if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %d bytes\n", hdr.l);
+ if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)hdr.l);
free(hdr.s);
if ( fp->line.l )
{
- if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
while ( hts_getline(fp, KS_SEP_LINE, &fp->line) >=0 ) // uncompressed file implies small size, we don't worry about speed
{
kputc('\n',&fp->line);
- if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
hts_close(fp);
close(out);
static void reheader_bcf(args_t *args, int is_compressed)
{
htsFile *fp = args->fp;
+
+ if ( args->n_threads > 0 )
+ {
+ args->threads = calloc(1, sizeof(*args->threads));
+ if ( !args->threads ) error("Could not allocate memory\n");
+ if ( !(args->threads->pool = hts_tpool_init(args->n_threads)) ) error("Could not initialize threading\n");
+ BGZF *bgzf = hts_get_bgzfp(fp);
+ if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+ }
+
bcf_hdr_t *hdr = bcf_hdr_read(fp); if ( !hdr ) error("Failed to read the header: %s\n", args->fname);
kstring_t htxt = {0,0,0};
bcf_hdr_format(hdr, 1, &htxt);
// write the header and the body
htsFile *fp_out = hts_open(args->output_fname ? args->output_fname : "-",is_compressed ? "wb" : "wbu");
if ( !fp_out ) error("%s: %s\n", args->output_fname ? args->output_fname : "-", strerror(errno));
+ if ( args->threads )
+ {
+ BGZF *bgzf = hts_get_bgzfp(fp_out);
+ if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+ }
bcf_hdr_write(fp_out, hdr_out);
bcf1_t *rec = bcf_init();
hts_close(fp);
bcf_hdr_destroy(hdr_out);
bcf_hdr_destroy(hdr);
+ if ( args->threads )
+ {
+ hts_tpool_destroy(args->threads->pool);
+ free(args->threads);
+ }
}
fprintf(stderr, " -h, --header <file> new header\n");
fprintf(stderr, " -o, --output <file> write output to a file [standard output]\n");
fprintf(stderr, " -s, --samples <file> new sample names\n");
+ fprintf(stderr, " --threads <int> number of extra compression threads (BCF only) [0]\n");
fprintf(stderr, "\n");
exit(1);
}
int c;
args_t *args = (args_t*) calloc(1,sizeof(args_t));
args->argc = argc; args->argv = argv;
-
+
static struct option loptions[] =
{
{"output",1,0,'o'},
{"header",1,0,'h'},
{"samples",1,0,'s'},
+ {"threads",1,NULL,1},
{0,0,0,0}
};
while ((c = getopt_long(argc, argv, "s:h:o:",loptions,NULL)) >= 0)
{
switch (c)
{
+ case 1 : args->n_threads = strtol(optarg, 0, 0); break;
case 'o': args->output_fname = optarg; break;
case 's': args->samples_fname = optarg; break;
case 'h': args->header_fname = optarg; break;
/* reheader.c -- reheader subcommand.
- Copyright (C) 2014-2017 Genome Research Ltd.
+ Copyright (C) 2014-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
+#include <inttypes.h>
#include <fcntl.h>
#include <math.h>
#include <htslib/vcf.h>
#include <htslib/bgzf.h>
#include <htslib/tbx.h> // for hts_get_bgzfp()
#include <htslib/kseq.h>
+#include <htslib/thread_pool.h>
#include "bcftools.h"
#include "khash_str2str.h"
char **argv, *fname, *samples_fname, *header_fname, *output_fname;
htsFile *fp;
htsFormat type;
- int argc;
+ htsThreadPool *threads;
+ int argc, n_threads;
}
args_t;
int out = args->output_fname ? open(args->output_fname, O_WRONLY|O_CREAT|O_TRUNC, 0666) : STDOUT_FILENO;
if ( out==-1 ) error("%s: %s\n", args->output_fname,strerror(errno));
- if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %d bytes\n", hdr.l);
+ if ( write(out, hdr.s, hdr.l)!=hdr.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)hdr.l);
free(hdr.s);
if ( fp->line.l )
{
- if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
while ( hts_getline(fp, KS_SEP_LINE, &fp->line) >=0 ) // uncompressed file implies small size, we don't worry about speed
{
kputc('\n',&fp->line);
- if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( write(out, fp->line.s, fp->line.l)!=fp->line.l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
hts_close(fp);
close(out);
static void reheader_bcf(args_t *args, int is_compressed)
{
htsFile *fp = args->fp;
+
+ if ( args->n_threads > 0 )
+ {
+ args->threads = calloc(1, sizeof(*args->threads));
+ if ( !args->threads ) error("Could not allocate memory\n");
+ if ( !(args->threads->pool = hts_tpool_init(args->n_threads)) ) error("Could not initialize threading\n");
+ BGZF *bgzf = hts_get_bgzfp(fp);
+ if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+ }
+
bcf_hdr_t *hdr = bcf_hdr_read(fp); if ( !hdr ) error("Failed to read the header: %s\n", args->fname);
kstring_t htxt = {0,0,0};
bcf_hdr_format(hdr, 1, &htxt);
// write the header and the body
htsFile *fp_out = hts_open(args->output_fname ? args->output_fname : "-",is_compressed ? "wb" : "wbu");
if ( !fp_out ) error("%s: %s\n", args->output_fname ? args->output_fname : "-", strerror(errno));
+ if ( args->threads )
+ {
+ BGZF *bgzf = hts_get_bgzfp(fp_out);
+ if ( bgzf ) bgzf_thread_pool(bgzf, args->threads->pool, args->threads->qsize);
+ }
bcf_hdr_write(fp_out, hdr_out);
bcf1_t *rec = bcf_init();
hts_close(fp);
bcf_hdr_destroy(hdr_out);
bcf_hdr_destroy(hdr);
+ if ( args->threads )
+ {
+ hts_tpool_destroy(args->threads->pool);
+ free(args->threads);
+ }
}
fprintf(bcftools_stderr, " -h, --header <file> new header\n");
fprintf(bcftools_stderr, " -o, --output <file> write output to a file [standard output]\n");
fprintf(bcftools_stderr, " -s, --samples <file> new sample names\n");
+ fprintf(bcftools_stderr, " --threads <int> number of extra compression threads (BCF only) [0]\n");
fprintf(bcftools_stderr, "\n");
exit(1);
}
int c;
args_t *args = (args_t*) calloc(1,sizeof(args_t));
args->argc = argc; args->argv = argv;
-
+
static struct option loptions[] =
{
{"output",1,0,'o'},
{"header",1,0,'h'},
{"samples",1,0,'s'},
+ {"threads",1,NULL,1},
{0,0,0,0}
};
while ((c = getopt_long(argc, argv, "s:h:o:",loptions,NULL)) >= 0)
{
switch (c)
{
+ case 1 : args->n_threads = strtol(optarg, 0, 0); break;
case 'o': args->output_fname = optarg; break;
case 's': args->samples_fname = optarg; break;
case 'h': args->header_fname = optarg; break;
BGZF *fp;
s.l = s.m = 0; s.s = 0;
fp = bgzf_open(argv[optind], "r");
- while (bgzf_getline(fp, '\n', &s) >= 0) fputs(s.s, bcftools_stdout) & fputc('\n', bcftools_stdout);
+ while (bgzf_getline(fp, '\n', &s) >= 0) bcftools_puts(s.s);
bgzf_close(fp);
free(s.s);
} else if (optind + 2 > argc) { // create index
for (i = optind + 1; i < argc; ++i) {
hts_itr_t *itr;
if ((itr = tbx_itr_querys(tbx, argv[i])) == 0) continue;
- while (tbx_bgzf_itr_next(fp, tbx, itr, &s) >= 0) fputs(s.s, bcftools_stdout) & fputc('\n', bcftools_stdout);
+ while (tbx_bgzf_itr_next(fp, tbx, itr, &s) >= 0) bcftools_puts(s.s);
tbx_itr_destroy(itr);
}
free(s.s);
--- /dev/null
+/* test/test-rbuf.c -- rbuf_t test harness.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include "rbuf.h"
+
+void debug_print(rbuf_t *rbuf, int *dat)
+{
+ int i;
+ for (i=-1; rbuf_next(rbuf, &i); ) printf(" %2d", i);
+ printf("\n");
+ for (i=-1; rbuf_next(rbuf, &i); ) printf(" %2d", dat[i]);
+ printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+ int i, j, *dat = (int*)calloc(10,sizeof(int));
+ rbuf_t rbuf;
+ rbuf_init(&rbuf,10);
+
+ rbuf.f = 5; // force wrapping
+ for (i=0; i<9; i++)
+ {
+ j = rbuf_append(&rbuf);
+ dat[j] = i+1;
+ }
+ printf("Inserted 1-9 starting at offset 5:\n");
+ debug_print(&rbuf, dat);
+
+ i = rbuf_kth(&rbuf, 3);
+ printf("4th is %d\n", dat[i]);
+
+ printf("Deleting 1-2:\n");
+ rbuf_shift_n(&rbuf, 2);
+ debug_print(&rbuf, dat);
+
+ printf("Prepending 0-8:\n");
+ for (i=0; i<9; i++)
+ {
+ j = rbuf_prepend(&rbuf);
+ dat[j] = i;
+ }
+ debug_print(&rbuf, dat);
+
+ printf("Expanding:\n");
+ rbuf_expand0(&rbuf,int,rbuf.n+1,dat);
+ debug_print(&rbuf, dat);
+
+ free(dat);
+ return 0;
+}
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* test/test-rbuf.c -- rbuf_t test harness.
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include "rbuf.h"
+
+void debug_print(rbuf_t *rbuf, int *dat)
+{
+ int i;
+ for (i=-1; rbuf_next(rbuf, &i); ) fprintf(bcftools_stdout, " %2d", i);
+ fprintf(bcftools_stdout, "\n");
+ for (i=-1; rbuf_next(rbuf, &i); ) fprintf(bcftools_stdout, " %2d", dat[i]);
+ fprintf(bcftools_stdout, "\n");
+}
+
+int bcftools_test-rbuf_main(int argc, char **argv)
+{
+ int i, j, *dat = (int*)calloc(10,sizeof(int));
+ rbuf_t rbuf;
+ rbuf_init(&rbuf,10);
+
+ rbuf.f = 5; // force wrapping
+ for (i=0; i<9; i++)
+ {
+ j = rbuf_append(&rbuf);
+ dat[j] = i+1;
+ }
+ fprintf(bcftools_stdout, "Inserted 1-9 starting at offset 5:\n");
+ debug_print(&rbuf, dat);
+
+ i = rbuf_kth(&rbuf, 3);
+ fprintf(bcftools_stdout, "4th is %d\n", dat[i]);
+
+ fprintf(bcftools_stdout, "Deleting 1-2:\n");
+ rbuf_shift_n(&rbuf, 2);
+ debug_print(&rbuf, dat);
+
+ fprintf(bcftools_stdout, "Prepending 0-8:\n");
+ for (i=0; i<9; i++)
+ {
+ j = rbuf_prepend(&rbuf);
+ dat[j] = i;
+ }
+ debug_print(&rbuf, dat);
+
+ fprintf(bcftools_stdout, "Expanding:\n");
+ rbuf_expand0(&rbuf,int,rbuf.n+1,dat);
+ debug_print(&rbuf, dat);
+
+ free(dat);
+ return 0;
+}
+
--- /dev/null
+/* test/test-regidx.c -- Regions index test harness.
+
+ gcc -g -Wall -O0 -I. -I../htslib/ -L../htslib regidx.c -o test-regidx test-regidx.c -lhts
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdarg.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <getopt.h>
+#include <htslib/kstring.h>
+#include <time.h>
+#include "regidx.h"
+
+static int verbose = 0;
+
+void debug(const char *format, ...)
+{
+ if ( verbose<2 ) return;
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(stderr, format, ap);
+ va_end(ap);
+}
+void info(const char *format, ...)
+{
+ if ( verbose<1 ) return;
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(stderr, format, ap);
+ va_end(ap);
+}
+void error(const char *format, ...)
+{
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(stderr, format, ap);
+ va_end(ap);
+ exit(-1);
+}
+
+int custom_parse(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ // Use the standard parser for CHROM,FROM,TO
+ int i, ret = regidx_parse_tab(line,chr_beg,chr_end,beg,end,NULL,NULL);
+ if ( ret!=0 ) return ret;
+
+ // Skip the fields that were parsed above
+ char *ss = (char*) line;
+ while ( *ss && isspace(*ss) ) ss++;
+ for (i=0; i<3; i++)
+ {
+ while ( *ss && !isspace(*ss) ) ss++;
+ if ( !*ss ) return -2; // wrong number of fields
+ while ( *ss && isspace(*ss) ) ss++;
+ }
+ if ( !*ss ) return -2;
+
+ // Parse the payload
+ char *se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ char **dat = (char**) payload;
+ *dat = (char*) malloc(se-ss+1);
+ memcpy(*dat,ss,se-ss+1);
+ (*dat)[se-ss] = 0;
+ return 0;
+}
+void custom_free(void *payload)
+{
+ char **dat = (char**)payload;
+ free(*dat);
+}
+
+void test_sequential_access(void)
+{
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Insert regions
+ kstring_t str = {0,0,0};
+ int i, n = 10;
+ for (i=0; i<n; i++)
+ {
+ int beg = 10*(i+1);
+ str.l = 0;
+ ksprintf(&str,"1\t%d\t%d\t%d",beg,beg,beg);
+ if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n",str.s);
+ }
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ i = 0;
+ while ( regitr_loop(itr) )
+ {
+ if ( itr->beg!=itr->end || itr->beg+1!=10*(i+1) ) error("listing failed, expected %d, found %d\n",10*(i+1),itr->beg+1);
+ str.l = 0;
+ ksprintf(&str,"%d",itr->beg+1);
+ if ( strcmp(regitr_payload(itr,char*),str.s) ) error("listing failed, expected payload \"%s\", found \"%s\"\n",str.s,regitr_payload(itr,char*));
+ i++;
+ }
+ if ( i!=n ) error("Expected %d regions, listed %d\n", n,i);
+ debug("ok: listed %d regions\n", n);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+ free(str.s);
+}
+
+void test_custom_payload(void)
+{
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Insert regions
+ char *line;
+ line = "1 10000000 10000000 1:10000000-10000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 20000000 20000001 1:20000000-20000001"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 20000002 20000002 1:20000002-20000002"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 30000000 30000000 1:30000000-30000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ int from, to;
+
+ from = to = 10000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+ if ( strcmp("1:10000000-10000000",regitr_payload(itr,char*)) ) error("query failed: 1:%d-%d vs %s\n", from,to,regitr_payload(itr,char*));
+ if ( !regidx_overlap(idx,"1",from-2,to-1,itr) ) error("query failed: 1:%d-%d\n",from-1,to);
+ if ( !regidx_overlap(idx,"1",from-2,to+3,itr) ) error("query failed: 1:%d-%d\n",from-1,to+2);
+ if ( regidx_overlap(idx,"1",from-2,to-2,itr) ) error("query failed: 1:%d-%d\n",from-1,to-1);
+
+ from = to = 20000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ from = to = 20000002;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ from = to = 30000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+}
+
+void get_random_region(uint32_t min, uint32_t max, uint32_t *beg, uint32_t *end)
+{
+ long int b = random(), e = random();
+ *beg = min + (float)b * (max-min) / RAND_MAX;
+ *end = *beg + (float)e * (max-*beg) / RAND_MAX;
+}
+
+void test_random(int nregs, uint32_t min, uint32_t max)
+{
+ min--;
+ max--;
+
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Test region
+ uint32_t beg,end;
+ get_random_region(min,max,&beg,&end);
+
+ // Insert regions
+ int i, nexp = 0;
+ kstring_t str = {0,0,0};
+ for (i=0; i<nregs; i++)
+ {
+ uint32_t b,e;
+ get_random_region(min,max,&b,&e);
+ str.l = 0;
+ ksprintf(&str,"1\t%"PRIu32"\t%"PRIu32"\t1:%"PRIu32"-%"PRIu32"",b+1,e+1,b+1,e+1);
+ if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n", str.s);
+ if ( e>=beg && b<=end ) nexp++;
+ }
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ int nhit = 0, ret = regidx_overlap(idx,"1",beg,end,itr);
+ if ( nexp && !ret ) error("query failed, expected %d overlap(s), found none: %d-%d\n", nexp,beg+1,end+1);
+ if ( !nexp && ret ) error("query failed, expected no overlaps, found some: %d-%d\n", beg+1,end+1);
+ while ( ret && regitr_overlap(itr) )
+ {
+ str.l = 0;
+ ksprintf(&str,"1:%"PRIu32"-%"PRIu32"",itr->beg+1,itr->end+1);
+ if ( strcmp(str.s,regitr_payload(itr,char*)) )
+ error("query failed, incorrect payload: %s vs %s (%d-%d)\n",str.s,regitr_payload(itr,char*),beg+1,end+1);
+ if ( itr->beg > end || itr->end < beg )
+ error("query failed, incorrect hit: %d-%d vs %d-%d, payload %s\n", beg+1,end+1,itr->beg+1,itr->end+1,regitr_payload(itr,char*));
+ nhit++;
+ }
+ if ( nexp!=nhit ) error("query failed, expected %d overlap(s), found %d: %d-%d\n",nexp,nhit,beg+1,end+1);
+ debug("ok: found %d overlaps\n", nexp);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+ free(str.s);
+}
+
+void create_line_bed(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s\t%d\t%d\n",chr,start-1,end);
+}
+void create_line_tab(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s\t%d\t%d\n",chr,start,end);
+}
+void create_line_reg(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s:%d-%d\n",chr,start,end);
+}
+
+typedef void (*set_line_f)(char *line, char *chr, int start, int end);
+
+void test(set_line_f set_line, regidx_parse_f parse)
+{
+ regidx_t *idx = regidx_init(NULL,parse,NULL,0,NULL);
+ if ( !idx ) error("init failed\n");
+
+ char line[250], *chr = "1";
+ int i, n = 10, start, end, nhit;
+ for (i=1; i<n; i++)
+ {
+ start = end = 10*i;
+ set_line(line,chr,start,end);
+ debug("insert: %s", line);
+ if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+ start = end = 10*i + 1;
+ set_line(line,chr,start,end);
+ debug("insert: %s", line);
+ if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ }
+
+ regitr_t *itr = regitr_init(idx);
+ for (i=1; i<n; i++)
+ {
+ // no hit
+ start = end = 10*i - 1;
+ if ( regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be no hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: no overlap found for %s:%d-%d\n",chr,start,end);
+
+
+ // one hit
+ start = end = 10*i;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+ // one hit
+ start = end = 10*i+1;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+ // two hits
+ start = 10*i; end = start+1;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=2 ) error("query failed, expected two hits, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+ }
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+}
+
+static void usage(void)
+{
+ fprintf(stderr, "Usage: test-regidx [OPTIONS]\n");
+ fprintf(stderr, "Options:\n");
+ fprintf(stderr, " -h, --help this help message\n");
+ fprintf(stderr, " -s, --seed <int> random seed\n");
+ fprintf(stderr, " -v, --verbose increase verbosity by giving multiple times\n");
+
+ exit(1);
+}
+
+int main(int argc, char **argv)
+{
+ static struct option loptions[] =
+ {
+ {"help",0,0,'h'},
+ {"verbose",0,0,'v'},
+ {"seed",1,0,'s'},
+ {0,0,0,0}
+ };
+ int c;
+ int seed = (int)time(NULL);
+ while ((c = getopt_long(argc, argv, "hvs:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 's': seed = atoi(optarg); break;
+ case 'v': verbose++; break;
+ default: usage(); break;
+ }
+ }
+
+ info("Testing sequential access\n");
+ test_sequential_access();
+
+ info("Testing TAB\n");
+ test(create_line_tab,regidx_parse_tab);
+
+ info("Testing REG\n");
+ test(create_line_reg,regidx_parse_reg);
+
+ info("Testing BED\n");
+ test(create_line_bed,regidx_parse_bed);
+
+ info("Testing custom payload\n");
+ test_custom_payload();
+
+ int i, ntest = 1000, nreg = 50;
+ srandom(seed);
+ info("%d randomized tests, %d regions per test. Random seed is %d\n", ntest,nreg,seed);
+ for (i=0; i<ntest; i++) test_random(nreg,1,1000);
+
+ return 0;
+}
+
+
--- /dev/null
+#include "bcftools.pysam.h"
+
+/* test/test-regidx.c -- Regions index test harness.
+
+ gcc -g -Wall -O0 -I. -I../htslib/ -L../htslib regidx.c -o test-regidx test-regidx.c -lhts
+
+ Copyright (C) 2014 Genome Research Ltd.
+
+ Author: Petr Danecek <pd3@sanger.ac.uk>
+
+ Permission is hereby granted, free of charge, to any person obtaining a copy
+ of this software and associated documentation files (the "Software"), to deal
+ in the Software without restriction, including without limitation the rights
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ copies of the Software, and to permit persons to whom the Software is
+ furnished to do so, subject to the following conditions:
+
+ The above copyright notice and this permission notice shall be included in
+ all copies or substantial portions of the Software.
+
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ THE SOFTWARE.
+*/
+
+#include <stdarg.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <ctype.h>
+#include <string.h>
+#include <getopt.h>
+#include <htslib/kstring.h>
+#include <time.h>
+#include "regidx.h"
+
+static int verbose = 0;
+
+void debug(const char *format, ...)
+{
+ if ( verbose<2 ) return;
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(bcftools_stderr, format, ap);
+ va_end(ap);
+}
+void info(const char *format, ...)
+{
+ if ( verbose<1 ) return;
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(bcftools_stderr, format, ap);
+ va_end(ap);
+}
+void error(const char *format, ...)
+{
+ va_list ap;
+ va_start(ap, format);
+ vfprintf(bcftools_stderr, format, ap);
+ va_end(ap);
+ exit(-1);
+}
+
+int custom_parse(const char *line, char **chr_beg, char **chr_end, uint32_t *beg, uint32_t *end, void *payload, void *usr)
+{
+ // Use the standard parser for CHROM,FROM,TO
+ int i, ret = regidx_parse_tab(line,chr_beg,chr_end,beg,end,NULL,NULL);
+ if ( ret!=0 ) return ret;
+
+ // Skip the fields that were parsed above
+ char *ss = (char*) line;
+ while ( *ss && isspace(*ss) ) ss++;
+ for (i=0; i<3; i++)
+ {
+ while ( *ss && !isspace(*ss) ) ss++;
+ if ( !*ss ) return -2; // wrong number of fields
+ while ( *ss && isspace(*ss) ) ss++;
+ }
+ if ( !*ss ) return -2;
+
+ // Parse the payload
+ char *se = ss;
+ while ( *se && !isspace(*se) ) se++;
+ char **dat = (char**) payload;
+ *dat = (char*) malloc(se-ss+1);
+ memcpy(*dat,ss,se-ss+1);
+ (*dat)[se-ss] = 0;
+ return 0;
+}
+void custom_free(void *payload)
+{
+ char **dat = (char**)payload;
+ free(*dat);
+}
+
+void test_sequential_access(void)
+{
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Insert regions
+ kstring_t str = {0,0,0};
+ int i, n = 10;
+ for (i=0; i<n; i++)
+ {
+ int beg = 10*(i+1);
+ str.l = 0;
+ ksprintf(&str,"1\t%d\t%d\t%d",beg,beg,beg);
+ if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n",str.s);
+ }
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ i = 0;
+ while ( regitr_loop(itr) )
+ {
+ if ( itr->beg!=itr->end || itr->beg+1!=10*(i+1) ) error("listing failed, expected %d, found %d\n",10*(i+1),itr->beg+1);
+ str.l = 0;
+ ksprintf(&str,"%d",itr->beg+1);
+ if ( strcmp(regitr_payload(itr,char*),str.s) ) error("listing failed, expected payload \"%s\", found \"%s\"\n",str.s,regitr_payload(itr,char*));
+ i++;
+ }
+ if ( i!=n ) error("Expected %d regions, listed %d\n", n,i);
+ debug("ok: listed %d regions\n", n);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+ free(str.s);
+}
+
+void test_custom_payload(void)
+{
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Insert regions
+ char *line;
+ line = "1 10000000 10000000 1:10000000-10000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 20000000 20000001 1:20000000-20000001"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 20000002 20000002 1:20000002-20000002"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ line = "1 30000000 30000000 1:30000000-30000000"; if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ int from, to;
+
+ from = to = 10000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+ if ( strcmp("1:10000000-10000000",regitr_payload(itr,char*)) ) error("query failed: 1:%d-%d vs %s\n", from,to,regitr_payload(itr,char*));
+ if ( !regidx_overlap(idx,"1",from-2,to-1,itr) ) error("query failed: 1:%d-%d\n",from-1,to);
+ if ( !regidx_overlap(idx,"1",from-2,to+3,itr) ) error("query failed: 1:%d-%d\n",from-1,to+2);
+ if ( regidx_overlap(idx,"1",from-2,to-2,itr) ) error("query failed: 1:%d-%d\n",from-1,to-1);
+
+ from = to = 20000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ from = to = 20000002;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ from = to = 30000000;
+ if ( !regidx_overlap(idx,"1",from-1,to-1,itr) ) error("query failed: 1:%d-%d\n",from,to);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+}
+
+void get_random_region(uint32_t min, uint32_t max, uint32_t *beg, uint32_t *end)
+{
+ long int b = random(), e = random();
+ *beg = min + (float)b * (max-min) / RAND_MAX;
+ *end = *beg + (float)e * (max-*beg) / RAND_MAX;
+}
+
+void test_random(int nregs, uint32_t min, uint32_t max)
+{
+ min--;
+ max--;
+
+ // Init index with no file name, we will insert the regions manually
+ regidx_t *idx = regidx_init(NULL,custom_parse,custom_free,sizeof(char*),NULL);
+ if ( !idx ) error("init failed\n");
+
+ // Test region
+ uint32_t beg,end;
+ get_random_region(min,max,&beg,&end);
+
+ // Insert regions
+ int i, nexp = 0;
+ kstring_t str = {0,0,0};
+ for (i=0; i<nregs; i++)
+ {
+ uint32_t b,e;
+ get_random_region(min,max,&b,&e);
+ str.l = 0;
+ ksprintf(&str,"1\t%"PRIu32"\t%"PRIu32"\t1:%"PRIu32"-%"PRIu32"",b+1,e+1,b+1,e+1);
+ if ( regidx_insert(idx,str.s)!=0 ) error("insert failed: %s\n", str.s);
+ if ( e>=beg && b<=end ) nexp++;
+ }
+
+ // Test
+ regitr_t *itr = regitr_init(idx);
+ int nhit = 0, ret = regidx_overlap(idx,"1",beg,end,itr);
+ if ( nexp && !ret ) error("query failed, expected %d overlap(s), found none: %d-%d\n", nexp,beg+1,end+1);
+ if ( !nexp && ret ) error("query failed, expected no overlaps, found some: %d-%d\n", beg+1,end+1);
+ while ( ret && regitr_overlap(itr) )
+ {
+ str.l = 0;
+ ksprintf(&str,"1:%"PRIu32"-%"PRIu32"",itr->beg+1,itr->end+1);
+ if ( strcmp(str.s,regitr_payload(itr,char*)) )
+ error("query failed, incorrect payload: %s vs %s (%d-%d)\n",str.s,regitr_payload(itr,char*),beg+1,end+1);
+ if ( itr->beg > end || itr->end < beg )
+ error("query failed, incorrect hit: %d-%d vs %d-%d, payload %s\n", beg+1,end+1,itr->beg+1,itr->end+1,regitr_payload(itr,char*));
+ nhit++;
+ }
+ if ( nexp!=nhit ) error("query failed, expected %d overlap(s), found %d: %d-%d\n",nexp,nhit,beg+1,end+1);
+ debug("ok: found %d overlaps\n", nexp);
+
+ // Clean up
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+ free(str.s);
+}
+
+void create_line_bed(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s\t%d\t%d\n",chr,start-1,end);
+}
+void create_line_tab(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s\t%d\t%d\n",chr,start,end);
+}
+void create_line_reg(char *line, char *chr, int start, int end)
+{
+ sprintf(line,"%s:%d-%d\n",chr,start,end);
+}
+
+typedef void (*set_line_f)(char *line, char *chr, int start, int end);
+
+void test(set_line_f set_line, regidx_parse_f parse)
+{
+ regidx_t *idx = regidx_init(NULL,parse,NULL,0,NULL);
+ if ( !idx ) error("init failed\n");
+
+ char line[250], *chr = "1";
+ int i, n = 10, start, end, nhit;
+ for (i=1; i<n; i++)
+ {
+ start = end = 10*i;
+ set_line(line,chr,start,end);
+ debug("insert: %s", line);
+ if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+
+ start = end = 10*i + 1;
+ set_line(line,chr,start,end);
+ debug("insert: %s", line);
+ if ( regidx_insert(idx,line)!=0 ) error("insert failed: %s\n", line);
+ }
+
+ regitr_t *itr = regitr_init(idx);
+ for (i=1; i<n; i++)
+ {
+ // no hit
+ start = end = 10*i - 1;
+ if ( regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be no hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: no overlap found for %s:%d-%d\n",chr,start,end);
+
+
+ // one hit
+ start = end = 10*i;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+ // one hit
+ start = end = 10*i+1;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=1 ) error("query failed, expected one hit, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+
+ // two hits
+ start = 10*i; end = start+1;
+ if ( !regidx_overlap(idx,chr,start-1,end-1,itr) ) error("query failed, there should be a hit: %s:%d-%d\n",chr,start,end);
+ debug("ok: overlap(s) found for %s:%d-%d\n",chr,start,end);
+ nhit = 0;
+ while ( regitr_overlap(itr) )
+ {
+ if ( itr->beg > end-1 || itr->end < start-1 ) error("query failed, incorrect region: %d-%d for %d-%d\n",itr->beg+1,itr->end+1,start,end);
+ debug("\t %d-%d\n",itr->beg+1,itr->end+1);
+ nhit++;
+ }
+ if ( nhit!=2 ) error("query failed, expected two hits, found %d: %s:%d-%d\n",nhit,chr,start,end);
+
+ }
+ regitr_destroy(itr);
+ regidx_destroy(idx);
+}
+
+static void usage(void)
+{
+ fprintf(bcftools_stderr, "Usage: test-regidx [OPTIONS]\n");
+ fprintf(bcftools_stderr, "Options:\n");
+ fprintf(bcftools_stderr, " -h, --help this help message\n");
+ fprintf(bcftools_stderr, " -s, --seed <int> random seed\n");
+ fprintf(bcftools_stderr, " -v, --verbose increase verbosity by giving multiple times\n");
+
+ exit(1);
+}
+
+int bcftools_test-regidx_main(int argc, char **argv)
+{
+ static struct option loptions[] =
+ {
+ {"help",0,0,'h'},
+ {"verbose",0,0,'v'},
+ {"seed",1,0,'s'},
+ {0,0,0,0}
+ };
+ int c;
+ int seed = (int)time(NULL);
+ while ((c = getopt_long(argc, argv, "hvs:",loptions,NULL)) >= 0)
+ {
+ switch (c)
+ {
+ case 's': seed = atoi(optarg); break;
+ case 'v': verbose++; break;
+ default: usage(); break;
+ }
+ }
+
+ info("Testing sequential access\n");
+ test_sequential_access();
+
+ info("Testing TAB\n");
+ test(create_line_tab,regidx_parse_tab);
+
+ info("Testing REG\n");
+ test(create_line_reg,regidx_parse_reg);
+
+ info("Testing BED\n");
+ test(create_line_bed,regidx_parse_bed);
+
+ info("Testing custom payload\n");
+ test_custom_payload();
+
+ int i, ntest = 1000, nreg = 50;
+ srandom(seed);
+ info("%d randomized tests, %d regions per test. Random seed is %d\n", ntest,nreg,seed);
+ for (i=0; i<ntest; i++) test_random(nreg,1,1000);
+
+ return 0;
+}
+
+
int nsmpl_annot;
int *sample_map, nsample_map, sample_is_file; // map[idst] -> isrc
+ uint8_t *src_smpl_pld, *dst_smpl_pld; // for Number=G format fields
int mtmpi, mtmpf, mtmps;
int mtmpi2, mtmpf2, mtmps2;
int mtmpi3, mtmpf3, mtmps3;
return bcf_update_id(args->hdr_out,line,rec->d.id);
return 0;
}
+static int vcf_setter_ref(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+ bcf1_t *rec = (bcf1_t*) data;
+ if ( !strcmp(rec->d.allele[0],line->d.allele[0]) ) return 0; // no update necessary
+ const char **als = (const char**) malloc(sizeof(char*)*line->n_allele);
+ als[0] = rec->d.allele[0];
+ int i;
+ for (i=1; i<line->n_allele; i++) als[i] = line->d.allele[i];
+ bcf_update_alleles(args->hdr_out, line, als, line->n_allele);
+ free(als);
+ return 0;
+}
+static int vcf_setter_alt(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+ bcf1_t *rec = (bcf1_t*) data;
+ int i;
+ if ( rec->n_allele==line->n_allele )
+ {
+ for (i=1; i<rec->n_allele; i++) if ( strcmp(rec->d.allele[i],line->d.allele[i]) ) break;
+ if ( i==rec->n_allele ) return 0; // no update necessary
+ }
+ const char **als = (const char**) malloc(sizeof(char*)*rec->n_allele);
+ als[0] = line->d.allele[0];
+ for (i=1; i<rec->n_allele; i++) als[i] = rec->d.allele[i];
+ bcf_update_alleles(args->hdr_out, line, als, rec->n_allele);
+ free(als);
+ return 0;
+}
static int setter_qual(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
annot_line_t *tab = (annot_line_t*) data;
if ( str[0]=='1' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,1);
if ( str[0]=='0' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,0);
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
return -1;
}
static int vcf_setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int ntmpi2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpi2, &args->mtmpi2);
{
int val = strtol(str, &end, 10);
if ( end==str )
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
ntmpi++;
hts_expand(int32_t,ntmpi,args->mtmpi,args->tmpi);
args->tmpi[ntmpi-1] = val;
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int ntmpf2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpf2, &args->mtmpf2);
{
double val = strtod(str, &end);
if ( end==str )
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
ntmpf++;
hts_expand(float,ntmpf,args->mtmpf,args->tmpf);
args->tmpf[ntmpf-1] = val;
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int i, empty = 0, nstr, mstr = args->tmpks.m;
return core_setter_format_str(args,line,col,args->tmpp);
}
+static int determine_ploidy(int nals, int *vals, int nvals1, uint8_t *smpl, int nsmpl)
+{
+ int i, j, ndip = nals*(nals+1)/2, max_ploidy = 0;
+ for (i=0; i<nsmpl; i++)
+ {
+ int *ptr = vals + i*nvals1;
+ int has_value = 0;
+ for (j=0; j<nvals1; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( ptr[j]!=bcf_int32_missing ) has_value = 1;
+ }
+ if ( has_value )
+ {
+ if ( j==ndip )
+ {
+ smpl[i] = 2;
+ max_ploidy = 2;
+ }
+ else if ( j==nals )
+ {
+ smpl[i] = 1;
+ if ( !max_ploidy ) max_ploidy = 1;
+ }
+ else return -j;
+ }
+ else smpl[i] = 0;
+ }
+ return max_ploidy;
+}
static int vcf_setter_format_int(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
bcf1_t *rec = (bcf1_t*) data;
int nsrc = bcf_get_format_int32(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpi,&args->mtmpi);
if ( nsrc==-3 ) return 0; // the tag is not present
if ( nsrc<=0 ) return 1; // error
- return core_setter_format_int(args,line,col,args->tmpi,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+ int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+ int nsrc1 = nsrc / nsmpl_src;
+ if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+ return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+
+ // create mapping from src to dst genotypes, haploid and diploid version
+ int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+ int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+ if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+ int i, j;
+ if ( rec->n_allele==line->n_allele )
+ {
+ // alleles unchanged?
+ for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+ if ( i==rec->n_allele )
+ return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+ }
+
+ int nsmpl_dst = rec->n_sample;
+ int ndst = bcf_get_format_int32(args->hdr,line,col->hdr_key_dst,&args->tmpi2,&args->mtmpi2);
+ int ndst1 = ndst / nsmpl_dst;
+ if ( ndst <= 0 )
+ {
+ if ( col->replace==REPLACE_NON_MISSING ) return 0; // overwrite only if present
+ if ( col->number==BCF_VL_G )
+ ndst1 = line->n_allele*(line->n_allele+1)/2;
+ else
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(int, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int32_t *dst = args->tmpi2 + i*ndst1;
+ for (j=0; j<ndst1; j++) dst[j] = bcf_int32_missing;
+ }
+ }
+
+ int nmap_dip = 0, *map_dip = NULL;
+ if ( col->number==BCF_VL_G )
+ {
+ map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+ if ( !args->src_smpl_pld )
+ {
+ args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+ args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+ }
+ int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+ if ( pld_src<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+ int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+ if ( pld_dst<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+ int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+ if ( ndst1_new != ndst1 )
+ {
+ if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+ ndst1 = ndst1_new;
+ hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ }
+ }
+ else if ( !ndst1 )
+ {
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ }
+
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int ii = args->sample_map ? args->sample_map[i] : i;
+ int32_t *ptr_src = args->tmpi + i*nsrc1;
+ int32_t *ptr_dst = args->tmpi2 + ii*ndst1;
+
+ if ( col->number==BCF_VL_G )
+ {
+ if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+ error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+ if ( !args->dst_smpl_pld[i] )
+ for (j=0; j<ndst1; j++) ptr_dst[j] = bcf_int32_missing;
+ }
+ if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+ {
+ for (j=0; j<nmap_hap; j++)
+ {
+ int k = map_hap[j];
+ if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+ }
+ if ( col->number==BCF_VL_G )
+ for (j=line->n_allele; j<ndst1; j++) ptr_dst[j++] = bcf_int32_vector_end;
+ }
+ else
+ {
+ for (j=0; j<nmap_dip; j++)
+ {
+ int k = map_dip[j];
+ if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+ }
+ }
+ }
+ return bcf_update_format_int32(args->hdr_out,line,col->hdr_key_dst,args->tmpi2,nsmpl_dst*ndst1);
}
static int vcf_setter_format_real(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
+
bcf1_t *rec = (bcf1_t*) data;
int nsrc = bcf_get_format_float(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpf,&args->mtmpf);
if ( nsrc==-3 ) return 0; // the tag is not present
if ( nsrc<=0 ) return 1; // error
- return core_setter_format_real(args,line,col,args->tmpf,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+ int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+ int nsrc1 = nsrc / nsmpl_src;
+ if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+ return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+
+ // create mapping from src to dst genotypes, haploid and diploid version
+ int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+ int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+ if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+ int i, j;
+ if ( rec->n_allele==line->n_allele )
+ {
+ // alleles unchanged?
+ for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+ if ( i==rec->n_allele )
+ return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+ }
+
+ int nsmpl_dst = rec->n_sample;
+ int ndst = bcf_get_format_float(args->hdr,line,col->hdr_key_dst,&args->tmpf2,&args->mtmpf2);
+ int ndst1 = ndst / nsmpl_dst;
+ if ( ndst <= 0 )
+ {
+ if ( col->replace==REPLACE_NON_MISSING ) return 0; // overwrite only if present
+ if ( col->number==BCF_VL_G )
+ ndst1 = line->n_allele*(line->n_allele+1)/2;
+ else
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ float *dst = args->tmpf2 + i*ndst1;
+ for (j=0; j<ndst1; j++) bcf_float_set_missing(dst[j]);
+ }
+ }
+
+ int nmap_dip = 0, *map_dip = NULL;
+ if ( col->number==BCF_VL_G )
+ {
+ map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+ if ( !args->src_smpl_pld )
+ {
+ args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+ args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+ }
+ int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+ if ( pld_src<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+ int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+ if ( pld_dst<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+ int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+ if ( ndst1_new != ndst1 )
+ {
+ if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+ ndst1 = ndst1_new;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ }
+ }
+ else if ( !ndst1 )
+ {
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ }
+
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int ii = args->sample_map ? args->sample_map[i] : i;
+ float *ptr_src = args->tmpf + i*nsrc1;
+ float *ptr_dst = args->tmpf2 + ii*ndst1;
+
+ if ( col->number==BCF_VL_G )
+ {
+ if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+ error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+ if ( !args->dst_smpl_pld[i] )
+ for (j=0; j<ndst1; j++) bcf_float_set_missing(ptr_dst[j]);
+ }
+ if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+ {
+ for (j=0; j<nmap_hap; j++)
+ {
+ int k = map_hap[j];
+ if ( k>=0 )
+ {
+ if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+ else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+ else ptr_dst[k] = ptr_src[j];
+ }
+ }
+ if ( col->number==BCF_VL_G )
+ for (j=line->n_allele; j<ndst1; j++) bcf_float_set_vector_end(ptr_dst[j]);
+ }
+ else
+ {
+ for (j=0; j<nmap_dip; j++)
+ {
+ int k = map_dip[j];
+ if ( k>=0 )
+ {
+ if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+ else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+ else ptr_dst[k] = ptr_src[j];
+ }
+ }
+ }
+ }
+ return bcf_update_format_float(args->hdr_out,line,col->hdr_key_dst,args->tmpf2,nsmpl_dst*ndst1);
+
}
static int vcf_setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
else if ( !strcasecmp("POS",str.s) ) args->from_idx = icol;
else if ( !strcasecmp("FROM",str.s) ) args->from_idx = icol;
else if ( !strcasecmp("TO",str.s) ) args->to_idx = icol;
- else if ( !strcasecmp("REF",str.s) ) args->ref_idx = icol;
- else if ( !strcasecmp("ALT",str.s) ) args->alt_idx = icol;
+ else if ( !strcasecmp("REF",str.s) )
+ {
+ if ( args->tgts_is_vcf )
+ {
+ args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+ annot_col_t *col = &args->cols[args->ncols-1];
+ col->setter = vcf_setter_ref;
+ col->hdr_key_src = strdup(str.s);
+ col->hdr_key_dst = strdup(str.s);
+ }
+ else args->ref_idx = icol;
+ }
+ else if ( !strcasecmp("ALT",str.s) )
+ {
+ if ( args->tgts_is_vcf )
+ {
+ args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+ annot_col_t *col = &args->cols[args->ncols-1];
+ col->setter = vcf_setter_alt;
+ col->hdr_key_src = strdup(str.s);
+ col->hdr_key_dst = strdup(str.s);
+ }
+ else args->alt_idx = icol;
+ }
else if ( !strcasecmp("ID",str.s) )
{
if ( replace==REPLACE_NON_MISSING ) error("Apologies, the -ID feature has not been implemented yet.\n");
case BCF_HT_STR: col->setter = vcf_setter_format_str; has_fmt_str = 1; break;
default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
}
+ hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, hrec->vals[k]);
+ col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
}
}
else if ( !strncasecmp("FORMAT/",str.s, 7) || !strncasecmp("FMT/",str.s,4) )
if ( args->tgts_is_vcf )
{
bcf_hrec_t *hrec = bcf_hdr_get_hrec(args->files->readers[1].header, BCF_HL_FMT, "ID", key_src, NULL);
+ if ( !hrec ) error("No such annotation \"%s\" in %s\n", key_src,args->targets_fname);
tmp.l = 0;
bcf_hrec_format_rename(hrec, key_dst, &tmp);
bcf_hdr_append(args->hdr_out, tmp.s);
case BCF_HT_STR: col->setter = args->tgts_is_vcf ? vcf_setter_format_str : setter_format_str; has_fmt_str = 1; break;
default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
}
+ if ( args->tgts_is_vcf )
+ {
+ bcf_hdr_t *tgts_hdr = args->files->readers[1].header;
+ hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, col->hdr_key_src);
+ col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
+ }
}
else
{
free(args->tmpp2);
free(args->tmpi3);
free(args->tmpf3);
+ free(args->src_smpl_pld);
+ free(args->dst_smpl_pld);
if ( args->set_ids )
convert_destroy(args->set_ids);
if ( args->filter )
int nsmpl_annot;
int *sample_map, nsample_map, sample_is_file; // map[idst] -> isrc
+ uint8_t *src_smpl_pld, *dst_smpl_pld; // for Number=G format fields
int mtmpi, mtmpf, mtmps;
int mtmpi2, mtmpf2, mtmps2;
int mtmpi3, mtmpf3, mtmps3;
return bcf_update_id(args->hdr_out,line,rec->d.id);
return 0;
}
+static int vcf_setter_ref(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+ bcf1_t *rec = (bcf1_t*) data;
+ if ( !strcmp(rec->d.allele[0],line->d.allele[0]) ) return 0; // no update necessary
+ const char **als = (const char**) malloc(sizeof(char*)*line->n_allele);
+ als[0] = rec->d.allele[0];
+ int i;
+ for (i=1; i<line->n_allele; i++) als[i] = line->d.allele[i];
+ bcf_update_alleles(args->hdr_out, line, als, line->n_allele);
+ free(als);
+ return 0;
+}
+static int vcf_setter_alt(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
+{
+ bcf1_t *rec = (bcf1_t*) data;
+ int i;
+ if ( rec->n_allele==line->n_allele )
+ {
+ for (i=1; i<rec->n_allele; i++) if ( strcmp(rec->d.allele[i],line->d.allele[i]) ) break;
+ if ( i==rec->n_allele ) return 0; // no update necessary
+ }
+ const char **als = (const char**) malloc(sizeof(char*)*rec->n_allele);
+ als[0] = line->d.allele[0];
+ for (i=1; i<rec->n_allele; i++) als[i] = rec->d.allele[i];
+ bcf_update_alleles(args->hdr_out, line, als, rec->n_allele);
+ free(als);
+ return 0;
+}
static int setter_qual(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
annot_line_t *tab = (annot_line_t*) data;
if ( str[0]=='1' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,1);
if ( str[0]=='0' && str[1]==0 ) return bcf_update_info_flag(args->hdr_out,line,col->hdr_key_dst,NULL,0);
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
return -1;
}
static int vcf_setter_info_flag(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int ntmpi2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpi2, &args->mtmpi2);
{
int val = strtol(str, &end, 10);
if ( end==str )
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
ntmpi++;
hts_expand(int32_t,ntmpi,args->mtmpi,args->tmpi);
args->tmpi[ntmpi-1] = val;
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int ntmpf2 = bcf_get_info_float(args->hdr, line, col->hdr_key_dst, &args->tmpf2, &args->mtmpf2);
{
double val = strtod(str, &end);
if ( end==str )
- error("Could not parse %s at %s:%d .. [%s]\n", bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
+ error("Could not parse %s at %s:%d .. [%s]\n", col->hdr_key_src,bcf_seqname(args->hdr,line),line->pos+1,tab->cols[col->icol]);
ntmpf++;
hts_expand(float,ntmpf,args->mtmpf,args->tmpf);
args->tmpf[ntmpf-1] = val;
int ndst = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
int *map = vcmp_map_ARvalues(args->vcmp,ndst,nals,als,line->n_allele,line->d.allele);
- if ( !map ) error("REF alleles not compatible at %s:%d\n");
+ if ( !map ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
// fill in any missing values in the target VCF (or all, if not present)
int i, empty = 0, nstr, mstr = args->tmpks.m;
return core_setter_format_str(args,line,col,args->tmpp);
}
+static int determine_ploidy(int nals, int *vals, int nvals1, uint8_t *smpl, int nsmpl)
+{
+ int i, j, ndip = nals*(nals+1)/2, max_ploidy = 0;
+ for (i=0; i<nsmpl; i++)
+ {
+ int *ptr = vals + i*nvals1;
+ int has_value = 0;
+ for (j=0; j<nvals1; j++)
+ {
+ if ( ptr[j]==bcf_int32_vector_end ) break;
+ if ( ptr[j]!=bcf_int32_missing ) has_value = 1;
+ }
+ if ( has_value )
+ {
+ if ( j==ndip )
+ {
+ smpl[i] = 2;
+ max_ploidy = 2;
+ }
+ else if ( j==nals )
+ {
+ smpl[i] = 1;
+ if ( !max_ploidy ) max_ploidy = 1;
+ }
+ else return -j;
+ }
+ else smpl[i] = 0;
+ }
+ return max_ploidy;
+}
static int vcf_setter_format_int(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
bcf1_t *rec = (bcf1_t*) data;
int nsrc = bcf_get_format_int32(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpi,&args->mtmpi);
if ( nsrc==-3 ) return 0; // the tag is not present
if ( nsrc<=0 ) return 1; // error
- return core_setter_format_int(args,line,col,args->tmpi,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+ int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+ int nsrc1 = nsrc / nsmpl_src;
+ if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+ return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+
+ // create mapping from src to dst genotypes, haploid and diploid version
+ int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+ int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+ if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+ int i, j;
+ if ( rec->n_allele==line->n_allele )
+ {
+ // alleles unchanged?
+ for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+ if ( i==rec->n_allele )
+ return core_setter_format_int(args,line,col,args->tmpi,nsrc1);
+ }
+
+ int nsmpl_dst = rec->n_sample;
+ int ndst = bcf_get_format_int32(args->hdr,line,col->hdr_key_dst,&args->tmpi2,&args->mtmpi2);
+ int ndst1 = ndst / nsmpl_dst;
+ if ( ndst <= 0 )
+ {
+ if ( col->replace==REPLACE_NON_MISSING ) return 0; // overwrite only if present
+ if ( col->number==BCF_VL_G )
+ ndst1 = line->n_allele*(line->n_allele+1)/2;
+ else
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(int, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int32_t *dst = args->tmpi2 + i*ndst1;
+ for (j=0; j<ndst1; j++) dst[j] = bcf_int32_missing;
+ }
+ }
+
+ int nmap_dip = 0, *map_dip = NULL;
+ if ( col->number==BCF_VL_G )
+ {
+ map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+ if ( !args->src_smpl_pld )
+ {
+ args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+ args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+ }
+ int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+ if ( pld_src<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+ int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+ if ( pld_dst<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+ int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+ if ( ndst1_new != ndst1 )
+ {
+ if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+ ndst1 = ndst1_new;
+ hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ }
+ }
+ else if ( !ndst1 )
+ {
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(int32_t, ndst1*nsmpl_dst, args->mtmpi2, args->tmpi2);
+ }
+
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int ii = args->sample_map ? args->sample_map[i] : i;
+ int32_t *ptr_src = args->tmpi + i*nsrc1;
+ int32_t *ptr_dst = args->tmpi2 + ii*ndst1;
+
+ if ( col->number==BCF_VL_G )
+ {
+ if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+ error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+ if ( !args->dst_smpl_pld[i] )
+ for (j=0; j<ndst1; j++) ptr_dst[j] = bcf_int32_missing;
+ }
+ if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+ {
+ for (j=0; j<nmap_hap; j++)
+ {
+ int k = map_hap[j];
+ if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+ }
+ if ( col->number==BCF_VL_G )
+ for (j=line->n_allele; j<ndst1; j++) ptr_dst[j++] = bcf_int32_vector_end;
+ }
+ else
+ {
+ for (j=0; j<nmap_dip; j++)
+ {
+ int k = map_dip[j];
+ if ( k>=0 ) ptr_dst[k] = ptr_src[j];
+ }
+ }
+ }
+ return bcf_update_format_int32(args->hdr_out,line,col->hdr_key_dst,args->tmpi2,nsmpl_dst*ndst1);
}
static int vcf_setter_format_real(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
{
+
bcf1_t *rec = (bcf1_t*) data;
int nsrc = bcf_get_format_float(args->files->readers[1].header,rec,col->hdr_key_src,&args->tmpf,&args->mtmpf);
if ( nsrc==-3 ) return 0; // the tag is not present
if ( nsrc<=0 ) return 1; // error
- return core_setter_format_real(args,line,col,args->tmpf,nsrc/bcf_hdr_nsamples(args->files->readers[1].header));
+ int nsmpl_src = bcf_hdr_nsamples(args->files->readers[1].header);
+ int nsrc1 = nsrc / nsmpl_src;
+ if ( col->number!=BCF_VL_G && col->number!=BCF_VL_R && col->number!=BCF_VL_A )
+ return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+
+ // create mapping from src to dst genotypes, haploid and diploid version
+ int nmap_hap = col->number==BCF_VL_G || col->number==BCF_VL_R ? rec->n_allele : rec->n_allele - 1;
+ int *map_hap = vcmp_map_ARvalues(args->vcmp,nmap_hap,line->n_allele,line->d.allele,rec->n_allele,rec->d.allele);
+ if ( !map_hap ) error("REF alleles not compatible at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+
+ int i, j;
+ if ( rec->n_allele==line->n_allele )
+ {
+ // alleles unchanged?
+ for (i=0; i<rec->n_allele; i++) if ( map_hap[i]!=i ) break;
+ if ( i==rec->n_allele )
+ return core_setter_format_real(args,line,col,args->tmpf,nsrc1);
+ }
+
+ int nsmpl_dst = rec->n_sample;
+ int ndst = bcf_get_format_float(args->hdr,line,col->hdr_key_dst,&args->tmpf2,&args->mtmpf2);
+ int ndst1 = ndst / nsmpl_dst;
+ if ( ndst <= 0 )
+ {
+ if ( col->replace==REPLACE_NON_MISSING ) return 0; // overwrite only if present
+ if ( col->number==BCF_VL_G )
+ ndst1 = line->n_allele*(line->n_allele+1)/2;
+ else
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ float *dst = args->tmpf2 + i*ndst1;
+ for (j=0; j<ndst1; j++) bcf_float_set_missing(dst[j]);
+ }
+ }
+
+ int nmap_dip = 0, *map_dip = NULL;
+ if ( col->number==BCF_VL_G )
+ {
+ map_dip = vcmp_map_dipGvalues(args->vcmp, &nmap_dip);
+ if ( !args->src_smpl_pld )
+ {
+ args->src_smpl_pld = (uint8_t*) malloc(nsmpl_src);
+ args->dst_smpl_pld = (uint8_t*) malloc(nsmpl_dst);
+ }
+ int pld_src = determine_ploidy(rec->n_allele, args->tmpi, nsrc1, args->src_smpl_pld, nsmpl_src);
+ if ( pld_src<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_src, rec->n_allele, bcf_seqname(bcf_sr_get_header(args->files,1),rec),rec->pos+1);
+ int pld_dst = determine_ploidy(line->n_allele, args->tmpi2, ndst1, args->dst_smpl_pld, nsmpl_dst);
+ if ( pld_dst<0 )
+ error("Unexpected number of %s values (%d) for %d alleles at %s:%d\n", col->hdr_key_src,-pld_dst, line->n_allele, bcf_seqname(args->hdr,line),line->pos+1);
+
+ int ndst1_new = pld_dst==1 ? line->n_allele : line->n_allele*(line->n_allele+1)/2;
+ if ( ndst1_new != ndst1 )
+ {
+ if ( ndst1 ) error("todo: %s ndst1!=ndst .. %d %d at %s:%d\n",col->hdr_key_src,ndst1_new,ndst1,bcf_seqname(args->hdr,line),line->pos+1);
+ ndst1 = ndst1_new;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ }
+ }
+ else if ( !ndst1 )
+ {
+ ndst1 = col->number==BCF_VL_A ? line->n_allele - 1 : line->n_allele;
+ hts_expand(float, ndst1*nsmpl_dst, args->mtmpf2, args->tmpf2);
+ }
+
+ for (i=0; i<nsmpl_dst; i++)
+ {
+ int ii = args->sample_map ? args->sample_map[i] : i;
+ float *ptr_src = args->tmpf + i*nsrc1;
+ float *ptr_dst = args->tmpf2 + ii*ndst1;
+
+ if ( col->number==BCF_VL_G )
+ {
+ if ( args->src_smpl_pld[ii] > 0 && args->dst_smpl_pld[i] > 0 && args->src_smpl_pld[ii]!=args->dst_smpl_pld[i] )
+ error("Sample ploidy differs at %s:%d\n", bcf_seqname(args->hdr,line),line->pos+1);
+ if ( !args->dst_smpl_pld[i] )
+ for (j=0; j<ndst1; j++) bcf_float_set_missing(ptr_dst[j]);
+ }
+ if ( col->number!=BCF_VL_G || args->src_smpl_pld[i]==1 )
+ {
+ for (j=0; j<nmap_hap; j++)
+ {
+ int k = map_hap[j];
+ if ( k>=0 )
+ {
+ if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+ else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+ else ptr_dst[k] = ptr_src[j];
+ }
+ }
+ if ( col->number==BCF_VL_G )
+ for (j=line->n_allele; j<ndst1; j++) bcf_float_set_vector_end(ptr_dst[j]);
+ }
+ else
+ {
+ for (j=0; j<nmap_dip; j++)
+ {
+ int k = map_dip[j];
+ if ( k>=0 )
+ {
+ if ( bcf_float_is_missing(ptr_src[j]) ) bcf_float_set_missing(ptr_dst[k]);
+ else if ( bcf_float_is_vector_end(ptr_src[j]) ) bcf_float_set_vector_end(ptr_dst[k]);
+ else ptr_dst[k] = ptr_src[j];
+ }
+ }
+ }
+ }
+ return bcf_update_format_float(args->hdr_out,line,col->hdr_key_dst,args->tmpf2,nsmpl_dst*ndst1);
+
}
static int vcf_setter_format_str(args_t *args, bcf1_t *line, annot_col_t *col, void *data)
else if ( !strcasecmp("POS",str.s) ) args->from_idx = icol;
else if ( !strcasecmp("FROM",str.s) ) args->from_idx = icol;
else if ( !strcasecmp("TO",str.s) ) args->to_idx = icol;
- else if ( !strcasecmp("REF",str.s) ) args->ref_idx = icol;
- else if ( !strcasecmp("ALT",str.s) ) args->alt_idx = icol;
+ else if ( !strcasecmp("REF",str.s) )
+ {
+ if ( args->tgts_is_vcf )
+ {
+ args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+ annot_col_t *col = &args->cols[args->ncols-1];
+ col->setter = vcf_setter_ref;
+ col->hdr_key_src = strdup(str.s);
+ col->hdr_key_dst = strdup(str.s);
+ }
+ else args->ref_idx = icol;
+ }
+ else if ( !strcasecmp("ALT",str.s) )
+ {
+ if ( args->tgts_is_vcf )
+ {
+ args->ncols++; args->cols = (annot_col_t*) realloc(args->cols,sizeof(annot_col_t)*args->ncols);
+ annot_col_t *col = &args->cols[args->ncols-1];
+ col->setter = vcf_setter_alt;
+ col->hdr_key_src = strdup(str.s);
+ col->hdr_key_dst = strdup(str.s);
+ }
+ else args->alt_idx = icol;
+ }
else if ( !strcasecmp("ID",str.s) )
{
if ( replace==REPLACE_NON_MISSING ) error("Apologies, the -ID feature has not been implemented yet.\n");
case BCF_HT_STR: col->setter = vcf_setter_format_str; has_fmt_str = 1; break;
default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
}
+ hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, hrec->vals[k]);
+ col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
}
}
else if ( !strncasecmp("FORMAT/",str.s, 7) || !strncasecmp("FMT/",str.s,4) )
if ( args->tgts_is_vcf )
{
bcf_hrec_t *hrec = bcf_hdr_get_hrec(args->files->readers[1].header, BCF_HL_FMT, "ID", key_src, NULL);
+ if ( !hrec ) error("No such annotation \"%s\" in %s\n", key_src,args->targets_fname);
tmp.l = 0;
bcf_hrec_format_rename(hrec, key_dst, &tmp);
bcf_hdr_append(args->hdr_out, tmp.s);
case BCF_HT_STR: col->setter = args->tgts_is_vcf ? vcf_setter_format_str : setter_format_str; has_fmt_str = 1; break;
default: error("The type of %s not recognised (%d)\n", str.s,bcf_hdr_id2type(args->hdr_out,BCF_HL_FMT,hdr_id));
}
+ if ( args->tgts_is_vcf )
+ {
+ bcf_hdr_t *tgts_hdr = args->files->readers[1].header;
+ hdr_id = bcf_hdr_id2int(tgts_hdr, BCF_DT_ID, col->hdr_key_src);
+ col->number = bcf_hdr_id2length(tgts_hdr,BCF_HL_FMT,hdr_id);
+ }
}
else
{
free(args->tmpp2);
free(args->tmpi3);
free(args->tmpf3);
+ free(args->src_smpl_pld);
+ free(args->dst_smpl_pld);
if ( args->set_ids )
convert_destroy(args->set_ids);
if ( args->filter )
#include <string.h>
#include <errno.h>
#include <math.h>
+#include <inttypes.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kseq.h>
args->seen_seq[chr_id] = 1;
prev_chr_id = chr_id;
- if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
}
else
}
if ( print_header )
{
- if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %d bytes\n", tmp->l);
+ if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)tmp->l);
tmp->l = 0;
}
return nskip;
{
if ( bgzf_write(bgzf_out, "BCF\2\2", 5) !=5 ) error("Failed to write %d bytes to %s\n", 5,args->output_fname);
if ( bgzf_write(bgzf_out, &tmp.l, 4) !=4 ) error("Failed to write %d bytes to %s\n", 4,args->output_fname);
- if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %d bytes to %s\n", tmp.l,args->output_fname);
+ if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %"PRId64" bytes to %s\n", (uint64_t)tmp.l,args->output_fname);
}
nskip = fp->block_offset;
}
nblock = unpackInt16(buf+16) + 1;
assert( nblock <= page_size && nblock >= nheader );
nread += bgzf_raw_read(fp, buf+nheader, nblock - nheader);
- if ( nread!=nblock ) error("Could not read %d bytes: %s\n",nblock,args->fnames[i]);
+ if ( nread!=nblock ) error("Could not read %"PRId64" bytes: %s\n",(uint64_t)nblock,args->fnames[i]);
if ( nread==neof && !memcmp(buf,eof,neof) ) continue;
nwr = bgzf_raw_write(bgzf_out, buf, nread);
- if ( nwr != nread ) error("Write failed, wrote %d instead of %d bytes.\n", nwr,(int)nread);
+ if ( nwr != nread ) error("Write failed, wrote %"PRId64" instead of %d bytes.\n", (uint64_t)nwr,(int)nread);
}
if (hts_close(hts_fp)) error("Close failed: %s\n",args->fnames[i]);
}
#include <string.h>
#include <errno.h>
#include <math.h>
+#include <inttypes.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kseq.h>
args->seen_seq[chr_id] = 1;
prev_chr_id = chr_id;
- if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %d bytes\n", fp->line.l);
+ if ( vcf_write_line(args->out_fh, &fp->line)!=0 ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)fp->line.l);
}
}
else
}
if ( print_header )
{
- if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %d bytes\n", tmp->l);
+ if ( bgzf_write(bgzf_out,tmp->s,tmp->l) != tmp->l ) error("Failed to write %"PRIu64" bytes\n", (uint64_t)tmp->l);
tmp->l = 0;
}
return nskip;
{
if ( bgzf_write(bgzf_out, "BCF\2\2", 5) !=5 ) error("Failed to write %d bytes to %s\n", 5,args->output_fname);
if ( bgzf_write(bgzf_out, &tmp.l, 4) !=4 ) error("Failed to write %d bytes to %s\n", 4,args->output_fname);
- if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %d bytes to %s\n", tmp.l,args->output_fname);
+ if ( bgzf_write(bgzf_out, tmp.s, tmp.l) != tmp.l) error("Failed to write %"PRId64" bytes to %s\n", (uint64_t)tmp.l,args->output_fname);
}
nskip = fp->block_offset;
}
nblock = unpackInt16(buf+16) + 1;
assert( nblock <= page_size && nblock >= nheader );
nread += bgzf_raw_read(fp, buf+nheader, nblock - nheader);
- if ( nread!=nblock ) error("Could not read %d bytes: %s\n",nblock,args->fnames[i]);
+ if ( nread!=nblock ) error("Could not read %"PRId64" bytes: %s\n",(uint64_t)nblock,args->fnames[i]);
if ( nread==neof && !memcmp(buf,eof,neof) ) continue;
nwr = bgzf_raw_write(bgzf_out, buf, nread);
- if ( nwr != nread ) error("Write failed, wrote %d instead of %d bytes.\n", nwr,(int)nread);
+ if ( nwr != nread ) error("Write failed, wrote %"PRId64" instead of %d bytes.\n", (uint64_t)nwr,(int)nread);
}
if (hts_close(hts_fp)) error("Close failed: %s\n",args->fnames[i]);
}
{
// missing GT
gts[0] = bcf_gt_missing;
- gts[1] = bcf_int32_vector_end;
+ gts[1] = bcf_gt_missing;
args->n.missing++;
return 0;
}
{
int pass = filter_test(args->filter, line, NULL);
if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
- if ( !pass ) continue;
- }
-
- if (!bcf_has_filter(hdr,line,"PASS"))
- {
- bcf_write(out_fh,hdr,line);
- continue;
+ if ( !pass )
+ {
+ bcf_write(out_fh,hdr,line);
+ continue;
+ }
}
// check if alleles compatible with being a gVCF record
{
// missing GT
gts[0] = bcf_gt_missing;
- gts[1] = bcf_int32_vector_end;
+ gts[1] = bcf_gt_missing;
args->n.missing++;
return 0;
}
{
int pass = filter_test(args->filter, line, NULL);
if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
- if ( !pass ) continue;
- }
-
- if (!bcf_has_filter(hdr,line,"PASS"))
- {
- bcf_write(out_fh,hdr,line);
- continue;
+ if ( !pass )
+ {
+ bcf_write(out_fh,hdr,line);
+ continue;
+ }
}
// check if alleles compatible with being a gVCF record
// check for truncated files, allow only with -f
BGZF *fp = bgzf_open(fname, "r");
if ( !fp ) error("index: failed to open %s\n", fname);
+ if ( bgzf_compression(fp)!=2 ) error("index: the file is not BGZF compressed, cannot index: %s\n", fname);
if ( bgzf_check_EOF(fp)!=1 ) error("index: the input is probably truncated, use -f to index anyway: %s\n", fname);
if ( bgzf_close(fp)!=0 ) error("index: close failed: %s\n", fname);
}
// check for truncated files, allow only with -f
BGZF *fp = bgzf_open(fname, "r");
if ( !fp ) error("index: failed to open %s\n", fname);
+ if ( bgzf_compression(fp)!=2 ) error("index: the file is not BGZF compressed, cannot index: %s\n", fname);
if ( bgzf_check_EOF(fp)!=1 ) error("index: the input is probably truncated, use -f to index anyway: %s\n", fname);
if ( bgzf_close(fp)!=0 ) error("index: close failed: %s\n", fname);
}
if ( !args->targets_list )
{
if ( argc-optind<2 ) error("Expected multiple files or the --targets option\n");
- if ( !args->isec_op ) error("Expected two file names or one of the options --complement, --nfiles or --targets\n");
+ if ( !args->isec_op ) error("One of the options --complement, --nfiles or --targets must be given with more than two files\n");
}
args->files->require_index = 1;
while (optind<argc)
if ( !args->targets_list )
{
if ( argc-optind<2 ) error("Expected multiple files or the --targets option\n");
- if ( !args->isec_op ) error("Expected two file names or one of the options --complement, --nfiles or --targets\n");
+ if ( !args->isec_op ) error("One of the options --complement, --nfiles or --targets must be given with more than two files\n");
}
args->files->require_index = 1;
while (optind<argc)
free(tmp);
}
+static inline int max_used_gt_ploidy(bcf_fmt_t *fmt, int nsmpl)
+{
+ int i,j, max_ploidy = 0;
+
+ #define BRANCH(type_t, vector_end) { \
+ type_t *ptr = (type_t*) fmt->p; \
+ for (i=0; i<nsmpl; i++) \
+ { \
+ for (j=0; j<fmt->n; j++) \
+ if ( ptr[j]==vector_end ) break; \
+ if ( j==fmt->n ) \
+ { \
+ /* all fields were used */ \
+ if ( max_ploidy < j ) max_ploidy = j; \
+ break; \
+ } \
+ if ( max_ploidy < j ) max_ploidy = j; \
+ ptr += fmt->n; \
+ } \
+ }
+ switch (fmt->type)
+ {
+ case BCF_BT_INT8: BRANCH(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH(int32_t, bcf_int32_vector_end); break;
+ default: error("Unexpected case: %d\n", fmt->type);
+ }
+ #undef BRANCH
+ return max_ploidy;
+}
+
void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
{
bcf_srs_t *files = args->files;
int nsize = 0, msize = sizeof(int32_t);
for (i=0; i<files->nreaders; i++)
{
- if ( !fmt_map[i] ) continue;
- if ( fmt_map[i]->n > nsize ) nsize = fmt_map[i]->n;
+ bcf_fmt_t *fmt = fmt_map[i];
+ if ( !fmt ) continue;
+ int pld = max_used_gt_ploidy(fmt_map[i], bcf_hdr_nsamples(bcf_sr_get_header(args->files,i)));
+ if ( nsize < pld ) nsize = pld;
}
+ if ( nsize==0 ) nsize = 1;
if ( ma->ntmp_arr < nsamples*nsize*msize )
{
int32_t *tmp = (int32_t *) ma->tmp_arr + ismpl*nsize;
int irec = ma->buf[i].cur;
- int j, k;
+ int j,k;
if ( !fmt_ori )
{
// missing values: assume maximum ploidy
{
// if all fields are missing then n==1 is valid
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori*(nals_ori+1)/2 && fmt_map[i]->n != nals_ori )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=G, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
else if ( length==BCF_VL_A )
{
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori-1 )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=A, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
else if ( length==BCF_VL_R )
{
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=R, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
}
}
// normalize alleles
maux->als = merge_alleles(line->d.allele, line->n_allele, buf->rec[j].map, maux->als, &maux->nals, &maux->mals);
- if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",bcf_seqname(args->out_hdr,line),line->pos+1,reader->fname);
+ if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",maux->chr,line->pos+1,reader->fname);
hts_expand0(int, maux->nals, maux->ncnt, maux->cnt);
for (k=1; k<line->n_allele; k++)
maux->cnt[ buf->rec[j].map[k] ]++; // how many times an allele appears in the files
free(tmp);
}
+static inline int max_used_gt_ploidy(bcf_fmt_t *fmt, int nsmpl)
+{
+ int i,j, max_ploidy = 0;
+
+ #define BRANCH(type_t, vector_end) { \
+ type_t *ptr = (type_t*) fmt->p; \
+ for (i=0; i<nsmpl; i++) \
+ { \
+ for (j=0; j<fmt->n; j++) \
+ if ( ptr[j]==vector_end ) break; \
+ if ( j==fmt->n ) \
+ { \
+ /* all fields were used */ \
+ if ( max_ploidy < j ) max_ploidy = j; \
+ break; \
+ } \
+ if ( max_ploidy < j ) max_ploidy = j; \
+ ptr += fmt->n; \
+ } \
+ }
+ switch (fmt->type)
+ {
+ case BCF_BT_INT8: BRANCH(int8_t, bcf_int8_vector_end); break;
+ case BCF_BT_INT16: BRANCH(int16_t, bcf_int16_vector_end); break;
+ case BCF_BT_INT32: BRANCH(int32_t, bcf_int32_vector_end); break;
+ default: error("Unexpected case: %d\n", fmt->type);
+ }
+ #undef BRANCH
+ return max_ploidy;
+}
+
void merge_GT(args_t *args, bcf_fmt_t **fmt_map, bcf1_t *out)
{
bcf_srs_t *files = args->files;
int nsize = 0, msize = sizeof(int32_t);
for (i=0; i<files->nreaders; i++)
{
- if ( !fmt_map[i] ) continue;
- if ( fmt_map[i]->n > nsize ) nsize = fmt_map[i]->n;
+ bcf_fmt_t *fmt = fmt_map[i];
+ if ( !fmt ) continue;
+ int pld = max_used_gt_ploidy(fmt_map[i], bcf_hdr_nsamples(bcf_sr_get_header(args->files,i)));
+ if ( nsize < pld ) nsize = pld;
}
+ if ( nsize==0 ) nsize = 1;
if ( ma->ntmp_arr < nsamples*nsize*msize )
{
int32_t *tmp = (int32_t *) ma->tmp_arr + ismpl*nsize;
int irec = ma->buf[i].cur;
- int j, k;
+ int j,k;
if ( !fmt_ori )
{
// missing values: assume maximum ploidy
{
// if all fields are missing then n==1 is valid
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori*(nals_ori+1)/2 && fmt_map[i]->n != nals_ori )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=G, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
else if ( length==BCF_VL_A )
{
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori-1 )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=A, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
else if ( length==BCF_VL_R )
{
if ( fmt_ori->n!=1 && fmt_ori->n != nals_ori )
- error("Incorrect number of %s fields (%d) at %s:%d, cannot merge.\n", key,fmt_ori->n,bcf_seqname(args->out_hdr,out),out->pos+1);
+ error("Incorrect number of FORMAT/%s values at %s:%d, cannot merge. The tag is defined as Number=R, but found\n"
+ "%d values and %d alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields\n",
+ key,bcf_seqname(args->out_hdr,out),out->pos+1,fmt_ori->n,nals_ori);
}
}
}
// normalize alleles
maux->als = merge_alleles(line->d.allele, line->n_allele, buf->rec[j].map, maux->als, &maux->nals, &maux->mals);
- if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",bcf_seqname(args->out_hdr,line),line->pos+1,reader->fname);
+ if ( !maux->als ) error("Failed to merge alleles at %s:%d in %s\n",maux->chr,line->pos+1,reader->fname);
hts_expand0(int, maux->nals, maux->ncnt, maux->cnt);
for (k=1; k<line->n_allele; k++)
maux->cnt[ buf->rec[j].map[k] ]++; // how many times an allele appears in the files
char **als;
int mmaps, nals, mals;
uint8_t *tmp_arr1, *tmp_arr2, *diploid;
- int ntmp_arr1, ntmp_arr2;
+ int32_t *int32_arr;
+ int ntmp_arr1, ntmp_arr2, nint32_arr;
kstring_t *tmp_str;
kstring_t *tmp_als, tmp_als_str;
int ntmp_als;
bcf_update_alleles_str(args->hdr,line,args->tmp_als_str.s);
args->nchanged++;
+ // Update INFO/END if necessary
+ int new_reflen = strlen(line->d.allele[0]);
+ if ( (ori_pos!=line->pos || reflen!=new_reflen) && bcf_get_info_int32(args->hdr, line, "END", &args->int32_arr, &args->nint32_arr)==1 )
+ {
+ // bcf_update_alleles_str() messed up rlen because line->pos changed. This will be fixed by bcf_update_info_int32()
+ args->int32_arr[0] = line->pos + new_reflen;
+ bcf_update_info_int32(args->hdr, line, "END", args->int32_arr, 1);
+ }
+
return ERR_OK;
}
}
static void merge_format_genotype(args_t *args, bcf1_t **lines, int nlines, bcf_fmt_t *fmt, bcf1_t *dst)
{
+ // reusing int8_t arrays as int32_t arrays
int ntmp = args->ntmp_arr1 / 4;
int ngts = bcf_get_genotypes(args->hdr,lines[0],&args->tmp_arr1,&ntmp);
args->ntmp_arr1 = ntmp * 4;
}
free(args->maps);
free(args->als);
+ free(args->int32_arr);
free(args->tmp_arr1);
free(args->tmp_arr2);
free(args->diploid);
break;
case 'o': args->output_fname = optarg; break;
case 'D':
- fprintf(stderr,"Warning: `-D` is functional but deprecated, replaced by `-d both`.\n");
+ fprintf(stderr,"Warning: `-D` is functional but deprecated, replaced by and alias of `-d none`.\n");
args->rmdup = BCF_SR_PAIR_EXACT;
break;
case 's': args->strict_filter = 1; break;
default: error("Unknown argument: %s\n", optarg);
}
}
- if ( argc>optind+1 ) usage();
- if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) usage();
- if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
char *fname = NULL;
if ( optind>=argc )
{
}
else fname = argv[optind];
+ if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) error("Expected -f, -m, -D or -d option\n");
+ if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
if ( args->region )
{
if ( bcf_sr_set_regions(args->files, args->region,region_is_file)<0 )
char **als;
int mmaps, nals, mals;
uint8_t *tmp_arr1, *tmp_arr2, *diploid;
- int ntmp_arr1, ntmp_arr2;
+ int32_t *int32_arr;
+ int ntmp_arr1, ntmp_arr2, nint32_arr;
kstring_t *tmp_str;
kstring_t *tmp_als, tmp_als_str;
int ntmp_als;
bcf_update_alleles_str(args->hdr,line,args->tmp_als_str.s);
args->nchanged++;
+ // Update INFO/END if necessary
+ int new_reflen = strlen(line->d.allele[0]);
+ if ( (ori_pos!=line->pos || reflen!=new_reflen) && bcf_get_info_int32(args->hdr, line, "END", &args->int32_arr, &args->nint32_arr)==1 )
+ {
+ // bcf_update_alleles_str() messed up rlen because line->pos changed. This will be fixed by bcf_update_info_int32()
+ args->int32_arr[0] = line->pos + new_reflen;
+ bcf_update_info_int32(args->hdr, line, "END", args->int32_arr, 1);
+ }
+
return ERR_OK;
}
}
static void merge_format_genotype(args_t *args, bcf1_t **lines, int nlines, bcf_fmt_t *fmt, bcf1_t *dst)
{
+ // reusing int8_t arrays as int32_t arrays
int ntmp = args->ntmp_arr1 / 4;
int ngts = bcf_get_genotypes(args->hdr,lines[0],&args->tmp_arr1,&ntmp);
args->ntmp_arr1 = ntmp * 4;
}
free(args->maps);
free(args->als);
+ free(args->int32_arr);
free(args->tmp_arr1);
free(args->tmp_arr2);
free(args->diploid);
break;
case 'o': args->output_fname = optarg; break;
case 'D':
- fprintf(bcftools_stderr,"Warning: `-D` is functional but deprecated, replaced by `-d both`.\n");
+ fprintf(bcftools_stderr,"Warning: `-D` is functional but deprecated, replaced by and alias of `-d none`.\n");
args->rmdup = BCF_SR_PAIR_EXACT;
break;
case 's': args->strict_filter = 1; break;
default: error("Unknown argument: %s\n", optarg);
}
}
- if ( argc>optind+1 ) usage();
- if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) usage();
- if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
char *fname = NULL;
if ( optind>=argc )
{
}
else fname = argv[optind];
+ if ( !args->ref_fname && !args->mrows_op && !args->rmdup ) error("Expected -f, -m, -D or -d option\n");
+ if ( !args->ref_fname && args->check_ref&CHECK_REF_FIX ) error("Expected --fasta-ref with --check-ref s\n");
+
if ( args->region )
{
if ( bcf_sr_set_regions(args->files, args->region,region_is_file)<0 )
case 'H': args->print_header = 1; break;
case 'v': args->vcf_list = optarg; break;
case 'c':
- error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n", optarg);
+ error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n");
break;
case 'a':
{
case 'H': args->print_header = 1; break;
case 'v': args->vcf_list = optarg; break;
case 'c':
- error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n", optarg);
+ error("The --collapse option is obsolete, pipe through `bcftools norm -c` instead.\n");
break;
case 'a':
{
/* vcfroh.c -- HMM model for detecting runs of autozygosity.
- Copyright (C) 2013-2017 Genome Research Ltd.
+ Copyright (C) 2013-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
#include <unistd.h>
#include <getopt.h>
#include <math.h>
+#include <inttypes.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kstring.h>
#include "bcftools.h"
#include "HMM.h"
#include "smpl_ilist.h"
+#include "filter.h"
#define STATE_HW 0 // normal state, follows Hardy-Weinberg allele frequencies
#define STATE_AZ 1 // autozygous state
#define OUTPUT_RG (1<<2)
#define OUTPUT_GZ (1<<3)
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+
/** Genetic map */
typedef struct
{
int32_t skip_rid, prev_rid, prev_pos;
int ntot; // some stats to detect if things didn't go wrong
+ int nno_af; // number of sites rejected because AF could not be determined
+ int nfiltered; // .. because of filters
+ int nnot_biallelic, ndup;
smpl_t *smpl; // HMM data for each sample
smpl_ilist_t *af_smpl; // list of samples to estimate AF from (--estimate-AF)
smpl_ilist_t *roh_smpl; // list of samples to analyze (--samples, --samples-file)
int argc, fake_PLs, snps_only, vi_training, samples_is_file, output_type, skip_homref, n_threads;
BGZF *out;
kstring_t str;
+
+ int filter_logic;
+ filter_t *filter;
+ char *filter_str;
}
args_t;
if ( !bcf_hdr_nsamples(args->hdr) ) error("No samples in the VCF?\n");
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
if ( !args->fake_PLs )
{
args->pl_hdr_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL");
static void destroy_data(args_t *args)
{
+ if ( args->filter ) filter_destroy(args->filter);
if ( bgzf_close(args->out)!=0 ) error("Error: close failed .. %s\n", args->output_fname);
int i;
for (i=0; i<args->roh_smpl->n; i++)
hts_getline(fp, KS_SEP_LINE, &str);
if ( strcmp(str.s,"position COMBINED_rate(cM/Mb) Genetic_Map(cM)") )
- error("Unexpected header, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
+ error("Unexpected header in %s, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
args->ngenmap = args->igenmap = 0;
while ( hts_getline(fp, KS_SEP_LINE, &str) > 0 )
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
double prob[3], norm = 0; \
- prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
for (j=0; j<3; j++) norm += prob[j]; \
for (j=0; j<3; j++) prob[j] /= norm; \
af += 0.5*prob[1] + prob[2]; \
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
double prob[3], norm = 0; \
- prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
for (j=0; j<3; j++) norm += prob[j]; \
for (j=0; j<3; j++) prob[j] /= norm; \
af += 0.5*prob[1] + prob[2]; \
alt_freq = (double) AC/AN;
}
- if ( ret<0 ) return ret;
- if ( alt_freq==0.0 ) return -1;
+ if ( args->dflt_AF>0 && (ret<0 || alt_freq==0.0) ) alt_freq = args->dflt_AF;
+ else if ( ret<0 ) { args->nno_af++; return ret; }
+ else if ( alt_freq==0.0 ) { args->nno_af++; return -1; }
int irr = bcf_alleles2gt(0,0), ira = bcf_alleles2gt(0,ial), iaa = bcf_alleles2gt(ial,ial);
if ( args->fake_PLs )
type_t *p = (type_t*)fmt_pl->p + fmt_pl->n*ismpl; \
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
- pdg[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- pdg[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- pdg[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ pdg[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ pdg[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ pdg[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
}
switch (fmt_pl->type) {
case BCF_BT_INT8: BRANCH(int8_t); break;
{
hts_expand(uint32_t,smpl->nsites+1,smpl->msites,smpl->sites);
smpl->eprob = (double*) realloc(smpl->eprob,sizeof(*smpl->eprob)*smpl->msites*2);
- if ( !smpl->eprob ) error("Error: failed to alloc %d bytes\n", sizeof(*smpl->eprob)*smpl->msites*2);
+ if ( !smpl->eprob ) error("Error: failed to alloc %"PRIu64" bytes\n", (uint64_t)(sizeof(*smpl->eprob)*smpl->msites*2));
}
// Calculate emission probabilities P(D|AZ) and P(D|HW)
for (i=0; i<args->roh_smpl->n; i++) flush_viterbi(args, i);
return;
}
- args->ntot++;
// Skip unwanted lines, for simplicity we consider only biallelic sites
if ( line->rid == args->skip_rid ) return;
- if ( line->n_allele==1 ) return; // no ALT allele
- if ( line->n_allele > 3 ) return; // cannot be bi-allelic, even with <*>
+ if ( line->n_allele==1 ) { args->nnot_biallelic++; return; } // no ALT allele
+ if ( line->n_allele > 3 ) { args->nnot_biallelic++; return; } // cannot be bi-allelic, even with <*>
// This can be raw callable VCF with the symbolic unseen allele <*>
int ial = 0;
if ( !strcmp("<*>",line->d.allele[i]) ) { ial = i; break; }
if ( ial==0 ) // normal VCF, the symbolic allele is not present
{
- if ( line->n_allele!=2 ) return; // not biallelic
+ if ( line->n_allele!=2 ) { args->nnot_biallelic++; return; } // not biallelic
ial = 1;
}
else
args->prev_pos = line->pos;
skip_rid = load_genmap(args, bcf_seqname(args->hdr,line));
}
- else if ( args->prev_pos == line->pos ) return; // skip duplicate positions
+ else if ( args->prev_pos == line->pos )
+ {
+ args->ndup++;
+ return; // skip duplicate positions
+ }
if ( skip_rid )
{
fprintf(stderr, "General Options:\n");
fprintf(stderr, " --AF-dflt <float> if AF is not known, use this allele frequency [skip]\n");
fprintf(stderr, " --AF-tag <TAG> use TAG for allele frequency\n");
- fprintf(stderr, " --AF-file <file> read allele frequencies from file (CHR\\tPOS\\tREF,ALT\\tAF)\n");
+ fprintf(stderr, " --AF-file <file> read allele frequencies from file (CHR\\tPOS\\tREF\\tALT\\tAF)\n");
fprintf(stderr, " -b --buffer-size <int[,int]> buffer size and the number of overlapping sites, 0 for unlimited [0]\n");
fprintf(stderr, " If the first number is negative, it is interpreted as the maximum memory to\n");
fprintf(stderr, " use, in MB. The default overlap is set to roughly 1%% of the buffer size.\n");
fprintf(stderr, " -e, --estimate-AF [TAG],<file> estimate AF from FORMAT/TAG (GT or PL) of all samples (\"-\") or samples listed\n");
fprintf(stderr, " in <file>. If TAG is not given, the frequency is estimated from GT by default\n");
+ fprintf(stderr, " --exclude <expr> exclude sites for which the expression is true\n");
fprintf(stderr, " -G, --GTs-only <float> use GTs and ignore PLs, instead using <float> for PL of the two least likely genotypes.\n");
fprintf(stderr, " Safe value to use is 30 to account for GT errors.\n");
+ fprintf(stderr, " --include <expr> select sites for which the expression is true\n");
fprintf(stderr, " -i, --ignore-homref skip hom-ref genotypes (0/0)\n");
fprintf(stderr, " -I, --skip-indels skip indels as their genotypes are enriched for errors\n");
fprintf(stderr, " -m, --genetic-map <file> genetic map in IMPUTE2 format, single file or mask, where string \"{CHROM}\"\n");
{"AF-tag",1,0,0},
{"AF-file",1,0,1},
{"AF-dflt",1,0,2},
+ {"include",1,0,3},
+ {"exclude",1,0,4},
{"buffer-size",1,0,'b'},
{"ignore-homref",0,0,'i'},
{"estimate-AF",1,0,'e'},
args->dflt_AF = strtod(optarg,&tmp);
if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
break;
+ case 3: args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+ case 4: args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
case 'o': args->output_fname = optarg; break;
case 'O':
if ( strchr(optarg,'s') || strchr(optarg,'S') ) args->output_type |= OUTPUT_ST;
else fname = argv[optind];
if ( args->vi_training && args->buffer_size ) error("Error: cannot use -b with -V\n");
- if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1]\n", args->t2AZ);
- if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1]\n", args->t2HW);
+ if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1] .. %e\n", args->t2AZ);
+ if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1] .. %e\n", args->t2HW);
if ( naf_opts>1 ) error("Error: The options --AF-tag, --AF-file and -e are mutually exclusive\n");
if ( args->af_fname && args->targets_list ) error("Error: The options --AF-file and -t are mutually exclusive\n");
if ( args->regions_list )
init_data(args);
while ( bcf_sr_next_line(args->files) )
{
- vcfroh(args, args->files->readers[0].buffer[0]);
+ args->ntot++;
+ bcf1_t *line = bcf_sr_get_line(args->files,0);
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, line, NULL);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !pass ) { args->nfiltered++; continue; }
+ }
+ vcfroh(args, line);
}
vcfroh(args, NULL);
int i, nmin = 0;
for (i=0; i<args->roh_smpl->n; i++)
if ( !i || args->smpl[i].nused < nmin ) nmin = args->smpl[i].nused;
- fprintf(stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+ if ( args->af_fname )
+ fprintf(stderr,"Number of lines overlapping with --AF-file/processed: %d/%d\n", args->ntot,nmin);
+ else
+ fprintf(stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+ fprintf(stderr,"Number of lines filtered/no AF/not biallelic/dup: %d/%d/%d/%d\n", args->nfiltered,args->nno_af,args->nnot_biallelic,args->ndup);
if ( nmin==0 )
{
- fprintf(stderr,"No usable sites were found.");
+ fprintf(stderr,"No usable sites were found.\n");
if ( !naf_opts && !args->dflt_AF ) fprintf(stderr, " Consider using one of the AF options.\n");
}
destroy_data(args);
/* vcfroh.c -- HMM model for detecting runs of autozygosity.
- Copyright (C) 2013-2017 Genome Research Ltd.
+ Copyright (C) 2013-2018 Genome Research Ltd.
Author: Petr Danecek <pd3@sanger.ac.uk>
#include <unistd.h>
#include <getopt.h>
#include <math.h>
+#include <inttypes.h>
#include <htslib/vcf.h>
#include <htslib/synced_bcf_reader.h>
#include <htslib/kstring.h>
#include "bcftools.h"
#include "HMM.h"
#include "smpl_ilist.h"
+#include "filter.h"
#define STATE_HW 0 // normal state, follows Hardy-Weinberg allele frequencies
#define STATE_AZ 1 // autozygous state
#define OUTPUT_RG (1<<2)
#define OUTPUT_GZ (1<<3)
+// Logic of the filters: include or exclude sites which match the filters?
+#define FLT_INCLUDE 1
+#define FLT_EXCLUDE 2
+
+
/** Genetic map */
typedef struct
{
int32_t skip_rid, prev_rid, prev_pos;
int ntot; // some stats to detect if things didn't go wrong
+ int nno_af; // number of sites rejected because AF could not be determined
+ int nfiltered; // .. because of filters
+ int nnot_biallelic, ndup;
smpl_t *smpl; // HMM data for each sample
smpl_ilist_t *af_smpl; // list of samples to estimate AF from (--estimate-AF)
smpl_ilist_t *roh_smpl; // list of samples to analyze (--samples, --samples-file)
int argc, fake_PLs, snps_only, vi_training, samples_is_file, output_type, skip_homref, n_threads;
BGZF *out;
kstring_t str;
+
+ int filter_logic;
+ filter_t *filter;
+ char *filter_str;
}
args_t;
if ( !bcf_hdr_nsamples(args->hdr) ) error("No samples in the VCF?\n");
+ if ( args->filter_str )
+ args->filter = filter_init(args->hdr, args->filter_str);
+
if ( !args->fake_PLs )
{
args->pl_hdr_id = bcf_hdr_id2int(args->hdr, BCF_DT_ID, "PL");
static void destroy_data(args_t *args)
{
+ if ( args->filter ) filter_destroy(args->filter);
if ( bgzf_close(args->out)!=0 ) error("Error: close failed .. %s\n", args->output_fname);
int i;
for (i=0; i<args->roh_smpl->n; i++)
hts_getline(fp, KS_SEP_LINE, &str);
if ( strcmp(str.s,"position COMBINED_rate(cM/Mb) Genetic_Map(cM)") )
- error("Unexpected header, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
+ error("Unexpected header in %s, found:\n\t[%s], but expected:\n\t[position COMBINED_rate(cM/Mb) Genetic_Map(cM)]\n", fname, str.s);
args->ngenmap = args->igenmap = 0;
while ( hts_getline(fp, KS_SEP_LINE, &str) > 0 )
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
double prob[3], norm = 0; \
- prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
for (j=0; j<3; j++) norm += prob[j]; \
for (j=0; j<3; j++) prob[j] /= norm; \
af += 0.5*prob[1] + prob[2]; \
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
double prob[3], norm = 0; \
- prob[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- prob[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- prob[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ prob[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ prob[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ prob[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
for (j=0; j<3; j++) norm += prob[j]; \
for (j=0; j<3; j++) prob[j] /= norm; \
af += 0.5*prob[1] + prob[2]; \
alt_freq = (double) AC/AN;
}
- if ( ret<0 ) return ret;
- if ( alt_freq==0.0 ) return -1;
+ if ( args->dflt_AF>0 && (ret<0 || alt_freq==0.0) ) alt_freq = args->dflt_AF;
+ else if ( ret<0 ) { args->nno_af++; return ret; }
+ else if ( alt_freq==0.0 ) { args->nno_af++; return -1; }
int irr = bcf_alleles2gt(0,0), ira = bcf_alleles2gt(0,ial), iaa = bcf_alleles2gt(ial,ial);
if ( args->fake_PLs )
type_t *p = (type_t*)fmt_pl->p + fmt_pl->n*ismpl; \
if ( p[irr]<0 || p[ira]<0 || p[iaa]<0 ) continue; /* missing value */ \
if ( p[irr]==p[ira] && p[irr]==p[iaa] ) continue; /* all values are the same */ \
- pdg[0] = p[irr] < (type_t)256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
- pdg[1] = p[ira] < (type_t)256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
- pdg[2] = p[iaa] < (type_t)256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
+ pdg[0] = p[irr] < 256 ? args->pl2p[ p[irr] ] : args->pl2p[255]; \
+ pdg[1] = p[ira] < 256 ? args->pl2p[ p[ira] ] : args->pl2p[255]; \
+ pdg[2] = p[iaa] < 256 ? args->pl2p[ p[iaa] ] : args->pl2p[255]; \
}
switch (fmt_pl->type) {
case BCF_BT_INT8: BRANCH(int8_t); break;
{
hts_expand(uint32_t,smpl->nsites+1,smpl->msites,smpl->sites);
smpl->eprob = (double*) realloc(smpl->eprob,sizeof(*smpl->eprob)*smpl->msites*2);
- if ( !smpl->eprob ) error("Error: failed to alloc %d bytes\n", sizeof(*smpl->eprob)*smpl->msites*2);
+ if ( !smpl->eprob ) error("Error: failed to alloc %"PRIu64" bytes\n", (uint64_t)(sizeof(*smpl->eprob)*smpl->msites*2));
}
// Calculate emission probabilities P(D|AZ) and P(D|HW)
for (i=0; i<args->roh_smpl->n; i++) flush_viterbi(args, i);
return;
}
- args->ntot++;
// Skip unwanted lines, for simplicity we consider only biallelic sites
if ( line->rid == args->skip_rid ) return;
- if ( line->n_allele==1 ) return; // no ALT allele
- if ( line->n_allele > 3 ) return; // cannot be bi-allelic, even with <*>
+ if ( line->n_allele==1 ) { args->nnot_biallelic++; return; } // no ALT allele
+ if ( line->n_allele > 3 ) { args->nnot_biallelic++; return; } // cannot be bi-allelic, even with <*>
// This can be raw callable VCF with the symbolic unseen allele <*>
int ial = 0;
if ( !strcmp("<*>",line->d.allele[i]) ) { ial = i; break; }
if ( ial==0 ) // normal VCF, the symbolic allele is not present
{
- if ( line->n_allele!=2 ) return; // not biallelic
+ if ( line->n_allele!=2 ) { args->nnot_biallelic++; return; } // not biallelic
ial = 1;
}
else
args->prev_pos = line->pos;
skip_rid = load_genmap(args, bcf_seqname(args->hdr,line));
}
- else if ( args->prev_pos == line->pos ) return; // skip duplicate positions
+ else if ( args->prev_pos == line->pos )
+ {
+ args->ndup++;
+ return; // skip duplicate positions
+ }
if ( skip_rid )
{
fprintf(bcftools_stderr, "General Options:\n");
fprintf(bcftools_stderr, " --AF-dflt <float> if AF is not known, use this allele frequency [skip]\n");
fprintf(bcftools_stderr, " --AF-tag <TAG> use TAG for allele frequency\n");
- fprintf(bcftools_stderr, " --AF-file <file> read allele frequencies from file (CHR\\tPOS\\tREF,ALT\\tAF)\n");
+ fprintf(bcftools_stderr, " --AF-file <file> read allele frequencies from file (CHR\\tPOS\\tREF\\tALT\\tAF)\n");
fprintf(bcftools_stderr, " -b --buffer-size <int[,int]> buffer size and the number of overlapping sites, 0 for unlimited [0]\n");
fprintf(bcftools_stderr, " If the first number is negative, it is interpreted as the maximum memory to\n");
fprintf(bcftools_stderr, " use, in MB. The default overlap is set to roughly 1%% of the buffer size.\n");
fprintf(bcftools_stderr, " -e, --estimate-AF [TAG],<file> estimate AF from FORMAT/TAG (GT or PL) of all samples (\"-\") or samples listed\n");
fprintf(bcftools_stderr, " in <file>. If TAG is not given, the frequency is estimated from GT by default\n");
+ fprintf(bcftools_stderr, " --exclude <expr> exclude sites for which the expression is true\n");
fprintf(bcftools_stderr, " -G, --GTs-only <float> use GTs and ignore PLs, instead using <float> for PL of the two least likely genotypes.\n");
fprintf(bcftools_stderr, " Safe value to use is 30 to account for GT errors.\n");
+ fprintf(bcftools_stderr, " --include <expr> select sites for which the expression is true\n");
fprintf(bcftools_stderr, " -i, --ignore-homref skip hom-ref genotypes (0/0)\n");
fprintf(bcftools_stderr, " -I, --skip-indels skip indels as their genotypes are enriched for errors\n");
fprintf(bcftools_stderr, " -m, --genetic-map <file> genetic map in IMPUTE2 format, single file or mask, where string \"{CHROM}\"\n");
{"AF-tag",1,0,0},
{"AF-file",1,0,1},
{"AF-dflt",1,0,2},
+ {"include",1,0,3},
+ {"exclude",1,0,4},
{"buffer-size",1,0,'b'},
{"ignore-homref",0,0,'i'},
{"estimate-AF",1,0,'e'},
args->dflt_AF = strtod(optarg,&tmp);
if ( *tmp ) error("Could not parse: --AF-dflt %s\n", optarg);
break;
+ case 3: args->filter_str = optarg; args->filter_logic = FLT_INCLUDE; break;
+ case 4: args->filter_str = optarg; args->filter_logic = FLT_EXCLUDE; break;
case 'o': args->output_fname = optarg; break;
case 'O':
if ( strchr(optarg,'s') || strchr(optarg,'S') ) args->output_type |= OUTPUT_ST;
else fname = argv[optind];
if ( args->vi_training && args->buffer_size ) error("Error: cannot use -b with -V\n");
- if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1]\n", args->t2AZ);
- if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1]\n", args->t2HW);
+ if ( args->t2AZ<0 || args->t2AZ>1 ) error("Error: The parameter --hw-to-az is not in [0,1] .. %e\n", args->t2AZ);
+ if ( args->t2HW<0 || args->t2HW>1 ) error("Error: The parameter --az-to-hw is not in [0,1] .. %e\n", args->t2HW);
if ( naf_opts>1 ) error("Error: The options --AF-tag, --AF-file and -e are mutually exclusive\n");
if ( args->af_fname && args->targets_list ) error("Error: The options --AF-file and -t are mutually exclusive\n");
if ( args->regions_list )
init_data(args);
while ( bcf_sr_next_line(args->files) )
{
- vcfroh(args, args->files->readers[0].buffer[0]);
+ args->ntot++;
+ bcf1_t *line = bcf_sr_get_line(args->files,0);
+ if ( args->filter )
+ {
+ int pass = filter_test(args->filter, line, NULL);
+ if ( args->filter_logic & FLT_EXCLUDE ) pass = pass ? 0 : 1;
+ if ( !pass ) { args->nfiltered++; continue; }
+ }
+ vcfroh(args, line);
}
vcfroh(args, NULL);
int i, nmin = 0;
for (i=0; i<args->roh_smpl->n; i++)
if ( !i || args->smpl[i].nused < nmin ) nmin = args->smpl[i].nused;
- fprintf(bcftools_stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+ if ( args->af_fname )
+ fprintf(bcftools_stderr,"Number of lines overlapping with --AF-file/processed: %d/%d\n", args->ntot,nmin);
+ else
+ fprintf(bcftools_stderr,"Number of lines total/processed: %d/%d\n", args->ntot,nmin);
+ fprintf(bcftools_stderr,"Number of lines filtered/no AF/not biallelic/dup: %d/%d/%d/%d\n", args->nfiltered,args->nno_af,args->nnot_biallelic,args->ndup);
if ( nmin==0 )
{
- fprintf(bcftools_stderr,"No usable sites were found.");
+ fprintf(bcftools_stderr,"No usable sites were found.\n");
if ( !naf_opts && !args->dflt_AF ) fprintf(bcftools_stderr, " Consider using one of the AF options.\n");
}
destroy_data(args);
som->bmu_th = args->bmu_th;
som->size = pow(som->nbin,som->ndim);
som->w = (double*) malloc(sizeof(double)*som->size*som->kdim);
- if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d kdim=%d]\n", sizeof(double)*som->size*som->kdim,som->nbin,som->ndim,som->kdim);
+ if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d kdim=%d]\n", (uint64_t)(sizeof(double)*som->size*som->kdim),som->nbin,som->ndim,som->kdim);
som->c = (double*) calloc(som->size,sizeof(double));
- if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d]\n", sizeof(double)*som->size,som->nbin,som->ndim);
+ if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d]\n", (uint64_t)(sizeof(double)*som->size),som->nbin,som->ndim);
int i;
for (i=0; i<som->size*som->kdim; i++)
som->w[i] = (double)random()/RAND_MAX;
som->bmu_th = args->bmu_th;
som->size = pow(som->nbin,som->ndim);
som->w = (double*) malloc(sizeof(double)*som->size*som->kdim);
- if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d kdim=%d]\n", sizeof(double)*som->size*som->kdim,som->nbin,som->ndim,som->kdim);
+ if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d kdim=%d]\n", (uint64_t)(sizeof(double)*som->size*som->kdim),som->nbin,som->ndim,som->kdim);
som->c = (double*) calloc(som->size,sizeof(double));
- if ( !som->w ) error("Could not alloc %d bytes [nbin=%d ndim=%d]\n", sizeof(double)*som->size,som->nbin,som->ndim);
+ if ( !som->w ) error("Could not alloc %"PRIu64" bytes [nbin=%d ndim=%d]\n", (uint64_t)(sizeof(double)*som->size),som->nbin,som->ndim);
int i;
for (i=0; i<som->size*som->kdim; i++)
som->w[i] = (double)random()/RAND_MAX;
if ( a->rid > b->rid ) return 1;
if ( a->pos < b->pos ) return -1;
if ( a->pos > b->pos ) return 1;
+
+ // Sort the same chr:pos records lexicographically by ref,alt.
+ // This will be called rarely so should not slow the sorting down
+ // noticeably.
+
+ if ( !a->unpacked ) bcf_unpack(a, BCF_UN_STR);
+ if ( !b->unpacked ) bcf_unpack(b, BCF_UN_STR);
+ int i;
+ for (i=0; i<a->n_allele; i++)
+ {
+ if ( i >= b->n_allele ) return 1;
+ int ret = strcasecmp(a->d.allele[i],b->d.allele[i]);
+ if ( ret ) return ret;
+ }
+ if ( a->n_allele < b->n_allele ) return -1;
return 0;
}
{
blk_t *a = *aptr;
blk_t *b = *bptr;
- if ( a->rec->rid < b->rec->rid ) return 1;
- if ( a->rec->rid > b->rec->rid ) return 0;
- if ( a->rec->pos < b->rec->pos ) return 1;
+ int ret = cmp_bcf_pos(&a->rec, &b->rec);
+ if ( ret < 0 ) return 1;
return 0;
}
KHEAP_INIT(blk, blk_t*, blk_is_smaller)
if ( a->rid > b->rid ) return 1;
if ( a->pos < b->pos ) return -1;
if ( a->pos > b->pos ) return 1;
+
+ // Sort the same chr:pos records lexicographically by ref,alt.
+ // This will be called rarely so should not slow the sorting down
+ // noticeably.
+
+ if ( !a->unpacked ) bcf_unpack(a, BCF_UN_STR);
+ if ( !b->unpacked ) bcf_unpack(b, BCF_UN_STR);
+ int i;
+ for (i=0; i<a->n_allele; i++)
+ {
+ if ( i >= b->n_allele ) return 1;
+ int ret = strcasecmp(a->d.allele[i],b->d.allele[i]);
+ if ( ret ) return ret;
+ }
+ if ( a->n_allele < b->n_allele ) return -1;
return 0;
}
{
blk_t *a = *aptr;
blk_t *b = *bptr;
- if ( a->rec->rid < b->rec->rid ) return 1;
- if ( a->rec->rid > b->rec->rid ) return 0;
- if ( a->rec->pos < b->rec->pos ) return 1;
+ int ret = cmp_bcf_pos(&a->rec, &b->rec);
+ if ( ret < 0 ) return 1;
return 0;
}
KHEAP_INIT(blk, blk_t*, blk_is_smaller)
int id = bcf_hdr_id2int(hdr,BCF_DT_ID,usr->tag);
if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,id) ) error("The INFO tag \"%s\" is not defined in the header\n", usr->tag);
usr->type = bcf_hdr_id2type(hdr,BCF_HL_INFO,id);
- if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->type);
+ if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->tag, usr->type);
}
}
static void init_stats(args_t *args)
int id = bcf_hdr_id2int(hdr,BCF_DT_ID,usr->tag);
if ( !bcf_hdr_idinfo_exists(hdr,BCF_HL_INFO,id) ) error("The INFO tag \"%s\" is not defined in the header\n", usr->tag);
usr->type = bcf_hdr_id2type(hdr,BCF_HL_INFO,id);
- if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->type);
+ if ( usr->type!=BCF_HT_REAL && usr->type!=BCF_HT_INT ) error("The INFO tag \"%s\" is not of Float or Integer type (%d)\n", usr->tag, usr->type);
}
}
static void init_stats(args_t *args)
char *tmp;
while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:auUhHc:C:Ii:e:xXpPq:Q:g:",loptions,NULL)) >= 0)
{
- char allele_type[8] = "nref";
+ char allele_type[9] = "nref";
switch (c)
{
case 'O':
case 'c':
{
args->min_ac_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%d:%s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
+ if ( sscanf(optarg,"%d:%8s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
error("Error: Could not parse --min-ac %s\n", optarg);
set_allele_type(&args->min_ac_type, allele_type);
args->calc_ac = 1;
case 'C':
{
args->max_ac_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%d:%s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
+ if ( sscanf(optarg,"%d:%8s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
error("Error: Could not parse --max-ac %s\n", optarg);
set_allele_type(&args->max_ac_type, allele_type);
args->calc_ac = 1;
case 'q':
{
args->min_af_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%f:%s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
- error("Error: Could not parse --min_af %s\n", optarg);
+ if ( sscanf(optarg,"%f:%8s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
+ error("Error: Could not parse --min-af %s\n", optarg);
set_allele_type(&args->min_af_type, allele_type);
args->calc_ac = 1;
break;
case 'Q':
{
args->max_af_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%f:%s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
- error("Error: Could not parse --min_af %s\n", optarg);
+ if ( sscanf(optarg,"%f:%8s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
+ error("Error: Could not parse --max-af %s\n", optarg);
set_allele_type(&args->max_af_type, allele_type);
args->calc_ac = 1;
break;
char *tmp;
while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:auUhHc:C:Ii:e:xXpPq:Q:g:",loptions,NULL)) >= 0)
{
- char allele_type[8] = "nref";
+ char allele_type[9] = "nref";
switch (c)
{
case 'O':
case 'c':
{
args->min_ac_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%d:%s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
+ if ( sscanf(optarg,"%d:%8s",&args->min_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->min_ac)!=1 )
error("Error: Could not parse --min-ac %s\n", optarg);
set_allele_type(&args->min_ac_type, allele_type);
args->calc_ac = 1;
case 'C':
{
args->max_ac_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%d:%s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
+ if ( sscanf(optarg,"%d:%8s",&args->max_ac, allele_type)!=2 && sscanf(optarg,"%d",&args->max_ac)!=1 )
error("Error: Could not parse --max-ac %s\n", optarg);
set_allele_type(&args->max_ac_type, allele_type);
args->calc_ac = 1;
case 'q':
{
args->min_af_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%f:%s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
- error("Error: Could not parse --min_af %s\n", optarg);
+ if ( sscanf(optarg,"%f:%8s",&args->min_af, allele_type)!=2 && sscanf(optarg,"%f",&args->min_af)!=1 )
+ error("Error: Could not parse --min-af %s\n", optarg);
set_allele_type(&args->min_af_type, allele_type);
args->calc_ac = 1;
break;
case 'Q':
{
args->max_af_type = ALLELE_NONREF;
- if ( sscanf(optarg,"%f:%s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
- error("Error: Could not parse --min_af %s\n", optarg);
+ if ( sscanf(optarg,"%f:%8s",&args->max_af, allele_type)!=2 && sscanf(optarg,"%f",&args->max_af)!=1 )
+ error("Error: Could not parse --max-af %s\n", optarg);
set_allele_type(&args->max_af_type, allele_type);
args->calc_ac = 1;
break;
#include <string.h>
#include <stdlib.h>
#include <htslib/hts.h>
+#include <htslib/vcf.h>
#include <ctype.h>
#include "vcmp.h"
char *dref;
int ndref, mdref; // ndref: positive when ref1 longer, negative when ref2 is longer
int nmatch;
- int *map, mmap;
+ int *map, mmap, nmap;
+ int *map_dip, mmap_dip, nmap_dip;
};
vcmp_t *vcmp_init()
void vcmp_destroy(vcmp_t *vcmp)
{
+ free(vcmp->map_dip);
free(vcmp->map);
free(vcmp->dref);
free(vcmp);
{
if ( vcmp_set_ref(vcmp,als1[0],als2[0]) < 0 ) return NULL;
- vcmp->map = (int*) realloc(vcmp->map,sizeof(int)*n);
+ vcmp->nmap = n;
+ hts_expand(int, vcmp->nmap, vcmp->mmap, vcmp->map);
int i, ifrom = n==nals2 ? 0 : 1;
for (i=ifrom; i<nals2; i++)
return vcmp->map;
}
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap)
+{
+ vcmp->nmap_dip = vcmp->nmap*(vcmp->nmap+1)/2;
+ hts_expand(int, vcmp->nmap_dip, vcmp->mmap_dip, vcmp->map_dip);
+
+ int i, j, k = 0;
+ for (i=0; i<vcmp->nmap; i++)
+ {
+ for (j=0; j<=i; j++)
+ {
+ vcmp->map_dip[k] = vcmp->map[i]>=0 && vcmp->map[j]>=0 ? bcf_alleles2gt(vcmp->map[i],vcmp->map[j]) : -1;
+ k++;
+ }
+ }
+ *nmap = k;
+ return vcmp->map_dip;
+}
+
+
#include <string.h>
#include <stdlib.h>
#include <htslib/hts.h>
+#include <htslib/vcf.h>
#include <ctype.h>
#include "vcmp.h"
char *dref;
int ndref, mdref; // ndref: positive when ref1 longer, negative when ref2 is longer
int nmatch;
- int *map, mmap;
+ int *map, mmap, nmap;
+ int *map_dip, mmap_dip, nmap_dip;
};
vcmp_t *vcmp_init()
void vcmp_destroy(vcmp_t *vcmp)
{
+ free(vcmp->map_dip);
free(vcmp->map);
free(vcmp->dref);
free(vcmp);
{
if ( vcmp_set_ref(vcmp,als1[0],als2[0]) < 0 ) return NULL;
- vcmp->map = (int*) realloc(vcmp->map,sizeof(int)*n);
+ vcmp->nmap = n;
+ hts_expand(int, vcmp->nmap, vcmp->mmap, vcmp->map);
int i, ifrom = n==nals2 ? 0 : 1;
for (i=ifrom; i<nals2; i++)
return vcmp->map;
}
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap)
+{
+ vcmp->nmap_dip = vcmp->nmap*(vcmp->nmap+1)/2;
+ hts_expand(int, vcmp->nmap_dip, vcmp->mmap_dip, vcmp->map_dip);
+
+ int i, j, k = 0;
+ for (i=0; i<vcmp->nmap; i++)
+ {
+ for (j=0; j<=i; j++)
+ {
+ vcmp->map_dip[k] = vcmp->map[i]>=0 && vcmp->map[j]>=0 ? bcf_alleles2gt(vcmp->map[i],vcmp->map[j]) : -1;
+ k++;
+ }
+ }
+ *nmap = k;
+ return vcmp->map_dip;
+}
+
+
*/
int *vcmp_map_ARvalues(vcmp_t *vcmp, int number, int nals1, char **als1, int nals2, char **als2);
+int *vcmp_map_dipGvalues(vcmp_t *vcmp, int *nmap);
#endif
-#define BCFTOOLS_VERSION "1.7"
+#define BCFTOOLS_VERSION "1.9"
+++ /dev/null
-"""compute number of reads/alignments from BAM file
-===================================================
-
-This is a benchmarking utility script with limited functionality.
-
-Compute simple flag stats on a BAM-file using
-the pysam cython interface.
-
-"""
-
-import sys
-import pysam
-import pyximport
-pyximport.install()
-import _cython_flagstat
-
-assert len(sys.argv) == 2, "USAGE: {} filename.bam".format(sys.argv[0])
-
-is_paired, is_proper = _cython_flagstat.count(
- pysam.AlignmentFile(sys.argv[1], "rb"))
-
-print ("there are alignments of %i paired reads" % is_paired)
-print ("there are %i proper paired alignments" % is_proper)
+++ /dev/null
-"""compute number of reads/alignments from BAM file
-===================================================
-
-This is a benchmarking utility script with limited functionality.
-
-Compute simple flag stats on a BAM-file using
-the pysam python interface.
-"""
-
-import sys
-import pysam
-
-assert len(sys.argv) == 2, "USAGE: {} filename.bam".format(sys.argv[0])
-
-is_paired = 0
-is_proper = 0
-
-for read in pysam.AlignmentFile(sys.argv[1], "rb"):
- is_paired += read.is_paired
- is_proper += read.is_proper_pair
-
-print ("there are alignments of %i paired reads" % is_paired)
-print ("there are %i proper paired alignments" % is_proper)
+++ /dev/null
-#!/bin/bash
-#
-# Build manylinux1 wheels for pysam. Based on the example at
-# <https://github.com/pypa/python-manylinux-demo>
-#
-# It is best to run this in a fresh clone of the repository!
-#
-# Run this within the repository root:
-# docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/buildwheels.sh
-#
-# The wheels will be put into the wheelhouse/ subdirectory.
-#
-# For interactive tests:
-# docker run -it -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /bin/bash
-
-set -xeuo pipefail
-
-# For convenience, if this script is called from outside of a docker container,
-# it starts a container and runs itself inside of it.
-if ! grep -q docker /proc/1/cgroup; then
- # We are not inside a container
- exec docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/$0
-fi
-
-yum install -y zlib-devel bzip2-devel xz-devel
-
-# Python 2.6 is not supported
-rm -r /opt/python/cp26*
-
-# Python 3.3 builds fail with:
-# /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-CentOS-linux/4.8.2/ld: cannot find -lchtslib
-rm -r /opt/python/cp33*
-
-# Without libcurl support, htslib can open files from HTTP and FTP URLs.
-# With libcurl support, it also supports HTTPS and S3 URLs, but libcurl needs a
-# current version of OpenSSL, and we do not want to be responsible for
-# updating the wheels as soon as there are any security issues. So disable
-# libcurl for now.
-# See also <https://github.com/pypa/manylinux/issues/74>.
-#
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
-
-PYBINS="/opt/python/*/bin"
-for PYBIN in ${PYBINS}; do
- ${PYBIN}/pip install -r /io/requirements.txt
- ${PYBIN}/pip wheel /io/ -w wheelhouse/
-done
-
-# Bundle external shared libraries into the wheels
-#
-# The '-L .' option is a workaround. By default, auditwheel puts all external
-# libraries (.so files) into a .libs directory and sets the RUNPATH to $ORIGIN/.libs.
-# When HTSLIB_MODE is 'shared' (now the default), then all so libraries part of
-# pysam require that RUNPATH is set to $ORIGIN (without the .libs). It seems
-# auditwheel overwrites $ORIGIN with $ORIGIN/.libs. This workaround makes
-# auditwheel set the RUNPATH to "$ORIGIN/." and it will work as desired.
-#
-for whl in wheelhouse/*.whl; do
- auditwheel repair -L . $whl -w /io/wheelhouse/
-done
-
-# Created files are owned by root, so fix permissions.
-chown -R --reference=/io/setup.py /io/wheelhouse/
-
-# TODO Install packages and test them
-#for PYBIN in ${PYBINS}; do
-# ${PYBIN}/pip install pysam --no-index -f /io/wheelhouse
-# (cd $HOME; ${PYBIN}/nosetests ...)
-#done
+++ /dev/null
-#!/bin/bash
-
-# Use internal htslib
-chmod a+x ./htslib/configure
-export CFLAGS="-I${PREFIX}/include/curl/ -I${PREFIX}/include -L${PREFIX}/lib"
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
-
-$PYTHON setup.py install
+++ /dev/null
-package:
- name: pysam
- version: 0.8.5
-
-source:
- path: ../../
-
-build:
- number: 0
-
-requirements:
- build:
- - python
- - setuptools
- - zlib
- - cython
-
- run:
- - python
- - zlib
-
-test:
- imports:
- - pysam
-
-about:
- home: https://github.com/pysam-developers/pysam
- license: MIT
- summary: Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix.
+++ /dev/null
-#!/usr/bin/env bash
-
-# function to detect the Operating System
-detect_os(){
-
-if [ -f /etc/os-release ]; then
-
- OS=$(cat /etc/os-release | awk '/VERSION_ID/ {sub("="," "); print $2;}' | sed 's/\"//g' | awk '{sub("\\."," "); print $1;}')
- if [ "$OS" != "12" ] ; then
-
- echo
- echo " Ubuntu version not supported "
- echo
- echo " Only Ubuntu 12.x has been tested so far "
- echo
- exit 1;
-
- fi
-
- OS="ubuntu"
-
-elif [ -f /etc/system-release ]; then
-
- OS=$(cat /etc/system-release | awk ' {print $4;}' | awk '{sub("\\."," "); print $1;}')
- if [ "$OS" != "6" ] ; then
- echo
- echo " Scientific Linux version not supported "
- echo
- echo " Only 6.x Scientific Linux has been tested so far "
- echo
- exit 1;
- fi
-
- OS="sl"
-
-else
-
- echo
- echo " Operating system not supported "
- echo
- echo " Exiting installation "
- echo
- exit 1;
-
-fi
-} # detect_os
-
-# message to display when the OS is not correct
-sanity_check_os() {
- echo
- echo " Unsupported operating system: $OS "
- echo " Installation aborted "
- echo
- exit 1;
-} # sanity_check_os
-
-# function to install operating system dependencies
-install_os_packages() {
-
-if [ "$OS" == "ubuntu" -o "$OS" == "travis" ] ; then
-
- echo
- echo " Installing packages for Ubuntu "
- echo
-
- apt-get install -y gcc g++
-
-elif [ "$OS" == "sl" ] ; then
-
- echo
- echo " Installing packages for Scientific Linux "
- echo
-
- yum -y install gcc zlib-devel gcc-c++
-
-else
-
- sanity_check_os
-
-fi # if-OS
-} # install_os_packages
-
-# funcion to install Python dependencies
-install_python_deps() {
-
-if [ "$OS" == "ubuntu" -o "$OS" == "sl" ] ; then
-
- echo
- echo " Installing Python dependencies for $1 "
- echo
-
- # Create virtual environment
- cd
- mkdir CGAT
- cd CGAT
- wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.10.1.tar.gz
- tar xvfz virtualenv-1.10.1.tar.gz
- rm virtualenv-1.10.1.tar.gz
- cd virtualenv-1.10.1
- python virtualenv.py cgat-venv
- source cgat-venv/bin/activate
-
- # Install Python prerequisites
- pip install cython
-
-elif [ "$OS" == "travis" ] ; then
- # Travis-CI provides a virtualenv with Python 2.7
- echo
- echo " Installing Python dependencies in travis "
- echo
-
- # Install Python prerequisites
- pip install cython
- pip install nose
-
-else
-
- sanity_check_os
-
-fi # if-OS
-} # install_python_deps
-
-# common set of tasks to prepare external dependencies
-nosetests_external_deps() {
-echo
-echo " Running nosetests for $1 "
-echo
-
-pushd .
-
-# create a new folder to store external tools
-mkdir -p $HOME/CGAT/external-tools
-
-# install samtools
-cd $HOME/CGAT/external-tools
-curl -L http://downloads.sourceforge.net/project/samtools/samtools/1.3/samtools-1.3.tar.bz2 > samtools-1.3.tar.bz2
-tar xjf samtools-1.3.tar.bz2
-cd samtools-1.3
-make
-PATH=$PATH:$HOME/CGAT/external-tools/samtools-1.3
-
-echo "installed samtools"
-samtools --version
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-# install bcftools
-cd $HOME/CGAT/external-tools
-curl -L https://github.com/samtools/bcftools/releases/download/1.3/bcftools-1.3.tar.bz2 > bcftools-1.3.tar.bz2
-tar xjf bcftools-1.3.tar.bz2
-cd bcftools-1.3
-make
-PATH=$PATH:$HOME/CGAT/external-tools/bcftools-1.3
-
-echo "installed bcftools"
-bcftools --version
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-popd
-
-} # nosetests_external_deps
-
-
-# function to run nosetests
-run_nosetests() {
-
-echo
-echo " Running nosetests for $1 "
-echo
-
-# prepare external dependencies
-nosetests_external_deps $OS
-
-# install code
-python setup.py install
-
-# change into tests directory. Otherwise,
-# 'import pysam' will import the repository,
-# not the installed version. This causes
-# problems in the compilation test.
-cd tests
-
-# create auxilliary data
-echo
-echo 'building test data'
-echo
-make -C pysam_data all
-make -C cbcf_data all
-
-# run nosetests
-# -s: do not capture stdout, conflicts with pysam.dispatch
-# -v: verbose output
-nosetests -s -v
-
-} # run_nosetests
-
-# function to display help message
-help_message() {
-echo
-echo " Use this script as follows: "
-echo
-echo " 1) Become root and install the operating system* packages: "
-echo " ./install-CGAT-tools.sh --install-os-packages"
-echo
-echo " 2) Now, as a normal user (non root), install the Python dependencies**: "
-echo " ./install-CGAT-tools.sh --install-python-deps"
-echo
-echo " At this stage the CGAT Code Collection is ready to go and you do not need further steps. Please type the following for more information:"
-echo " source $HOME/CGAT/virtualenv-1.10.1/cgat-venv/bin/activate"
-echo " cgat --help "
-echo
-echo " The CGAT Code Collection tests the software with nosetests. If you are interested in running those, please continue with the following steps:"
-echo
-echo " 3) Become root to install external tools and set up the environment: "
-echo " ./install-CGAT-tools.sh --install-nosetests-deps"
-echo
-echo " 4) Then, back as a normal user (non root), run nosetests as follows:"
-echo " ./install-CGAT-tools.sh --run-nosetests"
-echo
-echo " NOTES: "
-echo " * Supported operating systems: Ubuntu 12.x and Scientific Linux 6.x "
-echo " ** An isolated virtual environment will be created to install Python dependencies "
-echo
-exit 1;
-}
-
-
-# the main script starts here
-
-if [ $# -eq 0 -o $# -gt 1 ] ; then
-
- help_message
-
-else
-
- if [ "$1" == "--help" ] ; then
-
- help_message
-
- elif [ "$1" == "--travis" ] ; then
-
- OS="travis"
- install_os_packages
- install_python_deps
- run_nosetests
-
- elif [ "$1" == "--install-os-packages" ] ; then
-
- detect_os
- install_os_packages
-
- elif [ "$1" == "--install-python-deps" ] ; then
-
- detect_os
- install_python_deps
-
- elif [ "$1" == "--install-nosetests-deps" ] ; then
-
- detect_os
- install_nosetests_deps
-
- elif [ "$1" == "--run-nosetests" ] ; then
-
- detect_os
- run_nosetests
-
- else
-
- echo
- echo " Incorrect input parameter: $1 "
- help_message
-
- fi # if argument 1
-
-fi # if number of input parameters
-
--- /dev/null
+#!/bin/bash
+#
+# Build manylinux1 wheels for pysam. Based on the example at
+# <https://github.com/pypa/python-manylinux-demo>
+#
+# It is best to run this in a fresh clone of the repository!
+#
+# Run this within the repository root:
+# docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/buildwheels.sh
+#
+# The wheels will be put into the wheelhouse/ subdirectory.
+#
+# For interactive tests:
+# docker run -it -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /bin/bash
+
+set -xeuo pipefail
+
+# For convenience, if this script is called from outside of a docker container,
+# it starts a container and runs itself inside of it.
+if ! grep -q docker /proc/1/cgroup; then
+ # We are not inside a container
+ exec docker run --rm -v $(pwd):/io quay.io/pypa/manylinux1_x86_64 /io/$0
+fi
+
+yum install -y zlib-devel bzip2-devel xz-devel
+
+# Python 2.6 is not supported
+rm -r /opt/python/cp26*
+
+# Python 3.3 builds fail with:
+# /opt/rh/devtoolset-2/root/usr/libexec/gcc/x86_64-CentOS-linux/4.8.2/ld: cannot find -lchtslib
+rm -r /opt/python/cp33*
+
+# Without libcurl support, htslib can open files from HTTP and FTP URLs.
+# With libcurl support, it also supports HTTPS and S3 URLs, but libcurl needs a
+# current version of OpenSSL, and we do not want to be responsible for
+# updating the wheels as soon as there are any security issues. So disable
+# libcurl for now.
+# See also <https://github.com/pypa/manylinux/issues/74>.
+#
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
+
+PYBINS="/opt/python/*/bin"
+for PYBIN in ${PYBINS}; do
+ ${PYBIN}/pip install -r /io/requirements.txt
+ ${PYBIN}/pip wheel /io/ -w wheelhouse/
+done
+
+# Bundle external shared libraries into the wheels
+#
+# The '-L .' option is a workaround. By default, auditwheel puts all external
+# libraries (.so files) into a .libs directory and sets the RUNPATH to $ORIGIN/.libs.
+# When HTSLIB_MODE is 'shared' (now the default), then all so libraries part of
+# pysam require that RUNPATH is set to $ORIGIN (without the .libs). It seems
+# auditwheel overwrites $ORIGIN with $ORIGIN/.libs. This workaround makes
+# auditwheel set the RUNPATH to "$ORIGIN/." and it will work as desired.
+#
+for whl in wheelhouse/*.whl; do
+ auditwheel repair -L . $whl -w /io/wheelhouse/
+done
+
+# Created files are owned by root, so fix permissions.
+chown -R --reference=/io/setup.py /io/wheelhouse/
+
+# TODO Install packages and test them
+#for PYBIN in ${PYBINS}; do
+# ${PYBIN}/pip install pysam --no-index -f /io/wheelhouse
+# (cd $HOME; ${PYBIN}/nosetests ...)
+#done
--- /dev/null
+#!/bin/bash
+
+# Use internal htslib
+chmod a+x ./htslib/configure
+export CFLAGS="-I${PREFIX}/include/curl/ -I${PREFIX}/include -L${PREFIX}/lib"
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl"
+
+$PYTHON setup.py install
--- /dev/null
+package:
+ name: pysam
+ version: 0.8.5
+
+source:
+ path: ../../
+
+build:
+ number: 0
+
+requirements:
+ build:
+ - python
+ - setuptools
+ - zlib
+ - cython
+
+ run:
+ - python
+ - zlib
+
+test:
+ imports:
+ - pysam
+
+about:
+ home: https://github.com/pysam-developers/pysam
+ license: MIT
+ summary: Pysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix.
--- /dev/null
+#################################################################
+# Importing samtools and htslib
+#
+# For htslib, simply copy the whole release tar-ball
+# into the directory "htslib" and recreate the file version.h
+#
+# rm -rf htslib
+# mv download/htslib htslib
+# git checkout -- htslib/version.h
+# Edit the file htslib/version.h to set the right version number.
+#
+# For samtools, type:
+# rm -rf samtools
+# python import.py samtools download/samtools
+# git checkout -- samtools/version.h
+#
+# Manually, then:
+# modify config.h to set compatibility flags
+#
+# For bcftools, type:
+# rm -rf bcftools
+# python import.py bcftools download/bedtools
+# git checkout -- bcftools/version.h
+# rm -rf bedtools/test bedtools/plugins
+
+import fnmatch
+import os
+import re
+import itertools
+import shutil
+import sys
+import hashlib
+
+
+EXCLUDE = {
+ "samtools": (
+ "razip.c",
+ "bgzip.c",
+ "main.c",
+ "calDepth.c",
+ "bam2bed.c",
+ "wgsim.c",
+ "bam_tview.c",
+ "bam_tview.h",
+ "bam_tview_html.c",
+ "bam_tview_curses.c",
+ "md5fa.c",
+ "md5sum-lite.c",
+ "maq2sam.c",
+ "bamcheck.c",
+ "chk_indel.c",
+ "vcf-miniview.c",
+ "hfile_irods.c", # requires irods library
+ ),
+ "bcftools": (
+ "test", "plugins", "peakfit.c",
+ "peakfit.h",
+ # needs to renamed, name conflict with samtools reheader
+ # "reheader.c",
+ "polysomy.c"),
+ "htslib": (
+ 'htslib/tabix.c', 'htslib/bgzip.c',
+ 'htslib/htsfile.c', 'htslib/hfile_irods.c'),
+}
+
+
+MAIN = {
+ "samtools": "bamtk",
+ "bcftools": "main"
+}
+
+
+
+def locate(pattern, root=os.curdir):
+ '''Locate all files matching supplied filename pattern in and below
+ supplied root directory.
+ '''
+ for path, dirs, files in os.walk(os.path.abspath(root)):
+ for filename in fnmatch.filter(files, pattern):
+ yield os.path.join(path, filename)
+
+
+def _update_pysam_files(cf, destdir):
+ '''update pysam files applying redirection of ouput'''
+ basename = os.path.basename(destdir)
+ for filename in cf:
+ if not filename:
+ continue
+ dest = filename + ".pysam.c"
+ with open(filename, encoding="utf-8") as infile:
+ lines = "".join(infile.readlines())
+
+ with open(dest, "w", encoding="utf-8") as outfile:
+ outfile.write('#include "{}.pysam.h"\n\n'.format(basename))
+ subname, _ = os.path.splitext(os.path.basename(filename))
+ if subname in MAIN.get(basename, []):
+ lines = re.sub("int main\(", "int {}_main(".format(
+ basename), lines)
+ else:
+ lines = re.sub("int main\(", "int {}_{}_main(".format(
+ basename, subname), lines)
+ lines = re.sub("stderr", "{}_stderr".format(basename), lines)
+ lines = re.sub("stdout", "{}_stdout".format(basename), lines)
+ lines = re.sub(" printf\(", " fprintf({}_stdout, ".format(basename), lines)
+ lines = re.sub("([^kf])puts\(", r"\1{}_puts(".format(basename), lines)
+ lines = re.sub("putchar\(([^)]+)\)",
+ r"fputc(\1, {}_stdout)".format(basename), lines)
+
+ fn = os.path.basename(filename)
+ # some specific fixes:
+ SPECIFIC_SUBSTITUTIONS = {
+ "bam_md.c": (
+ 'sam_open_format("-", mode_w',
+ 'sam_open_format({}_stdout_fn, mode_w'.format(basename)),
+ "phase.c": (
+ 'putc("ACGT"[f->seq[j] == 1? (c&3, {}_stdout) : (c>>16&3)]);'.format(basename),
+ 'putc("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)], {}_stdout);'.format(basename)),
+ "cut_target.c": (
+ 'putc(33 + (cns[j]>>8>>2, {}_stdout));'.format(basename),
+ 'putc(33 + (cns[j]>>8>>2), {}_stdout);'.format(basename))
+ }
+ if fn in SPECIFIC_SUBSTITUTIONS:
+ lines = lines.replace(
+ SPECIFIC_SUBSTITUTIONS[fn][0],
+ SPECIFIC_SUBSTITUTIONS[fn][1])
+ outfile.write(lines)
+
+ with open(os.path.join("import", "pysam.h")) as inf, \
+ open(os.path.join(destdir, "{}.pysam.h".format(basename)), "w") as outf:
+ outf.write(re.sub("@pysam@", basename, inf.read()))
+
+ with open(os.path.join("import", "pysam.c")) as inf, \
+ open(os.path.join(destdir, "{}.pysam.c".format(basename)), "w") as outf:
+ outf.write(re.sub("@pysam@", basename, inf.read()))
+
+
+if len(sys.argv) >= 1:
+ if len(sys.argv) != 3:
+ raise ValueError("import requires dest src")
+
+ dest, srcdir = sys.argv[1:3]
+ if dest not in EXCLUDE:
+ raise ValueError("import expected one of %s" %
+ ",".join(EXCLUDE.keys()))
+ exclude = EXCLUDE[dest]
+ destdir = os.path.abspath(dest)
+ srcdir = os.path.abspath(srcdir)
+ if not os.path.exists(srcdir):
+ raise IOError(
+ "source directory `%s` does not exist." % srcdir)
+
+ cfiles = locate("*.c", srcdir)
+ hfiles = locate("*.h", srcdir)
+ mfiles = itertools.chain(locate("README", srcdir), locate("LICENSE", srcdir))
+
+ # remove unwanted files and htslib subdirectory.
+ cfiles = [x for x in cfiles if os.path.basename(x) not in exclude
+ and not re.search("htslib-", x)]
+
+ hfiles = [x for x in hfiles if os.path.basename(x) not in exclude
+ and not re.search("htslib-", x)]
+
+ ncopied = 0
+
+ def _compareAndCopy(src, srcdir, destdir, exclude):
+
+ d, f = os.path.split(src)
+ common_prefix = os.path.commonprefix((d, srcdir))
+ subdir = re.sub(common_prefix, "", d)[1:]
+ targetdir = os.path.join(destdir, subdir)
+ if not os.path.exists(targetdir):
+ os.makedirs(targetdir)
+ old_file = os.path.join(targetdir, f)
+ if os.path.exists(old_file):
+ md5_old = hashlib.md5(
+ "".join(open(old_file, "r", encoding="utf-8").readlines()).encode()).digest()
+ md5_new = hashlib.md5(
+ "".join(open(src, "r", encoding="utf-8").readlines()).encode()).digest()
+ if md5_old != md5_new:
+ raise ValueError(
+ "incompatible files for %s and %s" %
+ (old_file, src))
+
+ shutil.copy(src, targetdir)
+ return old_file
+
+ for src_file in hfiles:
+ _compareAndCopy(src_file, srcdir, destdir, exclude)
+ ncopied += 1
+
+ for src_file in mfiles:
+ _compareAndCopy(src_file, srcdir, destdir, exclude)
+ ncopied += 1
+
+ cf = []
+ for src_file in cfiles:
+ cf.append(_compareAndCopy(src_file,
+ srcdir,
+ destdir,
+ exclude))
+ ncopied += 1
+
+ sys.stdout.write(
+ "installed latest source code from %s: "
+ "%i files copied\n" % (srcdir, ncopied))
+ # redirect stderr to pysamerr and replace bam.h with a stub.
+ sys.stdout.write("applying stderr redirection\n")
+
+ _update_pysam_files(cf, destdir)
+
+ sys.exit(0)
+
+
+# if len(sys.argv) >= 2 and sys.argv[1] == "refresh":
+# sys.stdout.write("refreshing latest source code from .c to .pysam.c")
+# # redirect stderr to pysamerr and replace bam.h with a stub.
+# sys.stdout.write("applying stderr redirection")
+# for destdir in ('samtools', ):
+# pysamcfiles = locate("*.pysam.c", destdir)
+# for f in pysamcfiles:
+# os.remove(f)
+# cfiles = locate("*.c", destdir)
+# _update_pysam_files(cfiles, destdir)
+
+# sys.exit(0)
+
--- /dev/null
+#!/usr/bin/env bash
+
+# function to detect the Operating System
+detect_os(){
+
+if [ -f /etc/os-release ]; then
+
+ OS=$(cat /etc/os-release | awk '/VERSION_ID/ {sub("="," "); print $2;}' | sed 's/\"//g' | awk '{sub("\\."," "); print $1;}')
+ if [ "$OS" != "12" ] ; then
+
+ echo
+ echo " Ubuntu version not supported "
+ echo
+ echo " Only Ubuntu 12.x has been tested so far "
+ echo
+ exit 1;
+
+ fi
+
+ OS="ubuntu"
+
+elif [ -f /etc/system-release ]; then
+
+ OS=$(cat /etc/system-release | awk ' {print $4;}' | awk '{sub("\\."," "); print $1;}')
+ if [ "$OS" != "6" ] ; then
+ echo
+ echo " Scientific Linux version not supported "
+ echo
+ echo " Only 6.x Scientific Linux has been tested so far "
+ echo
+ exit 1;
+ fi
+
+ OS="sl"
+
+else
+
+ echo
+ echo " Operating system not supported "
+ echo
+ echo " Exiting installation "
+ echo
+ exit 1;
+
+fi
+} # detect_os
+
+# message to display when the OS is not correct
+sanity_check_os() {
+ echo
+ echo " Unsupported operating system: $OS "
+ echo " Installation aborted "
+ echo
+ exit 1;
+} # sanity_check_os
+
+# function to install operating system dependencies
+install_os_packages() {
+
+if [ "$OS" == "ubuntu" -o "$OS" == "travis" ] ; then
+
+ echo
+ echo " Installing packages for Ubuntu "
+ echo
+
+ apt-get install -y gcc g++
+
+elif [ "$OS" == "sl" ] ; then
+
+ echo
+ echo " Installing packages for Scientific Linux "
+ echo
+
+ yum -y install gcc zlib-devel gcc-c++
+
+else
+
+ sanity_check_os
+
+fi # if-OS
+} # install_os_packages
+
+# funcion to install Python dependencies
+install_python_deps() {
+
+if [ "$OS" == "ubuntu" -o "$OS" == "sl" ] ; then
+
+ echo
+ echo " Installing Python dependencies for $1 "
+ echo
+
+ # Create virtual environment
+ cd
+ mkdir CGAT
+ cd CGAT
+ wget --no-check-certificate https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.10.1.tar.gz
+ tar xvfz virtualenv-1.10.1.tar.gz
+ rm virtualenv-1.10.1.tar.gz
+ cd virtualenv-1.10.1
+ python virtualenv.py cgat-venv
+ source cgat-venv/bin/activate
+
+ # Install Python prerequisites
+ pip install cython
+
+elif [ "$OS" == "travis" ] ; then
+ # Travis-CI provides a virtualenv with Python 2.7
+ echo
+ echo " Installing Python dependencies in travis "
+ echo
+
+ # Install Python prerequisites
+ pip install cython
+ pip install nose
+
+else
+
+ sanity_check_os
+
+fi # if-OS
+} # install_python_deps
+
+# common set of tasks to prepare external dependencies
+nosetests_external_deps() {
+echo
+echo " Running nosetests for $1 "
+echo
+
+pushd .
+
+# create a new folder to store external tools
+mkdir -p $HOME/CGAT/external-tools
+
+# install samtools
+cd $HOME/CGAT/external-tools
+curl -L http://downloads.sourceforge.net/project/samtools/samtools/1.3/samtools-1.3.tar.bz2 > samtools-1.3.tar.bz2
+tar xjf samtools-1.3.tar.bz2
+cd samtools-1.3
+make
+PATH=$PATH:$HOME/CGAT/external-tools/samtools-1.3
+
+echo "installed samtools"
+samtools --version
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+# install bcftools
+cd $HOME/CGAT/external-tools
+curl -L https://github.com/samtools/bcftools/releases/download/1.3/bcftools-1.3.tar.bz2 > bcftools-1.3.tar.bz2
+tar xjf bcftools-1.3.tar.bz2
+cd bcftools-1.3
+make
+PATH=$PATH:$HOME/CGAT/external-tools/bcftools-1.3
+
+echo "installed bcftools"
+bcftools --version
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+popd
+
+} # nosetests_external_deps
+
+
+# function to run nosetests
+run_nosetests() {
+
+echo
+echo " Running nosetests for $1 "
+echo
+
+# prepare external dependencies
+nosetests_external_deps $OS
+
+# install code
+python setup.py install
+
+# change into tests directory. Otherwise,
+# 'import pysam' will import the repository,
+# not the installed version. This causes
+# problems in the compilation test.
+cd tests
+
+# create auxilliary data
+echo
+echo 'building test data'
+echo
+make -C pysam_data all
+make -C cbcf_data all
+
+# run nosetests
+# -s: do not capture stdout, conflicts with pysam.dispatch
+# -v: verbose output
+nosetests -s -v
+
+} # run_nosetests
+
+# function to display help message
+help_message() {
+echo
+echo " Use this script as follows: "
+echo
+echo " 1) Become root and install the operating system* packages: "
+echo " ./install-CGAT-tools.sh --install-os-packages"
+echo
+echo " 2) Now, as a normal user (non root), install the Python dependencies**: "
+echo " ./install-CGAT-tools.sh --install-python-deps"
+echo
+echo " At this stage the CGAT Code Collection is ready to go and you do not need further steps. Please type the following for more information:"
+echo " source $HOME/CGAT/virtualenv-1.10.1/cgat-venv/bin/activate"
+echo " cgat --help "
+echo
+echo " The CGAT Code Collection tests the software with nosetests. If you are interested in running those, please continue with the following steps:"
+echo
+echo " 3) Become root to install external tools and set up the environment: "
+echo " ./install-CGAT-tools.sh --install-nosetests-deps"
+echo
+echo " 4) Then, back as a normal user (non root), run nosetests as follows:"
+echo " ./install-CGAT-tools.sh --run-nosetests"
+echo
+echo " NOTES: "
+echo " * Supported operating systems: Ubuntu 12.x and Scientific Linux 6.x "
+echo " ** An isolated virtual environment will be created to install Python dependencies "
+echo
+exit 1;
+}
+
+
+# the main script starts here
+
+if [ $# -eq 0 -o $# -gt 1 ] ; then
+
+ help_message
+
+else
+
+ if [ "$1" == "--help" ] ; then
+
+ help_message
+
+ elif [ "$1" == "--travis" ] ; then
+
+ OS="travis"
+ install_os_packages
+ install_python_deps
+ run_nosetests
+
+ elif [ "$1" == "--install-os-packages" ] ; then
+
+ detect_os
+ install_os_packages
+
+ elif [ "$1" == "--install-python-deps" ] ; then
+
+ detect_os
+ install_python_deps
+
+ elif [ "$1" == "--install-nosetests-deps" ] ; then
+
+ detect_os
+ install_nosetests_deps
+
+ elif [ "$1" == "--run-nosetests" ] ; then
+
+ detect_os
+ run_nosetests
+
+ else
+
+ echo
+ echo " Incorrect input parameter: $1 "
+ help_message
+
+ fi # if argument 1
+
+fi # if number of input parameters
+
--- /dev/null
+#!/usr/bin/env bash
+
+# test script for pysam.
+# The script performs the following tasks:
+# 1. Setup a conda environment and install dependencies via conda
+# 2. Build pysam via the conda recipe
+# 3. Build pysam via setup.py from repository
+# 4. Run tests on the setup.py version
+# 5. Additional build tests
+# 5.1 pip install with cython
+# 5.2 pip install without cython
+# 5.3 pip install without cython and without configure options
+
+pushd .
+
+WORKDIR=`pwd`
+
+#Install miniconda python
+if [ $TRAVIS_OS_NAME == "osx" ]; then
+ wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O Miniconda3.sh
+else
+ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3.sh --no-check-certificate # Default OS versions are old and have SSL / CERT issues
+fi
+
+bash Miniconda3.sh -b
+
+# Create a new conda environment with the target python version
+~/miniconda3/bin/conda install conda-build -y
+~/miniconda3/bin/conda create -q -y --name testenv python=$CONDA_PY cython numpy pytest psutil pip
+
+# activate testenv environment
+source ~/miniconda3/bin/activate testenv
+
+conda config --add channels r
+conda config --add channels defaults
+conda config --add channels conda-forge
+conda config --add channels bioconda
+
+# pin versions, so that tests do not fail when pysam/htslib out of step
+# add htslib dependencies
+conda install -y "samtools=1.7" "bcftools=1.6" "htslib=1.7" xz curl bzip2
+
+# Need to make C compiler and linker use the anaconda includes and libraries:
+export PREFIX=~/miniconda3/
+export CFLAGS="-I${PREFIX}/include -L${PREFIX}/lib"
+export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl --disable-lzma"
+
+samtools --version
+htslib --version
+bcftools --version
+
+# Try building conda recipe first
+~/miniconda3/bin/conda-build devtools/conda-recipe/ --python=$CONDA_PY
+
+# install code from the repository via setup.py
+echo "installing via setup.py from repository"
+python setup.py install
+
+# create auxilliary data
+echo
+echo 'building test data'
+echo
+make -C tests/pysam_data
+make -C tests/cbcf_data
+
+# echo any limits that are in place
+ulimit -a
+
+# run tests
+pytest
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+# build source tar-ball. Make sure to run 'build' target so that .pyx
+# files are cythonized.
+python setup.py build sdist
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+# check for presence of config.h files
+echo "checking for presence of config.h files in tar-ball"
+tar -tvzf dist/pysam-*.tar.gz | grep "config.h$"
+
+if [ $? != 1 ]; then
+ echo "ERROR: found config.h in tar-ball"
+ tar -tvzf dist/pysam-*.tar.gz | grep "config.h%"
+ exit 1
+fi
+
+# test pip installation from tar-ball with cython
+echo "pip installing with cython"
+pip install --verbose --no-deps --no-binary=:all: dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+# attempt pip installation without cython
+echo "pip installing without cython"
+~/miniconda3/bin/conda remove -y cython
+~/miniconda3/bin/conda list
+echo "python is" `which python`
+pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+ exit 1
+fi
+
+# attempt pip installation without cython and without
+# command line options
+echo "pip installing without cython and no configure options"
+export HTSLIB_CONFIGURE_OPTIONS=""
+pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
+
+if [ $? != 0 ]; then
+ exit 1
+fi
API
===
-SAM/BAM files
--------------
+SAM/BAM/CRAM files
+-------------------
Objects of type :class:`~pysam.AlignmentFile` allow working with
BAM/SAM formatted files.
.. autoclass:: pysam.VariantHeaderRecord
:members:
+
+HTSFile
+-------
+
+HTSFile is the base class for :class:`pysam.AlignmentFile` and
+:class:`pysam.VariantFile`.
+
+.. autoclass:: pysam.HTSFile
+ :members:
- ========
+========
Glossary
========
Release notes
=============
+Release 0.15.0
+==============
+
+This release wraps htslib/samtools/bcftools version 1.9.0.
+
+* [#673] permit dash in chromosome name of region string
+* [#656] Support `text` when opening a SAM file for writing
+* [#658] return None in get_forward_sequence if sequence not in record
+* [#683] allow lower case bases in MD tags
+* Ensure that = and X CIGAR ops are treated the same as M
+
Release 0.14.1
==============
+++ /dev/null
-#################################################################
-# Importing samtools and htslib
-#
-# For htslib, simply copy the whole release tar-ball
-# into the directory "htslib" and recreate the file version.h
-#
-# rm -rf htslib
-# mv download/htslib htslib
-# git checkout -- htslib/version.h
-# Edit the file htslib/version.h to set the right version number.
-#
-# For samtools, type:
-# rm -rf samtools
-# python import.py samtools download/samtools
-# git checkout -- samtools/version.h
-#
-# Manually, then:
-# modify config.h to set compatibility flags
-#
-# For bcftools, type:
-# rm -rf bcftools
-# python import.py bcftools download/bedtools
-# git checkout -- bcftools/version.h
-# rm -rf bedtools/test bedtools/plugins
-
-import fnmatch
-import os
-import re
-import itertools
-import shutil
-import sys
-import hashlib
-
-
-EXCLUDE = {
- "samtools": (
- "razip.c",
- "bgzip.c",
- "main.c",
- "calDepth.c",
- "bam2bed.c",
- "wgsim.c",
- "bam_tview.c",
- "bam_tview.h",
- "bam_tview_html.c",
- "bam_tview_curses.c",
- "md5fa.c",
- "md5sum-lite.c",
- "maq2sam.c",
- "bamcheck.c",
- "chk_indel.c",
- "vcf-miniview.c",
- "hfile_irods.c", # requires irods library
- ),
- "bcftools": (
- "test", "plugins", "peakfit.c",
- "peakfit.h",
- # needs to renamed, name conflict with samtools reheader
- # "reheader.c",
- "polysomy.c"),
- "htslib": (
- 'htslib/tabix.c', 'htslib/bgzip.c',
- 'htslib/htsfile.c', 'htslib/hfile_irods.c'),
-}
-
-
-MAIN = {
- "samtools": "bamtk",
- "bcftools": "main"
-}
-
-
-
-def locate(pattern, root=os.curdir):
- '''Locate all files matching supplied filename pattern in and below
- supplied root directory.
- '''
- for path, dirs, files in os.walk(os.path.abspath(root)):
- for filename in fnmatch.filter(files, pattern):
- yield os.path.join(path, filename)
-
-
-def _update_pysam_files(cf, destdir):
- '''update pysam files applying redirection of ouput'''
- basename = os.path.basename(destdir)
- for filename in cf:
- if not filename:
- continue
- dest = filename + ".pysam.c"
- with open(filename, encoding="utf-8") as infile:
- lines = "".join(infile.readlines())
-
- with open(dest, "w", encoding="utf-8") as outfile:
- outfile.write('#include "{}.pysam.h"\n\n'.format(basename))
- subname, _ = os.path.splitext(os.path.basename(filename))
- if subname in MAIN.get(basename, []):
- lines = re.sub("int main\(", "int {}_main(".format(
- basename), lines)
- else:
- lines = re.sub("int main\(", "int {}_{}_main(".format(
- basename, subname), lines)
- lines = re.sub("stderr", "{}_stderr".format(basename), lines)
- lines = re.sub("stdout", "{}_stdout".format(basename), lines)
- lines = re.sub(" printf\(", " fprintf({}_stdout, ".format(basename), lines)
- lines = re.sub("([^kf])puts\(([^)]+)\)",
- r"\1fputs(\2, {}_stdout) & fputc('\\n', {}_stdout)".format(basename, basename),
- lines)
- lines = re.sub("putchar\(([^)]+)\)",
- r"fputc(\1, {}_stdout)".format(basename), lines)
-
- fn = os.path.basename(filename)
- # some specific fixes:
- SPECIFIC_SUBSTITUTIONS = {
- "bam_md.c": (
- 'sam_open_format("-", mode_w',
- 'sam_open_format({}_stdout_fn, mode_w'.format(basename)),
- "phase.c": (
- 'putc("ACGT"[f->seq[j] == 1? (c&3, {}_stdout) : (c>>16&3)]);'.format(basename),
- 'putc("ACGT"[f->seq[j] == 1? (c&3) : (c>>16&3)], {}_stdout);'.format(basename)),
- "cut_target.c": (
- 'putc(33 + (cns[j]>>8>>2, {}_stdout));'.format(basename),
- 'putc(33 + (cns[j]>>8>>2), {}_stdout);'.format(basename))
- }
- if fn in SPECIFIC_SUBSTITUTIONS:
- lines = lines.replace(
- SPECIFIC_SUBSTITUTIONS[fn][0],
- SPECIFIC_SUBSTITUTIONS[fn][1])
- outfile.write(lines)
-
- with open(os.path.join("import", "pysam.h")) as inf, \
- open(os.path.join(destdir, "{}.pysam.h".format(basename)), "w") as outf:
- outf.write(re.sub("@pysam@", basename, inf.read()))
-
- with open(os.path.join("import", "pysam.c")) as inf, \
- open(os.path.join(destdir, "{}.pysam.c".format(basename)), "w") as outf:
- outf.write(re.sub("@pysam@", basename, inf.read()))
-
-
-if len(sys.argv) >= 1:
- if len(sys.argv) != 3:
- raise ValueError("import requires dest src")
-
- dest, srcdir = sys.argv[1:3]
- if dest not in EXCLUDE:
- raise ValueError("import expected one of %s" %
- ",".join(EXCLUDE.keys()))
- exclude = EXCLUDE[dest]
- destdir = os.path.abspath(dest)
- srcdir = os.path.abspath(srcdir)
- if not os.path.exists(srcdir):
- raise IOError(
- "source directory `%s` does not exist." % srcdir)
-
- cfiles = locate("*.c", srcdir)
- hfiles = locate("*.h", srcdir)
- mfiles = itertools.chain(locate("README", srcdir), locate("LICENSE", srcdir))
-
- # remove unwanted files and htslib subdirectory.
- cfiles = [x for x in cfiles if os.path.basename(x) not in exclude
- and not re.search("htslib-", x)]
-
- hfiles = [x for x in hfiles if os.path.basename(x) not in exclude
- and not re.search("htslib-", x)]
-
- ncopied = 0
-
- def _compareAndCopy(src, srcdir, destdir, exclude):
-
- d, f = os.path.split(src)
- common_prefix = os.path.commonprefix((d, srcdir))
- subdir = re.sub(common_prefix, "", d)[1:]
- targetdir = os.path.join(destdir, subdir)
- if not os.path.exists(targetdir):
- os.makedirs(targetdir)
- old_file = os.path.join(targetdir, f)
- if os.path.exists(old_file):
- md5_old = hashlib.md5(
- "".join(open(old_file, "r", encoding="utf-8").readlines()).encode()).digest()
- md5_new = hashlib.md5(
- "".join(open(src, "r", encoding="utf-8").readlines()).encode()).digest()
- if md5_old != md5_new:
- raise ValueError(
- "incompatible files for %s and %s" %
- (old_file, src))
-
- shutil.copy(src, targetdir)
- return old_file
-
- for src_file in hfiles:
- _compareAndCopy(src_file, srcdir, destdir, exclude)
- ncopied += 1
-
- for src_file in mfiles:
- _compareAndCopy(src_file, srcdir, destdir, exclude)
- ncopied += 1
-
- cf = []
- for src_file in cfiles:
- cf.append(_compareAndCopy(src_file,
- srcdir,
- destdir,
- exclude))
- ncopied += 1
-
- sys.stdout.write(
- "installed latest source code from %s: "
- "%i files copied\n" % (srcdir, ncopied))
- # redirect stderr to pysamerr and replace bam.h with a stub.
- sys.stdout.write("applying stderr redirection\n")
-
- _update_pysam_files(cf, destdir)
-
- sys.exit(0)
-
-
-# if len(sys.argv) >= 2 and sys.argv[1] == "refresh":
-# sys.stdout.write("refreshing latest source code from .c to .pysam.c")
-# # redirect stderr to pysamerr and replace bam.h with a stub.
-# sys.stdout.write("applying stderr redirection")
-# for destdir in ('samtools', ):
-# pysamcfiles = locate("*.pysam.c", destdir)
-# for f in pysamcfiles:
-# os.remove(f)
-# cfiles = locate("*.c", destdir)
-# _update_pysam_files(cfiles, destdir)
-
-# sys.exit(0)
-
@pysam@_stdout_fileno = STDOUT_FILENO;
}
+int @pysam@_puts(const char *s)
+{
+ if (fputs(s, @pysam@_stdout) == EOF) return EOF;
+ return putc('\n', @pysam@_stdout);
+}
+
void @pysam@_set_optind(int val)
{
// setting this in cython via
*/
void @pysam@_unset_stdout(void);
+int @pysam@_puts(const char *s);
+
int @pysam@_dispatch(int argc, char *argv[]);
void @pysam@_set_optind(int);
cdef char * md_tag = <char*>bam_aux2Z(md_tag_ptr)
cdef int md_idx = 0
+ cdef char c
s_idx = 0
# Check if MD tag is valid by matching CIGAR length to MD tag defined length
s_idx += 1
md_idx += 1
else:
- # save mismatch and change to lower case
- s[s_idx] = md_tag[md_idx] + 32
+ # save mismatch
+ # enforce lower case
+ c = md_tag[md_idx]
+ if c <= 90:
+ c += 32
+ s[s_idx] = c
s_idx += 1
r_idx += 1
md_idx += 1
if _full:
for i from 0 <= i < l:
result.append(None)
- elif op == BAM_CMATCH:
+ elif op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
for i from pos <= i < pos + l:
result.append(i)
pos += l
Reads mapping to the reverse strand will be reverse
complemented.
+
+ Returns None if the record has no query sequence.
"""
+ if self.query_sequence is None:
+ return None
s = force_str(self.query_sequence)
if self.is_reverse:
s = s.translate(maketrans("ACGTacgtNnXx", "TGCAtgcaNnXx"))[::-1]
for k from 0 <= k < pysam_get_n_cigar(src):
op = cigar_p[k] & BAM_CIGAR_MASK
l = cigar_p[k] >> BAM_CIGAR_SHIFT
- if op == BAM_CMATCH:
+ if op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
result.append((pos, pos + l))
pos += l
elif op == BAM_CDEL or op == BAM_CREF_SKIP:
op = cigar_p[k] & BAM_CIGAR_MASK
l = cigar_p[k] >> BAM_CIGAR_SHIFT
- if op == BAM_CMATCH:
+ if op == BAM_CMATCH or op == BAM_CEQUAL or op == BAM_CDIFF:
o = min( pos + l, end) - max( pos, start )
if o > 0: overlap += o
- if op == BAM_CMATCH or op == BAM_CDEL or op == BAM_CREF_SKIP:
+ if op == BAM_CMATCH or op == BAM_CDEL or op == BAM_CREF_SKIP or op == BAM_CEQUAL or op == BAM_CDIFF:
pos += l
return overlap
header=None, add_sq_text=False, check_header=True, check_sq=True,
reference_filename=None, filename=None, index_filename=None,
filepath_index=None, require_index=False, duplicate_filehandle=True,
- ignore_truncation=False)
+ ignore_truncation=False, threads=1)
A :term:`SAM`/:term:`BAM`/:term:`CRAM` formatted file.
format_options: list
A list of key=value strings, as accepted by --input-fmt-option and
--output-fmt-option in samtools.
+ threads: integer
+ Number of threads to use for compressing/decompressing BAM/CRAM files.
+ Setting threads to > 1 cannot be combined with `ignore_truncation`.
+ (Default=1)
"""
def __cinit__(self, *args, **kwargs):
self.htsfile = NULL
self.filename = None
self.mode = None
+ self.threads = 1
self.is_stream = False
self.is_remote = False
self.index = NULL
referencelengths=None,
duplicate_filehandle=True,
ignore_truncation=False,
- format_options=None):
+ format_options=None,
+ threads=1):
'''open a sam, bam or cram formatted file.
If _open is called on an existing file, the current file
cdef char *cindexname = NULL
cdef char *cmode = NULL
cdef bam_hdr_t * hdr = NULL
-
+
+ if threads > 1 and ignore_truncation:
+ # This won't raise errors if reaching a truncated alignment,
+ # because bgzf_mt_reader in htslib does not deal with
+ # bgzf_mt_read_block returning non-zero values, contrary
+ # to bgzf_read (https://github.com/samtools/htslib/blob/1.7/bgzf.c#L888)
+ # Better to avoid this (for now) than to produce seemingly correct results.
+ raise ValueError('Cannot add extra threads when "ignore_truncation" is True')
+ self.threads = threads
+
# for backwards compatibility:
if referencenames is not None:
reference_names = referencenames
if mode[0] == 'w':
# open file for writing
- if not (template or header or reference_names):
+ if not (template or header or text or (reference_names and reference_lengths)):
raise ValueError(
- "either supply options `template`, `header` or both `reference_names` "
+ "either supply options `template`, `header`, `text` or both `reference_names` "
"and `reference_lengths` for writing")
if template:
else:
raise ValueError("not enough information to construct header. Please provide template, "
"header, text or reference_names/reference_lengths")
-
self.htsfile = self._open_htsfile()
if self.htsfile == NULL:
# is given, the CRAM reference arrays will be built from
# the @SQ header in the header
if "c" in mode and reference_filename:
- # note that fn_aux takes ownership, so create a copy
- self.htsfile.fn_aux = strdup(self.reference_filename)
+ if (hts_set_fai_filename(self.htsfile, self.reference_filename) != 0):
+ raise ValueError("failure when setting reference filename")
# write header to htsfile
if "b" in mode or "c" in mode or "h" in mode:
end=None):
"""fetch reads aligned in a :term:`region`.
- See :meth:`AlignmentFile.parse_region` for more information
- on genomic regions. :term:`reference` and `end` are also accepted for
- backward compatiblity as synonyms for :term:`contig` and `stop`,
- respectively.
+ See :meth:`~pysam.HTSFile.parse_region` for more information
+ on how genomic regions can be specified. :term:`reference` and
+ `end` are also accepted for backward compatiblity as synonyms
+ for :term:`contig` and `stop`, respectively.
Without a `contig` or `region` all mapped reads in the file
will be fetched. The reads will be returned ordered by reference
file. This mode of iteration still requires an index. If there is
no index, use `until_eof=True`.
- If only `reference` is set, all reads aligned to `reference`
+ If only `contig` is set, all reads aligned to `contig`
will be fetched.
A :term:`SAM` file does not allow random access. If `region`
or `contig` are given, an exception is raised.
- :class:`~pysam.FastaFile`
- :class:`~pysam.IteratorRow`
- :class:`~pysam.IteratorRow`
- :class:`~IteratorRow`
- :class:`IteratorRow`
-
Parameters
----------
cdef class VariantFile(HTSFile):
"""*(filename, mode=None, index_filename=None, header=None, drop_samples=False,
- duplicate_filehandle=True, ignore_truncation=False)*
+ duplicate_filehandle=True, ignore_truncation=False, threads=1)*
A :term:`VCF`/:term:`BCF` formatted file. The file is automatically
opened.
appears to be truncated due to a missing EOF marker. Only applies
to bgzipped formats. (Default=False)
+ threads: integer
+ Number of threads to use for compressing/decompressing VCF/BCF files.
+ Setting threads to > 1 cannot be combined with `ignore_truncation`.
+ (Default=1)
+
"""
def __cinit__(self, *args, **kwargs):
self.htsfile = NULL
self.index = None
self.filename = None
self.mode = None
+ self.threads = 1
self.index_filename = None
self.is_stream = False
self.is_remote = False
vars.filename = self.filename
vars.mode = self.mode
+ vars.threads = self.threads
vars.index_filename = self.index_filename
vars.drop_samples = self.drop_samples
vars.is_stream = self.is_stream
VariantHeader header=None,
drop_samples=False,
duplicate_filehandle=True,
- ignore_truncation=False):
+ ignore_truncation=False,
+ threads=1):
"""open a vcf/bcf file.
If open is called on an existing VariantFile, the current file will be
cdef char *cindex_filename = NULL
cdef char *cmode
+ if threads > 1 and ignore_truncation:
+ # This won't raise errors if reaching a truncated alignment,
+ # because bgzf_mt_reader in htslib does not deal with
+ # bgzf_mt_read_block returning non-zero values, contrary
+ # to bgzf_read (https://github.com/samtools/htslib/blob/1.7/bgzf.c#L888)
+ # Better to avoid this (for now) than to produce seemingly correct results.
+ raise ValueError('Cannot add extra threads when "ignore_truncation" is True')
+ self.threads = threads
+
# close a previously opened file
if self.is_open:
self.close()
cdef readonly object filename # filename as supplied by user
cdef readonly object mode # file opening mode
+ cdef readonly object threads # number of threads to use
cdef readonly object index_filename # filename of index, if supplied by user
cdef readonly bint is_stream # Is htsfile a non-seekable stream
"""
Base class for HTS file types
"""
+
def __cinit__(self, *args, **kwargs):
self.htsfile = NULL
+ self.threads = 1
self.duplicate_filehandle = True
def close(self):
cdef htsFile *_open_htsfile(self) except? NULL:
cdef char *cfilename
cdef char *cmode = self.mode
- cdef int fd, dup_fd
+ cdef int fd, dup_fd, threads
+ threads = self.threads - 1
if isinstance(self.filename, bytes):
cfilename = self.filename
with nogil:
- return hts_open(cfilename, cmode)
+ htsfile = hts_open(cfilename, cmode)
+ if htsfile != NULL:
+ hts_set_threads(htsfile, threads)
+ return htsfile
else:
if isinstance(self.filename, int):
fd = self.filename
filename = encode_filename(filename)
cfilename = filename
with nogil:
- return hts_hopen(hfile, cfilename, cmode)
+ htsfile = hts_hopen(hfile, cfilename, cmode)
+ if htsfile != NULL:
+ hts_set_threads(htsfile, threads)
+ return htsfile
def add_hts_options(self, format_options=None):
"""Given a list of key=value format option strings, add them to an open htsFile
raise RuntimeError('An error occured while applying the requested format options')
hts_opt_free(opts)
- def parse_region(self, contig=None, start=None, stop=None, region=None,tid=None,
+ def parse_region(self, contig=None, start=None, stop=None,
+ region=None, tid=None,
reference=None, end=None):
"""parse alternative ways to specify a genomic region. A region can
either be specified by :term:`contig`, `start` and
`stop`. `start` and `stop` denote 0-based, half-open
- intervals. :term:`reference` and `end` are also accepted for
+ intervals. :term:`reference` and `end` are also accepted for
backward compatiblity as synonyms for :term:`contig` and
`stop`, respectively.
if region:
region = force_str(region)
- parts = re.split('[:-]', region)
- contig = parts[0]
- if len(parts) >= 2:
- rstart = int(parts[1]) - 1
- if len(parts) >= 3:
- rstop = int(parts[2])
+ if ":" in region:
+ contig, coord = region.split(":")
+ parts = coord.split("-")
+ rstart = int(parts[0]) - 1
+ if len(parts) >= 1:
+ rstop = int(parts[1])
+ else:
+ contig = region
if tid is not None:
if not self.is_valid_tid(tid):
The encoding passed to the parser
+ threads: integer
+ Number of threads to use for decompressing Tabix files.
+ (Default=1)
+
+
Raises
------
parser=None,
index=None,
encoding="ascii",
+ threads=1,
*args,
**kwargs ):
self.is_remote = False
self.is_stream = False
self.parser = parser
+ self.threads = threads
self._open(filename, mode, index, *args, **kwargs)
self.encoding = encoding
- def _open( self,
+ def _open( self,
filename,
mode='r',
index=None,
+ threads=1,
):
'''open a :term:`tabix file` for reading.'''
if self.htsfile != NULL:
self.close()
self.htsfile = NULL
+ self.threads=threads
filename_index = index or (filename + ".tbi")
# encode all the strings to pass to tabix
The file is being re-opened.
'''
return TabixFile(self.filename,
- mode="r",
+ mode="r",
+ threads=self.threads,
parser=self.parser,
index=self.filename_index,
encoding=self.encoding)
cimport cython
from cpython cimport array as c_array
-cpdef parse_region(reference=*, start=*, end=*, region=*)
+cpdef parse_region(contig=*, start=*, stop=*, region=*, reference=*, end=*)
#########################################################################
# Utility functions for quality string conversions
return s
-cpdef parse_region(reference=None,
+cpdef parse_region(contig=None,
start=None,
- end=None,
- region=None):
+ stop=None,
+ region=None,
+ reference=None,
+ end=None):
"""parse alternative ways to specify a genomic region. A region can
either be specified by :term:`reference`, `start` and
`end`. `start` and `end` denote 0-based, half-open intervals.
+
+ :term:`reference` and `end` are also accepted for backward
+ compatiblity as synonyms for :term:`contig` and `stop`,
+ respectively.
Alternatively, a samtools :term:`region` string can be supplied.
cdef long long rstart
cdef long long rend
+ if contig is not None:
+ reference = contig
+ if stop is not None:
+ end = stop
+
rstart = 0
rend = MAX_POS
if start != None:
if region:
region = force_str(region)
- parts = re.split("[:-]", region)
- reference = parts[0]
- if len(parts) >= 2:
- rstart = int(parts[1]) - 1
- if len(parts) >= 3:
- rend = int(parts[2])
+ if ":" in region:
+ contig, coord = region.split(":")
+ parts = coord.split("-")
+ rstart = int(parts[0]) - 1
+ if len(parts) >= 1:
+ rend = int(parts[1])
+ else:
+ contig = region
if not reference:
return None, 0, 0
# pysam versioning information
-__version__ = "0.14.1"
+__version__ = "0.15.0"
# TODO: upgrade number
-__samtools_version__ = "1.7"
+__samtools_version__ = "1.9"
# TODO: upgrade code and number
-__bcftools_version__ = "1.6"
+__bcftools_version__ = "1.9"
-__htslib_version__ = "1.7"
+__htslib_version__ = "1.9"
+++ /dev/null
-#!/usr/bin/env bash
-
-# test script for pysam.
-# The script performs the following tasks:
-# 1. Setup a conda environment and install dependencies via conda
-# 2. Build pysam via the conda recipe
-# 3. Build pysam via setup.py from repository
-# 4. Run tests on the setup.py version
-# 5. Additional build tests
-# 5.1 pip install with cython
-# 5.2 pip install without cython
-# 5.3 pip install without cython and without configure options
-
-pushd .
-
-WORKDIR=`pwd`
-
-#Install miniconda python
-if [ $TRAVIS_OS_NAME == "osx" ]; then
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O Miniconda3.sh
-else
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O Miniconda3.sh --no-check-certificate # Default OS versions are old and have SSL / CERT issues
-fi
-
-bash Miniconda3.sh -b
-
-# Create a new conda environment with the target python version
-~/miniconda3/bin/conda install conda-build -y
-~/miniconda3/bin/conda create -q -y --name testenv python=$CONDA_PY cython numpy pytest psutil pip
-
-# activate testenv environment
-source ~/miniconda3/bin/activate testenv
-
-conda config --add channels r
-conda config --add channels defaults
-conda config --add channels conda-forge
-conda config --add channels bioconda
-
-# pin versions, so that tests do not fail when pysam/htslib out of step
-# add htslib dependencies
-conda install -y "samtools=1.7" "bcftools=1.6" "htslib=1.7" xz curl bzip2
-
-# Need to make C compiler and linker use the anaconda includes and libraries:
-export PREFIX=~/miniconda3/
-export CFLAGS="-I${PREFIX}/include -L${PREFIX}/lib"
-export HTSLIB_CONFIGURE_OPTIONS="--disable-libcurl --disable-lzma"
-
-samtools --version
-htslib --version
-bcftools --version
-
-# Try building conda recipe first
-~/miniconda3/bin/conda-build ci/conda-recipe/ --python=$CONDA_PY
-
-# install code from the repository via setup.py
-echo "installing via setup.py from repository"
-python setup.py install
-
-# create auxilliary data
-echo
-echo 'building test data'
-echo
-make -C tests/pysam_data
-make -C tests/cbcf_data
-
-# echo any limits that are in place
-ulimit -a
-
-# run tests
-pytest
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-# build source tar-ball. Make sure to run 'build' target so that .pyx
-# files are cythonized.
-python setup.py build sdist
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-# check for presence of config.h files
-echo "checking for presence of config.h files in tar-ball"
-tar -tvzf dist/pysam-*.tar.gz | grep "config.h$"
-
-if [ $? != 1 ]; then
- echo "ERROR: found config.h in tar-ball"
- tar -tvzf dist/pysam-*.tar.gz | grep "config.h%"
- exit 1
-fi
-
-# test pip installation from tar-ball with cython
-echo "pip installing with cython"
-pip install --verbose --no-deps --no-binary=:all: dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-# attempt pip installation without cython
-echo "pip installing without cython"
-~/miniconda3/bin/conda remove -y cython
-~/miniconda3/bin/conda list
-echo "python is" `which python`
-pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
- exit 1
-fi
-
-# attempt pip installation without cython and without
-# command line options
-echo "pip installing without cython and no configure options"
-export HTSLIB_CONFIGURE_OPTIONS=""
-pip install --verbose --no-deps --no-binary=:all: --force-reinstall --upgrade dist/pysam-*.tar.gz
-
-if [ $? != 0 ]; then
- exit 1
-fi
The MIT/Expat License
-Copyright (C) 2008-2014 Genome Research Ltd.
+Copyright (C) 2008-2018 Genome Research Ltd.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
The typical simple case of building Samtools using the HTSlib bundled within
this Samtools release tarball is done as follows:
- cd .../samtools-1.7 # Within the unpacked release directory
+ cd .../samtools-1.9 # Within the unpacked release directory
./configure
make
installation using the HTSlib bundled within this Samtools release tarball,
and building the various HTSlib utilities such as bgzip is done as follows:
- cd .../samtools-1.7 # Within the unpacked release directory
+ cd .../samtools-1.9 # Within the unpacked release directory
./configure --prefix=/path/to/location
make all all-htslib
make install install-htslib
See INSTALL for full building and installation instructions and details.
+Building with HTSlib plug-in support
+====================================
+
+Enabling plug-ins causes some parts of HTSlib to be built as separate modules.
+There are two advantages to this:
+
+ * The static library libhts.a has fewer dependencies, which makes linking
+ third-party code against it easier.
+
+ * It is possible to build extra plug-ins in addition to the ones that are
+ bundled with HTSlib. For example, the hts-plugins repository
+ <https://github.com/samtools/htslib-plugins> includes a module that
+ allows direct access to files stored in an iRODS data management
+ repository (see <https://irods.org/>).
+
+To build with plug-ins, you need to use the --enable-plugins configure option
+as follows:
+
+ cd .../samtools-1.9 # Within the unpacked release directory
+ ./configure --enable-plugins --prefix=/path/to/location
+ make all all-htslib
+ make install install-htslib
+
+There are two other configure options that affect plug-ins. These are:
+ --with-plugin-dir=DIR plug-in installation location
+ --with-plugin-path=PATH default plug-in search path
+
+The default for --with-plugin-dir is <prefix>/libexec/htslib.
+--with-plugin-path sets the built-in search path used to find the plug-ins. By
+default this is the directory set by the --with-plugin-dir option. Multiple
+directories should be separated by colons.
+
+Setting --with-plugin-path is useful if you want to run directly from
+the source distribution instead of installing the package. In that case
+you can use:
+
+ cd .../samtools-1.9 # Within the unpacked release directory
+ ./configure --enable-plugins --with-plugin-path=$PWD/htslib-1.9
+ make all all-htslib
+
+It is possible to override the built-in search path using the HTS_PATH
+environment variable. Directories should be separated by colons. To
+include the built-in path, add an empty entry to HTS_PATH:
+
+ export HTS_PATH=:/my/path # Search built-in path first
+ export HTS_PATH=/my/path: # Search built-in path last
+ export HTS_PATH=/my/path1::/my/path2 # Search built-in path between others
Using an optimised zlib library
===============================
char *s = bam_format1(header, b);
int ret = -1;
if (!s) return -1;
- if (fputs(s, samtools_stdout) & fputc('\n', samtools_stdout) != EOF) ret = 0;
+ if (samtools_puts(s) != EOF) ret = 0;
free(s);
return ret;
}
@copyright Genome Research Ltd.
*/
-#define BAM_VERSION "1.7"
+#define BAM_VERSION "1.9"
#include <stdint.h>
#include <stdlib.h>
fprintf(stderr, " -b <bed> list of positions or regions\n");
fprintf(stderr, " -f <list> list of input BAM filenames, one per line [null]\n");
fprintf(stderr, " -l <int> read length threshold (ignore reads shorter than <int>) [0]\n");
- fprintf(stderr, " -d/-m <int> maximum coverage depth [8000]\n"); // the htslib's default
+ fprintf(stderr, " -d/-m <int> maximum coverage depth [8000]. If 0, depth is set to the maximum\n"
+ " integer value, effectively removing any depth limit.\n"); // the htslib's default
fprintf(stderr, " -q <int> base quality threshold [0]\n");
fprintf(stderr, " -Q <int> mapping quality threshold [0]\n");
fprintf(stderr, " -r <chr:from-to> region\n");
mplp = bam_mplp_init(n, read_bam, (void**)data); // initialization
if (0 < max_depth)
bam_mplp_set_maxcnt(mplp,max_depth); // set maximum coverage depth
+ else if (!max_depth)
+ bam_mplp_set_maxcnt(mplp,INT_MAX);
n_plp = calloc(n, sizeof(int)); // n_plp[i] is the number of covering reads from the i-th BAM
plp = calloc(n, sizeof(bam_pileup1_t*)); // plp[i] points to the array of covering reads (internal in mplp)
while ((ret=bam_mplp_auto(mplp, &tid, &pos, n_plp, plp)) > 0) { // come to the next covered position
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j; // DON'T modfity plp[][] unless you really know
if (p->is_del || p->is_refskip) ++m; // having dels or refskips at tid:pos
- else if (bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
+ else if (p->qpos < p->b->core.l_qseq &&
+ bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
}
printf("\t%d", n_plp[i] - m); // this the depth to output
}
fprintf(samtools_stderr, " -b <bed> list of positions or regions\n");
fprintf(samtools_stderr, " -f <list> list of input BAM filenames, one per line [null]\n");
fprintf(samtools_stderr, " -l <int> read length threshold (ignore reads shorter than <int>) [0]\n");
- fprintf(samtools_stderr, " -d/-m <int> maximum coverage depth [8000]\n"); // the htslib's default
+ fprintf(samtools_stderr, " -d/-m <int> maximum coverage depth [8000]. If 0, depth is set to the maximum\n"
+ " integer value, effectively removing any depth limit.\n"); // the htslib's default
fprintf(samtools_stderr, " -q <int> base quality threshold [0]\n");
fprintf(samtools_stderr, " -Q <int> mapping quality threshold [0]\n");
fprintf(samtools_stderr, " -r <chr:from-to> region\n");
mplp = bam_mplp_init(n, read_bam, (void**)data); // initialization
if (0 < max_depth)
bam_mplp_set_maxcnt(mplp,max_depth); // set maximum coverage depth
+ else if (!max_depth)
+ bam_mplp_set_maxcnt(mplp,INT_MAX);
n_plp = calloc(n, sizeof(int)); // n_plp[i] is the number of covering reads from the i-th BAM
plp = calloc(n, sizeof(bam_pileup1_t*)); // plp[i] points to the array of covering reads (internal in mplp)
while ((ret=bam_mplp_auto(mplp, &tid, &pos, n_plp, plp)) > 0) { // come to the next covered position
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j; // DON'T modfity plp[][] unless you really know
if (p->is_del || p->is_refskip) ++m; // having dels or refskips at tid:pos
- else if (bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
+ else if (p->qpos < p->b->core.l_qseq &&
+ bam_get_qual(p->b)[p->qpos] < baseQ) ++m; // low base quality
}
fprintf(samtools_stdout, "\t%d", n_plp[i] - m); // this the depth to output
}
}
static bool init(const parsed_opts_t* opts, state_t** state_out) {
+ char output_mode[8] = "w";
state_t* retval = (state_t*) calloc(1, sizeof(state_t));
+
if (retval == NULL) {
fprintf(stderr, "[init] Out of memory allocating state struct.\n");
return false;
retval->input_header = sam_hdr_read(retval->input_file);
retval->output_header = bam_hdr_dup(retval->input_header);
- retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, "w", &opts->ga.out);
+ if (opts->output_name) // File format auto-detection
+ sam_open_mode(output_mode + 1, opts->output_name, NULL);
+ retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, output_mode, &opts->ga.out);
if (retval->output_file == NULL) {
print_error_errno("addreplacerg", "could not create \"%s\"", opts->output_name);
}
static bool init(const parsed_opts_t* opts, state_t** state_out) {
+ char output_mode[8] = "w";
state_t* retval = (state_t*) calloc(1, sizeof(state_t));
+
if (retval == NULL) {
fprintf(samtools_stderr, "[init] Out of memory allocating state struct.\n");
return false;
retval->input_header = sam_hdr_read(retval->input_file);
retval->output_header = bam_hdr_dup(retval->input_header);
- retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, "w", &opts->ga.out);
+ if (opts->output_name) // File format auto-detection
+ sam_open_mode(output_mode + 1, opts->output_name, NULL);
+ retval->output_file = sam_open_format(opts->output_name == NULL?"-":opts->output_name, output_mode, &opts->ga.out);
if (retval->output_file == NULL) {
print_error_errno("addreplacerg", "could not create \"%s\"", opts->output_name);
case 'h': {
samFile *fph = sam_open(optarg, "r");
if (fph == 0) {
- fprintf(stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, argv[1]);
+ fprintf(stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, optarg);
return 1;
}
h = sam_hdr_read(fph);
if (h == NULL) {
fprintf(stderr,
- "[%s] ERROR: failed to read the header for '%s'.\n",
- __func__, argv[1]);
+ "[%s] ERROR: failed to read the header from '%s'.\n",
+ __func__, optarg);
return 1;
}
sam_close(fph);
case 'h': {
samFile *fph = sam_open(optarg, "r");
if (fph == 0) {
- fprintf(samtools_stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, argv[1]);
+ fprintf(samtools_stderr, "[%s] ERROR: fail to read the header from '%s'.\n", __func__, optarg);
return 1;
}
h = sam_hdr_read(fph);
if (h == NULL) {
fprintf(samtools_stderr,
- "[%s] ERROR: failed to read the header for '%s'.\n",
- __func__, argv[1]);
+ "[%s] ERROR: failed to read the header from '%s'.\n",
+ __func__, optarg);
return 1;
}
sam_close(fph);
#define __STDC_FORMAT_MACROS
#include <inttypes.h>
#include <unistd.h>
+#include <getopt.h>
#include "samtools.h"
+#include "sam_opts.h"
#define BAM_LIDX_SHIFT 14
return EXIT_FAILURE;
}
+/*
+ * Cram indices do not contain mapped/unmapped record counts, so we have to
+ * decode each record and count. However we can speed this up as much as
+ * possible by using the required fields parameter.
+ *
+ * This prints the stats to stdout in the same manner than the BAM function
+ * does.
+ *
+ * Returns 0 on success,
+ * -1 on failure.
+ */
+int slow_idxstats(samFile *fp, bam_hdr_t *header) {
+ int ret, last_tid = -2;
+ bam1_t *b = bam_init1();
+
+ if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_RNAME | SAM_FLAG))
+ return -1;
+
+ uint64_t (*count0)[2] = calloc(header->n_targets+1, sizeof(*count0));
+ uint64_t (*counts)[2] = count0+1;
+ if (!count0)
+ return -1;
+
+ while ((ret = sam_read1(fp, header, b)) >= 0) {
+ if (b->core.tid >= header->n_targets || b->core.tid < -1) {
+ free(count0);
+ return -1;
+ }
+
+ if (b->core.tid != last_tid) {
+ if (last_tid >= -1) {
+ if (counts[b->core.tid][0] + counts[b->core.tid][1]) {
+ print_error("idxstats", "file is not position sorted");
+ free(count0);
+ return -1;
+ }
+ }
+ last_tid = b->core.tid;
+ }
+
+ counts[b->core.tid][(b->core.flag & BAM_FUNMAP) ? 1 : 0]++;
+ }
+
+ if (ret == -1) {
+ int i;
+ for (i = 0; i < header->n_targets; i++) {
+ printf("%s\t%d\t%"PRIu64"\t%"PRIu64"\n",
+ header->target_name[i],
+ header->target_len[i],
+ counts[i][0], counts[i][1]);
+ }
+ printf("*\t0\t%"PRIu64"\t%"PRIu64"\n", counts[-1][0], counts[-1][1]);
+ }
+
+ free(count0);
+
+ bam_destroy1(b);
+
+ return (ret == -1) ? 0 : -1;
+}
+
+static void usage_exit(FILE *fp, int exit_status)
+{
+ fprintf(fp, "Usage: samtools idxstats [options] <in.bam>\n");
+ sam_global_opt_help(fp, "-.---@");
+ exit(exit_status);
+}
+
int bam_idxstats(int argc, char *argv[])
{
hts_idx_t* idx;
bam_hdr_t* header;
samFile* fp;
+ int c;
- if (argc < 2) {
- fprintf(stderr, "Usage: samtools idxstats <in.bam>\n");
- return 1;
+ sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
+ static const struct option lopts[] = {
+ SAM_OPT_GLOBAL_OPTIONS('-', 0, '-', '-', '-', '@'),
+ {NULL, 0, NULL, 0}
+ };
+
+ while ((c = getopt_long(argc, argv, "@:", lopts, NULL)) >= 0) {
+ switch (c) {
+ default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
+ /* else fall-through */
+ case '?':
+ usage_exit(stderr, EXIT_FAILURE);
+ }
}
- fp = sam_open(argv[1], "r");
+
+ if (argc != optind+1) {
+ if (argc == optind) usage_exit(stdout, EXIT_SUCCESS);
+ else usage_exit(stderr, EXIT_FAILURE);
+ }
+
+ fp = sam_open_format(argv[optind], "r", &ga.in);
if (fp == NULL) {
- print_error_errno("idxstats", "failed to open \"%s\"", argv[1]);
+ print_error_errno("idxstats", "failed to open \"%s\"", argv[optind]);
return 1;
}
header = sam_hdr_read(fp);
if (header == NULL) {
- print_error("idxstats", "failed to read header for \"%s\"", argv[1]);
- return 1;
- }
- idx = sam_index_load(fp, argv[1]);
- if (idx == NULL) {
- print_error("idxstats", "fail to load index for \"%s\"", argv[1]);
+ print_error("idxstats", "failed to read header for \"%s\"", argv[optind]);
return 1;
}
- int i;
- for (i = 0; i < header->n_targets; ++i) {
- // Print out contig name and length
- printf("%s\t%d", header->target_name[i], header->target_len[i]);
- // Now fetch info about it from the meta bin
- uint64_t u, v;
- hts_idx_get_stat(idx, i, &u, &v);
- printf("\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+ if (hts_get_format(fp)->format != bam) {
+ slow_method:
+ if (ga.nthreads)
+ hts_set_threads(fp, ga.nthreads);
+
+ if (slow_idxstats(fp, header) < 0) {
+ print_error("idxstats", "failed to process \"%s\"", argv[optind]);
+ return 1;
+ }
+ } else {
+ idx = sam_index_load(fp, argv[optind]);
+ if (idx == NULL) {
+ print_error("idxstats", "fail to load index for \"%s\", "
+ "reverting to slow method", argv[optind]);
+ goto slow_method;
+ }
+
+ int i;
+ for (i = 0; i < header->n_targets; ++i) {
+ // Print out contig name and length
+ printf("%s\t%d", header->target_name[i], header->target_len[i]);
+ // Now fetch info about it from the meta bin
+ uint64_t u, v;
+ hts_idx_get_stat(idx, i, &u, &v);
+ printf("\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+ }
+ // Dump information about unmapped reads
+ printf("*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+ hts_idx_destroy(idx);
}
- // Dump information about unmapped reads
- printf("*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+
bam_hdr_destroy(header);
- hts_idx_destroy(idx);
sam_close(fp);
return 0;
}
#define __STDC_FORMAT_MACROS
#include <inttypes.h>
#include <unistd.h>
+#include <getopt.h>
#include "samtools.h"
+#include "sam_opts.h"
#define BAM_LIDX_SHIFT 14
return EXIT_FAILURE;
}
+/*
+ * Cram indices do not contain mapped/unmapped record counts, so we have to
+ * decode each record and count. However we can speed this up as much as
+ * possible by using the required fields parameter.
+ *
+ * This prints the stats to samtools_stdout in the same manner than the BAM function
+ * does.
+ *
+ * Returns 0 on success,
+ * -1 on failure.
+ */
+int slow_idxstats(samFile *fp, bam_hdr_t *header) {
+ int ret, last_tid = -2;
+ bam1_t *b = bam_init1();
+
+ if (hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_RNAME | SAM_FLAG))
+ return -1;
+
+ uint64_t (*count0)[2] = calloc(header->n_targets+1, sizeof(*count0));
+ uint64_t (*counts)[2] = count0+1;
+ if (!count0)
+ return -1;
+
+ while ((ret = sam_read1(fp, header, b)) >= 0) {
+ if (b->core.tid >= header->n_targets || b->core.tid < -1) {
+ free(count0);
+ return -1;
+ }
+
+ if (b->core.tid != last_tid) {
+ if (last_tid >= -1) {
+ if (counts[b->core.tid][0] + counts[b->core.tid][1]) {
+ print_error("idxstats", "file is not position sorted");
+ free(count0);
+ return -1;
+ }
+ }
+ last_tid = b->core.tid;
+ }
+
+ counts[b->core.tid][(b->core.flag & BAM_FUNMAP) ? 1 : 0]++;
+ }
+
+ if (ret == -1) {
+ int i;
+ for (i = 0; i < header->n_targets; i++) {
+ fprintf(samtools_stdout, "%s\t%d\t%"PRIu64"\t%"PRIu64"\n",
+ header->target_name[i],
+ header->target_len[i],
+ counts[i][0], counts[i][1]);
+ }
+ fprintf(samtools_stdout, "*\t0\t%"PRIu64"\t%"PRIu64"\n", counts[-1][0], counts[-1][1]);
+ }
+
+ free(count0);
+
+ bam_destroy1(b);
+
+ return (ret == -1) ? 0 : -1;
+}
+
+static void usage_exit(FILE *fp, int exit_status)
+{
+ fprintf(fp, "Usage: samtools idxstats [options] <in.bam>\n");
+ sam_global_opt_help(fp, "-.---@");
+ exit(exit_status);
+}
+
int bam_idxstats(int argc, char *argv[])
{
hts_idx_t* idx;
bam_hdr_t* header;
samFile* fp;
+ int c;
- if (argc < 2) {
- fprintf(samtools_stderr, "Usage: samtools idxstats <in.bam>\n");
- return 1;
+ sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
+ static const struct option lopts[] = {
+ SAM_OPT_GLOBAL_OPTIONS('-', 0, '-', '-', '-', '@'),
+ {NULL, 0, NULL, 0}
+ };
+
+ while ((c = getopt_long(argc, argv, "@:", lopts, NULL)) >= 0) {
+ switch (c) {
+ default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
+ /* else fall-through */
+ case '?':
+ usage_exit(samtools_stderr, EXIT_FAILURE);
+ }
}
- fp = sam_open(argv[1], "r");
+
+ if (argc != optind+1) {
+ if (argc == optind) usage_exit(samtools_stdout, EXIT_SUCCESS);
+ else usage_exit(samtools_stderr, EXIT_FAILURE);
+ }
+
+ fp = sam_open_format(argv[optind], "r", &ga.in);
if (fp == NULL) {
- print_error_errno("idxstats", "failed to open \"%s\"", argv[1]);
+ print_error_errno("idxstats", "failed to open \"%s\"", argv[optind]);
return 1;
}
header = sam_hdr_read(fp);
if (header == NULL) {
- print_error("idxstats", "failed to read header for \"%s\"", argv[1]);
- return 1;
- }
- idx = sam_index_load(fp, argv[1]);
- if (idx == NULL) {
- print_error("idxstats", "fail to load index for \"%s\"", argv[1]);
+ print_error("idxstats", "failed to read header for \"%s\"", argv[optind]);
return 1;
}
- int i;
- for (i = 0; i < header->n_targets; ++i) {
- // Print out contig name and length
- fprintf(samtools_stdout, "%s\t%d", header->target_name[i], header->target_len[i]);
- // Now fetch info about it from the meta bin
- uint64_t u, v;
- hts_idx_get_stat(idx, i, &u, &v);
- fprintf(samtools_stdout, "\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+ if (hts_get_format(fp)->format != bam) {
+ slow_method:
+ if (ga.nthreads)
+ hts_set_threads(fp, ga.nthreads);
+
+ if (slow_idxstats(fp, header) < 0) {
+ print_error("idxstats", "failed to process \"%s\"", argv[optind]);
+ return 1;
+ }
+ } else {
+ idx = sam_index_load(fp, argv[optind]);
+ if (idx == NULL) {
+ print_error("idxstats", "fail to load index for \"%s\", "
+ "reverting to slow method", argv[optind]);
+ goto slow_method;
+ }
+
+ int i;
+ for (i = 0; i < header->n_targets; ++i) {
+ // Print out contig name and length
+ fprintf(samtools_stdout, "%s\t%d", header->target_name[i], header->target_len[i]);
+ // Now fetch info about it from the meta bin
+ uint64_t u, v;
+ hts_idx_get_stat(idx, i, &u, &v);
+ fprintf(samtools_stdout, "\t%" PRIu64 "\t%" PRIu64 "\n", u, v);
+ }
+ // Dump information about unmapped reads
+ fprintf(samtools_stdout, "*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+ hts_idx_destroy(idx);
}
- // Dump information about unmapped reads
- fprintf(samtools_stdout, "*\t0\t0\t%" PRIu64 "\n", hts_idx_get_n_no_coor(idx));
+
bam_hdr_destroy(header);
- hts_idx_destroy(idx);
sam_close(fp);
return 0;
}
#include <assert.h>
#include "bam_plbuf.h"
#include "bam_lpileup.h"
-#include "samtools.h"
#include <htslib/ksort.h>
#define TV_GAP 2
#include <assert.h>
#include "bam_plbuf.h"
#include "bam_lpileup.h"
-#include "samtools.h"
#include <htslib/ksort.h>
#define TV_GAP 2
/* bam_markdup.c -- Mark duplicates from a coord sorted file that has gone
through fixmates with the mate scoring option on.
- Copyright (C) 2017 Genome Research Ltd.
+ Copyright (C) 2017-18 Genome Research Ltd.
Author: Andrew Whitwham <aw7@sanger.ac.uk>
if ((data = bam_aux_get(b, "ms"))) {
score = bam_aux2i(data);
} else {
- fprintf(stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return -1;
}
other_end = unclipped_other_end(bam->core.mpos, cig);
other_coord = unclipped_other_start(bam->core.mpos, cig);
} else {
- fprintf(stderr, "[markdup] error: no MC tag.\n");
+ fprintf(stderr, "[markdup] error: no MC tag. Please run samtools fixmate on file first.\n");
return 1;
}
bp = &kh_val(pair_hash, k);
if ((mate_tmp = get_mate_score(bp->p)) == -1) {
- fprintf(stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return 1;
} else {
old_score = calc_score(bp->p) + mate_tmp;
}
if ((mate_tmp = get_mate_score(in_read->b)) == -1) {
- fprintf(stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return 1;
} else {
new_score = calc_score(in_read->b) + mate_tmp;
/* bam_markdup.c -- Mark duplicates from a coord sorted file that has gone
through fixmates with the mate scoring option on.
- Copyright (C) 2017 Genome Research Ltd.
+ Copyright (C) 2017-18 Genome Research Ltd.
Author: Andrew Whitwham <aw7@sanger.ac.uk>
if ((data = bam_aux_get(b, "ms"))) {
score = bam_aux2i(data);
} else {
- fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return -1;
}
other_end = unclipped_other_end(bam->core.mpos, cig);
other_coord = unclipped_other_start(bam->core.mpos, cig);
} else {
- fprintf(samtools_stderr, "[markdup] error: no MC tag.\n");
+ fprintf(samtools_stderr, "[markdup] error: no MC tag. Please run samtools fixmate on file first.\n");
return 1;
}
bp = &kh_val(pair_hash, k);
if ((mate_tmp = get_mate_score(bp->p)) == -1) {
- fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return 1;
} else {
old_score = calc_score(bp->p) + mate_tmp;
}
if ((mate_tmp = get_mate_score(in_read->b)) == -1) {
- fprintf(samtools_stderr, "[markdup] error: no ms score tag.\n");
+ fprintf(samtools_stderr, "[markdup] error: no ms score tag. Please run samtools fixmate on file first.\n");
return 1;
} else {
new_score = calc_score(in_read->b) + mate_tmp;
{
bam_hdr_t *header;
bam1_t *b[2] = { NULL, NULL };
- int curr, has_prev, pre_end = 0, cur_end = 0;
+ int curr, has_prev, pre_end = 0, cur_end = 0, result;
kstring_t str;
str.l = str.m = 0; str.s = 0;
b[0] = bam_init1();
b[1] = bam_init1();
curr = 0; has_prev = 0;
- while (sam_read1(in, header, b[curr]) >= 0) {
+ while ((result = sam_read1(in, header, b[curr])) >= 0) {
bam1_t *cur = b[curr], *pre = b[1-curr];
if (cur->core.flag & BAM_FSECONDARY)
{
curr = 1 - curr;
pre_end = cur_end;
}
+ if (result < -1) goto fail;
if (has_prev && !remove_reads) { // If we still have a BAM in the buffer it must be unpaired
bam1_t *pre = b[1-curr];
if (pre->core.tid < 0 || pre->core.pos < 0 || pre->core.flag&BAM_FUNMAP) { // If unmapped
{
bam_hdr_t *header;
bam1_t *b[2] = { NULL, NULL };
- int curr, has_prev, pre_end = 0, cur_end = 0;
+ int curr, has_prev, pre_end = 0, cur_end = 0, result;
kstring_t str;
str.l = str.m = 0; str.s = 0;
b[0] = bam_init1();
b[1] = bam_init1();
curr = 0; has_prev = 0;
- while (sam_read1(in, header, b[curr]) >= 0) {
+ while ((result = sam_read1(in, header, b[curr])) >= 0) {
bam1_t *cur = b[curr], *pre = b[1-curr];
if (cur->core.flag & BAM_FSECONDARY)
{
curr = 1 - curr;
pre_end = cur_end;
}
+ if (result < -1) goto fail;
if (has_prev && !remove_reads) { // If we still have a BAM in the buffer it must be unpaired
bam1_t *pre = b[1-curr];
if (pre->core.tid < 0 || pre->core.pos < 0 || pre->core.flag&BAM_FUNMAP) { // If unmapped
int bam_aux_drop_other(bam1_t *b, uint8_t *s);
-void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
+void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm, int quiet_mode)
{
uint8_t *seq = bam_get_seq(b);
uint32_t *cigar = bam_get_cigar(b);
if (old_nm) old_nm_i = bam_aux2i(old_nm);
if (!old_nm) bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
else if (nm != old_nm_i) {
- fprintf(stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+ if (!quiet_mode) {
+ fprintf(stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+ }
bam_aux_del(b, old_nm);
bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
}
if (i < str->l) is_diff = 1;
} else is_diff = 1;
if (is_diff) {
- fprintf(stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+ if (!quiet_mode) {
+ fprintf(stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+ }
bam_aux_del(b, old_md);
bam_aux_append(b, "MD", 'Z', str->l + 1, (uint8_t*)str->s);
}
free(str->s); free(str);
}
-void bam_fillmd1(bam1_t *b, char *ref, int flag)
+void bam_fillmd1(bam1_t *b, char *ref, int flag, int quiet_mode)
{
- bam_fillmd1_core(b, ref, INT_MAX, flag, 0);
+ bam_fillmd1_core(b, ref, INT_MAX, flag, 0, quiet_mode);
}
int calmd_usage() {
fprintf(stderr,
-"Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>\n"
+"Usage: samtools calmd [-eubrAESQ] <aln.bam> <ref.fasta>\n"
"Options:\n"
" -e change identical bases to '='\n"
" -u uncompressed BAM output (for piping)\n"
" -b compressed BAM output\n"
" -S ignored (input format is auto-detected)\n"
" -A modify the quality string\n"
+" -Q use quiet mode to output less debug info to stdout\n"
" -r compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n"
" -E extended BAQ for better sensitivity but lower specificity\n");
int bam_fillmd(int argc, char *argv[])
{
- int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag;
+ int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag, quiet_mode;
htsThreadPool p = {NULL, 0};
samFile *fp = NULL, *fpout = NULL;
bam_hdr_t *header = NULL;
};
flt_flag = UPDATE_NM | UPDATE_MD;
- is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0;
+ is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = quiet_mode = 0;
strcpy(mode_w, "w");
- while ((c = getopt_long(argc, argv, "EqreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "EqQreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
switch (c) {
case 'r': is_realn = 1; break;
case 'e': flt_flag |= USE_EQUAL; break;
case 'C': capQ = atoi(optarg); break;
case 'A': baq_flag |= 1; break;
case 'E': baq_flag |= 2; break;
+ case 'Q': quiet_mode = 1; break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
fprintf(stderr, "[bam_fillmd] unrecognized option '-%c'\n\n", c);
/* else fall-through */
int q = sam_cap_mapq(b, ref, len, capQ);
if (b->core.qual > q) b->core.qual = q;
}
- if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm);
+ if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm, quiet_mode);
}
if (sam_write1(fpout, header, b) < 0) {
print_error_errno("calmd", "failed to write to output file");
int bam_aux_drop_other(bam1_t *b, uint8_t *s);
-void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm)
+void bam_fillmd1_core(bam1_t *b, char *ref, int ref_len, int flag, int max_nm, int quiet_mode)
{
uint8_t *seq = bam_get_seq(b);
uint32_t *cigar = bam_get_cigar(b);
if (old_nm) old_nm_i = bam_aux2i(old_nm);
if (!old_nm) bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
else if (nm != old_nm_i) {
- fprintf(samtools_stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+ if (!quiet_mode) {
+ fprintf(samtools_stderr, "[bam_fillmd1] different NM for read '%s': %d -> %d\n", bam_get_qname(b), old_nm_i, nm);
+ }
bam_aux_del(b, old_nm);
bam_aux_append(b, "NM", 'i', 4, (uint8_t*)&nm);
}
if (i < str->l) is_diff = 1;
} else is_diff = 1;
if (is_diff) {
- fprintf(samtools_stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+ if (!quiet_mode) {
+ fprintf(samtools_stderr, "[bam_fillmd1] different MD for read '%s': '%s' -> '%s'\n", bam_get_qname(b), old_md+1, str->s);
+ }
bam_aux_del(b, old_md);
bam_aux_append(b, "MD", 'Z', str->l + 1, (uint8_t*)str->s);
}
free(str->s); free(str);
}
-void bam_fillmd1(bam1_t *b, char *ref, int flag)
+void bam_fillmd1(bam1_t *b, char *ref, int flag, int quiet_mode)
{
- bam_fillmd1_core(b, ref, INT_MAX, flag, 0);
+ bam_fillmd1_core(b, ref, INT_MAX, flag, 0, quiet_mode);
}
int calmd_usage() {
fprintf(samtools_stderr,
-"Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>\n"
+"Usage: samtools calmd [-eubrAESQ] <aln.bam> <ref.fasta>\n"
"Options:\n"
" -e change identical bases to '='\n"
" -u uncompressed BAM output (for piping)\n"
" -b compressed BAM output\n"
" -S ignored (input format is auto-detected)\n"
" -A modify the quality string\n"
+" -Q use quiet mode to output less debug info to samtools_stdout\n"
" -r compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)\n"
" -E extended BAQ for better sensitivity but lower specificity\n");
int bam_fillmd(int argc, char *argv[])
{
- int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag;
+ int c, flt_flag, tid = -2, ret, len, is_bam_out, is_uncompressed, max_nm, is_realn, capQ, baq_flag, quiet_mode;
htsThreadPool p = {NULL, 0};
samFile *fp = NULL, *fpout = NULL;
bam_hdr_t *header = NULL;
};
flt_flag = UPDATE_NM | UPDATE_MD;
- is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = 0;
+ is_bam_out = is_uncompressed = is_realn = max_nm = capQ = baq_flag = quiet_mode = 0;
strcpy(mode_w, "w");
- while ((c = getopt_long(argc, argv, "EqreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "EqQreuNhbSC:n:Ad@:", lopts, NULL)) >= 0) {
switch (c) {
case 'r': is_realn = 1; break;
case 'e': flt_flag |= USE_EQUAL; break;
case 'C': capQ = atoi(optarg); break;
case 'A': baq_flag |= 1; break;
case 'E': baq_flag |= 2; break;
+ case 'Q': quiet_mode = 1; break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
fprintf(samtools_stderr, "[bam_fillmd] unrecognized option '-%c'\n\n", c);
/* else fall-through */
int q = sam_cap_mapq(b, ref, len, capQ);
if (b->core.qual > q) b->core.qual = q;
}
- if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm);
+ if (ref) bam_fillmd1_core(b, ref, len, flt_flag, max_nm, quiet_mode);
}
if (sam_write1(fpout, header, b) < 0) {
print_error_errno("calmd", "failed to write to output file");
#define MPLP_SMART_OVERLAPS (1<<12)
#define MPLP_PRINT_QNAME (1<<13)
+#define MPLP_MAX_DEPTH 8000
+#define MPLP_MAX_INDEL_DEPTH 250
+
void *bed_read(const char *fn);
void bed_destroy(void *_h);
int bed_overlap(const void *_h, const char *chr, int beg, int end);
exit(EXIT_FAILURE);
}
bam_smpl_add(sm, fn[i], (conf->flag&MPLP_IGNORE_RG)? 0 : h_tmp->text);
- // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
- rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+ if (conf->flag & MPLP_BCF) {
+ // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
+ rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+ }
if (conf->reg) {
hts_idx_t *idx = sam_index_load(data[i]->fp, fn[i]);
if (idx == NULL) {
data[i]->h = h;
}
}
- // allocate data storage proportionate to number of samples being studied sm->n
- gplp.n = sm->n;
- gplp.n_plp = calloc(sm->n, sizeof(int));
- gplp.m_plp = calloc(sm->n, sizeof(int));
- gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
-
fprintf(stderr, "[%s] %d samples in %d input files\n", __func__, sm->n, n);
- // write the VCF header
if (conf->flag & MPLP_BCF)
{
const char *mode;
+ // allocate data storage proportionate to number of samples being studied sm->n
+ gplp.n = sm->n;
+ gplp.n_plp = calloc(sm->n, sizeof(int));
+ gplp.m_plp = calloc(sm->n, sizeof(int));
+ gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
+
+ // write the VCF header
+
if ( conf->flag & MPLP_VCF )
mode = (conf->flag&MPLP_NO_COMP)? "wu" : "wz"; // uncompressed VCF or compressed VCF
else
// init pileup
iter = bam_mplp_init(n, mplp_func, (void**)data);
if ( conf->flag & MPLP_SMART_OVERLAPS ) bam_mplp_init_overlaps(iter);
- max_depth = conf->max_depth;
- if (max_depth * sm->n > 1<<20)
- fprintf(stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
- if (max_depth * sm->n < 8000) {
- max_depth = 8000 / sm->n;
- fprintf(stderr, "<%s> Set max per-file depth to %d\n", __func__, max_depth);
+ if ( !conf->max_depth ) {
+ max_depth = INT_MAX;
+ fprintf(stderr, "[%s] Max depth set to maximum value (%d)\n", __func__, INT_MAX);
+ } else {
+ max_depth = conf->max_depth;
+ if ( max_depth * n > 1<<20 )
+ fprintf(stderr, "[%s] Combined max depth is above 1M. Potential memory hog!\n", __func__);
}
+
+ // Only used when writing BCF
max_indel_depth = conf->max_indel_depth * sm->n;
bam_mplp_set_maxcnt(iter, max_depth);
bcf1_t *bcf_rec = bcf_init1();
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j;
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
c = plp[i][j].b->core.qual + 33;
if (c > 126) c = 126;
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j;
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
if (n > 0) putc(',', pileup_fp);
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = &plp[i][j];
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
if (n > 0) putc(',', pileup_fp);
"\n"
"Output options:\n"
" -o, --output FILE write output to FILE [standard output]\n"
-" -g, --BCF generate genotype likelihoods in BCF format\n"
-" -v, --VCF generate genotype likelihoods in VCF format\n"
-"\n"
-"Output options for mpileup format (without -g/-v):\n"
" -O, --output-BP output base positions on reads\n"
" -s, --output-MQ output mapping quality\n"
" --output-QNAME output read names\n"
" -a output all positions (including zero depth)\n"
" -a -a (or -aa) output absolutely all positions, including unused ref. sequences\n"
"\n"
-"Output options for genotype likelihoods (when -g/-v is used):\n"
-" -t, --output-tags LIST optional tags to output:\n"
-" DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []\n"
-" -u, --uncompressed generate uncompressed VCF/BCF output\n"
-"\n"
-"SNP/INDEL genotype likelihoods options (effective with -g/-v):\n"
-" -e, --ext-prob INT Phred-scaled gap extension seq error probability [%d]\n", mplp->extQ);
- fprintf(fp,
-" -F, --gap-frac FLOAT minimum fraction of gapped reads [%g]\n", mplp->min_frac);
- fprintf(fp,
-" -h, --tandem-qual INT coefficient for homopolymer errors [%d]\n", mplp->tandemQ);
- fprintf(fp,
-" -I, --skip-indels do not perform indel calling\n"
-" -L, --max-idepth INT maximum per-file depth for INDEL calling [%d]\n", mplp->max_indel_depth);
- fprintf(fp,
-" -m, --min-ireads INT minimum number gapped reads for indel candidates [%d]\n", mplp->min_support);
- fprintf(fp,
-" -o, --open-prob INT Phred-scaled gap open seq error probability [%d]\n", mplp->openQ);
- fprintf(fp,
-" -p, --per-sample-mF apply -m and -F per-sample for increased sensitivity\n"
-" -P, --platforms STR comma separated list of platforms for indels [all]\n");
+"Generic options:\n");
sam_global_opt_help(fp, "-.--.-");
- fprintf(fp,
-"\n"
-"Notes: Assuming diploid individuals.\n");
+
+ fprintf(fp, "\n"
+"Note that using \"samtools mpileup\" to generate BCF or VCF files is now\n"
+"deprecated. To output these formats, please use \"bcftools mpileup\" instead.\n");
free(tmp_require);
free(tmp_filter);
}
+static void deprecated(char opt) {
+ fprintf(stderr, "[warning] samtools mpileup option `%c` is functional, "
+ "but deprecated. Please switch to using bcftools mpileup in future.\n", opt);
+}
+
int bam_mpileup(int argc, char *argv[])
{
int c;
memset(&mplp, 0, sizeof(mplp_conf_t));
mplp.min_baseQ = 13;
mplp.capQ_thres = 0;
- mplp.max_depth = 250; mplp.max_indel_depth = 250;
+ mplp.max_depth = MPLP_MAX_DEPTH;
+ mplp.max_indel_depth = MPLP_MAX_INDEL_DEPTH;
mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
mplp.min_frac = 0.002; mplp.min_support = 1;
mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN | MPLP_SMART_OVERLAPS;
mplp.bed = bed_read(optarg);
if (!mplp.bed) { print_error_errno("mpileup", "Could not read file \"%s\"", optarg); return 1; }
break;
- case 'P': mplp.pl_list = strdup(optarg); break;
- case 'p': mplp.flag |= MPLP_PER_SAMPLE; break;
- case 'g': mplp.flag |= MPLP_BCF; break;
- case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; break;
- case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; break;
+ case 'P': mplp.pl_list = strdup(optarg); deprecated(c); break;
+ case 'p': mplp.flag |= MPLP_PER_SAMPLE; deprecated(c); break;
+ case 'g': mplp.flag |= MPLP_BCF; deprecated(c); break;
+ case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; deprecated(c); break;
+ case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; deprecated(c); break;
case 'B': mplp.flag &= ~MPLP_REALN; break;
- case 'D': mplp.fmt_flag |= B2B_FMT_DP; fprintf(stderr, "[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.\n"); break;
- case 'S': mplp.fmt_flag |= B2B_FMT_SP; fprintf(stderr, "[warning] samtools mpileup option `-S` is functional, but deprecated. Please switch to `-t SP` in future.\n"); break;
- case 'V': mplp.fmt_flag |= B2B_FMT_DV; fprintf(stderr, "[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.\n"); break;
- case 'I': mplp.flag |= MPLP_NO_INDEL; break;
+ case 'D': mplp.fmt_flag |= B2B_FMT_DP; deprecated(c); break;
+ case 'S': mplp.fmt_flag |= B2B_FMT_SP; deprecated(c); break;
+ case 'V': mplp.fmt_flag |= B2B_FMT_DV; deprecated(c); break;
+ case 'I': mplp.flag |= MPLP_NO_INDEL; deprecated(c); break;
case 'E': mplp.flag |= MPLP_REDO_BAQ; break;
case '6': mplp.flag |= MPLP_ILLUMINA13; break;
case 'R': mplp.flag |= MPLP_IGNORE_RG; break;
char *end;
long value = strtol(optarg, &end, 10);
// Distinguish between -o INT and -o FILE (a bit of a hack!)
- if (*end == '\0') mplp.openQ = value;
- else mplp.output_fname = optarg;
+ if (*end == '\0') {
+ mplp.openQ = value;
+ fprintf(stderr, "[warning] samtools mpileup option "
+ "'--open-prob INT' is functional, but deprecated. "
+ "Please switch to using bcftools mpileup in future.\n");
+ } else {
+ mplp.output_fname = optarg;
+ }
}
break;
- case 'e': mplp.extQ = atoi(optarg); break;
- case 'h': mplp.tandemQ = atoi(optarg); break;
+ case 'e': mplp.extQ = atoi(optarg); deprecated(c); break;
+ case 'h': mplp.tandemQ = atoi(optarg); deprecated(c); break;
case 'A': use_orphan = 1; break;
- case 'F': mplp.min_frac = atof(optarg); break;
- case 'm': mplp.min_support = atoi(optarg); break;
- case 'L': mplp.max_indel_depth = atoi(optarg); break;
+ case 'F': mplp.min_frac = atof(optarg); deprecated(c); break;
+ case 'm': mplp.min_support = atoi(optarg); deprecated(c); break;
+ case 'L': mplp.max_indel_depth = atoi(optarg); deprecated(c); break;
case 'G': {
FILE *fp_rg;
char buf[1024];
mplp.rghash = khash_str2int_init();
if ((fp_rg = fopen(optarg, "r")) == NULL)
- fprintf(stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg);
+ fprintf(stderr, "[%s] Fail to open file %s. Continue anyway.\n", __func__, optarg);
while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me...
khash_str2int_inc(mplp.rghash, strdup(buf));
fclose(fp_rg);
}
break;
- case 't': mplp.fmt_flag |= parse_format_flag(optarg); break;
+ case 't': mplp.fmt_flag |= parse_format_flag(optarg); deprecated(c); break;
case 'a': mplp.all++; break;
default:
if (parse_sam_global_opt(c, optarg, lopts, &mplp.ga) == 0) break;
#define MPLP_SMART_OVERLAPS (1<<12)
#define MPLP_PRINT_QNAME (1<<13)
+#define MPLP_MAX_DEPTH 8000
+#define MPLP_MAX_INDEL_DEPTH 250
+
void *bed_read(const char *fn);
void bed_destroy(void *_h);
int bed_overlap(const void *_h, const char *chr, int beg, int end);
exit(EXIT_FAILURE);
}
bam_smpl_add(sm, fn[i], (conf->flag&MPLP_IGNORE_RG)? 0 : h_tmp->text);
- // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
- rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+ if (conf->flag & MPLP_BCF) {
+ // Collect read group IDs with PL (platform) listed in pl_list (note: fragile, strstr search)
+ rghash = bcf_call_add_rg(rghash, h_tmp->text, conf->pl_list);
+ }
if (conf->reg) {
hts_idx_t *idx = sam_index_load(data[i]->fp, fn[i]);
if (idx == NULL) {
data[i]->h = h;
}
}
- // allocate data storage proportionate to number of samples being studied sm->n
- gplp.n = sm->n;
- gplp.n_plp = calloc(sm->n, sizeof(int));
- gplp.m_plp = calloc(sm->n, sizeof(int));
- gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
-
fprintf(samtools_stderr, "[%s] %d samples in %d input files\n", __func__, sm->n, n);
- // write the VCF header
if (conf->flag & MPLP_BCF)
{
const char *mode;
+ // allocate data storage proportionate to number of samples being studied sm->n
+ gplp.n = sm->n;
+ gplp.n_plp = calloc(sm->n, sizeof(int));
+ gplp.m_plp = calloc(sm->n, sizeof(int));
+ gplp.plp = calloc(sm->n, sizeof(bam_pileup1_t*));
+
+ // write the VCF header
+
if ( conf->flag & MPLP_VCF )
mode = (conf->flag&MPLP_NO_COMP)? "wu" : "wz"; // uncompressed VCF or compressed VCF
else
// init pileup
iter = bam_mplp_init(n, mplp_func, (void**)data);
if ( conf->flag & MPLP_SMART_OVERLAPS ) bam_mplp_init_overlaps(iter);
- max_depth = conf->max_depth;
- if (max_depth * sm->n > 1<<20)
- fprintf(samtools_stderr, "(%s) Max depth is above 1M. Potential memory hog!\n", __func__);
- if (max_depth * sm->n < 8000) {
- max_depth = 8000 / sm->n;
- fprintf(samtools_stderr, "<%s> Set max per-file depth to %d\n", __func__, max_depth);
+ if ( !conf->max_depth ) {
+ max_depth = INT_MAX;
+ fprintf(samtools_stderr, "[%s] Max depth set to maximum value (%d)\n", __func__, INT_MAX);
+ } else {
+ max_depth = conf->max_depth;
+ if ( max_depth * n > 1<<20 )
+ fprintf(samtools_stderr, "[%s] Combined max depth is above 1M. Potential memory hog!\n", __func__);
}
+
+ // Only used when writing BCF
max_indel_depth = conf->max_indel_depth * sm->n;
bam_mplp_set_maxcnt(iter, max_depth);
bcf1_t *bcf_rec = bcf_init1();
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j;
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
c = plp[i][j].b->core.qual + 33;
if (c > 126) c = 126;
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = plp[i] + j;
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
if (n > 0) putc(',', pileup_fp);
putc('\t', pileup_fp);
for (j = 0; j < n_plp[i]; ++j) {
const bam_pileup1_t *p = &plp[i][j];
- int c = bam_get_qual(p->b)[p->qpos];
+ int c = p->qpos < p->b->core.l_qseq
+ ? bam_get_qual(p->b)[p->qpos]
+ : 0;
if ( c < conf->min_baseQ ) continue;
if (n > 0) putc(',', pileup_fp);
"\n"
"Output options:\n"
" -o, --output FILE write output to FILE [standard output]\n"
-" -g, --BCF generate genotype likelihoods in BCF format\n"
-" -v, --VCF generate genotype likelihoods in VCF format\n"
-"\n"
-"Output options for mpileup format (without -g/-v):\n"
" -O, --output-BP output base positions on reads\n"
" -s, --output-MQ output mapping quality\n"
" --output-QNAME output read names\n"
" -a output all positions (including zero depth)\n"
" -a -a (or -aa) output absolutely all positions, including unused ref. sequences\n"
"\n"
-"Output options for genotype likelihoods (when -g/-v is used):\n"
-" -t, --output-tags LIST optional tags to output:\n"
-" DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []\n"
-" -u, --uncompressed generate uncompressed VCF/BCF output\n"
-"\n"
-"SNP/INDEL genotype likelihoods options (effective with -g/-v):\n"
-" -e, --ext-prob INT Phred-scaled gap extension seq error probability [%d]\n", mplp->extQ);
- fprintf(fp,
-" -F, --gap-frac FLOAT minimum fraction of gapped reads [%g]\n", mplp->min_frac);
- fprintf(fp,
-" -h, --tandem-qual INT coefficient for homopolymer errors [%d]\n", mplp->tandemQ);
- fprintf(fp,
-" -I, --skip-indels do not perform indel calling\n"
-" -L, --max-idepth INT maximum per-file depth for INDEL calling [%d]\n", mplp->max_indel_depth);
- fprintf(fp,
-" -m, --min-ireads INT minimum number gapped reads for indel candidates [%d]\n", mplp->min_support);
- fprintf(fp,
-" -o, --open-prob INT Phred-scaled gap open seq error probability [%d]\n", mplp->openQ);
- fprintf(fp,
-" -p, --per-sample-mF apply -m and -F per-sample for increased sensitivity\n"
-" -P, --platforms STR comma separated list of platforms for indels [all]\n");
+"Generic options:\n");
sam_global_opt_help(fp, "-.--.-");
- fprintf(fp,
-"\n"
-"Notes: Assuming diploid individuals.\n");
+
+ fprintf(fp, "\n"
+"Note that using \"samtools mpileup\" to generate BCF or VCF files is now\n"
+"deprecated. To output these formats, please use \"bcftools mpileup\" instead.\n");
free(tmp_require);
free(tmp_filter);
}
+static void deprecated(char opt) {
+ fprintf(samtools_stderr, "[warning] samtools mpileup option `%c` is functional, "
+ "but deprecated. Please switch to using bcftools mpileup in future.\n", opt);
+}
+
int bam_mpileup(int argc, char *argv[])
{
int c;
memset(&mplp, 0, sizeof(mplp_conf_t));
mplp.min_baseQ = 13;
mplp.capQ_thres = 0;
- mplp.max_depth = 250; mplp.max_indel_depth = 250;
+ mplp.max_depth = MPLP_MAX_DEPTH;
+ mplp.max_indel_depth = MPLP_MAX_INDEL_DEPTH;
mplp.openQ = 40; mplp.extQ = 20; mplp.tandemQ = 100;
mplp.min_frac = 0.002; mplp.min_support = 1;
mplp.flag = MPLP_NO_ORPHAN | MPLP_REALN | MPLP_SMART_OVERLAPS;
mplp.bed = bed_read(optarg);
if (!mplp.bed) { print_error_errno("mpileup", "Could not read file \"%s\"", optarg); return 1; }
break;
- case 'P': mplp.pl_list = strdup(optarg); break;
- case 'p': mplp.flag |= MPLP_PER_SAMPLE; break;
- case 'g': mplp.flag |= MPLP_BCF; break;
- case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; break;
- case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; break;
+ case 'P': mplp.pl_list = strdup(optarg); deprecated(c); break;
+ case 'p': mplp.flag |= MPLP_PER_SAMPLE; deprecated(c); break;
+ case 'g': mplp.flag |= MPLP_BCF; deprecated(c); break;
+ case 'v': mplp.flag |= MPLP_BCF | MPLP_VCF; deprecated(c); break;
+ case 'u': mplp.flag |= MPLP_NO_COMP | MPLP_BCF; deprecated(c); break;
case 'B': mplp.flag &= ~MPLP_REALN; break;
- case 'D': mplp.fmt_flag |= B2B_FMT_DP; fprintf(samtools_stderr, "[warning] samtools mpileup option `-D` is functional, but deprecated. Please switch to `-t DP` in future.\n"); break;
- case 'S': mplp.fmt_flag |= B2B_FMT_SP; fprintf(samtools_stderr, "[warning] samtools mpileup option `-S` is functional, but deprecated. Please switch to `-t SP` in future.\n"); break;
- case 'V': mplp.fmt_flag |= B2B_FMT_DV; fprintf(samtools_stderr, "[warning] samtools mpileup option `-V` is functional, but deprecated. Please switch to `-t DV` in future.\n"); break;
- case 'I': mplp.flag |= MPLP_NO_INDEL; break;
+ case 'D': mplp.fmt_flag |= B2B_FMT_DP; deprecated(c); break;
+ case 'S': mplp.fmt_flag |= B2B_FMT_SP; deprecated(c); break;
+ case 'V': mplp.fmt_flag |= B2B_FMT_DV; deprecated(c); break;
+ case 'I': mplp.flag |= MPLP_NO_INDEL; deprecated(c); break;
case 'E': mplp.flag |= MPLP_REDO_BAQ; break;
case '6': mplp.flag |= MPLP_ILLUMINA13; break;
case 'R': mplp.flag |= MPLP_IGNORE_RG; break;
char *end;
long value = strtol(optarg, &end, 10);
// Distinguish between -o INT and -o FILE (a bit of a hack!)
- if (*end == '\0') mplp.openQ = value;
- else mplp.output_fname = optarg;
+ if (*end == '\0') {
+ mplp.openQ = value;
+ fprintf(samtools_stderr, "[warning] samtools mpileup option "
+ "'--open-prob INT' is functional, but deprecated. "
+ "Please switch to using bcftools mpileup in future.\n");
+ } else {
+ mplp.output_fname = optarg;
+ }
}
break;
- case 'e': mplp.extQ = atoi(optarg); break;
- case 'h': mplp.tandemQ = atoi(optarg); break;
+ case 'e': mplp.extQ = atoi(optarg); deprecated(c); break;
+ case 'h': mplp.tandemQ = atoi(optarg); deprecated(c); break;
case 'A': use_orphan = 1; break;
- case 'F': mplp.min_frac = atof(optarg); break;
- case 'm': mplp.min_support = atoi(optarg); break;
- case 'L': mplp.max_indel_depth = atoi(optarg); break;
+ case 'F': mplp.min_frac = atof(optarg); deprecated(c); break;
+ case 'm': mplp.min_support = atoi(optarg); deprecated(c); break;
+ case 'L': mplp.max_indel_depth = atoi(optarg); deprecated(c); break;
case 'G': {
FILE *fp_rg;
char buf[1024];
mplp.rghash = khash_str2int_init();
if ((fp_rg = fopen(optarg, "r")) == NULL)
- fprintf(samtools_stderr, "(%s) Fail to open file %s. Continue anyway.\n", __func__, optarg);
+ fprintf(samtools_stderr, "[%s] Fail to open file %s. Continue anyway.\n", __func__, optarg);
while (!feof(fp_rg) && fscanf(fp_rg, "%s", buf) > 0) // this is not a good style, but forgive me...
khash_str2int_inc(mplp.rghash, strdup(buf));
fclose(fp_rg);
}
break;
- case 't': mplp.fmt_flag |= parse_format_flag(optarg); break;
+ case 't': mplp.fmt_flag |= parse_format_flag(optarg); deprecated(c); break;
case 'a': mplp.all++; break;
default:
if (parse_sam_global_opt(c, optarg, lopts, &mplp.ga) == 0) break;
#include <getopt.h>
#include <assert.h>
#include <pthread.h>
-#include "htslib/bgzf.h"
#include "htslib/ksort.h"
#include "htslib/hts_os.h"
#include "htslib/khash.h"
#include "samtools.h"
-// Struct which contains the a record, and the pointer to the sort tag (if any)
-// Used to speed up sort-by-tag.
+// Struct which contains the a record, and the pointer to the sort tag (if any) or
+// a combined ref / position / strand.
+// Used to speed up tag and position sorts.
typedef struct bam1_tag {
bam1_t *bam_record;
- const uint8_t *tag;
+ union {
+ const uint8_t *tag;
+ uint64_t pos;
+ } u;
} bam1_tag;
/* Minimum memory required in megabytes before sort will attempt to run. This
typedef struct {
int i;
+ uint32_t rev;
uint64_t pos, idx;
bam1_tag entry;
} heap1_t;
if (fa != fb) return fa > fb;
} else {
if (a.pos != b.pos) return a.pos > b.pos;
+ if (a.rev != b.rev) return a.rev > b.rev;
}
// This compares by position in the input file(s)
if (a.i != b.i) return a.i > b.i;
int res;
h->i = i;
h->entry.bam_record = bam_init1();
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
if (!h->entry.bam_record) goto mem_fail;
res = iter[i] ? sam_itr_next(fp[i], iter[i], h->entry.bam_record) : sam_read1(fp[i], hdr[i], h->entry.bam_record);
if (res >= 0) {
bam_translate(h->entry.bam_record, translation_tbl + i);
- h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1)<<1 | bam_is_rev(h->entry.bam_record);
+ h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1);
+ h->rev = bam_is_rev(h->entry.bam_record);
h->idx = idx++;
if (g_is_by_tag) {
- h->entry.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
+ h->entry.u.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
} else {
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
}
}
else if (res == -1 && (!iter[i] || iter[i]->finished)) {
h->pos = HEAP_EMPTY;
bam_destroy1(h->entry.bam_record);
h->entry.bam_record = NULL;
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
} else {
print_error(cmd, "failed to read first record from \"%s\"", fn[i]);
goto fail;
}
if ((j = (iter[heap->i]? sam_itr_next(fp[heap->i], iter[heap->i], b) : sam_read1(fp[heap->i], hdr[heap->i], b))) >= 0) {
bam_translate(b, translation_tbl + heap->i);
- heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1)<<1 | bam_is_rev(b);
+ heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1);
+ heap->rev = bam_is_rev(b);
heap->idx = idx++;
if (g_is_by_tag) {
- heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+ heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
} else {
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
}
} else if (j == -1 && (!iter[heap->i] || iter[heap->i]->finished)) {
heap->pos = HEAP_EMPTY;
bam_destroy1(heap->entry.bam_record);
heap->entry.bam_record = NULL;
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
} else {
print_error(cmd, "\"%s\" is truncated", fn[heap->i]);
goto fail;
}
if (res >= 0) {
heap->pos = (((uint64_t)heap->entry.bam_record->core.tid<<32)
- | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1)<<1
- | bam_is_rev(heap->entry.bam_record));
+ | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1));
+ heap->rev = bam_is_rev(heap->entry.bam_record);
heap->idx = (*idx)++;
if (g_is_by_tag) {
- heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+ heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
} else {
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
}
} else if (res == -1) {
heap->pos = HEAP_EMPTY;
if (i < nfiles) bam_destroy1(heap->entry.bam_record);
heap->entry.bam_record = NULL;
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
} else {
return -1;
}
// Get a read into the heap
h->i = i;
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
if (i < n) {
h->entry.bam_record = bam_init1();
if (!h->entry.bam_record) goto mem_fail;
return -1;
}
- hts_set_threads(fpout, n_threads);
+ if (n_threads > 1) hts_set_threads(fpout, n_threads);
if (sam_hdr_write(fpout, hout) != 0) {
print_error_errno(cmd, "failed to write header to \"%s\"", out);
if (t != 0) return t;
return (int) (a.bam_record->core.flag&0xc0) - (int) (b.bam_record->core.flag&0xc0);
} else {
- pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1)<<1|bam_is_rev(a.bam_record);
- pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1)<<1|bam_is_rev(b.bam_record);
+ pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1);
+ pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1);
+
+ if (pa == pb) {
+ pa = bam_is_rev(a.bam_record);
+ pb = bam_is_rev(b.bam_record);
+ }
+
return pa < pb ? -1 : (pa > pb ? 1 : 0);
}
}
// equal to or greater than b, respectively.
static inline int bam1_cmp_by_tag(const bam1_tag a, const bam1_tag b)
{
- const uint8_t* aux_a = a.tag;
- const uint8_t* aux_b = b.tag;
+ const uint8_t* aux_a = a.u.tag;
+ const uint8_t* aux_b = b.u.tag;
if (aux_a == NULL && aux_b != NULL) {
return -1;
return -1;
}
+#define NUMBASE 256
+#define STEP 8
+
+static int ks_radixsort(size_t n, bam1_tag *buf, const bam_hdr_t *h)
+{
+ int curr = 0, ret = -1;
+ ssize_t i;
+ bam1_tag *buf_ar2[2], *bam_a, *bam_b;
+ uint64_t max_pos = 0, max_digit = 0, shift = 0;
+
+ for (i = 0; i < n; i++) {
+ bam1_t *b = buf[i].bam_record;
+ int32_t tid = b->core.tid == -1 ? h->n_targets : b->core.tid;
+ buf[i].u.pos = (uint64_t)tid<<32 | (b->core.pos+1)<<1 | bam_is_rev(b);
+ if (max_pos < buf[i].u.pos)
+ max_pos = buf[i].u.pos;
+ }
+
+ while (max_pos) {
+ ++max_digit;
+ max_pos = max_pos >> 1;
+ }
+
+ buf_ar2[0] = buf;
+ buf_ar2[1] = (bam1_tag *)malloc(sizeof(bam1_tag) * n);
+ if (buf_ar2[1] == NULL) {
+ print_error("sort", "couldn't allocate memory for temporary buf");
+ goto err;
+ }
+
+ while (shift < max_digit){
+ size_t remainders[NUMBASE] = { 0 };
+ bam_a = buf_ar2[curr]; bam_b = buf_ar2[1-curr];
+ for (i = 0; i < n; ++i)
+ remainders[(bam_a[i].u.pos >> shift) % NUMBASE]++;
+ for (i = 1; i < NUMBASE; ++i)
+ remainders[i] += remainders[i - 1];
+ for (i = n - 1; i >= 0; i--) {
+ size_t j = --remainders[(bam_a[i].u.pos >> shift) % NUMBASE];
+ bam_b[j] = bam_a[i];
+ }
+ shift += STEP;
+ curr = 1 - curr;
+ }
+ if (curr == 1) {
+ bam1_tag *end = buf + n;
+ bam_a = buf_ar2[0]; bam_b = buf_ar2[1];
+ while (bam_a < end) *bam_a++ = *bam_b++;
+ }
+
+ ret = 0;
+err:
+ free(buf_ar2[1]);
+ return ret;
+}
+
static void *worker(void *data)
{
worker_t *w = (worker_t*)data;
char *name;
w->error = 0;
- ks_mergesort(sort, w->buf_len, w->buf, 0);
+
+ if (!g_is_by_qname && !g_is_by_tag) {
+ if (ks_radixsort(w->buf_len, w->buf, w->h) < 0) {
+ w->error = errno;
+ return NULL;
+ }
+ } else {
+ ks_mergesort(sort, w->buf_len, w->buf, 0);
+ }
if (w->no_save)
return 0;
mem_full = 1;
}
- // Pull out the pointer to the sort tag if applicable
+ // Pull out the value of the position
+ // or the pointer to the sort tag if applicable
if (g_is_by_tag) {
- buf[k].tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
+ buf[k].u.tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
} else {
- buf[k].tag = NULL;
+ buf[k].u.tag = NULL;
}
++k;
in_mem = calloc(n_threads > 0 ? n_threads : 1, sizeof(in_mem[0]));
if (!in_mem) goto err;
num_in_mem = sort_blocks(n_files, k, buf, prefix, header, n_threads,
- in_mem);
+ in_mem);
if (num_in_mem < 0) goto err;
} else {
num_in_mem = 0;
// write the final output
if (n_files == 0 && num_in_mem < 2) { // a single block
- ks_mergesort(sort, k, buf, 0);
if (write_buffer(fnout, modeout, k, buf, header, n_threads, out_fmt) != 0) {
print_error_errno("sort", "failed to create \"%s\"", fnout);
goto err;
bam_destroy1(b);
free(buf);
free(bam_mem);
+ free(in_mem);
bam_hdr_destroy(header);
if (fp) sam_close(fp);
return ret;
#include <getopt.h>
#include <assert.h>
#include <pthread.h>
-#include "htslib/bgzf.h"
#include "htslib/ksort.h"
#include "htslib/hts_os.h"
#include "htslib/khash.h"
#include "samtools.h"
-// Struct which contains the a record, and the pointer to the sort tag (if any)
-// Used to speed up sort-by-tag.
+// Struct which contains the a record, and the pointer to the sort tag (if any) or
+// a combined ref / position / strand.
+// Used to speed up tag and position sorts.
typedef struct bam1_tag {
bam1_t *bam_record;
- const uint8_t *tag;
+ union {
+ const uint8_t *tag;
+ uint64_t pos;
+ } u;
} bam1_tag;
/* Minimum memory required in megabytes before sort will attempt to run. This
typedef struct {
int i;
+ uint32_t rev;
uint64_t pos, idx;
bam1_tag entry;
} heap1_t;
if (fa != fb) return fa > fb;
} else {
if (a.pos != b.pos) return a.pos > b.pos;
+ if (a.rev != b.rev) return a.rev > b.rev;
}
// This compares by position in the input file(s)
if (a.i != b.i) return a.i > b.i;
int res;
h->i = i;
h->entry.bam_record = bam_init1();
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
if (!h->entry.bam_record) goto mem_fail;
res = iter[i] ? sam_itr_next(fp[i], iter[i], h->entry.bam_record) : sam_read1(fp[i], hdr[i], h->entry.bam_record);
if (res >= 0) {
bam_translate(h->entry.bam_record, translation_tbl + i);
- h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1)<<1 | bam_is_rev(h->entry.bam_record);
+ h->pos = ((uint64_t)h->entry.bam_record->core.tid<<32) | (uint32_t)((int32_t)h->entry.bam_record->core.pos+1);
+ h->rev = bam_is_rev(h->entry.bam_record);
h->idx = idx++;
if (g_is_by_tag) {
- h->entry.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
+ h->entry.u.tag = bam_aux_get(h->entry.bam_record, g_sort_tag);
} else {
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
}
}
else if (res == -1 && (!iter[i] || iter[i]->finished)) {
h->pos = HEAP_EMPTY;
bam_destroy1(h->entry.bam_record);
h->entry.bam_record = NULL;
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
} else {
print_error(cmd, "failed to read first record from \"%s\"", fn[i]);
goto fail;
}
if ((j = (iter[heap->i]? sam_itr_next(fp[heap->i], iter[heap->i], b) : sam_read1(fp[heap->i], hdr[heap->i], b))) >= 0) {
bam_translate(b, translation_tbl + heap->i);
- heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1)<<1 | bam_is_rev(b);
+ heap->pos = ((uint64_t)b->core.tid<<32) | (uint32_t)((int)b->core.pos+1);
+ heap->rev = bam_is_rev(b);
heap->idx = idx++;
if (g_is_by_tag) {
- heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+ heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
} else {
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
}
} else if (j == -1 && (!iter[heap->i] || iter[heap->i]->finished)) {
heap->pos = HEAP_EMPTY;
bam_destroy1(heap->entry.bam_record);
heap->entry.bam_record = NULL;
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
} else {
print_error(cmd, "\"%s\" is truncated", fn[heap->i]);
goto fail;
}
if (res >= 0) {
heap->pos = (((uint64_t)heap->entry.bam_record->core.tid<<32)
- | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1)<<1
- | bam_is_rev(heap->entry.bam_record));
+ | (uint32_t)((int32_t)heap->entry.bam_record->core.pos+1));
+ heap->rev = bam_is_rev(heap->entry.bam_record);
heap->idx = (*idx)++;
if (g_is_by_tag) {
- heap->entry.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
+ heap->entry.u.tag = bam_aux_get(heap->entry.bam_record, g_sort_tag);
} else {
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
}
} else if (res == -1) {
heap->pos = HEAP_EMPTY;
if (i < nfiles) bam_destroy1(heap->entry.bam_record);
heap->entry.bam_record = NULL;
- heap->entry.tag = NULL;
+ heap->entry.u.tag = NULL;
} else {
return -1;
}
// Get a read into the heap
h->i = i;
- h->entry.tag = NULL;
+ h->entry.u.tag = NULL;
if (i < n) {
h->entry.bam_record = bam_init1();
if (!h->entry.bam_record) goto mem_fail;
return -1;
}
- hts_set_threads(fpout, n_threads);
+ if (n_threads > 1) hts_set_threads(fpout, n_threads);
if (sam_hdr_write(fpout, hout) != 0) {
print_error_errno(cmd, "failed to write header to \"%s\"", out);
if (t != 0) return t;
return (int) (a.bam_record->core.flag&0xc0) - (int) (b.bam_record->core.flag&0xc0);
} else {
- pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1)<<1|bam_is_rev(a.bam_record);
- pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1)<<1|bam_is_rev(b.bam_record);
+ pa = (uint64_t)a.bam_record->core.tid<<32|(a.bam_record->core.pos+1);
+ pb = (uint64_t)b.bam_record->core.tid<<32|(b.bam_record->core.pos+1);
+
+ if (pa == pb) {
+ pa = bam_is_rev(a.bam_record);
+ pb = bam_is_rev(b.bam_record);
+ }
+
return pa < pb ? -1 : (pa > pb ? 1 : 0);
}
}
// equal to or greater than b, respectively.
static inline int bam1_cmp_by_tag(const bam1_tag a, const bam1_tag b)
{
- const uint8_t* aux_a = a.tag;
- const uint8_t* aux_b = b.tag;
+ const uint8_t* aux_a = a.u.tag;
+ const uint8_t* aux_b = b.u.tag;
if (aux_a == NULL && aux_b != NULL) {
return -1;
return -1;
}
+#define NUMBASE 256
+#define STEP 8
+
+static int ks_radixsort(size_t n, bam1_tag *buf, const bam_hdr_t *h)
+{
+ int curr = 0, ret = -1;
+ ssize_t i;
+ bam1_tag *buf_ar2[2], *bam_a, *bam_b;
+ uint64_t max_pos = 0, max_digit = 0, shift = 0;
+
+ for (i = 0; i < n; i++) {
+ bam1_t *b = buf[i].bam_record;
+ int32_t tid = b->core.tid == -1 ? h->n_targets : b->core.tid;
+ buf[i].u.pos = (uint64_t)tid<<32 | (b->core.pos+1)<<1 | bam_is_rev(b);
+ if (max_pos < buf[i].u.pos)
+ max_pos = buf[i].u.pos;
+ }
+
+ while (max_pos) {
+ ++max_digit;
+ max_pos = max_pos >> 1;
+ }
+
+ buf_ar2[0] = buf;
+ buf_ar2[1] = (bam1_tag *)malloc(sizeof(bam1_tag) * n);
+ if (buf_ar2[1] == NULL) {
+ print_error("sort", "couldn't allocate memory for temporary buf");
+ goto err;
+ }
+
+ while (shift < max_digit){
+ size_t remainders[NUMBASE] = { 0 };
+ bam_a = buf_ar2[curr]; bam_b = buf_ar2[1-curr];
+ for (i = 0; i < n; ++i)
+ remainders[(bam_a[i].u.pos >> shift) % NUMBASE]++;
+ for (i = 1; i < NUMBASE; ++i)
+ remainders[i] += remainders[i - 1];
+ for (i = n - 1; i >= 0; i--) {
+ size_t j = --remainders[(bam_a[i].u.pos >> shift) % NUMBASE];
+ bam_b[j] = bam_a[i];
+ }
+ shift += STEP;
+ curr = 1 - curr;
+ }
+ if (curr == 1) {
+ bam1_tag *end = buf + n;
+ bam_a = buf_ar2[0]; bam_b = buf_ar2[1];
+ while (bam_a < end) *bam_a++ = *bam_b++;
+ }
+
+ ret = 0;
+err:
+ free(buf_ar2[1]);
+ return ret;
+}
+
static void *worker(void *data)
{
worker_t *w = (worker_t*)data;
char *name;
w->error = 0;
- ks_mergesort(sort, w->buf_len, w->buf, 0);
+
+ if (!g_is_by_qname && !g_is_by_tag) {
+ if (ks_radixsort(w->buf_len, w->buf, w->h) < 0) {
+ w->error = errno;
+ return NULL;
+ }
+ } else {
+ ks_mergesort(sort, w->buf_len, w->buf, 0);
+ }
if (w->no_save)
return 0;
mem_full = 1;
}
- // Pull out the pointer to the sort tag if applicable
+ // Pull out the value of the position
+ // or the pointer to the sort tag if applicable
if (g_is_by_tag) {
- buf[k].tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
+ buf[k].u.tag = bam_aux_get(buf[k].bam_record, g_sort_tag);
} else {
- buf[k].tag = NULL;
+ buf[k].u.tag = NULL;
}
++k;
in_mem = calloc(n_threads > 0 ? n_threads : 1, sizeof(in_mem[0]));
if (!in_mem) goto err;
num_in_mem = sort_blocks(n_files, k, buf, prefix, header, n_threads,
- in_mem);
+ in_mem);
if (num_in_mem < 0) goto err;
} else {
num_in_mem = 0;
// write the final output
if (n_files == 0 && num_in_mem < 2) { // a single block
- ks_mergesort(sort, k, buf, 0);
if (write_buffer(fnout, modeout, k, buf, header, n_threads, out_fmt) != 0) {
print_error_errno("sort", "failed to create \"%s\"", fnout);
goto err;
bam_destroy1(b);
free(buf);
free(bam_mem);
+ free(in_mem);
bam_hdr_destroy(header);
if (fp) sam_close(fp);
return ret;
/* bamshuf.c -- collate subcommand.
Copyright (C) 2012 Broad Institute.
- Copyright (C) 2013, 2015 Genome Research Ltd.
+ Copyright (C) 2013, 2015, 2018 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
+#include <errno.h>
+#ifdef _WIN32
+# define WIN32_LEAN_AND_MEAN
+# include <windows.h>
+#endif
#include "htslib/sam.h"
#include "htslib/hts.h"
#include "htslib/ksort.h"
#include "samtools.h"
#include "htslib/thread_pool.h"
#include "sam_opts.h"
+#include "htslib/khash.h"
#define DEF_CLEVEL 1
KSORT_INIT(bamshuf, elem_t, elem_lt)
+
+typedef struct {
+ int written;
+ bam1_t *b;
+} bam_item_t;
+
+typedef struct {
+ bam1_t *bam_pool;
+ bam_item_t *items;
+ size_t size;
+ size_t index;
+} bam_list_t;
+
+typedef struct {
+ bam_item_t *bi;
+} store_item_t;
+
+KHASH_MAP_INIT_STR(bam_store, store_item_t)
+
+
+static bam_item_t *store_bam(bam_list_t *list) {
+ size_t old_index = list->index;
+
+ list->items[list->index++].written = 0;
+
+ if (list->index >= list->size)
+ list->index = 0;
+
+ return &list->items[old_index];
+}
+
+
+static int write_bam_needed(bam_list_t *list) {
+ return !list->items[list->index].written;
+}
+
+
+static void mark_bam_as_written(bam_list_t *list) {
+ list->items[list->index].written = 1;
+}
+
+
+static int create_bam_list(bam_list_t *list, size_t max_size) {
+ size_t i;
+
+ list->size = list->index = 0;
+ list->items = NULL;
+ list->bam_pool = NULL;
+
+ if ((list->items = malloc(max_size * sizeof(bam_item_t))) == NULL) {
+ return 1;
+ }
+
+ if ((list->bam_pool = calloc(max_size, sizeof(bam1_t))) == NULL) {
+ return 1;
+ }
+
+ for (i = 0; i < max_size; i++) {
+ list->items[i].b = &list->bam_pool[i];
+ list->items[i].written = 1;
+ }
+
+ list->size = max_size;
+ list->index = 0;
+
+ return 0;
+}
+
+
+static void destroy_bam_list(bam_list_t *list) {
+ size_t i;
+
+ for (i = 0; i < list->size; i++) {
+ free(list->bam_pool[i].data);
+ }
+
+ free(list->bam_pool);
+ free(list->items);
+}
+
+
+static inline int write_to_bin_file(bam1_t *bam, int64_t *count, samFile **bin_files, char **names, bam_hdr_t *header, int files) {
+ uint32_t x;
+
+ x = hash_X31_Wang(bam_get_qname(bam)) % files;
+
+ if (sam_write1(bin_files[x], header, bam) < 0) {
+ print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", names[x]);
+ return 1;
+ }
+
+ ++count[x];
+
+ return 0;
+}
+
+
static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
- int is_stdout, sam_global_args *ga)
+ int is_stdout, const char *output_file, int fast, int store_max, sam_global_args *ga)
{
samFile *fp, *fpw = NULL, **fpt = NULL;
char **fnt = NULL, modew[8];
bam1_t *b = NULL;
- int i, l, r;
+ int i, counter, l, r;
bam_hdr_t *h = NULL;
int64_t j, max_cnt = 0, *cnt = NULL;
elem_t *a = NULL;
goto fail;
}
+ // open final output file
+ l = strlen(pre);
+
+ sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+
+ if (!is_stdout && !output_file) { // output to a file (name based on prefix)
+ char *fnw = (char*)calloc(l + 5, 1);
+ if (!fnw) goto mem_fail;
+ if (ga->out.format == unknown_format)
+ sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
+ else
+ sprintf(fnw, "%s.%s", pre, hts_format_file_extension(&ga->out));
+ fpw = sam_open_format(fnw, modew, &ga->out);
+ free(fnw);
+ } else if (output_file) { // output to a given file
+ modew[0] = 'w'; modew[1] = '\0';
+ sam_open_mode(modew + 1, output_file, NULL);
+ j = strlen(modew);
+ snprintf(modew + j, sizeof(modew) - j, "%d",
+ (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+ fpw = sam_open_format(output_file, modew, &ga->out);
+ } else fpw = sam_open_format("-", modew, &ga->out); // output to stdout
+ if (fpw == NULL) {
+ if (is_stdout) print_error_errno("collate", "Cannot open standard output");
+ else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
+ goto fail;
+ }
+ if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
+
+ if (sam_hdr_write(fpw, h) < 0) {
+ print_error_errno("collate", "Couldn't write header");
+ goto fail;
+ }
+
fnt = (char**)calloc(n_files, sizeof(char*));
if (!fnt) goto mem_fail;
fpt = (samFile**)calloc(n_files, sizeof(samFile*));
cnt = (int64_t*)calloc(n_files, 8);
if (!cnt) goto mem_fail;
- l = strlen(pre);
-
- for (i = 0; i < n_files; ++i) {
- fnt[i] = (char*)calloc(l + 10, 1);
+ for (i = counter = 0; i < n_files; ++i) {
+ fnt[i] = (char*)calloc(l + 20, 1);
if (!fnt[i]) goto mem_fail;
- sprintf(fnt[i], "%s.%.4d.bam", pre, i);
- fpt[i] = sam_open(fnt[i], "wb1");
+ do {
+ sprintf(fnt[i], "%s.%04d.bam", pre, counter++);
+ fpt[i] = sam_open(fnt[i], "wxb1");
+ } while (!fpt[i] && errno == EEXIST);
if (fpt[i] == NULL) {
print_error_errno("collate", "Cannot open intermediate file \"%s\"", fnt[i]);
goto fail;
}
+ if (p.pool) hts_set_opt(fpt[i], HTS_OPT_THREAD_POOL, &p);
if (sam_hdr_write(fpt[i], h) < 0) {
print_error_errno("collate", "Couldn't write header to intermediate file \"%s\"", fnt[i]);
goto fail;
}
}
- b = bam_init1();
- if (!b) goto mem_fail;
- while ((r = sam_read1(fp, h, b)) >= 0) {
- uint32_t x;
- x = hash_X31_Wang(bam_get_qname(b)) % n_files;
- if (sam_write1(fpt[x], h, b) < 0) {
- print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", fnt[x]);
+
+ if (fast) {
+ khash_t(bam_store) *stored = kh_init(bam_store);
+ khiter_t itr;
+ bam_list_t list;
+ int err = 0;
+ if (!stored) goto mem_fail;
+
+ if (store_max < 2) store_max = 2;
+
+ if (create_bam_list(&list, store_max)) {
+ fprintf(stderr, "[collate[ ERROR: unable to create bam list.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ while ((r = sam_read1(fp, h, list.items[list.index].b)) >= 0) {
+ int ret;
+ bam1_t *b = list.items[list.index].b;
+ int readflag = b->core.flag & (BAM_FREAD1 | BAM_FREAD2);
+
+ // strictly paired reads only
+ if (!(b->core.flag & (BAM_FSECONDARY | BAM_FSUPPLEMENTARY)) && (readflag == BAM_FREAD1 || readflag == BAM_FREAD2)) {
+
+ itr = kh_get(bam_store, stored, bam_get_qname(b));
+
+ if (itr == kh_end(stored)) {
+ // new read
+ itr = kh_put(bam_store, stored, bam_get_qname(b), &ret);
+
+ if (ret > 0) { // okay to go ahead store it
+ kh_value(stored, itr).bi = store_bam(&list);
+
+ // see if the next one on the list needs to be written out
+ if (write_bam_needed(&list)) {
+ if (write_to_bin_file(list.items[list.index].b, cnt, fpt, fnt, h, n_files) < 0) {
+ fprintf(stderr, "[collate] ERROR: could not write line.\n");
+ err = 1;
+ goto fast_fail;
+ } else {
+ mark_bam_as_written(&list);
+
+ itr = kh_get(bam_store, stored, bam_get_qname(list.items[list.index].b));
+
+ if (itr != kh_end(stored)) {
+ kh_del(bam_store, stored, itr);
+ } else {
+ fprintf(stderr, "[collate] ERROR: stored value not in hash.\n");
+ err = 1;
+ goto fast_fail;
+ }
+ }
+ }
+ } else if (ret == 0) {
+ fprintf(stderr, "[collate] ERROR: value already in hash.\n");
+ err = 1;
+ goto fast_fail;
+ } else {
+ fprintf(stderr, "[collate] ERROR: unable to store in hash.\n");
+ err = 1;
+ goto fast_fail;
+ }
+ } else { // we have a match
+ // write out the reads in R1 R2 order
+ bam1_t *r1, *r2;
+
+ if (b->core.flag & BAM_FREAD1) {
+ r1 = b;
+ r2 = kh_value(stored, itr).bi->b;
+ } else {
+ r1 = kh_value(stored, itr).bi->b;
+ r2 = b;
+ }
+
+ if (sam_write1(fpw, h, r1) < 0) {
+ fprintf(stderr, "[collate] ERROR: could not write r1 alignment.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ if (sam_write1(fpw, h, r2) < 0) {
+ fprintf(stderr, "[collate] ERROR: could not write r2 alignment.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ mark_bam_as_written(&list);
+
+ // remove stored read
+ kh_value(stored, itr).bi->written = 1;
+ kh_del(bam_store, stored, itr);
+ }
+ }
+ }
+
+ for (list.index = 0; list.index < list.size; list.index++) {
+ if (write_bam_needed(&list)) {
+ bam1_t *b = list.items[list.index].b;
+
+ if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+ err = 1;
+ goto fast_fail;
+ } else {
+ itr = kh_get(bam_store, stored, bam_get_qname(b));
+ kh_del(bam_store, stored, itr);
+ }
+ }
+ }
+
+ fast_fail:
+ if (err) {
+ for (itr = kh_begin(stored); itr != kh_end(stored); ++itr) {
+ if (kh_exist(stored, itr)) {
+ kh_del(bam_store, stored, itr);
+ }
+ }
+
+ kh_destroy(bam_store, stored);
+ destroy_bam_list(&list);
goto fail;
+ } else {
+ kh_destroy(bam_store, stored);
+ destroy_bam_list(&list);
}
- ++cnt[x];
+
+ } else {
+ b = bam_init1();
+ if (!b) goto mem_fail;
+
+ while ((r = sam_read1(fp, h, b)) >= 0) {
+ if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+ bam_destroy1(b);
+ goto fail;
+ }
+ }
+
+ bam_destroy1(b);
}
- bam_destroy1(b);
- b = NULL;
+
if (r < -1) {
fprintf(stderr, "Error reading input file\n");
goto fail;
fpt = NULL;
sam_close(fp);
fp = NULL;
- // merge
- sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
- if (!is_stdout) { // output to a file
- char *fnw = (char*)calloc(l + 5, 1);
- if (!fnw) goto mem_fail;
- if (ga->out.format == unknown_format)
- sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
- else
- sprintf(fnw, "%s.%s", pre, hts_format_file_extension(&ga->out));
- fpw = sam_open_format(fnw, modew, &ga->out);
- free(fnw);
- } else fpw = sam_open_format("-", modew, &ga->out); // output to stdout
- if (fpw == NULL) {
- if (is_stdout) print_error_errno("collate", "Cannot open standard output");
- else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
- goto fail;
- }
- if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
-
- if (sam_hdr_write(fpw, h) < 0) {
- print_error_errno("collate", "Couldn't write header");
- goto fail;
- }
+ // merge
a = malloc(max_cnt * sizeof(elem_t));
if (!a) goto mem_fail;
for (j = 0; j < max_cnt; ++j) {
if (fp) sam_close(fp);
if (fpw) sam_close(fpw);
if (h) bam_hdr_destroy(h);
- if (b) bam_destroy1(b);
for (i = 0; i < n_files; ++i) {
if (fnt) free(fnt[i]);
if (fpt && fpt[i]) sam_close(fpt[i]);
return 1;
}
-static int usage(FILE *fp, int n_files) {
+static int usage(FILE *fp, int n_files, int reads_store) {
fprintf(fp,
- "Usage: samtools collate [-Ou] [-n nFiles] [-l cLevel] <in.bam> <out.prefix>\n\n"
+ "Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]\n\n"
"Options:\n"
" -O output to stdout\n"
+ " -o output file name (use prefix if not set)\n"
" -u uncompressed BAM output\n"
+ " -f fast (only primary alignments)\n"
+ " -r working reads stored (with -f) [%d]\n" // reads_store
" -l INT compression level [%d]\n" // DEF_CLEVEL
" -n INT number of temporary files [%d]\n", // n_files
- DEF_CLEVEL, n_files);
+ reads_store, DEF_CLEVEL, n_files);
sam_global_opt_help(fp, "-....@");
+ fprintf(fp,
+ " <prefix> is required unless the -o or -O options are used.\n");
return 1;
}
+char * generate_prefix() {
+ char *prefix;
+ unsigned int pid = getpid();
+#ifdef _WIN32
+# define PREFIX_LEN (MAX_PATH + 16)
+ DWORD ret;
+ prefix = calloc(PREFIX_LEN, sizeof(*prefix));
+ if (!prefix) {
+ perror("collate");
+ return NULL;
+ }
+ ret = GetTempPathA(MAX_PATH, prefix);
+ if (ret > MAX_PATH || ret == 0) {
+ fprintf(stderr,
+ "[E::collate] Couldn't get path for temporary files.\n");
+ free(prefix);
+ return NULL;
+ }
+ snprintf(prefix + ret, PREFIX_LEN - ret, "\\%x", pid);
+ return prefix;
+#else
+# define PREFIX_LEN 64
+ prefix = malloc(PREFIX_LEN);
+ if (!prefix) {
+ perror("collate");
+ return NULL;
+ }
+ snprintf(prefix, PREFIX_LEN, "/tmp/collate%x", pid);
+ return prefix;
+#endif
+}
+
int main_bamshuf(int argc, char *argv[])
{
- int c, n_files = 64, clevel = DEF_CLEVEL, is_stdout = 0, is_un = 0;
+ int c, n_files = 64, clevel = DEF_CLEVEL, is_stdout = 0, is_un = 0, fast_coll = 0, reads_store = 10000, ret, pre_mem = 0;
+ const char *output_file = NULL;
+ char *prefix = NULL;
sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
static const struct option lopts[] = {
SAM_OPT_GLOBAL_OPTIONS('-', 0, 0, 0, 0, '@'),
{ NULL, 0, NULL, 0 }
};
- while ((c = getopt_long(argc, argv, "n:l:uO@:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "n:l:uOo:@:fr:", lopts, NULL)) >= 0) {
switch (c) {
case 'n': n_files = atoi(optarg); break;
case 'l': clevel = atoi(optarg); break;
case 'u': is_un = 1; break;
case 'O': is_stdout = 1; break;
+ case 'o': output_file = optarg; break;
+ case 'f': fast_coll = 1; break;
+ case 'r': reads_store = atoi(optarg); break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
/* else fall-through */
- case '?': return usage(stderr, n_files);
+ case '?': return usage(stderr, n_files, reads_store);
}
}
if (is_un) clevel = 0;
- if (optind + 2 > argc)
- return usage(stderr, n_files);
+ if (argc >= optind + 2) prefix = argv[optind+1];
+ if (!(prefix || is_stdout || output_file))
+ return usage(stderr, n_files, reads_store);
+ if (is_stdout && output_file) {
+ fprintf(stderr, "collate: -o and -O options cannot be used together.\n");
+ return usage(stderr, n_files, reads_store);
+ }
+ if (!prefix) {
+ prefix = generate_prefix();
+ pre_mem = 1;
+ }
+
+ if (!prefix) return EXIT_FAILURE;
+
+ ret = bamshuf(argv[optind], n_files, prefix, clevel, is_stdout,
+ output_file, fast_coll, reads_store, &ga);
+
+ if (pre_mem) free(prefix);
- return bamshuf(argv[optind], n_files, argv[optind+1], clevel, is_stdout, &ga);
+ return ret;
}
/* bamshuf.c -- collate subcommand.
Copyright (C) 2012 Broad Institute.
- Copyright (C) 2013, 2015 Genome Research Ltd.
+ Copyright (C) 2013, 2015, 2018 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
+#include <errno.h>
+#ifdef _WIN32
+# define WIN32_LEAN_AND_MEAN
+# include <windows.h>
+#endif
#include "htslib/sam.h"
#include "htslib/hts.h"
#include "htslib/ksort.h"
#include "samtools.h"
#include "htslib/thread_pool.h"
#include "sam_opts.h"
+#include "htslib/khash.h"
#define DEF_CLEVEL 1
KSORT_INIT(bamshuf, elem_t, elem_lt)
+
+typedef struct {
+ int written;
+ bam1_t *b;
+} bam_item_t;
+
+typedef struct {
+ bam1_t *bam_pool;
+ bam_item_t *items;
+ size_t size;
+ size_t index;
+} bam_list_t;
+
+typedef struct {
+ bam_item_t *bi;
+} store_item_t;
+
+KHASH_MAP_INIT_STR(bam_store, store_item_t)
+
+
+static bam_item_t *store_bam(bam_list_t *list) {
+ size_t old_index = list->index;
+
+ list->items[list->index++].written = 0;
+
+ if (list->index >= list->size)
+ list->index = 0;
+
+ return &list->items[old_index];
+}
+
+
+static int write_bam_needed(bam_list_t *list) {
+ return !list->items[list->index].written;
+}
+
+
+static void mark_bam_as_written(bam_list_t *list) {
+ list->items[list->index].written = 1;
+}
+
+
+static int create_bam_list(bam_list_t *list, size_t max_size) {
+ size_t i;
+
+ list->size = list->index = 0;
+ list->items = NULL;
+ list->bam_pool = NULL;
+
+ if ((list->items = malloc(max_size * sizeof(bam_item_t))) == NULL) {
+ return 1;
+ }
+
+ if ((list->bam_pool = calloc(max_size, sizeof(bam1_t))) == NULL) {
+ return 1;
+ }
+
+ for (i = 0; i < max_size; i++) {
+ list->items[i].b = &list->bam_pool[i];
+ list->items[i].written = 1;
+ }
+
+ list->size = max_size;
+ list->index = 0;
+
+ return 0;
+}
+
+
+static void destroy_bam_list(bam_list_t *list) {
+ size_t i;
+
+ for (i = 0; i < list->size; i++) {
+ free(list->bam_pool[i].data);
+ }
+
+ free(list->bam_pool);
+ free(list->items);
+}
+
+
+static inline int write_to_bin_file(bam1_t *bam, int64_t *count, samFile **bin_files, char **names, bam_hdr_t *header, int files) {
+ uint32_t x;
+
+ x = hash_X31_Wang(bam_get_qname(bam)) % files;
+
+ if (sam_write1(bin_files[x], header, bam) < 0) {
+ print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", names[x]);
+ return 1;
+ }
+
+ ++count[x];
+
+ return 0;
+}
+
+
static int bamshuf(const char *fn, int n_files, const char *pre, int clevel,
- int is_samtools_stdout, sam_global_args *ga)
+ int is_samtools_stdout, const char *output_file, int fast, int store_max, sam_global_args *ga)
{
samFile *fp, *fpw = NULL, **fpt = NULL;
char **fnt = NULL, modew[8];
bam1_t *b = NULL;
- int i, l, r;
+ int i, counter, l, r;
bam_hdr_t *h = NULL;
int64_t j, max_cnt = 0, *cnt = NULL;
elem_t *a = NULL;
goto fail;
}
+ // open final output file
+ l = strlen(pre);
+
+ sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+
+ if (!is_samtools_stdout && !output_file) { // output to a file (name based on prefix)
+ char *fnw = (char*)calloc(l + 5, 1);
+ if (!fnw) goto mem_fail;
+ if (ga->out.format == unknown_format)
+ sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
+ else
+ sprintf(fnw, "%s.%s", pre, hts_format_file_extension(&ga->out));
+ fpw = sam_open_format(fnw, modew, &ga->out);
+ free(fnw);
+ } else if (output_file) { // output to a given file
+ modew[0] = 'w'; modew[1] = '\0';
+ sam_open_mode(modew + 1, output_file, NULL);
+ j = strlen(modew);
+ snprintf(modew + j, sizeof(modew) - j, "%d",
+ (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
+ fpw = sam_open_format(output_file, modew, &ga->out);
+ } else fpw = sam_open_format("-", modew, &ga->out); // output to samtools_stdout
+ if (fpw == NULL) {
+ if (is_samtools_stdout) print_error_errno("collate", "Cannot open standard output");
+ else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
+ goto fail;
+ }
+ if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
+
+ if (sam_hdr_write(fpw, h) < 0) {
+ print_error_errno("collate", "Couldn't write header");
+ goto fail;
+ }
+
fnt = (char**)calloc(n_files, sizeof(char*));
if (!fnt) goto mem_fail;
fpt = (samFile**)calloc(n_files, sizeof(samFile*));
cnt = (int64_t*)calloc(n_files, 8);
if (!cnt) goto mem_fail;
- l = strlen(pre);
-
- for (i = 0; i < n_files; ++i) {
- fnt[i] = (char*)calloc(l + 10, 1);
+ for (i = counter = 0; i < n_files; ++i) {
+ fnt[i] = (char*)calloc(l + 20, 1);
if (!fnt[i]) goto mem_fail;
- sprintf(fnt[i], "%s.%.4d.bam", pre, i);
- fpt[i] = sam_open(fnt[i], "wb1");
+ do {
+ sprintf(fnt[i], "%s.%04d.bam", pre, counter++);
+ fpt[i] = sam_open(fnt[i], "wxb1");
+ } while (!fpt[i] && errno == EEXIST);
if (fpt[i] == NULL) {
print_error_errno("collate", "Cannot open intermediate file \"%s\"", fnt[i]);
goto fail;
}
+ if (p.pool) hts_set_opt(fpt[i], HTS_OPT_THREAD_POOL, &p);
if (sam_hdr_write(fpt[i], h) < 0) {
print_error_errno("collate", "Couldn't write header to intermediate file \"%s\"", fnt[i]);
goto fail;
}
}
- b = bam_init1();
- if (!b) goto mem_fail;
- while ((r = sam_read1(fp, h, b)) >= 0) {
- uint32_t x;
- x = hash_X31_Wang(bam_get_qname(b)) % n_files;
- if (sam_write1(fpt[x], h, b) < 0) {
- print_error_errno("collate", "Couldn't write to intermediate file \"%s\"", fnt[x]);
+
+ if (fast) {
+ khash_t(bam_store) *stored = kh_init(bam_store);
+ khiter_t itr;
+ bam_list_t list;
+ int err = 0;
+ if (!stored) goto mem_fail;
+
+ if (store_max < 2) store_max = 2;
+
+ if (create_bam_list(&list, store_max)) {
+ fprintf(samtools_stderr, "[collate[ ERROR: unable to create bam list.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ while ((r = sam_read1(fp, h, list.items[list.index].b)) >= 0) {
+ int ret;
+ bam1_t *b = list.items[list.index].b;
+ int readflag = b->core.flag & (BAM_FREAD1 | BAM_FREAD2);
+
+ // strictly paired reads only
+ if (!(b->core.flag & (BAM_FSECONDARY | BAM_FSUPPLEMENTARY)) && (readflag == BAM_FREAD1 || readflag == BAM_FREAD2)) {
+
+ itr = kh_get(bam_store, stored, bam_get_qname(b));
+
+ if (itr == kh_end(stored)) {
+ // new read
+ itr = kh_put(bam_store, stored, bam_get_qname(b), &ret);
+
+ if (ret > 0) { // okay to go ahead store it
+ kh_value(stored, itr).bi = store_bam(&list);
+
+ // see if the next one on the list needs to be written out
+ if (write_bam_needed(&list)) {
+ if (write_to_bin_file(list.items[list.index].b, cnt, fpt, fnt, h, n_files) < 0) {
+ fprintf(samtools_stderr, "[collate] ERROR: could not write line.\n");
+ err = 1;
+ goto fast_fail;
+ } else {
+ mark_bam_as_written(&list);
+
+ itr = kh_get(bam_store, stored, bam_get_qname(list.items[list.index].b));
+
+ if (itr != kh_end(stored)) {
+ kh_del(bam_store, stored, itr);
+ } else {
+ fprintf(samtools_stderr, "[collate] ERROR: stored value not in hash.\n");
+ err = 1;
+ goto fast_fail;
+ }
+ }
+ }
+ } else if (ret == 0) {
+ fprintf(samtools_stderr, "[collate] ERROR: value already in hash.\n");
+ err = 1;
+ goto fast_fail;
+ } else {
+ fprintf(samtools_stderr, "[collate] ERROR: unable to store in hash.\n");
+ err = 1;
+ goto fast_fail;
+ }
+ } else { // we have a match
+ // write out the reads in R1 R2 order
+ bam1_t *r1, *r2;
+
+ if (b->core.flag & BAM_FREAD1) {
+ r1 = b;
+ r2 = kh_value(stored, itr).bi->b;
+ } else {
+ r1 = kh_value(stored, itr).bi->b;
+ r2 = b;
+ }
+
+ if (sam_write1(fpw, h, r1) < 0) {
+ fprintf(samtools_stderr, "[collate] ERROR: could not write r1 alignment.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ if (sam_write1(fpw, h, r2) < 0) {
+ fprintf(samtools_stderr, "[collate] ERROR: could not write r2 alignment.\n");
+ err = 1;
+ goto fast_fail;
+ }
+
+ mark_bam_as_written(&list);
+
+ // remove stored read
+ kh_value(stored, itr).bi->written = 1;
+ kh_del(bam_store, stored, itr);
+ }
+ }
+ }
+
+ for (list.index = 0; list.index < list.size; list.index++) {
+ if (write_bam_needed(&list)) {
+ bam1_t *b = list.items[list.index].b;
+
+ if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+ err = 1;
+ goto fast_fail;
+ } else {
+ itr = kh_get(bam_store, stored, bam_get_qname(b));
+ kh_del(bam_store, stored, itr);
+ }
+ }
+ }
+
+ fast_fail:
+ if (err) {
+ for (itr = kh_begin(stored); itr != kh_end(stored); ++itr) {
+ if (kh_exist(stored, itr)) {
+ kh_del(bam_store, stored, itr);
+ }
+ }
+
+ kh_destroy(bam_store, stored);
+ destroy_bam_list(&list);
goto fail;
+ } else {
+ kh_destroy(bam_store, stored);
+ destroy_bam_list(&list);
}
- ++cnt[x];
+
+ } else {
+ b = bam_init1();
+ if (!b) goto mem_fail;
+
+ while ((r = sam_read1(fp, h, b)) >= 0) {
+ if (write_to_bin_file(b, cnt, fpt, fnt, h, n_files)) {
+ bam_destroy1(b);
+ goto fail;
+ }
+ }
+
+ bam_destroy1(b);
}
- bam_destroy1(b);
- b = NULL;
+
if (r < -1) {
fprintf(samtools_stderr, "Error reading input file\n");
goto fail;
fpt = NULL;
sam_close(fp);
fp = NULL;
- // merge
- sprintf(modew, "wb%d", (clevel >= 0 && clevel <= 9)? clevel : DEF_CLEVEL);
- if (!is_samtools_stdout) { // output to a file
- char *fnw = (char*)calloc(l + 5, 1);
- if (!fnw) goto mem_fail;
- if (ga->out.format == unknown_format)
- sprintf(fnw, "%s.bam", pre); // "wb" above makes BAM the default
- else
- sprintf(fnw, "%s.%s", pre, hts_format_file_extension(&ga->out));
- fpw = sam_open_format(fnw, modew, &ga->out);
- free(fnw);
- } else fpw = sam_open_format("-", modew, &ga->out); // output to samtools_stdout
- if (fpw == NULL) {
- if (is_samtools_stdout) print_error_errno("collate", "Cannot open standard output");
- else print_error_errno("collate", "Cannot open output file \"%s.bam\"", pre);
- goto fail;
- }
- if (p.pool) hts_set_opt(fpw, HTS_OPT_THREAD_POOL, &p);
-
- if (sam_hdr_write(fpw, h) < 0) {
- print_error_errno("collate", "Couldn't write header");
- goto fail;
- }
+ // merge
a = malloc(max_cnt * sizeof(elem_t));
if (!a) goto mem_fail;
for (j = 0; j < max_cnt; ++j) {
if (fp) sam_close(fp);
if (fpw) sam_close(fpw);
if (h) bam_hdr_destroy(h);
- if (b) bam_destroy1(b);
for (i = 0; i < n_files; ++i) {
if (fnt) free(fnt[i]);
if (fpt && fpt[i]) sam_close(fpt[i]);
return 1;
}
-static int usage(FILE *fp, int n_files) {
+static int usage(FILE *fp, int n_files, int reads_store) {
fprintf(fp,
- "Usage: samtools collate [-Ou] [-n nFiles] [-l cLevel] <in.bam> <out.prefix>\n\n"
+ "Usage: samtools collate [-Ou] [-o <name>] [-n nFiles] [-l cLevel] <in.bam> [<prefix>]\n\n"
"Options:\n"
" -O output to samtools_stdout\n"
+ " -o output file name (use prefix if not set)\n"
" -u uncompressed BAM output\n"
+ " -f fast (only primary alignments)\n"
+ " -r working reads stored (with -f) [%d]\n" // reads_store
" -l INT compression level [%d]\n" // DEF_CLEVEL
" -n INT number of temporary files [%d]\n", // n_files
- DEF_CLEVEL, n_files);
+ reads_store, DEF_CLEVEL, n_files);
sam_global_opt_help(fp, "-....@");
+ fprintf(fp,
+ " <prefix> is required unless the -o or -O options are used.\n");
return 1;
}
+char * generate_prefix() {
+ char *prefix;
+ unsigned int pid = getpid();
+#ifdef _WIN32
+# define PREFIX_LEN (MAX_PATH + 16)
+ DWORD ret;
+ prefix = calloc(PREFIX_LEN, sizeof(*prefix));
+ if (!prefix) {
+ perror("collate");
+ return NULL;
+ }
+ ret = GetTempPathA(MAX_PATH, prefix);
+ if (ret > MAX_PATH || ret == 0) {
+ fprintf(samtools_stderr,
+ "[E::collate] Couldn't get path for temporary files.\n");
+ free(prefix);
+ return NULL;
+ }
+ snprintf(prefix + ret, PREFIX_LEN - ret, "\\%x", pid);
+ return prefix;
+#else
+# define PREFIX_LEN 64
+ prefix = malloc(PREFIX_LEN);
+ if (!prefix) {
+ perror("collate");
+ return NULL;
+ }
+ snprintf(prefix, PREFIX_LEN, "/tmp/collate%x", pid);
+ return prefix;
+#endif
+}
+
int main_bamshuf(int argc, char *argv[])
{
- int c, n_files = 64, clevel = DEF_CLEVEL, is_samtools_stdout = 0, is_un = 0;
+ int c, n_files = 64, clevel = DEF_CLEVEL, is_samtools_stdout = 0, is_un = 0, fast_coll = 0, reads_store = 10000, ret, pre_mem = 0;
+ const char *output_file = NULL;
+ char *prefix = NULL;
sam_global_args ga = SAM_GLOBAL_ARGS_INIT;
static const struct option lopts[] = {
SAM_OPT_GLOBAL_OPTIONS('-', 0, 0, 0, 0, '@'),
{ NULL, 0, NULL, 0 }
};
- while ((c = getopt_long(argc, argv, "n:l:uO@:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "n:l:uOo:@:fr:", lopts, NULL)) >= 0) {
switch (c) {
case 'n': n_files = atoi(optarg); break;
case 'l': clevel = atoi(optarg); break;
case 'u': is_un = 1; break;
case 'O': is_samtools_stdout = 1; break;
+ case 'o': output_file = optarg; break;
+ case 'f': fast_coll = 1; break;
+ case 'r': reads_store = atoi(optarg); break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
/* else fall-through */
- case '?': return usage(samtools_stderr, n_files);
+ case '?': return usage(samtools_stderr, n_files, reads_store);
}
}
if (is_un) clevel = 0;
- if (optind + 2 > argc)
- return usage(samtools_stderr, n_files);
+ if (argc >= optind + 2) prefix = argv[optind+1];
+ if (!(prefix || is_samtools_stdout || output_file))
+ return usage(samtools_stderr, n_files, reads_store);
+ if (is_samtools_stdout && output_file) {
+ fprintf(samtools_stderr, "collate: -o and -O options cannot be used together.\n");
+ return usage(samtools_stderr, n_files, reads_store);
+ }
+ if (!prefix) {
+ prefix = generate_prefix();
+ pre_mem = 1;
+ }
+
+ if (!prefix) return EXIT_FAILURE;
+
+ ret = bamshuf(argv[optind], n_files, prefix, clevel, is_samtools_stdout,
+ output_file, fast_coll, reads_store, &ga);
+
+ if (pre_mem) free(prefix);
- return bamshuf(argv[optind], n_files, argv[optind+1], clevel, is_samtools_stdout, &ga);
+ return ret;
}
#include "htslib/hts.h"
#include "samtools.h"
+#include "version.h"
int bam_taf2baf(int argc, char *argv[]);
int bam_mpileup(int argc, char *argv[]);
int main_addreplacerg(int argc, char *argv[]);
int faidx_main(int argc, char *argv[]);
int dict_main(int argc, char *argv[]);
+int fqidx_main(int argc, char *argv[]);
+const char *samtools_version()
+{
+ return SAMTOOLS_VERSION;
+}
static void usage(FILE *fp)
{
" -- Indexing\n"
" dict create a sequence dictionary file\n"
" faidx index/extract FASTA\n"
+" fqidx index/extract FASTQ\n"
" index index alignment\n"
"\n"
" -- Editing\n"
else if (strcmp(argv[1], "index") == 0) ret = bam_index(argc-1, argv+1);
else if (strcmp(argv[1], "idxstats") == 0) ret = bam_idxstats(argc-1, argv+1);
else if (strcmp(argv[1], "faidx") == 0) ret = faidx_main(argc-1, argv+1);
+ else if (strcmp(argv[1], "fqidx") == 0) ret = fqidx_main(argc-1, argv+1);
else if (strcmp(argv[1], "dict") == 0) ret = dict_main(argc-1, argv+1);
else if (strcmp(argv[1], "fixmate") == 0) ret = bam_mating(argc-1, argv+1);
else if (strcmp(argv[1], "rmdup") == 0) ret = bam_rmdup(argc-1, argv+1);
#include "htslib/hts.h"
#include "samtools.h"
+#include "version.h"
int bam_taf2baf(int argc, char *argv[]);
int bam_mpileup(int argc, char *argv[]);
int bam_merge(int argc, char *argv[]);
int bam_index(int argc, char *argv[]);
int bam_sort(int argc, char *argv[]);
-// int bam_tview_main(int argc, char *argv[]);
+//int bam_tview_main(int argc, char *argv[]);
int bam_mating(int argc, char *argv[]);
int bam_rmdup(int argc, char *argv[]);
int bam_flagstat(int argc, char *argv[]);
int main_addreplacerg(int argc, char *argv[]);
int faidx_main(int argc, char *argv[]);
int dict_main(int argc, char *argv[]);
+int fqidx_main(int argc, char *argv[]);
+const char *samtools_version()
+{
+ return SAMTOOLS_VERSION;
+}
static void usage(FILE *fp)
{
" -- Indexing\n"
" dict create a sequence dictionary file\n"
" faidx index/extract FASTA\n"
+" fqidx index/extract FASTQ\n"
" index index alignment\n"
"\n"
" -- Editing\n"
else if (strcmp(argv[1], "index") == 0) ret = bam_index(argc-1, argv+1);
else if (strcmp(argv[1], "idxstats") == 0) ret = bam_idxstats(argc-1, argv+1);
else if (strcmp(argv[1], "faidx") == 0) ret = faidx_main(argc-1, argv+1);
+ else if (strcmp(argv[1], "fqidx") == 0) ret = fqidx_main(argc-1, argv+1);
else if (strcmp(argv[1], "dict") == 0) ret = dict_main(argc-1, argv+1);
else if (strcmp(argv[1], "fixmate") == 0) ret = bam_mating(argc-1, argv+1);
else if (strcmp(argv[1], "rmdup") == 0) ret = bam_rmdup(argc-1, argv+1);
fprintf(samtools_stderr, "[main] The `pileup' command has been removed. Please use `mpileup' instead.\n");
return 1;
}
- // else if (strcmp(argv[1], "tview") == 0) ret = bam_tview_main(argc-1, argv+1);
+ //else if (strcmp(argv[1], "tview") == 0) ret = bam_tview_main(argc-1, argv+1);
else if (strcmp(argv[1], "--version") == 0) {
- fprintf(samtools_stdout,
+ fprintf(samtools_stdout,
"samtools %s\n"
"Using htslib %s\n"
"Copyright (C) 2018 Genome Research Ltd.\n",
kstream_t *ks;
hts_idx_t **idx;
aux_t **aux;
- int *n_plp, dret, i, n, c, min_mapQ = 0;
+ int *n_plp, dret, i, j, m, n, c, min_mapQ = 0, skip_DN = 0;
int64_t *cnt;
const bam_pileup1_t **plp;
int usage = 0;
{ NULL, 0, NULL, 0 }
};
- while ((c = getopt_long(argc, argv, "Q:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "Q:j", lopts, NULL)) >= 0) {
switch (c) {
case 'Q': min_mapQ = atoi(optarg); break;
+ case 'j': skip_DN = 1; break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
/* else fall-through */
case '?': usage = 1; break;
if (usage || optind + 2 > argc) {
fprintf(stderr, "Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]\n\n");
fprintf(stderr, "Options:\n");
- fprintf(stderr, " -Q <int> mapping quality threshold [0]\n");
+ fprintf(stderr, " -Q <int> mapping quality threshold [0]\n");
+ fprintf(stderr, " -j do not include deletions (D) and ref skips (N) in bedcov computation\n");
sam_global_opt_help(stderr, "-.--.-");
return 1;
}
bam_mplp_set_maxcnt(mplp, 64000);
memset(cnt, 0, 8 * n);
while (bam_mplp_auto(mplp, &tid, &pos, n_plp, plp) > 0)
- if (pos >= beg && pos < end)
- for (i = 0; i < n; ++i) cnt[i] += n_plp[i];
+ if (pos >= beg && pos < end) {
+ for (i = 0, m = 0; i < n; ++i) {
+ if (skip_DN)
+ for (j = 0; j < n_plp[i]; ++j) {
+ const bam_pileup1_t *pi = plp[i] + j;
+ if (pi->is_del || pi->is_refskip) ++m;
+ }
+ cnt[i] += n_plp[i] - m;
+ }
+ }
for (i = 0; i < n; ++i) {
kputc('\t', &str);
kputl(cnt[i], &str);
kstream_t *ks;
hts_idx_t **idx;
aux_t **aux;
- int *n_plp, dret, i, n, c, min_mapQ = 0;
+ int *n_plp, dret, i, j, m, n, c, min_mapQ = 0, skip_DN = 0;
int64_t *cnt;
const bam_pileup1_t **plp;
int usage = 0;
{ NULL, 0, NULL, 0 }
};
- while ((c = getopt_long(argc, argv, "Q:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "Q:j", lopts, NULL)) >= 0) {
switch (c) {
case 'Q': min_mapQ = atoi(optarg); break;
+ case 'j': skip_DN = 1; break;
default: if (parse_sam_global_opt(c, optarg, lopts, &ga) == 0) break;
/* else fall-through */
case '?': usage = 1; break;
if (usage || optind + 2 > argc) {
fprintf(samtools_stderr, "Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]\n\n");
fprintf(samtools_stderr, "Options:\n");
- fprintf(samtools_stderr, " -Q <int> mapping quality threshold [0]\n");
+ fprintf(samtools_stderr, " -Q <int> mapping quality threshold [0]\n");
+ fprintf(samtools_stderr, " -j do not include deletions (D) and ref skips (N) in bedcov computation\n");
sam_global_opt_help(samtools_stderr, "-.--.-");
return 1;
}
bam_mplp_set_maxcnt(mplp, 64000);
memset(cnt, 0, 8 * n);
while (bam_mplp_auto(mplp, &tid, &pos, n_plp, plp) > 0)
- if (pos >= beg && pos < end)
- for (i = 0; i < n; ++i) cnt[i] += n_plp[i];
+ if (pos >= beg && pos < end) {
+ for (i = 0, m = 0; i < n; ++i) {
+ if (skip_DN)
+ for (j = 0; j < n_plp[i]; ++j) {
+ const bam_pileup1_t *pi = plp[i] + j;
+ if (pi->is_del || pi->is_refskip) ++m;
+ }
+ cnt[i] += n_plp[i] - m;
+ }
+ }
for (i = 0; i < n; ++i) {
kputc('\t', &str);
kputl(cnt[i], &str);
}
- fputs(str.s, samtools_stdout) & fputc('\n', samtools_stdout);
+ samtools_puts(str.s);
bam_mplp_destroy(mplp);
continue;
/* bedidx.c -- BED file indexing.
Copyright (C) 2011 Broad Institute.
- Copyright (C) 2014 Genome Research Ltd.
+ Copyright (C) 2014,2017 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>
* @param reg_hash the region hash table with interval lists as values
*/
-static void bed_unify(void *reg_hash) {
+void bed_unify(void *reg_hash) {
int i, j, new_n;
reghash_t *h;
gzFile fp;
kstream_t *ks = NULL;
int dret;
- unsigned int line = 0;
+ unsigned int line = 0, save_errno;
kstring_t str = { 0, 0, NULL };
if (NULL == h) return NULL;
// has called their reference "browser" or "track".
if (0 == strcmp(ref, "browser")) continue;
if (0 == strcmp(ref, "track")) continue;
- fprintf(stderr, "[bed_read] Parse error reading %s at line %u\n",
- fn, line);
- goto fail_no_msg;
+ if (num < 1) {
+ fprintf(stderr,
+ "[bed_read] Parse error reading \"%s\" at line %u\n",
+ fn, line);
+ } else {
+ fprintf(stderr,
+ "[bed_read] Parse error reading \"%s\" at line %u : "
+ "end (%u) must not be less than start (%u)\n",
+ fn, line, end, beg);
+ }
+ errno = 0; // Prevent caller from printing misleading error messages
+ goto fail;
}
// Put reg in the hash table if not already there
//bed_unify(h);
return h;
fail:
- fprintf(stderr, "[bed_read] Error reading %s : %s\n", fn, strerror(errno));
- fail_no_msg:
+ save_errno = errno;
if (ks) ks_destroy(ks);
if (fp) gzclose(fp);
free(str.s);
bed_destroy(h);
+ errno = save_errno;
return NULL;
}
/* bedidx.c -- BED file indexing.
Copyright (C) 2011 Broad Institute.
- Copyright (C) 2014 Genome Research Ltd.
+ Copyright (C) 2014,2017 Genome Research Ltd.
Author: Heng Li <lh3@sanger.ac.uk>
* @param reg_hash the region hash table with interval lists as values
*/
-static void bed_unify(void *reg_hash) {
+void bed_unify(void *reg_hash) {
int i, j, new_n;
reghash_t *h;
gzFile fp;
kstream_t *ks = NULL;
int dret;
- unsigned int line = 0;
+ unsigned int line = 0, save_errno;
kstring_t str = { 0, 0, NULL };
if (NULL == h) return NULL;
// has called their reference "browser" or "track".
if (0 == strcmp(ref, "browser")) continue;
if (0 == strcmp(ref, "track")) continue;
- fprintf(samtools_stderr, "[bed_read] Parse error reading %s at line %u\n",
- fn, line);
- goto fail_no_msg;
+ if (num < 1) {
+ fprintf(samtools_stderr,
+ "[bed_read] Parse error reading \"%s\" at line %u\n",
+ fn, line);
+ } else {
+ fprintf(samtools_stderr,
+ "[bed_read] Parse error reading \"%s\" at line %u : "
+ "end (%u) must not be less than start (%u)\n",
+ fn, line, end, beg);
+ }
+ errno = 0; // Prevent caller from printing misleading error messages
+ goto fail;
}
// Put reg in the hash table if not already there
//bed_unify(h);
return h;
fail:
- fprintf(samtools_stderr, "[bed_read] Error reading %s : %s\n", fn, strerror(errno));
- fail_no_msg:
+ save_errno = errno;
if (ks) ks_destroy(ks);
if (fp) gzclose(fp);
free(str.s);
bed_destroy(h);
+ errno = save_errno;
return NULL;
}
+/* bedidx.h -- BED file indexing header file.
+
+ Copyright (C) 2017 Genome Research Ltd.
+
+ Author: Valeriu Ohan <vo2@sanger.ac.uk>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE. */
+
#ifndef BEDIDX_H
#define BEDIDX_H
void *bed_hash_regions(void *reg_hash, char **regs, int first, int last, int *op);
const char* bed_get(void *reg_hash, int index, int filter);
hts_reglist_t *bed_reglist(void *reg_hash, int filter, int *count_regs);
+void bed_unify(void *_h);
#endif
/* faidx.c -- faidx subcommand.
- Copyright (C) 2008, 2009, 2013, 2016 Genome Research Ltd.
+ Copyright (C) 2008, 2009, 2013, 2016, 2018 Genome Research Ltd.
Portions copyright (C) 2011 Broad Institute.
Author: Heng Li <lh3@sanger.ac.uk>
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
-DEALINGS IN THE SOFTWARE. */
+DEALINGS IN THE SOFTWARE.
+
+History:
+
+ * 2016-01-12: Pierre Lindenbaum @yokofakun : added options -o -n
+
+*/
#include <config.h>
+#include <ctype.h>
+#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
-
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+#include <limits.h>
#include <htslib/faidx.h>
+#include <htslib/hts.h>
+#include <htslib/hfile.h>
+#include <htslib/kstring.h>
#include "samtools.h"
-static int usage(FILE *fp, int exit_status)
+#define DEFAULT_FASTA_LINE_LEN 60
+
+static unsigned char comp_base[256] = {
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+ 32, '!', '"', '#', '$', '%', '&', '\'','(', ')', '*', '+', ',', '-', '.', '/',
+'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
+'@', 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
+'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z', '[', '\\',']', '^', '_',
+'`', 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
+'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', '{', '|', '}', '~', 127,
+128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
+144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
+160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
+176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
+192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
+208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
+224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
+};
+
+static void reverse_complement(char *str, int len) {
+ char c;
+ int i = 0, j = len - 1;
+
+ while (i <= j) {
+ c = str[i];
+ str[i] = comp_base[(unsigned char)str[j]];
+ str[j] = comp_base[(unsigned char)c];
+ i++;
+ j--;
+ }
+}
+
+
+static void reverse(char *str, int len) {
+ char c;
+ int i = 0, j = len - 1;
+
+ while (i < j) {
+ c = str[i];
+ str[i] = str[j];
+ str[j] = c;
+ i++;
+ j--;
+ }
+}
+
+
+static int write_line(FILE *file, const char *line, const char *name, const int ignore,
+ const int length, const int seq_len) {
+ int beg, end;
+
+ if (seq_len < 0) {
+ fprintf(stderr, "[faidx] Failed to fetch sequence in %s\n", name);
+
+ if (ignore && seq_len == -2) {
+ return EXIT_SUCCESS;
+ } else {
+ return EXIT_FAILURE;
+ }
+ } else if (seq_len == 0) {
+ fprintf(stderr, "[faidx] Zero length sequence: %s\n", name);
+ } else if (hts_parse_reg(name, &beg, &end) && (end < INT_MAX) && (seq_len != end - beg)) {
+ fprintf(stderr, "[faidx] Truncated sequence: %s\n", name);
+ }
+
+ size_t i, seq_sz = seq_len;
+
+ for (i = 0; i < seq_sz; i += length)
+ {
+ size_t len = i + length < seq_sz ? length : seq_sz - i;
+ if (fwrite(line + i, 1, len, file) < len ||
+ fputc('\n', file) == EOF) {
+ print_error_errno("faidx", "failed to write output");
+ return EXIT_FAILURE;
+ }
+ }
+
+ return EXIT_SUCCESS;
+}
+
+
+static int write_output(faidx_t *faid, FILE *file, const char *name, const int ignore,
+ const int length, const int rev,
+ const char *pos_strand_name, const char *neg_strand_name,
+ enum fai_format_options format) {
+ int seq_len;
+ char *seq = fai_fetch(faid, name, &seq_len);
+
+ if (format == FAI_FASTA) {
+ fprintf(file, ">%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+ } else {
+ fprintf(file, "@%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+ }
+
+ if (rev && seq_len > 0) {
+ reverse_complement(seq, seq_len);
+ }
+
+ if (write_line(file, seq, name, ignore, length, seq_len) == EXIT_FAILURE) {
+ free(seq);
+ return EXIT_FAILURE;
+ }
+
+ free(seq);
+
+ if (format == FAI_FASTQ) {
+ fprintf(file, "+\n");
+
+ char *qual = fai_fetchqual(faid, name, &seq_len);
+
+ if (rev && seq_len > 0) {
+ reverse(qual, seq_len);
+ }
+
+ if (write_line(file, qual, name, ignore, length, seq_len) == EXIT_FAILURE) {
+ free(seq);
+ return EXIT_FAILURE;
+ }
+
+ free(qual);
+ }
+
+ return EXIT_SUCCESS;
+}
+
+
+static int read_regions_from_file(faidx_t *faid, hFILE *in_file, FILE *file, const int ignore,
+ const int length, const int rev,
+ const char *pos_strand_name,
+ const char *neg_strand_name,
+ enum fai_format_options format) {
+ kstring_t line = {0, 0, NULL};
+ int ret = EXIT_FAILURE;
+
+ while (line.l = 0, kgetline(&line, (kgets_func *)hgets, in_file) >= 0) {
+ if ((ret = write_output(faid, file, line.s, ignore, length, rev, pos_strand_name, neg_strand_name, format)) == EXIT_FAILURE) {
+ break;
+ }
+ }
+
+ free(line.s);
+
+ return ret;
+}
+
+static int usage(FILE *fp, enum fai_format_options format, int exit_status)
{
- fprintf(fp, "Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]\n");
+ char *tool, *file_type;
+
+ if (format == FAI_FASTA) {
+ tool = "faidx <file.fa|file.fa.gz>";
+ file_type = "FASTA";
+ } else {
+ tool = "fqidx <file.fq|file.fq.gz>";
+ file_type = "FASTQ";
+ }
+
+ fprintf(fp, "Usage: samtools %s [<reg> [...]]\n", tool);
+ fprintf(fp, "Option: \n"
+ " -o, --output FILE Write %s to file.\n"
+ " -n, --length INT Length of %s sequence line. [60]\n"
+ " -c, --continue Continue after trying to retrieve missing region.\n"
+ " -r, --region-file FILE File of regions. Format is chr:from-to. One per line.\n"
+ " -i, --reverse-complement Reverse complement sequences.\n"
+ " --mark-strand TYPE Add strand indicator to sequence name\n"
+ " TYPE = rc for /rc on negative strand (default)\n"
+ " no for no strand indicator\n"
+ " sign for (+) / (-)\n"
+ " custom,<pos>,<neg> for custom indicator\n",
+ file_type, file_type);
+
+
+ if (format == FAI_FASTA) {
+ fprintf(fp, " -f, --fastq File and index in FASTQ format.\n");
+ }
+
+ fprintf(fp, " -h, --help This message.\n");
+
return exit_status;
}
-int faidx_main(int argc, char *argv[])
+int faidx_core(int argc, char *argv[], enum fai_format_options format)
{
- int c;
- while((c = getopt(argc, argv, "h")) >= 0)
- {
- switch(c)
- {
- case 'h':
- return usage(stdout, EXIT_SUCCESS);
+ int c, ignore_error = 0, rev = 0;
+ int line_len = DEFAULT_FASTA_LINE_LEN ;/* fasta line len */
+ char* output_file = NULL; /* output file (default is stdout ) */
+ char *region_file = NULL; // list of regions from file, one per line
+ char *pos_strand_name = ""; // Extension to add to name for +ve strand
+ char *neg_strand_name = "/rc"; // Extension to add to name for -ve strand
+ char *strand_names = NULL; // Used for custom strand annotation
+ FILE* file_out = stdout;/* output stream */
+
+ static const struct option lopts[] = {
+ { "output", required_argument, NULL, 'o' },
+ { "help", no_argument, NULL, 'h' },
+ { "length", required_argument, NULL, 'n' },
+ { "continue", no_argument, NULL, 'c' },
+ { "region-file", required_argument, NULL, 'r' },
+ { "fastq", no_argument, NULL, 'f' },
+ { "reverse-complement", no_argument, NULL, 'i' },
+ { "mark-strand", required_argument, NULL, 1000 },
+ { NULL, 0, NULL, 0 }
+ };
- default:
- return usage(stderr, EXIT_FAILURE);
+ while ((c = getopt_long(argc, argv, "ho:n:cr:fi", lopts, NULL)) >= 0) {
+ switch (c) {
+ case 'o': output_file = optarg; break;
+ case 'n': line_len = atoi(optarg);
+ if(line_len<1) {
+ fprintf(stderr,"[faidx] bad line length '%s', using default:%d\n",optarg,DEFAULT_FASTA_LINE_LEN);
+ line_len= DEFAULT_FASTA_LINE_LEN ;
+ }
+ break;
+ case 'c': ignore_error = 1; break;
+ case 'r': region_file = optarg; break;
+ case 'f': format = FAI_FASTQ; break;
+ case 'i': rev = 1; break;
+ case '?': return usage(stderr, format, EXIT_FAILURE);
+ case 'h': return usage(stdout, format, EXIT_SUCCESS);
+ case 1000:
+ if (strcmp(optarg, "no") == 0) {
+ pos_strand_name = neg_strand_name = "";
+ } else if (strcmp(optarg, "sign") == 0) {
+ pos_strand_name = "(+)";
+ neg_strand_name = "(-)";
+ } else if (strcmp(optarg, "rc") == 0) {
+ pos_strand_name = "";
+ neg_strand_name = "/rc";
+ } else if (strncmp(optarg, "custom,", 7) == 0) {
+ size_t len = strlen(optarg + 7);
+ size_t comma = strcspn(optarg + 7, ",");
+ free(strand_names);
+ strand_names = pos_strand_name = malloc(len + 2);
+ if (!strand_names) {
+ fprintf(stderr, "[faidx] Out of memory\n");
+ return EXIT_FAILURE;
+ }
+ neg_strand_name = pos_strand_name + comma + 1;
+ memcpy(pos_strand_name, optarg + 7, comma);
+ pos_strand_name[comma] = '\0';
+ if (comma < len)
+ memcpy(neg_strand_name, optarg + 7 + comma + 1,
+ len - comma);
+ neg_strand_name[len - comma] = '\0';
+ } else {
+ fprintf(stderr, "[faidx] Unknown --mark-strand option \"%s\"\n", optarg);
+ return usage(stderr, format, EXIT_FAILURE);
+ }
+ break;
+ default: break;
}
}
+
if ( argc==optind )
- return usage(stdout, EXIT_SUCCESS);
- if ( argc==2 )
+ return usage(stdout, format, EXIT_SUCCESS);
+
+ if ( optind+1 == argc && !region_file)
{
if (fai_build(argv[optind]) != 0) {
- fprintf(stderr, "Could not build fai index %s.fai\n", argv[optind]);
+ fprintf(stderr, "[faidx] Could not build fai index %s.fai\n", argv[optind]);
return EXIT_FAILURE;
}
return 0;
}
- faidx_t *fai = fai_load(argv[optind]);
+ faidx_t *fai = fai_load_format(argv[optind], format);
+
if ( !fai ) {
- fprintf(stderr, "Could not load fai index of %s\n", argv[optind]);
+ fprintf(stderr, "[faidx] Could not load fai index of %s\n", argv[optind]);
return EXIT_FAILURE;
}
- int exit_status = EXIT_SUCCESS;
+ /** output file provided by user */
+ if( output_file != NULL ) {
+ if( strcmp( output_file, argv[optind] ) == 0 ) {
+ fprintf(stderr,"[faidx] Same input/output : %s\n", output_file);
+ return EXIT_FAILURE;
+ }
- while ( ++optind<argc && exit_status == EXIT_SUCCESS)
- {
- printf(">%s\n", argv[optind]);
- int seq_len;
- char *seq = fai_fetch(fai, argv[optind], &seq_len);
- if ( seq_len < 0 ) {
- fprintf(stderr, "Failed to fetch sequence in %s\n", argv[optind]);
- exit_status = EXIT_FAILURE;
- break;
+ file_out = fopen( output_file, "w" );
+
+ if( file_out == NULL) {
+ fprintf(stderr,"[faidx] Cannot open \"%s\" for writing :%s.\n", output_file, strerror(errno) );
+ return EXIT_FAILURE;
}
- size_t i, seq_sz = seq_len;
- for (i=0; i<seq_sz; i+=60)
- {
- size_t len = i + 60 < seq_sz ? 60 : seq_sz - i;
- if (fwrite(seq + i, 1, len, stdout) < len ||
- putchar('\n') == EOF) {
- print_error_errno("faidx", "failed to write output");
- exit_status = EXIT_FAILURE;
- break;
+ }
+
+ int exit_status = EXIT_SUCCESS;
+
+ if (region_file) {
+ hFILE *rf;
+
+ if ((rf = hopen(region_file, "r"))) {
+ exit_status = read_regions_from_file(fai, rf, file_out, ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+
+ if (hclose(rf) != 0) {
+ fprintf(stderr, "[faidx] Warning: failed to close %s", region_file);
}
+ } else {
+ fprintf(stderr, "[faidx] Failed to open \"%s\" for reading.\n", region_file);
+ exit_status = EXIT_FAILURE;
}
- free(seq);
}
+
+ while ( ++optind<argc && exit_status == EXIT_SUCCESS) {
+ exit_status = write_output(fai, file_out, argv[optind], ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+ }
+
fai_destroy(fai);
- if (fflush(stdout) == EOF) {
+ if (fflush(file_out) == EOF) {
print_error_errno("faidx", "failed to flush output");
exit_status = EXIT_FAILURE;
}
+ if( output_file != NULL) fclose(file_out);
+ free(strand_names);
+
return exit_status;
}
+
+
+int faidx_main(int argc, char *argv[]) {
+ return faidx_core(argc, argv, FAI_FASTA);
+}
+
+
+int fqidx_main(int argc, char *argv[]) {
+ return faidx_core(argc, argv, FAI_FASTQ);
+}
+
/* faidx.c -- faidx subcommand.
- Copyright (C) 2008, 2009, 2013, 2016 Genome Research Ltd.
+ Copyright (C) 2008, 2009, 2013, 2016, 2018 Genome Research Ltd.
Portions copyright (C) 2011 Broad Institute.
Author: Heng Li <lh3@sanger.ac.uk>
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
-DEALINGS IN THE SOFTWARE. */
+DEALINGS IN THE SOFTWARE.
+
+History:
+
+ * 2016-01-12: Pierre Lindenbaum @yokofakun : added options -o -n
+
+*/
#include <config.h>
+#include <ctype.h>
+#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
-
+#include <stdarg.h>
+#include <errno.h>
+#include <getopt.h>
+#include <limits.h>
#include <htslib/faidx.h>
+#include <htslib/hts.h>
+#include <htslib/hfile.h>
+#include <htslib/kstring.h>
#include "samtools.h"
-static int usage(FILE *fp, int exit_status)
+#define DEFAULT_FASTA_LINE_LEN 60
+
+static unsigned char comp_base[256] = {
+ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
+ 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
+ 32, '!', '"', '#', '$', '%', '&', '\'','(', ')', '*', '+', ',', '-', '.', '/',
+'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
+'@', 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
+'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z', '[', '\\',']', '^', '_',
+'`', 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
+'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', '{', '|', '}', '~', 127,
+128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
+144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
+160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
+176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
+192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
+208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
+224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
+240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
+};
+
+static void reverse_complement(char *str, int len) {
+ char c;
+ int i = 0, j = len - 1;
+
+ while (i <= j) {
+ c = str[i];
+ str[i] = comp_base[(unsigned char)str[j]];
+ str[j] = comp_base[(unsigned char)c];
+ i++;
+ j--;
+ }
+}
+
+
+static void reverse(char *str, int len) {
+ char c;
+ int i = 0, j = len - 1;
+
+ while (i < j) {
+ c = str[i];
+ str[i] = str[j];
+ str[j] = c;
+ i++;
+ j--;
+ }
+}
+
+
+static int write_line(FILE *file, const char *line, const char *name, const int ignore,
+ const int length, const int seq_len) {
+ int beg, end;
+
+ if (seq_len < 0) {
+ fprintf(samtools_stderr, "[faidx] Failed to fetch sequence in %s\n", name);
+
+ if (ignore && seq_len == -2) {
+ return EXIT_SUCCESS;
+ } else {
+ return EXIT_FAILURE;
+ }
+ } else if (seq_len == 0) {
+ fprintf(samtools_stderr, "[faidx] Zero length sequence: %s\n", name);
+ } else if (hts_parse_reg(name, &beg, &end) && (end < INT_MAX) && (seq_len != end - beg)) {
+ fprintf(samtools_stderr, "[faidx] Truncated sequence: %s\n", name);
+ }
+
+ size_t i, seq_sz = seq_len;
+
+ for (i = 0; i < seq_sz; i += length)
+ {
+ size_t len = i + length < seq_sz ? length : seq_sz - i;
+ if (fwrite(line + i, 1, len, file) < len ||
+ fputc('\n', file) == EOF) {
+ print_error_errno("faidx", "failed to write output");
+ return EXIT_FAILURE;
+ }
+ }
+
+ return EXIT_SUCCESS;
+}
+
+
+static int write_output(faidx_t *faid, FILE *file, const char *name, const int ignore,
+ const int length, const int rev,
+ const char *pos_strand_name, const char *neg_strand_name,
+ enum fai_format_options format) {
+ int seq_len;
+ char *seq = fai_fetch(faid, name, &seq_len);
+
+ if (format == FAI_FASTA) {
+ fprintf(file, ">%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+ } else {
+ fprintf(file, "@%s%s\n", name, rev ? neg_strand_name : pos_strand_name);
+ }
+
+ if (rev && seq_len > 0) {
+ reverse_complement(seq, seq_len);
+ }
+
+ if (write_line(file, seq, name, ignore, length, seq_len) == EXIT_FAILURE) {
+ free(seq);
+ return EXIT_FAILURE;
+ }
+
+ free(seq);
+
+ if (format == FAI_FASTQ) {
+ fprintf(file, "+\n");
+
+ char *qual = fai_fetchqual(faid, name, &seq_len);
+
+ if (rev && seq_len > 0) {
+ reverse(qual, seq_len);
+ }
+
+ if (write_line(file, qual, name, ignore, length, seq_len) == EXIT_FAILURE) {
+ free(seq);
+ return EXIT_FAILURE;
+ }
+
+ free(qual);
+ }
+
+ return EXIT_SUCCESS;
+}
+
+
+static int read_regions_from_file(faidx_t *faid, hFILE *in_file, FILE *file, const int ignore,
+ const int length, const int rev,
+ const char *pos_strand_name,
+ const char *neg_strand_name,
+ enum fai_format_options format) {
+ kstring_t line = {0, 0, NULL};
+ int ret = EXIT_FAILURE;
+
+ while (line.l = 0, kgetline(&line, (kgets_func *)hgets, in_file) >= 0) {
+ if ((ret = write_output(faid, file, line.s, ignore, length, rev, pos_strand_name, neg_strand_name, format)) == EXIT_FAILURE) {
+ break;
+ }
+ }
+
+ free(line.s);
+
+ return ret;
+}
+
+static int usage(FILE *fp, enum fai_format_options format, int exit_status)
{
- fprintf(fp, "Usage: samtools faidx <file.fa|file.fa.gz> [<reg> [...]]\n");
+ char *tool, *file_type;
+
+ if (format == FAI_FASTA) {
+ tool = "faidx <file.fa|file.fa.gz>";
+ file_type = "FASTA";
+ } else {
+ tool = "fqidx <file.fq|file.fq.gz>";
+ file_type = "FASTQ";
+ }
+
+ fprintf(fp, "Usage: samtools %s [<reg> [...]]\n", tool);
+ fprintf(fp, "Option: \n"
+ " -o, --output FILE Write %s to file.\n"
+ " -n, --length INT Length of %s sequence line. [60]\n"
+ " -c, --continue Continue after trying to retrieve missing region.\n"
+ " -r, --region-file FILE File of regions. Format is chr:from-to. One per line.\n"
+ " -i, --reverse-complement Reverse complement sequences.\n"
+ " --mark-strand TYPE Add strand indicator to sequence name\n"
+ " TYPE = rc for /rc on negative strand (default)\n"
+ " no for no strand indicator\n"
+ " sign for (+) / (-)\n"
+ " custom,<pos>,<neg> for custom indicator\n",
+ file_type, file_type);
+
+
+ if (format == FAI_FASTA) {
+ fprintf(fp, " -f, --fastq File and index in FASTQ format.\n");
+ }
+
+ fprintf(fp, " -h, --help This message.\n");
+
return exit_status;
}
-int faidx_main(int argc, char *argv[])
+int faidx_core(int argc, char *argv[], enum fai_format_options format)
{
- int c;
- while((c = getopt(argc, argv, "h")) >= 0)
- {
- switch(c)
- {
- case 'h':
- return usage(samtools_stdout, EXIT_SUCCESS);
+ int c, ignore_error = 0, rev = 0;
+ int line_len = DEFAULT_FASTA_LINE_LEN ;/* fasta line len */
+ char* output_file = NULL; /* output file (default is samtools_stdout ) */
+ char *region_file = NULL; // list of regions from file, one per line
+ char *pos_strand_name = ""; // Extension to add to name for +ve strand
+ char *neg_strand_name = "/rc"; // Extension to add to name for -ve strand
+ char *strand_names = NULL; // Used for custom strand annotation
+ FILE* file_out = samtools_stdout;/* output stream */
+
+ static const struct option lopts[] = {
+ { "output", required_argument, NULL, 'o' },
+ { "help", no_argument, NULL, 'h' },
+ { "length", required_argument, NULL, 'n' },
+ { "continue", no_argument, NULL, 'c' },
+ { "region-file", required_argument, NULL, 'r' },
+ { "fastq", no_argument, NULL, 'f' },
+ { "reverse-complement", no_argument, NULL, 'i' },
+ { "mark-strand", required_argument, NULL, 1000 },
+ { NULL, 0, NULL, 0 }
+ };
- default:
- return usage(samtools_stderr, EXIT_FAILURE);
+ while ((c = getopt_long(argc, argv, "ho:n:cr:fi", lopts, NULL)) >= 0) {
+ switch (c) {
+ case 'o': output_file = optarg; break;
+ case 'n': line_len = atoi(optarg);
+ if(line_len<1) {
+ fprintf(samtools_stderr,"[faidx] bad line length '%s', using default:%d\n",optarg,DEFAULT_FASTA_LINE_LEN);
+ line_len= DEFAULT_FASTA_LINE_LEN ;
+ }
+ break;
+ case 'c': ignore_error = 1; break;
+ case 'r': region_file = optarg; break;
+ case 'f': format = FAI_FASTQ; break;
+ case 'i': rev = 1; break;
+ case '?': return usage(samtools_stderr, format, EXIT_FAILURE);
+ case 'h': return usage(samtools_stdout, format, EXIT_SUCCESS);
+ case 1000:
+ if (strcmp(optarg, "no") == 0) {
+ pos_strand_name = neg_strand_name = "";
+ } else if (strcmp(optarg, "sign") == 0) {
+ pos_strand_name = "(+)";
+ neg_strand_name = "(-)";
+ } else if (strcmp(optarg, "rc") == 0) {
+ pos_strand_name = "";
+ neg_strand_name = "/rc";
+ } else if (strncmp(optarg, "custom,", 7) == 0) {
+ size_t len = strlen(optarg + 7);
+ size_t comma = strcspn(optarg + 7, ",");
+ free(strand_names);
+ strand_names = pos_strand_name = malloc(len + 2);
+ if (!strand_names) {
+ fprintf(samtools_stderr, "[faidx] Out of memory\n");
+ return EXIT_FAILURE;
+ }
+ neg_strand_name = pos_strand_name + comma + 1;
+ memcpy(pos_strand_name, optarg + 7, comma);
+ pos_strand_name[comma] = '\0';
+ if (comma < len)
+ memcpy(neg_strand_name, optarg + 7 + comma + 1,
+ len - comma);
+ neg_strand_name[len - comma] = '\0';
+ } else {
+ fprintf(samtools_stderr, "[faidx] Unknown --mark-strand option \"%s\"\n", optarg);
+ return usage(samtools_stderr, format, EXIT_FAILURE);
+ }
+ break;
+ default: break;
}
}
+
if ( argc==optind )
- return usage(samtools_stdout, EXIT_SUCCESS);
- if ( argc==2 )
+ return usage(samtools_stdout, format, EXIT_SUCCESS);
+
+ if ( optind+1 == argc && !region_file)
{
if (fai_build(argv[optind]) != 0) {
- fprintf(samtools_stderr, "Could not build fai index %s.fai\n", argv[optind]);
+ fprintf(samtools_stderr, "[faidx] Could not build fai index %s.fai\n", argv[optind]);
return EXIT_FAILURE;
}
return 0;
}
- faidx_t *fai = fai_load(argv[optind]);
+ faidx_t *fai = fai_load_format(argv[optind], format);
+
if ( !fai ) {
- fprintf(samtools_stderr, "Could not load fai index of %s\n", argv[optind]);
+ fprintf(samtools_stderr, "[faidx] Could not load fai index of %s\n", argv[optind]);
return EXIT_FAILURE;
}
- int exit_status = EXIT_SUCCESS;
+ /** output file provided by user */
+ if( output_file != NULL ) {
+ if( strcmp( output_file, argv[optind] ) == 0 ) {
+ fprintf(samtools_stderr,"[faidx] Same input/output : %s\n", output_file);
+ return EXIT_FAILURE;
+ }
- while ( ++optind<argc && exit_status == EXIT_SUCCESS)
- {
- fprintf(samtools_stdout, ">%s\n", argv[optind]);
- int seq_len;
- char *seq = fai_fetch(fai, argv[optind], &seq_len);
- if ( seq_len < 0 ) {
- fprintf(samtools_stderr, "Failed to fetch sequence in %s\n", argv[optind]);
- exit_status = EXIT_FAILURE;
- break;
+ file_out = fopen( output_file, "w" );
+
+ if( file_out == NULL) {
+ fprintf(samtools_stderr,"[faidx] Cannot open \"%s\" for writing :%s.\n", output_file, strerror(errno) );
+ return EXIT_FAILURE;
}
- size_t i, seq_sz = seq_len;
- for (i=0; i<seq_sz; i+=60)
- {
- size_t len = i + 60 < seq_sz ? 60 : seq_sz - i;
- if (fwrite(seq + i, 1, len, samtools_stdout) < len ||
- fputc('\n', samtools_stdout) == EOF) {
- print_error_errno("faidx", "failed to write output");
- exit_status = EXIT_FAILURE;
- break;
+ }
+
+ int exit_status = EXIT_SUCCESS;
+
+ if (region_file) {
+ hFILE *rf;
+
+ if ((rf = hopen(region_file, "r"))) {
+ exit_status = read_regions_from_file(fai, rf, file_out, ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+
+ if (hclose(rf) != 0) {
+ fprintf(samtools_stderr, "[faidx] Warning: failed to close %s", region_file);
}
+ } else {
+ fprintf(samtools_stderr, "[faidx] Failed to open \"%s\" for reading.\n", region_file);
+ exit_status = EXIT_FAILURE;
}
- free(seq);
}
+
+ while ( ++optind<argc && exit_status == EXIT_SUCCESS) {
+ exit_status = write_output(fai, file_out, argv[optind], ignore_error, line_len, rev, pos_strand_name, neg_strand_name, format);
+ }
+
fai_destroy(fai);
- if (fflush(samtools_stdout) == EOF) {
+ if (fflush(file_out) == EOF) {
print_error_errno("faidx", "failed to flush output");
exit_status = EXIT_FAILURE;
}
+ if( output_file != NULL) fclose(file_out);
+ free(strand_names);
+
return exit_status;
}
+
+
+int faidx_main(int argc, char *argv[]) {
+ return faidx_core(argc, argv, FAI_FASTA);
+}
+
+
+int fqidx_main(int argc, char *argv[]) {
+ return faidx_core(argc, argv, FAI_FASTQ);
+}
+
--- /dev/null
+[Files in this distribution outwith the cram/ subdirectory are distributed
+according to the terms of the following MIT/Expat license.]
+
+The MIT/Expat License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
+
+
+[Files within the cram/ subdirectory in this distribution are distributed
+according to the terms of the following Modified 3-Clause BSD license.]
+
+The Modified-BSD License
+
+Copyright (C) 2012-2018 Genome Research Ltd.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+
+3. Neither the names Genome Research Ltd and Wellcome Trust Sanger Institute
+ nor the names of its contributors may be used to endorse or promote products
+ derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY GENOME RESEARCH LTD AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL GENOME RESEARCH LTD OR ITS CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+[The use of a range of years within a copyright notice in this distribution
+should be interpreted as being equivalent to a list of years including the
+first and last year specified and all consecutive years between them.
+
+For example, a copyright notice that reads "Copyright (C) 2005, 2007-2009,
+2011-2012" should be interpreted as being identical to a notice that reads
+"Copyright (C) 2005, 2007, 2008, 2009, 2011, 2012" and a copyright notice
+that reads "Copyright (C) 2005-2012" should be interpreted as being identical
+to a notice that reads "Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010,
+2011, 2012".]
--- /dev/null
+HTSlib is an implementation of a unified C library for accessing common file
+formats, such as SAM, CRAM, VCF, and BCF, used for high-throughput sequencing
+data. It is the core library used by samtools and bcftools.
+
+See INSTALL for building and installation instructions.
if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret);
ks_getuntil(ks, '\n', &s, &dret); // skip the empty line
if (write_cns) {
- if (t[4].l) fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout);
+ if (t[4].l) samtools_puts(t[4].s);
t[4].l = 0;
}
} else if (strcmp(s.s, "AF") == 0) { // padded read position
int reversed, neg, pos;
if (t[0].l == 0) fatal("come to 'AF' before reading 'CO'");
if (write_cns) {
- if (t[4].l) fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout);
+ if (t[4].l) samtools_puts(t[4].s);
t[4].l = 0;
}
ks_getuntil(ks, 0, &s, &dret); // read name
kputs("\t*\t0\t0\t", &t[4]); // empty MRNM, MPOS and TLEN
kputsn(t[3].s, t[3].l, &t[4]); // unpadded SEQ
kputs("\t*", &t[4]); // QUAL
- fputs(t[4].s, samtools_stdout) & fputc('\n', samtools_stdout); // print to samtools_stdout
+ samtools_puts(t[4].s); // print to samtools_stdout
++af_i;
} else if (dret != '\n') ks_getuntil(ks, '\n', &s, &dret);
}
memset(&g, 0, sizeof(phaseg_t));
g.flag = FLAG_FIX_CHIMERA;
g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256;
- while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A", lopts, NULL)) >= 0) {
switch (c) {
case 'D': g.max_depth = atoi(optarg); break;
case 'q': g.min_varLOD = atoi(optarg); break;
memset(&g, 0, sizeof(phaseg_t));
g.flag = FLAG_FIX_CHIMERA;
g.min_varLOD = 37; g.k = 13; g.min_baseQ = 13; g.max_depth = 256;
- while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A:", lopts, NULL)) >= 0) {
+ while ((c = getopt_long(argc, argv, "Q:eFq:k:b:l:D:A", lopts, NULL)) >= 0) {
switch (c) {
case 'D': g.max_depth = atoi(optarg); break;
case 'q': g.min_varLOD = atoi(optarg); break;
#include <stdarg.h>
#include <string.h>
#include <errno.h>
-#include <stdlib.h>
#include "samtools.h"
-#include "version.h"
static void vprint_error_core(const char *subcommand, const char *format, va_list args, const char *extra)
{
vprint_error_core(subcommand, format, args, err? strerror(err) : NULL);
va_end(args);
}
-
-const char *samtools_version()
-{
- return SAMTOOLS_VERSION;
-}
-
-const char *samtools_version_short()
-{
- char *sv, *hyph, *v;
- int len;
-
- v = SAMTOOLS_VERSION;
- hyph = strchr(v, '-');
- if (!hyph)
- return strdup(v);
-
- len = hyph - v;
- sv = (char *)malloc(len+1);
- if (!sv)
- return NULL;
-
- strncpy(sv, v, len);
- sv[len] = '\0';
-
- return (const char*)sv;
-}
#include <stdarg.h>
#include <string.h>
#include <errno.h>
-#include <stdlib.h>
#include "samtools.h"
-#include "version.h"
static void vprint_error_core(const char *subcommand, const char *format, va_list args, const char *extra)
{
vprint_error_core(subcommand, format, args, err? strerror(err) : NULL);
va_end(args);
}
-
-const char *samtools_version()
-{
- return SAMTOOLS_VERSION;
-}
-
-const char *samtools_version_short()
-{
- char *sv, *hyph, *v;
- int len;
-
- v = SAMTOOLS_VERSION;
- hyph = strchr(v, '-');
- if (!hyph)
- return strdup(v);
-
- len = hyph - v;
- sv = (char *)malloc(len+1);
- if (!sv)
- return NULL;
-
- strncpy(sv, v, len);
- sv[len] = '\0';
-
- return (const char*)sv;
-}
int main_samview(int argc, char *argv[])
{
int c, is_header = 0, is_header_only = 0, ret = 0, compress_level = -1, is_count = 0;
- int is_long_help = 0;
int64_t count = 0;
samFile *in = 0, *out = 0, *un_out=0;
FILE *fp_out = NULL;
/* parse command-line options */
strcpy(out_mode, "w");
strcpy(out_un_mode, "w");
+ if (argc == 1 && isatty(STDIN_FILENO))
+ return usage(stdout, EXIT_SUCCESS, 0);
+
+ // Suppress complaints about '?' being an unrecognised option. Without
+ // this we have to put '?' in the options list, which makes it hard to
+ // tell a bad long option from the use of '-?' (both return '?' and
+ // set optopt to '\0').
+ opterr = 0;
+
while ((c = getopt_long(argc, argv,
- "SbBcCt:h1Ho:O:q:f:F:G:ul:r:?T:R:L:s:@:m:x:U:M",
+ "SbBcCt:h1Ho:O:q:f:F:G:ul:r:T:R:L:s:@:m:x:U:M",
lopts, NULL)) >= 0) {
switch (c) {
case 's':
srand(settings.subsam_seed);
settings.subsam_seed = rand();
}
- settings.subsam_frac = strtod(q, &q);
+
+ if (q && *q == '.') {
+ settings.subsam_frac = strtod(q, &q);
+ if (*q) ret = 1;
+ } else {
+ ret = 1;
+ }
+
+ if (ret == 1) {
+ print_error("view", "Incorrect sampling argument \"%s\"", optarg);
+ goto view_end;
+ }
break;
case 'm': settings.min_qlen = atoi(optarg); break;
case 'c': is_count = 1; break;
//case 'x': out_format = "x"; break;
//case 'X': out_format = "X"; break;
*/
- case '?': is_long_help = 1; break;
+ case '?':
+ if (optopt == '?') { // '-?' appeared on command line
+ return usage(stdout, EXIT_SUCCESS, 1);
+ } else {
+ if (optopt) { // Bad short option
+ print_error("view", "invalid option -- '%c'", optopt);
+ } else { // Bad long option
+ // Do our best. There is no good solution to finding
+ // out what the bad option was.
+ // See, e.g. https://stackoverflow.com/questions/2723888/where-does-getopt-long-store-an-unrecognized-option
+ if (optind > 0 && strncmp(argv[optind - 1], "--", 2) == 0) {
+ print_error("view", "unrecognised option '%s'",
+ argv[optind - 1]);
+ }
+ }
+ return usage(stderr, EXIT_FAILURE, 0);
+ }
case 'B': settings.remove_B = 1; break;
case 'x':
{
if (strlen(optarg) != 2) {
fprintf(stderr, "main_samview: Error parsing -x auxiliary tags should be exactly two characters long.\n");
- return usage(stderr, EXIT_FAILURE, is_long_help);
+ return usage(stderr, EXIT_FAILURE, 0);
}
settings.remove_aux = (char**)realloc(settings.remove_aux, sizeof(char*) * (++settings.remove_aux_len));
settings.remove_aux[settings.remove_aux_len-1] = optarg;
case 'M': settings.multi_region = 1; break;
default:
if (parse_sam_global_opt(c, optarg, lopts, &ga) != 0)
- return usage(stderr, EXIT_FAILURE, is_long_help);
+ return usage(stderr, EXIT_FAILURE, 0);
break;
}
}
strcat(out_mode, tmp);
strcat(out_un_mode, tmp);
}
- if (argc == optind && isatty(STDIN_FILENO)) return usage(stdout, EXIT_SUCCESS, is_long_help); // potential memory leak...
+ if (argc == optind && isatty(STDIN_FILENO)) {
+ print_error("view", "No input provided or missing option argument.");
+ return usage(stderr, EXIT_FAILURE, 0); // potential memory leak...
+ }
fn_in = (optind < argc)? argv[optind] : "-";
// generate the fn_list if necessary
settings.bed = bed_hash_regions(settings.bed, argv, optind+1, argc, &filter_op); //insert(1) or filter out(0) the regions from the command line in the same hash table as the bed file
if (!filter_op)
filter_state = FILTERED;
+ } else {
+ bed_unify(settings.bed);
}
bam1_t *b = bam_init1();
if (settings.bed == NULL) { // index is unavailable or no regions have been specified
- while ((result = sam_read1(in, header, b)) >= 0) { // read one alignment from `in'
- if (!process_aln(header, b, &settings)) {
- if (!is_count) { if (check_sam_write1(out, header, b, fn_out, &ret) < 0) break; }
- count++;
- } else {
- if (un_out) { if (check_sam_write1(un_out, header, b, fn_un_out, &ret) < 0) break; }
- }
- }
- if (result < -1) {
- fprintf(stderr, "[main_samview] truncated file.\n");
- ret = 1;
- }
+ fprintf(stderr, "[main_samview] no regions or BED file have been provided. Aborting.\n");
} else {
hts_idx_t *idx = sam_index_load(in, fn_in); // load index
if (idx != NULL) {
"Usage: samtools %s [options...] <in.bam>\n", command);
fprintf(to,
"Options:\n"
-" -0 FILE write paired reads flagged both or neither READ1 and READ2 to FILE\n"
-" -1 FILE write paired reads flagged READ1 to FILE\n"
-" -2 FILE write paired reads flagged READ2 to FILE\n"
+" -0 FILE write reads designated READ_OTHER to FILE\n"
+" -1 FILE write reads designated READ1 to FILE\n"
+" -2 FILE write reads designated READ2 to FILE\n"
+" note: if a singleton file is specified with -s, only\n"
+" paired reads will be written to the -1 and -2 files.\n"
" -f INT only include reads with all of the FLAGs in INT present [0]\n" // F&x == x
" -F INT only include reads with none of the FLAGS in INT present [0]\n" // F&x == 0
" -G INT only EXCLUDE reads with all of the FLAGs in INT present [0]\n" // !(F&x == x)
if (fq) fprintf(to,
" -O output quality in the OQ tag if present\n");
fprintf(to,
-" -s FILE write singleton reads to FILE [assume single-end]\n"
+" -s FILE write singleton reads designated READ1 or READ2 to FILE\n"
" -t copy RG, BC and QT tags to the %s header line\n",
fq ? "FASTQ" : "FASTA");
fprintf(to,
" --index-format STR How to parse barcode and quality tags\n\n");
sam_global_opt_help(to, "-.--.@");
fprintf(to,
-" \n"
-" The index-format string describes how to parse the barcode and quality tags, for example:\n"
+"\n"
+"Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.\n"
+"Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.\n"
+"Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set\n"
+"or both unset.\n"
+"Run 'samtools flags' for more information on flag codes and meanings.\n");
+ fprintf(to,
+"\n"
+"The index-format string describes how to parse the barcode and quality tags, for example:\n"
" i14i8 the first 14 characters are index 1, the next 8 characters are index 2\n"
" n8i14 ignore the first 8 characters, and use the next 14 characters for index 1\n"
-" If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
-" 'read until the separator or end of tag', for example:\n"
+"If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
+"'read until the separator or end of tag', for example:\n"
" n*i* ignore the left part of the tag until the separator, then use the second part\n"
" of the tag as index 1\n");
+ fprintf(to,
+"\n"
+"Examples:\n"
+" To get just the paired reads in separate files, use:\n"
+" samtools %s -1 paired1.%s -2 paired2.%s -0 /dev/null -s /dev/null -n -F 0x900 in.bam\n"
+"\n To get all non-supplementary/secondary reads in a single file, redirect the output:\n"
+" samtools %s -F 0x900 in.bam > all_reads.%s\n",
+ command, fq ? "fq" : "fa", fq ? "fq" : "fa",
+ command, fq ? "fq" : "fa");
}
typedef enum { READ_UNKNOWN = 0, READ_1 = 1, READ_2 = 2 } readpart;
state->filetype = opts->filetype;
state->def_qual = opts->def_qual;
state->index_sequence = NULL;
- state->hstdout = bgzf_dopen(fileno(stdout), "wu");
+ state->hstdout = NULL;
state->compression_level = opts->compression_level;
state->taglist = kl_init(ktaglist);
}
}
+ if (opts->ga.reference) {
+ if (hts_set_fai_filename(state->fp, opts->ga.reference) != 0) {
+ print_error_errno("bam2fq", "cannot load reference \"%s\"", opts->ga.reference);
+ free(state);
+ return false;
+ }
+ }
+
int i;
for (i = 0; i < 3; ++i) {
if (opts->fnr[i]) {
return false;
}
} else {
+ if (!state->hstdout) {
+ state->hstdout = bgzf_dopen(fileno(stdout), "wu");
+ if (!state->hstdout) {
+ print_error_errno("bam2fq", "Cannot open STDOUT");
+ free(state);
+ return false;
+ }
+ }
state->fpr[i] = state->hstdout;
}
}
if (state->fpse && bgzf_close(state->fpse)) { print_error_errno("bam2fq", "Error closing singleton file \"%s\"", opts->fnse); valid = false; }
int i;
for (i = 0; i < 3; ++i) {
- if (state->fpr[i] == state->hstdout) {
- if (i==0 && bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing STDOUT"); valid = false; }
- } else {
+ if (state->fpr[i] != state->hstdout) {
if (bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing r%d file \"%s\"", i, opts->fnr[i]); valid = false; }
}
}
+ if (state->hstdout) {
+ if (bgzf_close(state->hstdout)) {
+ print_error_errno("bam2fq", "Error closing STDOUT");
+ valid = false;
+ }
+ }
for (i = 0; i < 2; i++) {
if (state->fpi[i] && bgzf_close(state->fpi[i])) {
print_error_errno("bam2fq", "Error closing i%d file \"%s\"", i+1, opts->index_file[i]);
int main_samview(int argc, char *argv[])
{
int c, is_header = 0, is_header_only = 0, ret = 0, compress_level = -1, is_count = 0;
- int is_long_help = 0;
int64_t count = 0;
samFile *in = 0, *out = 0, *un_out=0;
FILE *fp_out = NULL;
/* parse command-line options */
strcpy(out_mode, "w");
strcpy(out_un_mode, "w");
+ if (argc == 1 && isatty(STDIN_FILENO))
+ return usage(samtools_stdout, EXIT_SUCCESS, 0);
+
+ // Suppress complaints about '?' being an unrecognised option. Without
+ // this we have to put '?' in the options list, which makes it hard to
+ // tell a bad long option from the use of '-?' (both return '?' and
+ // set optopt to '\0').
+ opterr = 0;
+
while ((c = getopt_long(argc, argv,
- "SbBcCt:h1Ho:O:q:f:F:G:ul:r:?T:R:L:s:@:m:x:U:M",
+ "SbBcCt:h1Ho:O:q:f:F:G:ul:r:T:R:L:s:@:m:x:U:M",
lopts, NULL)) >= 0) {
switch (c) {
case 's':
srand(settings.subsam_seed);
settings.subsam_seed = rand();
}
- settings.subsam_frac = strtod(q, &q);
+
+ if (q && *q == '.') {
+ settings.subsam_frac = strtod(q, &q);
+ if (*q) ret = 1;
+ } else {
+ ret = 1;
+ }
+
+ if (ret == 1) {
+ print_error("view", "Incorrect sampling argument \"%s\"", optarg);
+ goto view_end;
+ }
break;
case 'm': settings.min_qlen = atoi(optarg); break;
case 'c': is_count = 1; break;
//case 'x': out_format = "x"; break;
//case 'X': out_format = "X"; break;
*/
- case '?': is_long_help = 1; break;
+ case '?':
+ if (optopt == '?') { // '-?' appeared on command line
+ return usage(samtools_stdout, EXIT_SUCCESS, 1);
+ } else {
+ if (optopt) { // Bad short option
+ print_error("view", "invalid option -- '%c'", optopt);
+ } else { // Bad long option
+ // Do our best. There is no good solution to finding
+ // out what the bad option was.
+ // See, e.g. https://stackoverflow.com/questions/2723888/where-does-getopt-long-store-an-unrecognized-option
+ if (optind > 0 && strncmp(argv[optind - 1], "--", 2) == 0) {
+ print_error("view", "unrecognised option '%s'",
+ argv[optind - 1]);
+ }
+ }
+ return usage(samtools_stderr, EXIT_FAILURE, 0);
+ }
case 'B': settings.remove_B = 1; break;
case 'x':
{
if (strlen(optarg) != 2) {
fprintf(samtools_stderr, "main_samview: Error parsing -x auxiliary tags should be exactly two characters long.\n");
- return usage(samtools_stderr, EXIT_FAILURE, is_long_help);
+ return usage(samtools_stderr, EXIT_FAILURE, 0);
}
settings.remove_aux = (char**)realloc(settings.remove_aux, sizeof(char*) * (++settings.remove_aux_len));
settings.remove_aux[settings.remove_aux_len-1] = optarg;
case 'M': settings.multi_region = 1; break;
default:
if (parse_sam_global_opt(c, optarg, lopts, &ga) != 0)
- return usage(samtools_stderr, EXIT_FAILURE, is_long_help);
+ return usage(samtools_stderr, EXIT_FAILURE, 0);
break;
}
}
strcat(out_mode, tmp);
strcat(out_un_mode, tmp);
}
- if (argc == optind && isatty(STDIN_FILENO)) return usage(samtools_stdout, EXIT_SUCCESS, is_long_help); // potential memory leak...
+ if (argc == optind && isatty(STDIN_FILENO)) {
+ print_error("view", "No input provided or missing option argument.");
+ return usage(samtools_stderr, EXIT_FAILURE, 0); // potential memory leak...
+ }
fn_in = (optind < argc)? argv[optind] : "-";
// generate the fn_list if necessary
settings.bed = bed_hash_regions(settings.bed, argv, optind+1, argc, &filter_op); //insert(1) or filter out(0) the regions from the command line in the same hash table as the bed file
if (!filter_op)
filter_state = FILTERED;
+ } else {
+ bed_unify(settings.bed);
}
bam1_t *b = bam_init1();
if (settings.bed == NULL) { // index is unavailable or no regions have been specified
- while ((result = sam_read1(in, header, b)) >= 0) { // read one alignment from `in'
- if (!process_aln(header, b, &settings)) {
- if (!is_count) { if (check_sam_write1(out, header, b, fn_out, &ret) < 0) break; }
- count++;
- } else {
- if (un_out) { if (check_sam_write1(un_out, header, b, fn_un_out, &ret) < 0) break; }
- }
- }
- if (result < -1) {
- fprintf(samtools_stderr, "[main_samview] truncated file.\n");
- ret = 1;
- }
+ fprintf(samtools_stderr, "[main_samview] no regions or BED file have been provided. Aborting.\n");
} else {
hts_idx_t *idx = sam_index_load(in, fn_in); // load index
if (idx != NULL) {
"Usage: samtools %s [options...] <in.bam>\n", command);
fprintf(to,
"Options:\n"
-" -0 FILE write paired reads flagged both or neither READ1 and READ2 to FILE\n"
-" -1 FILE write paired reads flagged READ1 to FILE\n"
-" -2 FILE write paired reads flagged READ2 to FILE\n"
+" -0 FILE write reads designated READ_OTHER to FILE\n"
+" -1 FILE write reads designated READ1 to FILE\n"
+" -2 FILE write reads designated READ2 to FILE\n"
+" note: if a singleton file is specified with -s, only\n"
+" paired reads will be written to the -1 and -2 files.\n"
" -f INT only include reads with all of the FLAGs in INT present [0]\n" // F&x == x
" -F INT only include reads with none of the FLAGS in INT present [0]\n" // F&x == 0
" -G INT only EXCLUDE reads with all of the FLAGs in INT present [0]\n" // !(F&x == x)
if (fq) fprintf(to,
" -O output quality in the OQ tag if present\n");
fprintf(to,
-" -s FILE write singleton reads to FILE [assume single-end]\n"
+" -s FILE write singleton reads designated READ1 or READ2 to FILE\n"
" -t copy RG, BC and QT tags to the %s header line\n",
fq ? "FASTQ" : "FASTA");
fprintf(to,
" --index-format STR How to parse barcode and quality tags\n\n");
sam_global_opt_help(to, "-.--.@");
fprintf(to,
-" \n"
-" The index-format string describes how to parse the barcode and quality tags, for example:\n"
+"\n"
+"Reads are designated READ1 if FLAG READ1 is set and READ2 is not set.\n"
+"Reads are designated READ2 if FLAG READ1 is not set and READ2 is set.\n"
+"Reads are designated READ_OTHER if FLAGs READ1 and READ2 are either both set\n"
+"or both unset.\n"
+"Run 'samtools flags' for more information on flag codes and meanings.\n");
+ fprintf(to,
+"\n"
+"The index-format string describes how to parse the barcode and quality tags, for example:\n"
" i14i8 the first 14 characters are index 1, the next 8 characters are index 2\n"
" n8i14 ignore the first 8 characters, and use the next 14 characters for index 1\n"
-" If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
-" 'read until the separator or end of tag', for example:\n"
+"If the tag contains a separator, then the numeric part can be replaced with '*' to mean\n"
+"'read until the separator or end of tag', for example:\n"
" n*i* ignore the left part of the tag until the separator, then use the second part\n"
" of the tag as index 1\n");
+ fprintf(to,
+"\n"
+"Examples:\n"
+" To get just the paired reads in separate files, use:\n"
+" samtools %s -1 paired1.%s -2 paired2.%s -0 /dev/null -s /dev/null -n -F 0x900 in.bam\n"
+"\n To get all non-supplementary/secondary reads in a single file, redirect the output:\n"
+" samtools %s -F 0x900 in.bam > all_reads.%s\n",
+ command, fq ? "fq" : "fa", fq ? "fq" : "fa",
+ command, fq ? "fq" : "fa");
}
typedef enum { READ_UNKNOWN = 0, READ_1 = 1, READ_2 = 2 } readpart;
state->filetype = opts->filetype;
state->def_qual = opts->def_qual;
state->index_sequence = NULL;
- state->hsamtools_stdout = bgzf_dopen(fileno(samtools_stdout), "wu");
+ state->hsamtools_stdout = NULL;
state->compression_level = opts->compression_level;
state->taglist = kl_init(ktaglist);
}
}
+ if (opts->ga.reference) {
+ if (hts_set_fai_filename(state->fp, opts->ga.reference) != 0) {
+ print_error_errno("bam2fq", "cannot load reference \"%s\"", opts->ga.reference);
+ free(state);
+ return false;
+ }
+ }
+
int i;
for (i = 0; i < 3; ++i) {
if (opts->fnr[i]) {
return false;
}
} else {
+ if (!state->hsamtools_stdout) {
+ state->hsamtools_stdout = bgzf_dopen(fileno(samtools_stdout), "wu");
+ if (!state->hsamtools_stdout) {
+ print_error_errno("bam2fq", "Cannot open STDOUT");
+ free(state);
+ return false;
+ }
+ }
state->fpr[i] = state->hsamtools_stdout;
}
}
if (state->fpse && bgzf_close(state->fpse)) { print_error_errno("bam2fq", "Error closing singleton file \"%s\"", opts->fnse); valid = false; }
int i;
for (i = 0; i < 3; ++i) {
- if (state->fpr[i] == state->hsamtools_stdout) {
- if (i==0 && bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing STDOUT"); valid = false; }
- } else {
+ if (state->fpr[i] != state->hsamtools_stdout) {
if (bgzf_close(state->fpr[i])) { print_error_errno("bam2fq", "Error closing r%d file \"%s\"", i, opts->fnr[i]); valid = false; }
}
}
+ if (state->hsamtools_stdout) {
+ if (bgzf_close(state->hsamtools_stdout)) {
+ print_error_errno("bam2fq", "Error closing STDOUT");
+ valid = false;
+ }
+ }
for (i = 0; i < 2; i++) {
if (state->fpi[i] && bgzf_close(state->fpi[i])) {
print_error_errno("bam2fq", "Error closing i%d file \"%s\"", i+1, opts->index_file[i]);
#define SAMTOOLS_H
const char *samtools_version(void);
-const char *samtools_version_short(void);
#if defined __GNUC__ && __GNUC__ >= 2
#define CHECK_PRINTF(fmt,args) __attribute__ ((format (printf, fmt, args)))
samtools_stdout_fileno = STDOUT_FILENO;
}
+int samtools_puts(const char *s)
+{
+ if (fputs(s, samtools_stdout) == EOF) return EOF;
+ return putc('\n', samtools_stdout);
+}
+
void samtools_set_optind(int val)
{
// setting this in cython via
*/
void samtools_unset_stdout(void);
+int samtools_puts(const char *s);
+
int samtools_dispatch(int argc, char *argv[]);
void samtools_set_optind(int);
#include <htslib/kstring.h>
#include "stats_isize.h"
#include "sam_opts.h"
+#include "bedidx.h"
#define BWA_MIN_RDLEN 35
+#define DEFAULT_CHUNK_NO 8
+#define DEFAULT_PAIR_MAX 10000
// From the spec
// If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
#define IS_PAIRED_AND_MAPPED(bam) (((bam)->core.flag&BAM_FPAIRED) && !((bam)->core.flag&BAM_FUNMAP) && !((bam)->core.flag&BAM_FMUNMAP))
// Misc
char *split_tag; // Tag on which to perform stats splitting
char *split_prefix; // Path or string prefix for filenames created when splitting
+ int remove_overlaps;
+ int cov_threshold;
}
stats_info_t;
// Arrays for the histogram data
uint64_t *quals_1st, *quals_2nd;
uint64_t *gc_1st, *gc_2nd;
- acgtno_count_t *acgtno_cycles;
- uint64_t *read_lengths;
+ acgtno_count_t *acgtno_cycles_1st;
+ acgtno_count_t *acgtno_cycles_2nd;
+ uint64_t *read_lengths, *read_lengths_1st, *read_lengths_2nd;
uint64_t *insertions, *deletions;
uint64_t *ins_cycles_1st, *ins_cycles_2nd, *del_cycles_1st, *del_cycles_2nd;
isize_t *isize;
// The extremes encountered
int max_len; // Maximum read length
+ int max_len_1st; // Maximum read length for forward reads
+ int max_len_2nd; // Maximum read length for reverse reads
int max_qual; // Maximum quality
int is_sorted;
// Summary numbers
uint64_t total_len;
+ uint64_t total_len_1st;
+ uint64_t total_len_2nd;
uint64_t total_len_dup;
uint64_t nreads_1st;
uint64_t nreads_2nd;
uint64_t *mpc_buf; // Mismatches per cycle
// Target regions
- int nregions, reg_from,reg_to;
+ int nregions, reg_from, reg_to;
regions_t *regions;
// Auxiliary data
char* split_name;
stats_info_t* info; // Pointer to options and settings struct
+ pos_t *chunks;
+ uint32_t nchunks;
+ uint32_t pair_count; // Number of active pairs in the pairing hash table
+ uint32_t target_count; // Number of bases covered by the target file
+ uint32_t last_pair_tid;
+ uint32_t last_read_flush;
}
stats_t;
KHASH_MAP_INIT_STR(c2stats, stats_t*)
+typedef struct {
+ uint32_t first; // 1 - first read, 2 - second read
+ uint32_t n, m; // number of chunks, allocated chunks
+ pos_t *chunks; // chunk array of size m
+} pair_t;
+KHASH_MAP_INIT_STR(qn2pair, pair_t*)
+
+
static void error(const char *format, ...);
int is_in_regions(bam1_t *bam_line, stats_t *stats);
void realloc_buffers(stats_t *stats, int seq_len);
+static int regions_lt(const void *r1, const void *r2) {
+ int64_t from_diff = (int64_t)((pos_t *)r1)->from - (int64_t)((pos_t *)r2)->from;
+ int64_t to_diff = (int64_t)((pos_t *)r1)->to - (int64_t)((pos_t *)r2)->to;
+
+ return from_diff > 0 ? 1 : from_diff < 0 ? -1 : to_diff > 0 ? 1 : to_diff < 0 ? -1 : 0;
+}
// Coverage distribution methods
static inline int coverage_idx(int min, int max, int n, int step, int depth)
memset(stats->mpc_buf + stats->nbases*stats->nquals, 0, (n-stats->nbases)*stats->nquals*sizeof(uint64_t));
}
- stats->acgtno_cycles = realloc(stats->acgtno_cycles, n*sizeof(acgtno_count_t));
- if ( !stats->acgtno_cycles )
+ stats->acgtno_cycles_1st = realloc(stats->acgtno_cycles_1st, n*sizeof(acgtno_count_t));
+ if ( !stats->acgtno_cycles_1st )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
- memset(stats->acgtno_cycles + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+ memset(stats->acgtno_cycles_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+
+ stats->acgtno_cycles_2nd = realloc(stats->acgtno_cycles_2nd, n*sizeof(acgtno_count_t));
+ if ( !stats->acgtno_cycles_2nd )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
+ memset(stats->acgtno_cycles_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
stats->read_lengths = realloc(stats->read_lengths, n*sizeof(uint64_t));
if ( !stats->read_lengths )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
memset(stats->read_lengths + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+ stats->read_lengths_1st = realloc(stats->read_lengths_1st, n*sizeof(uint64_t));
+ if ( !stats->read_lengths_1st )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+ memset(stats->read_lengths_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
+ stats->read_lengths_2nd = realloc(stats->read_lengths_2nd, n*sizeof(uint64_t));
+ if ( !stats->read_lengths_2nd )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+ memset(stats->read_lengths_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
stats->insertions = realloc(stats->insertions, n*sizeof(uint64_t));
if ( !stats->insertions )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
// Count GC and ACGT per cycle. Note that cycle is approximate, clipping is ignored
uint8_t *seq = bam_get_seq(bam_line);
- int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line);
+ int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line), is_first = IS_READ1(bam_line);
for (i=0; i<seq_len; i++)
{
// Read cycle for current index
// =ACMGRSVTWYHKDBN
switch (bam_seqi(seq, i)) {
case 1:
- stats->acgtno_cycles[ read_cycle ].a++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].a++ : stats->acgtno_cycles_2nd[ read_cycle ].a++;
break;
case 2:
- stats->acgtno_cycles[ read_cycle ].c++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].c++ : stats->acgtno_cycles_2nd[ read_cycle ].c++;
gc_count++;
break;
case 4:
- stats->acgtno_cycles[ read_cycle ].g++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].g++ : stats->acgtno_cycles_2nd[ read_cycle ].g++;
gc_count++;
break;
case 8:
- stats->acgtno_cycles[ read_cycle ].t++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].t++ : stats->acgtno_cycles_2nd[ read_cycle ].t++;
break;
case 15:
- stats->acgtno_cycles[ read_cycle ].n++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].n++ : stats->acgtno_cycles_2nd[ read_cycle ].n++;
break;
default:
/*
* count "=" sequences in "other" along
* with MRSVWYHKDB ambiguity codes
*/
- stats->acgtno_cycles[ read_cycle ].other++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].other++ : stats->acgtno_cycles_2nd[ read_cycle ].other++;
break;
}
}
// fill GC histogram
uint64_t *quals;
uint8_t *bam_quals = bam_get_qual(bam_line);
- if ( bam_line->core.flag&BAM_FREAD2 )
+ if ( IS_READ2(bam_line) )
{
quals = stats->quals_2nd;
stats->nreads_2nd++;
+ stats->total_len_2nd += seq_len;
for (i=gc_idx_min; i<gc_idx_max; i++)
stats->gc_2nd[i]++;
}
{
quals = stats->quals_1st;
stats->nreads_1st++;
+ stats->total_len_1st += seq_len;
for (i=gc_idx_min; i<gc_idx_max; i++)
stats->gc_1st[i]++;
}
*gc_count_out = gc_count;
}
-void collect_stats(bam1_t *bam_line, stats_t *stats)
+static int cleanup_overlaps(khash_t(qn2pair) *read_pairs, int max) {
+ if ( !read_pairs )
+ return 0;
+
+ int count = 0;
+ khint_t k;
+ for (k = kh_begin(read_pairs); k < kh_end(read_pairs); k++) {
+ if ( kh_exist(read_pairs, k) ) {
+ char *key = (char *)kh_key(read_pairs, k);
+ pair_t *val = kh_val(read_pairs, k);
+ if ( val && val->chunks ) {
+ if ( val->chunks[val->n-1].to < max ) {
+ free(val->chunks);
+ free(val);
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ count++;
+ }
+ } else {
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ count++;
+ }
+ }
+ }
+ if ( max == INT_MAX )
+ kh_destroy(qn2pair, read_pairs);
+
+ return count;
+}
+
+static void remove_overlaps(bam1_t *bam_line, khash_t(qn2pair) *read_pairs, stats_t *stats, int pmin, int pmax) {
+ if ( !bam_line || !read_pairs || !stats )
+ return;
+
+ uint32_t first = (IS_READ1(bam_line) > 0 ? 1 : 0) + (IS_READ2(bam_line) > 0 ? 2 : 0) ;
+ if ( !(bam_line->core.flag & BAM_FPAIRED) ||
+ (bam_line->core.flag & BAM_FMUNMAP) ||
+ (abs(bam_line->core.isize) >= 2*bam_line->core.l_qseq) ||
+ (first != 1 && first != 2) ) {
+ if ( pmin >= 0 )
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+ return;
+ }
+
+ char *qname = bam_get_qname(bam_line);
+ if ( !qname ) {
+ fprintf(stderr, "Error retrieving qname for line starting at pos %d\n", bam_line->core.pos);
+ return;
+ }
+
+ khint_t k = kh_get(qn2pair, read_pairs, qname);
+ if ( k == kh_end(read_pairs) ) { //first chunk from this template
+ if ( pmin == -1 )
+ return;
+
+ int ret;
+ char *s = strdup(qname);
+ if ( !s ) {
+ fprintf(stderr, "Error allocating memory\n");
+ return;
+ }
+
+ k = kh_put(qn2pair, read_pairs, s, &ret);
+ if ( -1 == ret ) {
+ fprintf(stderr, "Error inserting read '%s' in pair hash table\n", qname);
+ return;
+ }
+
+ pair_t *pc = calloc(1, sizeof(pair_t));
+ if ( !pc ) {
+ fprintf(stderr, "Error allocating memory\n");
+ return;
+ }
+
+ pc->m = DEFAULT_CHUNK_NO;
+ pc->chunks = calloc(pc->m, sizeof(pos_t));
+ if ( !pc->chunks ) {
+ fprintf(stderr, "Error allocating memory\n");
+ return;
+ }
+
+ pc->chunks[0].from = pmin;
+ pc->chunks[0].to = pmax;
+ pc->n = 1;
+ pc->first = first;
+
+ kh_val(read_pairs, k) = pc;
+ stats->pair_count++;
+ } else { //template already present
+ pair_t *pc = kh_val(read_pairs, k);
+ if ( !pc ) {
+ fprintf(stderr, "Invalid hash table entry\n");
+ return;
+ }
+
+ if ( first == pc->first ) { //chunk from an existing line
+ if ( pmin == -1 )
+ return;
+
+ if ( pc->n == pc->m ) {
+ pos_t *tmp = realloc(pc->chunks, (pc->m<<1)*sizeof(pos_t));
+ if ( !tmp ) {
+ fprintf(stderr, "Error allocating memory\n");
+ return;
+ }
+ pc->chunks = tmp;
+ pc->m<<=1;
+ }
+
+ pc->chunks[pc->n].from = pmin;
+ pc->chunks[pc->n].to = pmax;
+ pc->n++;
+ } else { //the other line, check for overlapping
+ if ( pmin == -1 && kh_exist(read_pairs, k) ) { //job done, delete entry
+ char *key = (char *)kh_key(read_pairs, k);
+ pair_t *val = kh_val(read_pairs, k);
+ if ( val) {
+ free(val->chunks);
+ free(val);
+ }
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ stats->pair_count--;
+ return;
+ }
+
+ int i;
+ for (i=0; i<pc->n; i++) {
+ if ( pmin >= pc->chunks[i].to )
+ continue;
+
+ if ( pmax <= pc->chunks[i].from ) //no overlap
+ break;
+
+ if ( pmin < pc->chunks[i].from ) { //overlap at the beginning
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pc->chunks[i].from-1);
+ pmin = pc->chunks[i].from;
+ }
+
+ if ( pmax <= pc->chunks[i].to ) { //completely contained
+ stats->nbases_mapped_cigar -= (pmax - pmin);
+ return;
+ } else { //overlap at the end
+ stats->nbases_mapped_cigar -= (pc->chunks[i].to - pmin);
+ pmin = pc->chunks[i].to;
+ }
+ }
+ }
+ }
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+}
+
+void collect_stats(bam1_t *bam_line, stats_t *stats, khash_t(qn2pair) *read_pairs)
{
if ( stats->rg_hash )
{
// Update max_len observed
if ( stats->max_len<read_len )
stats->max_len = read_len;
+ if ( IS_READ1(bam_line) && stats->max_len_1st < read_len )
+ stats->max_len_1st = read_len;
+ if ( IS_READ2(bam_line) && stats->max_len_2nd < read_len )
+ stats->max_len_2nd = read_len;
+
int i;
int gc_count = 0;
if ( IS_ORIGINAL(bam_line) )
{
stats->read_lengths[read_len]++;
+ if ( IS_READ1(bam_line) ) stats->read_lengths_1st[read_len]++;
+ if ( IS_READ2(bam_line) ) stats->read_lengths_2nd[read_len]++;
collect_orig_read_stats(bam_line, stats, &gc_count);
}
if ( is_fwd*is_mfwd>0 )
stats->isize->inc_other(stats->isize->data, isize);
- else if ( is_fst*pos_fst>0 )
+ else if ( is_fst*pos_fst>=0 )
{
if ( is_fst*is_fwd>0 )
stats->isize->inc_inward(stats->isize->data, isize);
int ncig = bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
if ( !ncig ) continue; // curiously, this can happen: 0D
if ( cig==BAM_CDEL ) readlen += ncig;
- else if ( cig==BAM_CMATCH )
+ else if ( cig==BAM_CMATCH || cig==BAM_CEQUAL || cig==BAM_CDIFF )
{
if ( iref < stats->reg_from ) ncig -= stats->reg_from-iref;
else if ( iref+ncig-1 > stats->reg_to ) ncig -= iref+ncig-1 - stats->reg_to;
// Count the whole read
for (i=0; i<bam_line->core.n_cigar; i++)
{
- if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CMATCH || bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CINS )
+ int cig = bam_cigar_op(bam_get_cigar(bam_line)[i]);
+ if ( cig==BAM_CMATCH || cig==BAM_CINS || cig==BAM_CEQUAL || cig==BAM_CDIFF )
stats->nbases_mapped_cigar += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
- if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CDEL )
+ if ( cig==BAM_CDEL )
readlen += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
}
}
if ( stats->is_sorted )
{
- if ( stats->tid==-1 || stats->tid!=bam_line->core.tid )
+ if ( stats->tid==-1 || stats->tid!=bam_line->core.tid ) {
round_buffer_flush(stats, -1);
+ }
+
+ //cleanup the pair hash table to free memory
+ stats->last_read_flush++;
+ if ( stats->pair_count > DEFAULT_PAIR_MAX && stats->last_read_flush > DEFAULT_PAIR_MAX) {
+ stats->pair_count -= cleanup_overlaps(read_pairs, bam_line->core.pos);
+ stats->last_read_flush = 0;
+ }
+
+ if ( stats->last_pair_tid != bam_line->core.tid) {
+ stats->pair_count -= cleanup_overlaps(read_pairs, INT_MAX-1);
+ stats->last_pair_tid = bam_line->core.tid;
+ stats->last_read_flush = 0;
+ }
// Mismatches per cycle and GC-depth graph. For simplicity, reads overlapping GCD bins
// are not splitted which results in up to seq_len-1 overlaps. The default bin size is
// Coverage distribution graph
round_buffer_flush(stats,bam_line->core.pos);
- round_buffer_insert_read(&(stats->cov_rbuf),bam_line->core.pos,bam_line->core.pos+seq_len-1);
+ if ( stats->regions ) {
+ uint32_t p = bam_line->core.pos, pnew, pmin, pmax, j;
+ pmin = pmax = i = j = 0;
+ while ( j < bam_line->core.n_cigar && i < stats->nchunks ) {
+ int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+ int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+ switch(op) {
+ case BAM_CMATCH:
+ case BAM_CEQUAL:
+ case BAM_CDIFF:
+ pmin = MAX(p, stats->chunks[i].from-1);
+ pmax = MIN(p+oplen, stats->chunks[i].to);
+ if ( pmax >= pmin ) {
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, pmin, pmax);
+ else
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+ }
+ break;
+ case BAM_CDEL:
+ break;
+ }
+ pnew = p + (bam_cigar_type(op)&2 ? oplen : 0); // consumes reference
+
+ if ( pnew >= stats->chunks[i].to ) {
+ // go to the next chunk
+ i++;
+ } else {
+ // go to the next CIGAR op
+ j++;
+ p = pnew;
+ }
+ }
+ } else {
+ uint32_t p = bam_line->core.pos, j;
+ for (j = 0; j < bam_line->core.n_cigar; j++) {
+ int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+ int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+ switch(op) {
+ case BAM_CMATCH:
+ case BAM_CEQUAL:
+ case BAM_CDIFF:
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, p, p+oplen);
+ else
+ round_buffer_insert_read(&(stats->cov_rbuf), p, p+oplen-1);
+ break;
+ case BAM_CDEL:
+ break;
+ }
+ p += bam_cigar_type(op)&2 ? oplen : 0; // consumes reference
+ }
+ }
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, -1, -1); //remove the line from the hash table
}
}
void output_stats(FILE *to, stats_t *stats, int sparse)
{
// Calculate average insert size and standard deviation (from the main bulk data only)
- int isize, ibulk=0;
- uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0;
+ int isize, ibulk=0, icov;
+ uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0, cov_sum=0;
+ double bulk=0, avg_isize=0, sd_isize=0;
for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
{
// Each pair was counted twice
nisize += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
}
- double bulk=0, avg_isize=0, sd_isize=0;
for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
{
- bulk += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+ uint64_t num = stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+ if (num > 0) ibulk = isize + 1;
+ bulk += num;
avg_isize += isize * (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize));
if ( bulk/nisize > stats->info->isize_main_bulk )
}
avg_isize /= nisize ? nisize : 1;
for (isize=1; isize<ibulk; isize++)
- sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / nisize;
+ sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / (nisize ? nisize : 1);
sd_isize = sqrt(sd_isize);
-
fprintf(to, "# This file was produced by samtools stats (%s+htslib-%s) and can be plotted using plot-bamstats\n", samtools_version(), hts_version());
if( stats->split_name != NULL ){
fprintf(to, "# This file contains statistics only for reads with tag: %s=%s\n", stats->info->split_tag, stats->split_name);
fprintf(to, "SN\treads QC failed:\t%ld\n", (long)stats->nreads_QCfailed);
fprintf(to, "SN\tnon-primary alignments:\t%ld\n", (long)stats->nreads_secondary);
fprintf(to, "SN\ttotal length:\t%ld\t# ignores clipping\n", (long)stats->total_len);
+ fprintf(to, "SN\ttotal first fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_1st);
+ fprintf(to, "SN\ttotal last fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_2nd);
fprintf(to, "SN\tbases mapped:\t%ld\t# ignores clipping\n", (long)stats->nbases_mapped); // the length of the whole read goes here, including soft-clips etc.
fprintf(to, "SN\tbases mapped (cigar):\t%ld\t# more accurate\n", (long)stats->nbases_mapped_cigar); // only matched and inserted bases are counted here
fprintf(to, "SN\tbases trimmed:\t%ld\n", (long)stats->nbases_trimmed);
fprintf(to, "SN\terror rate:\t%e\t# mismatches / bases mapped (cigar)\n", stats->nbases_mapped_cigar ? (float)stats->nmismatches/stats->nbases_mapped_cigar : 0);
float avg_read_length = (stats->nreads_1st+stats->nreads_2nd)?stats->total_len/(stats->nreads_1st+stats->nreads_2nd):0;
fprintf(to, "SN\taverage length:\t%.0f\n", avg_read_length);
+ fprintf(to, "SN\taverage first fragment length:\t%.0f\n", stats->nreads_1st? (float)stats->total_len_1st/stats->nreads_1st:0);
+ fprintf(to, "SN\taverage last fragment length:\t%.0f\n", stats->nreads_2nd? (float)stats->total_len_2nd/stats->nreads_2nd:0);
fprintf(to, "SN\tmaximum length:\t%d\n", stats->max_len);
+ fprintf(to, "SN\tmaximum first fragment length:\t%d\n", stats->max_len_1st);
+ fprintf(to, "SN\tmaximum last fragment length:\t%d\n", stats->max_len_2nd);
fprintf(to, "SN\taverage quality:\t%.1f\n", stats->total_len?stats->sum_qual/stats->total_len:0);
fprintf(to, "SN\tinsert size average:\t%.1f\n", avg_isize);
fprintf(to, "SN\tinsert size standard deviation:\t%.1f\n", sd_isize);
fprintf(to, "SN\toutward oriented pairs:\t%ld\n", (long)nisize_outward);
fprintf(to, "SN\tpairs with other orientation:\t%ld\n", (long)nisize_other);
fprintf(to, "SN\tpairs on different chromosomes:\t%ld\n", (long)stats->nreads_anomalous/2);
+ fprintf(to, "SN\tpercentage of properly paired reads (%%):\t%.1f\n", (stats->nreads_1st+stats->nreads_2nd)? (float)(100*stats->nreads_properly_paired)/(stats->nreads_1st+stats->nreads_2nd):0);
+ if ( stats->target_count ) {
+ fprintf(to, "SN\tbases inside the target:\t%u\n", stats->target_count);
+ for (icov=stats->info->cov_threshold+1; icov<stats->ncov; icov++)
+ cov_sum += stats->cov[icov];
+ fprintf(to, "SN\tpercentage of target genome with coverage > %d (%%):\t%.2f\n", stats->info->cov_threshold, (float)(100*cov_sum)/stats->target_count);
+ }
int ibase,iqual;
if ( stats->max_len<stats->nbases ) stats->max_len++;
if ( stats->max_qual+1<stats->nquals ) stats->max_qual++;
- fprintf(to, "# First Fragment Qualitites. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
+ fprintf(to, "# First Fragment Qualities. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
- for (ibase=0; ibase<stats->max_len; ibase++)
+ for (ibase=0; ibase<stats->max_len_1st; ibase++)
{
fprintf(to, "FFQ\t%d",ibase+1);
for (iqual=0; iqual<=stats->max_qual; iqual++)
}
fprintf(to, "\n");
}
- fprintf(to, "# Last Fragment Qualitites. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
+ fprintf(to, "# Last Fragment Qualities. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
- for (ibase=0; ibase<stats->max_len; ibase++)
+ for (ibase=0; ibase<stats->max_len_2nd; ibase++)
{
fprintf(to, "LFQ\t%d",ibase+1);
for (iqual=0; iqual<=stats->max_qual; iqual++)
fprintf(to, "# ACGT content per cycle. Use `grep ^GCC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
for (ibase=0; ibase<stats->max_len; ibase++)
{
- acgtno_count_t *acgtno_count = &(stats->acgtno_cycles[ibase]);
- uint64_t acgt_sum = acgtno_count->a + acgtno_count->c + acgtno_count->g + acgtno_count->t;
+ acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+ acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+ uint64_t acgt_sum = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t +
+ acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
if ( ! acgt_sum ) continue;
- fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1, 100.*acgtno_count->a/acgt_sum, 100.*acgtno_count->c/acgt_sum, 100.*acgtno_count->g/acgt_sum, 100.*acgtno_count->t/acgt_sum, 100.*acgtno_count->n/acgt_sum, 100.*acgtno_count->other/acgt_sum);
+ fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*(acgtno_count_1st->a + acgtno_count_2nd->a)/acgt_sum,
+ 100.*(acgtno_count_1st->c + acgtno_count_2nd->c)/acgt_sum,
+ 100.*(acgtno_count_1st->g + acgtno_count_2nd->g)/acgt_sum,
+ 100.*(acgtno_count_1st->t + acgtno_count_2nd->t)/acgt_sum,
+ 100.*(acgtno_count_1st->n + acgtno_count_2nd->n)/acgt_sum,
+ 100.*(acgtno_count_1st->other + acgtno_count_2nd->other)/acgt_sum);
+
+ }
+ fprintf(to, "# ACGT content per cycle for first fragments. Use `grep ^FBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+ for (ibase=0; ibase<stats->max_len; ibase++)
+ {
+ acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+ uint64_t acgt_sum_1st = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t;
+
+ if ( acgt_sum_1st )
+ fprintf(to, "FBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*acgtno_count_1st->a/acgt_sum_1st,
+ 100.*acgtno_count_1st->c/acgt_sum_1st,
+ 100.*acgtno_count_1st->g/acgt_sum_1st,
+ 100.*acgtno_count_1st->t/acgt_sum_1st,
+ 100.*acgtno_count_1st->n/acgt_sum_1st,
+ 100.*acgtno_count_1st->other/acgt_sum_1st);
+
+ }
+ fprintf(to, "# ACGT content per cycle for last fragments. Use `grep ^LBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+ for (ibase=0; ibase<stats->max_len; ibase++)
+ {
+ acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+ uint64_t acgt_sum_2nd = acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
+
+ if ( acgt_sum_2nd )
+ fprintf(to, "LBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*acgtno_count_2nd->a/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->c/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->g/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->t/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->n/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->other/acgt_sum_2nd);
+
}
fprintf(to, "# Insert sizes. Use `grep ^IS | cut -f 2-` to extract this part. The columns are: insert size, pairs total, inward oriented pairs, outward oriented pairs, other pairs\n");
for (isize=0; isize<ibulk; isize++) {
int ilen;
for (ilen=0; ilen<stats->max_len; ilen++)
{
- if ( stats->read_lengths[ilen]>0 )
- fprintf(to, "RL\t%d\t%ld\n", ilen, (long)stats->read_lengths[ilen]);
+ if ( stats->read_lengths[ilen+1]>0 )
+ fprintf(to, "RL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths[ilen+1]);
+ }
+
+ fprintf(to, "# Read lengths - first fragments. Use `grep ^FRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+ for (ilen=0; ilen<stats->max_len_1st; ilen++)
+ {
+ if ( stats->read_lengths_1st[ilen+1]>0 )
+ fprintf(to, "FRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_1st[ilen+1]);
+ }
+
+ fprintf(to, "# Read lengths - last fragments. Use `grep ^LRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+ for (ilen=0; ilen<stats->max_len_2nd; ilen++)
+ {
+ if ( stats->read_lengths_2nd[ilen+1]>0 )
+ fprintf(to, "LRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_2nd[ilen+1]);
}
fprintf(to, "# Indel distribution. Use `grep ^ID | cut -f 2-` to extract this part. The columns are: length, number of insertions, number of deletions\n");
+
for (ilen=0; ilen<stats->nindels; ilen++)
{
if ( stats->insertions[ilen]>0 || stats->deletions[ilen]>0 )
fprintf(to, "# Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.\n");
if ( stats->cov[0] )
fprintf(to, "COV\t[<%d]\t%d\t%ld\n",stats->info->cov_min,stats->info->cov_min-1, (long)stats->cov[0]);
- int icov;
for (icov=1; icov<stats->ncov-1; icov++)
if ( stats->cov[icov] )
fprintf(to, "COV\t[%d-%d]\t%d\t%ld\n",stats->info->cov_min + (icov-1)*stats->info->cov_step, stats->info->cov_min + icov*stats->info->cov_step-1,stats->info->cov_min + icov*stats->info->cov_step-1, (long)stats->cov[icov]);
if ( !fp ) error("%s: %s\n",file,strerror(errno));
kstring_t line = { 0, 0, NULL };
- int warned = 0;
+ int warned = 0, r, p, new_p;
int prev_tid=-1, prev_pos=-1;
while (line.l = 0, kgetline(&line, (kgets_func *)fgets, fp) >= 0)
{
if ( prev_pos>stats->regions[tid].pos[npos].from )
error("The positions are not in chromosomal order (%s:%d comes after %d)\n", line.s,stats->regions[tid].pos[npos].from,prev_pos);
stats->regions[tid].npos++;
+ if ( stats->regions[tid].npos > stats->nchunks )
+ stats->nchunks = stats->regions[tid].npos;
}
free(line.s);
if ( !stats->regions ) error("Unable to map the -t sequences to the BAM sequences.\n");
fclose(fp);
+
+ // sort region intervals and remove duplicates
+ for (r = 0; r < stats->nregions; r++) {
+ regions_t *reg = &stats->regions[r];
+ if ( reg->npos > 1 ) {
+ qsort(reg->pos, reg->npos, sizeof(pos_t), regions_lt);
+ for (new_p = 0, p = 1; p < reg->npos; p++) {
+ if ( reg->pos[new_p].to < reg->pos[p].from )
+ reg->pos[++new_p] = reg->pos[p];
+ else if ( reg->pos[new_p].to < reg->pos[p].to )
+ reg->pos[new_p].to = reg->pos[p].to;
+ }
+ reg->npos = ++new_p;
+ }
+ for (p = 0; p < reg->npos; p++)
+ stats->target_count += (reg->pos[p].to - reg->pos[p].from + 1);
+ }
+
+ stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
}
void destroy_regions(stats_t *stats)
free(stats->regions[i].pos);
}
if ( stats->regions ) free(stats->regions);
+ if ( stats->chunks ) free(stats->chunks);
}
void reset_regions(stats_t *stats)
int i = reg->cpos;
while ( i<reg->npos && reg->pos[i].to<=bam_line->core.pos ) i++;
if ( i>=reg->npos ) { reg->cpos = reg->npos; return 0; }
- if ( bam_line->core.pos + bam_line->core.l_qseq + 1 < reg->pos[i].from ) return 0;
+ int64_t endpos = bam_endpos(bam_line);
+ if ( endpos < reg->pos[i].from ) return 0;
+
+ //found a read overlapping a region
reg->cpos = i;
stats->reg_from = reg->pos[i].from;
stats->reg_to = reg->pos[i].to;
+ //now find all the overlapping chunks
+ stats->nchunks = 0;
+ while (i < reg->npos) {
+ if (bam_line->core.pos < reg->pos[i].to && endpos >= reg->pos[i].from) {
+ stats->chunks[stats->nchunks].from = MAX(bam_line->core.pos+1, reg->pos[i].from);
+ stats->chunks[stats->nchunks].to = MIN(endpos, reg->pos[i].to);
+ stats->nchunks++;
+ }
+ i++;
+ }
+
return 1;
}
+int replicate_regions(stats_t *stats, hts_itr_multi_t *iter) {
+ if ( !stats || !iter)
+ return 1;
+
+ int i, j, tid;
+ stats->nregions = iter->n_reg;
+ stats->regions = calloc(stats->nregions, sizeof(regions_t));
+ stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
+ if ( !stats->regions || !stats->chunks )
+ return 1;
+
+ for (i = 0; i < iter->n_reg; i++) {
+ tid = iter->reg_list[i].tid;
+ if ( tid < 0 )
+ continue;
+
+ if ( tid >= stats->nregions ) {
+ regions_t *tmp = realloc(stats->regions, (tid+10) * sizeof(regions_t));
+ if ( !tmp )
+ return 1;
+ stats->regions = tmp;
+ memset(stats->regions + stats->nregions, 0,
+ (tid+10-stats->nregions) * sizeof(regions_t));
+ stats->nregions = tid+10;
+ }
+
+ stats->regions[tid].mpos = stats->regions[tid].npos = iter->reg_list[i].count;
+ stats->regions[tid].pos = calloc(stats->regions[tid].mpos, sizeof(pos_t));
+ if ( !stats->regions[tid].pos )
+ return 1;
+
+ for (j = 0; j < stats->regions[tid].npos; j++) {
+ stats->regions[tid].pos[j].from = iter->reg_list[i].intervals[j].beg+1;
+ stats->regions[tid].pos[j].to = iter->reg_list[i].intervals[j].end;
+
+ stats->target_count += (stats->regions[tid].pos[j].to - stats->regions[tid].pos[j].from + 1);
+ }
+ }
+
+ return 0;
+}
+
void init_group_id(stats_t *stats, const char *id)
{
#if 0
printf(" -S, --split <tag> Also write statistics to separate files split by tagged field.\n");
printf(" -t, --target-regions <file> Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.\n");
printf(" -x, --sparse Suppress outputting IS rows where there are no insertions.\n");
+ printf(" -p, --remove-overlaps Remove overlaps of paired-end reads from coverage and base count computations.\n");
+ printf(" -g, --cov-threshold Only bases with coverage above this value will be included in the target percentage computation.\n");
sam_global_opt_help(stdout, "-.--.@");
printf("\n");
}
free(stats->gcd);
free(stats->rseq_buf);
free(stats->mpc_buf);
- free(stats->acgtno_cycles);
+ free(stats->acgtno_cycles_1st);
+ free(stats->acgtno_cycles_2nd);
free(stats->read_lengths);
+ free(stats->read_lengths_1st);
+ free(stats->read_lengths_2nd);
free(stats->insertions);
free(stats->deletions);
free(stats->ins_cycles_1st);
stats_t *curr_stats = NULL;
for(i = kh_begin(split_hash); i != kh_end(split_hash); ++i){
if(!kh_exist(split_hash, i)) continue;
- curr_stats = kh_value(split_hash, i);
- cleanup_stats(curr_stats);
+ curr_stats = kh_value(split_hash, i);
+ cleanup_stats(curr_stats);
}
kh_destroy(c2stats, split_hash);
}
info->filter_readlen = -1;
info->argc = argc;
info->argv = argv;
+ info->remove_overlaps = 0;
+ info->cov_threshold = 0;
return info;
}
stats->ngc = 200;
stats->nquals = 256;
stats->nbases = 300;
- stats->max_len = 30;
- stats->max_qual = 40;
stats->rseq_pos = -1;
stats->tid = stats->gcd_pos = -1;
stats->igcd = 0;
stats->is_sorted = 1;
stats->nindels = stats->nbases;
stats->split_name = NULL;
+ stats->nchunks = 0;
+ stats->pair_count = 0;
+ stats->last_pair_tid = -2;
+ stats->last_read_flush = 0;
+ stats->target_count = 0;
return stats;
}
stats->isize = init_isize_t(info->nisize ?info->nisize+1 :0);
stats->gcd = calloc(stats->ngcd,sizeof(gc_depth_t));
stats->mpc_buf = info->fai ? calloc(stats->nquals*stats->nbases,sizeof(uint64_t)) : NULL;
- stats->acgtno_cycles = calloc(stats->nbases,sizeof(acgtno_count_t));
+ stats->acgtno_cycles_1st = calloc(stats->nbases,sizeof(acgtno_count_t));
+ stats->acgtno_cycles_2nd = calloc(stats->nbases,sizeof(acgtno_count_t));
stats->read_lengths = calloc(stats->nbases,sizeof(uint64_t));
+ stats->read_lengths_1st = calloc(stats->nbases,sizeof(uint64_t));
+ stats->read_lengths_2nd = calloc(stats->nbases,sizeof(uint64_t));
stats->insertions = calloc(stats->nbases,sizeof(uint64_t));
stats->deletions = calloc(stats->nbases,sizeof(uint64_t));
stats->ins_cycles_1st = calloc(stats->nbases+1,sizeof(uint64_t));
{"sparse", no_argument, NULL, 'x'},
{"split", required_argument, NULL, 'S'},
{"split-prefix", required_argument, NULL, 'P'},
+ {"remove-overlaps", no_argument, NULL, 'p'},
+ {"cov-threshold", required_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
};
int opt;
- while ( (opt=getopt_long(argc,argv,"?hdsxr:c:l:i:t:m:q:f:F:I:1:S:P:@:",loptions,NULL))>0 )
+ while ( (opt=getopt_long(argc,argv,"?hdsxpr:c:l:i:t:m:q:f:F:g:I:1:S:P:@:",loptions,NULL))>0 )
{
switch (opt)
{
case 'f': info->flag_require = bam_str2flag(optarg); break;
- case 'F': info->flag_filter = bam_str2flag(optarg); break;
+ case 'F': info->flag_filter |= bam_str2flag(optarg); break;
case 'd': info->flag_filter |= BAM_FDUP; break;
case 's': break;
case 'r': info->fai = fai_load(optarg);
case 'x': sparse = 1; break;
case 'S': info->split_tag = optarg; break;
case 'P': info->split_prefix = optarg; break;
+ case 'p': info->remove_overlaps = 1; break;
+ case 'g': info->cov_threshold = atoi(optarg);
+ if ( info->cov_threshold < 0 || info->cov_threshold == INT_MAX )
+ error("Unsupported value for coverage threshold %d\n", info->cov_threshold);
+ break;
case '?':
case 'h': error(NULL);
default:
bam_fname = "-";
}
- if (init_stat_info_fname(info, bam_fname, &ga.in)) return 1;
+ if (init_stat_info_fname(info, bam_fname, &ga.in)) {
+ free(info);
+ return 1;
+ }
if (ga.nthreads > 0)
hts_set_threads(info->sam, ga.nthreads);
// .. hash
khash_t(c2stats)* split_hash = kh_init(c2stats);
+ khash_t(qn2pair)* read_pairs = kh_init(qn2pair);
+
// Collect statistics
bam1_t *bam_line = bam_init1();
if ( optind<argc )
{
- // Collect stats in selected regions only
- hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
- if (bam_idx == 0)
- error("Random alignment retrieval only works for indexed BAM files.\n");
-
- int i;
- for (i=optind; i<argc; i++)
- {
- hts_itr_t* iter = bam_itr_querys(bam_idx, info->sam_header, argv[i]);
- while (sam_itr_next(info->sam, iter, bam_line) >= 0) {
- if (info->split_tag) {
- curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
- collect_stats(bam_line, curr_stats);
+ int filter = 1;
+ // Prepare the region hash table for the multi-region iterator
+ void *region_hash = bed_hash_regions(NULL, argv, optind, argc, &filter);
+ if (region_hash) {
+
+ // Collect stats in selected regions only
+ hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
+ if (bam_idx) {
+
+ int regcount = 0;
+ hts_reglist_t *reglist = bed_reglist(region_hash, ALL, ®count);
+ if (reglist) {
+
+ hts_itr_multi_t *iter = sam_itr_regions(bam_idx, info->sam_header, reglist, regcount);
+ if (iter) {
+
+ if (!targets) {
+ all_stats->nchunks = argc-optind;
+ if ( replicate_regions(all_stats, iter) )
+ fprintf(stderr, "Replications of the regions failed.");
+ }
+
+ if ( all_stats->nregions && all_stats->regions ) {
+ while (sam_itr_multi_next(info->sam, iter, bam_line) >= 0) {
+ if (info->split_tag) {
+ curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
+ collect_stats(bam_line, curr_stats, read_pairs);
+ }
+ collect_stats(bam_line, all_stats, read_pairs);
+ }
+ }
+
+ hts_itr_multi_destroy(iter);
+ } else {
+ fprintf(stderr, "Creation of the region iterator failed.");
+ hts_reglist_free(reglist, regcount);
+ }
+ } else {
+ fprintf(stderr, "Creation of the region list failed.");
}
- collect_stats(bam_line, all_stats);
+
+ hts_idx_destroy(bam_idx);
+ } else {
+ fprintf(stderr, "Random alignment retrieval only works for indexed BAM files.\n");
}
- reset_regions(all_stats);
- bam_itr_destroy(iter);
+
+ bed_destroy(region_hash);
+ } else {
+ fprintf(stderr, "Creation of the region hash table failed.\n");
}
- hts_idx_destroy(bam_idx);
}
else
{
+ if ( info->cov_threshold > 0 && !targets ) {
+ fprintf(stderr, "Coverage percentage calcuation requires a list of target regions\n");
+ goto cleanup;
+ }
+
// Stream through the entire BAM ignoring off-target regions if -t is given
int ret;
while ((ret = sam_read1(info->sam, info->sam_header, bam_line)) >= 0) {
if (info->split_tag) {
curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
- collect_stats(bam_line, curr_stats);
+ collect_stats(bam_line, curr_stats, read_pairs);
}
- collect_stats(bam_line, all_stats);
+ collect_stats(bam_line, all_stats, read_pairs);
}
if (ret < -1) {
if (info->split_tag)
output_split_stats(split_hash, bam_fname, sparse);
+cleanup:
bam_destroy1(bam_line);
bam_hdr_destroy(info->sam_header);
sam_global_args_free(&ga);
cleanup_stats(all_stats);
cleanup_stats_info(info);
destroy_split_stats(split_hash);
+ cleanup_overlaps(read_pairs, INT_MAX);
return 0;
}
#include <htslib/kstring.h>
#include "stats_isize.h"
#include "sam_opts.h"
+#include "bedidx.h"
#define BWA_MIN_RDLEN 35
+#define DEFAULT_CHUNK_NO 8
+#define DEFAULT_PAIR_MAX 10000
// From the spec
// If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10, 0x100 and 0x800, and the bit 0x20 of the previous read in the template.
#define IS_PAIRED_AND_MAPPED(bam) (((bam)->core.flag&BAM_FPAIRED) && !((bam)->core.flag&BAM_FUNMAP) && !((bam)->core.flag&BAM_FMUNMAP))
// Misc
char *split_tag; // Tag on which to perform stats splitting
char *split_prefix; // Path or string prefix for filenames created when splitting
+ int remove_overlaps;
+ int cov_threshold;
}
stats_info_t;
// Arrays for the histogram data
uint64_t *quals_1st, *quals_2nd;
uint64_t *gc_1st, *gc_2nd;
- acgtno_count_t *acgtno_cycles;
- uint64_t *read_lengths;
+ acgtno_count_t *acgtno_cycles_1st;
+ acgtno_count_t *acgtno_cycles_2nd;
+ uint64_t *read_lengths, *read_lengths_1st, *read_lengths_2nd;
uint64_t *insertions, *deletions;
uint64_t *ins_cycles_1st, *ins_cycles_2nd, *del_cycles_1st, *del_cycles_2nd;
isize_t *isize;
// The extremes encountered
int max_len; // Maximum read length
+ int max_len_1st; // Maximum read length for forward reads
+ int max_len_2nd; // Maximum read length for reverse reads
int max_qual; // Maximum quality
int is_sorted;
// Summary numbers
uint64_t total_len;
+ uint64_t total_len_1st;
+ uint64_t total_len_2nd;
uint64_t total_len_dup;
uint64_t nreads_1st;
uint64_t nreads_2nd;
uint64_t *mpc_buf; // Mismatches per cycle
// Target regions
- int nregions, reg_from,reg_to;
+ int nregions, reg_from, reg_to;
regions_t *regions;
// Auxiliary data
char* split_name;
stats_info_t* info; // Pointer to options and settings struct
+ pos_t *chunks;
+ uint32_t nchunks;
+ uint32_t pair_count; // Number of active pairs in the pairing hash table
+ uint32_t target_count; // Number of bases covered by the target file
+ uint32_t last_pair_tid;
+ uint32_t last_read_flush;
}
stats_t;
KHASH_MAP_INIT_STR(c2stats, stats_t*)
+typedef struct {
+ uint32_t first; // 1 - first read, 2 - second read
+ uint32_t n, m; // number of chunks, allocated chunks
+ pos_t *chunks; // chunk array of size m
+} pair_t;
+KHASH_MAP_INIT_STR(qn2pair, pair_t*)
+
+
static void error(const char *format, ...);
int is_in_regions(bam1_t *bam_line, stats_t *stats);
void realloc_buffers(stats_t *stats, int seq_len);
+static int regions_lt(const void *r1, const void *r2) {
+ int64_t from_diff = (int64_t)((pos_t *)r1)->from - (int64_t)((pos_t *)r2)->from;
+ int64_t to_diff = (int64_t)((pos_t *)r1)->to - (int64_t)((pos_t *)r2)->to;
+
+ return from_diff > 0 ? 1 : from_diff < 0 ? -1 : to_diff > 0 ? 1 : to_diff < 0 ? -1 : 0;
+}
// Coverage distribution methods
static inline int coverage_idx(int min, int max, int n, int step, int depth)
memset(stats->mpc_buf + stats->nbases*stats->nquals, 0, (n-stats->nbases)*stats->nquals*sizeof(uint64_t));
}
- stats->acgtno_cycles = realloc(stats->acgtno_cycles, n*sizeof(acgtno_count_t));
- if ( !stats->acgtno_cycles )
+ stats->acgtno_cycles_1st = realloc(stats->acgtno_cycles_1st, n*sizeof(acgtno_count_t));
+ if ( !stats->acgtno_cycles_1st )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
- memset(stats->acgtno_cycles + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+ memset(stats->acgtno_cycles_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
+
+ stats->acgtno_cycles_2nd = realloc(stats->acgtno_cycles_2nd, n*sizeof(acgtno_count_t));
+ if ( !stats->acgtno_cycles_2nd )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len, n*sizeof(acgtno_count_t));
+ memset(stats->acgtno_cycles_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(acgtno_count_t));
stats->read_lengths = realloc(stats->read_lengths, n*sizeof(uint64_t));
if ( !stats->read_lengths )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
memset(stats->read_lengths + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+ stats->read_lengths_1st = realloc(stats->read_lengths_1st, n*sizeof(uint64_t));
+ if ( !stats->read_lengths_1st )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+ memset(stats->read_lengths_1st + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
+ stats->read_lengths_2nd = realloc(stats->read_lengths_2nd, n*sizeof(uint64_t));
+ if ( !stats->read_lengths_2nd )
+ error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
+ memset(stats->read_lengths_2nd + stats->nbases, 0, (n-stats->nbases)*sizeof(uint64_t));
+
stats->insertions = realloc(stats->insertions, n*sizeof(uint64_t));
if ( !stats->insertions )
error("Could not realloc buffers, the sequence too long: %d (%ld)\n", seq_len,n*sizeof(uint64_t));
// Count GC and ACGT per cycle. Note that cycle is approximate, clipping is ignored
uint8_t *seq = bam_get_seq(bam_line);
- int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line);
+ int i, read_cycle, gc_count = 0, reverse = IS_REVERSE(bam_line), is_first = IS_READ1(bam_line);
for (i=0; i<seq_len; i++)
{
// Read cycle for current index
// =ACMGRSVTWYHKDBN
switch (bam_seqi(seq, i)) {
case 1:
- stats->acgtno_cycles[ read_cycle ].a++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].a++ : stats->acgtno_cycles_2nd[ read_cycle ].a++;
break;
case 2:
- stats->acgtno_cycles[ read_cycle ].c++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].c++ : stats->acgtno_cycles_2nd[ read_cycle ].c++;
gc_count++;
break;
case 4:
- stats->acgtno_cycles[ read_cycle ].g++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].g++ : stats->acgtno_cycles_2nd[ read_cycle ].g++;
gc_count++;
break;
case 8:
- stats->acgtno_cycles[ read_cycle ].t++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].t++ : stats->acgtno_cycles_2nd[ read_cycle ].t++;
break;
case 15:
- stats->acgtno_cycles[ read_cycle ].n++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].n++ : stats->acgtno_cycles_2nd[ read_cycle ].n++;
break;
default:
/*
* count "=" sequences in "other" along
* with MRSVWYHKDB ambiguity codes
*/
- stats->acgtno_cycles[ read_cycle ].other++;
+ is_first ? stats->acgtno_cycles_1st[ read_cycle ].other++ : stats->acgtno_cycles_2nd[ read_cycle ].other++;
break;
}
}
// fill GC histogram
uint64_t *quals;
uint8_t *bam_quals = bam_get_qual(bam_line);
- if ( bam_line->core.flag&BAM_FREAD2 )
+ if ( IS_READ2(bam_line) )
{
quals = stats->quals_2nd;
stats->nreads_2nd++;
+ stats->total_len_2nd += seq_len;
for (i=gc_idx_min; i<gc_idx_max; i++)
stats->gc_2nd[i]++;
}
{
quals = stats->quals_1st;
stats->nreads_1st++;
+ stats->total_len_1st += seq_len;
for (i=gc_idx_min; i<gc_idx_max; i++)
stats->gc_1st[i]++;
}
*gc_count_out = gc_count;
}
-void collect_stats(bam1_t *bam_line, stats_t *stats)
+static int cleanup_overlaps(khash_t(qn2pair) *read_pairs, int max) {
+ if ( !read_pairs )
+ return 0;
+
+ int count = 0;
+ khint_t k;
+ for (k = kh_begin(read_pairs); k < kh_end(read_pairs); k++) {
+ if ( kh_exist(read_pairs, k) ) {
+ char *key = (char *)kh_key(read_pairs, k);
+ pair_t *val = kh_val(read_pairs, k);
+ if ( val && val->chunks ) {
+ if ( val->chunks[val->n-1].to < max ) {
+ free(val->chunks);
+ free(val);
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ count++;
+ }
+ } else {
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ count++;
+ }
+ }
+ }
+ if ( max == INT_MAX )
+ kh_destroy(qn2pair, read_pairs);
+
+ return count;
+}
+
+static void remove_overlaps(bam1_t *bam_line, khash_t(qn2pair) *read_pairs, stats_t *stats, int pmin, int pmax) {
+ if ( !bam_line || !read_pairs || !stats )
+ return;
+
+ uint32_t first = (IS_READ1(bam_line) > 0 ? 1 : 0) + (IS_READ2(bam_line) > 0 ? 2 : 0) ;
+ if ( !(bam_line->core.flag & BAM_FPAIRED) ||
+ (bam_line->core.flag & BAM_FMUNMAP) ||
+ (abs(bam_line->core.isize) >= 2*bam_line->core.l_qseq) ||
+ (first != 1 && first != 2) ) {
+ if ( pmin >= 0 )
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+ return;
+ }
+
+ char *qname = bam_get_qname(bam_line);
+ if ( !qname ) {
+ fprintf(samtools_stderr, "Error retrieving qname for line starting at pos %d\n", bam_line->core.pos);
+ return;
+ }
+
+ khint_t k = kh_get(qn2pair, read_pairs, qname);
+ if ( k == kh_end(read_pairs) ) { //first chunk from this template
+ if ( pmin == -1 )
+ return;
+
+ int ret;
+ char *s = strdup(qname);
+ if ( !s ) {
+ fprintf(samtools_stderr, "Error allocating memory\n");
+ return;
+ }
+
+ k = kh_put(qn2pair, read_pairs, s, &ret);
+ if ( -1 == ret ) {
+ fprintf(samtools_stderr, "Error inserting read '%s' in pair hash table\n", qname);
+ return;
+ }
+
+ pair_t *pc = calloc(1, sizeof(pair_t));
+ if ( !pc ) {
+ fprintf(samtools_stderr, "Error allocating memory\n");
+ return;
+ }
+
+ pc->m = DEFAULT_CHUNK_NO;
+ pc->chunks = calloc(pc->m, sizeof(pos_t));
+ if ( !pc->chunks ) {
+ fprintf(samtools_stderr, "Error allocating memory\n");
+ return;
+ }
+
+ pc->chunks[0].from = pmin;
+ pc->chunks[0].to = pmax;
+ pc->n = 1;
+ pc->first = first;
+
+ kh_val(read_pairs, k) = pc;
+ stats->pair_count++;
+ } else { //template already present
+ pair_t *pc = kh_val(read_pairs, k);
+ if ( !pc ) {
+ fprintf(samtools_stderr, "Invalid hash table entry\n");
+ return;
+ }
+
+ if ( first == pc->first ) { //chunk from an existing line
+ if ( pmin == -1 )
+ return;
+
+ if ( pc->n == pc->m ) {
+ pos_t *tmp = realloc(pc->chunks, (pc->m<<1)*sizeof(pos_t));
+ if ( !tmp ) {
+ fprintf(samtools_stderr, "Error allocating memory\n");
+ return;
+ }
+ pc->chunks = tmp;
+ pc->m<<=1;
+ }
+
+ pc->chunks[pc->n].from = pmin;
+ pc->chunks[pc->n].to = pmax;
+ pc->n++;
+ } else { //the other line, check for overlapping
+ if ( pmin == -1 && kh_exist(read_pairs, k) ) { //job done, delete entry
+ char *key = (char *)kh_key(read_pairs, k);
+ pair_t *val = kh_val(read_pairs, k);
+ if ( val) {
+ free(val->chunks);
+ free(val);
+ }
+ free(key);
+ kh_del(qn2pair, read_pairs, k);
+ stats->pair_count--;
+ return;
+ }
+
+ int i;
+ for (i=0; i<pc->n; i++) {
+ if ( pmin >= pc->chunks[i].to )
+ continue;
+
+ if ( pmax <= pc->chunks[i].from ) //no overlap
+ break;
+
+ if ( pmin < pc->chunks[i].from ) { //overlap at the beginning
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pc->chunks[i].from-1);
+ pmin = pc->chunks[i].from;
+ }
+
+ if ( pmax <= pc->chunks[i].to ) { //completely contained
+ stats->nbases_mapped_cigar -= (pmax - pmin);
+ return;
+ } else { //overlap at the end
+ stats->nbases_mapped_cigar -= (pc->chunks[i].to - pmin);
+ pmin = pc->chunks[i].to;
+ }
+ }
+ }
+ }
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+}
+
+void collect_stats(bam1_t *bam_line, stats_t *stats, khash_t(qn2pair) *read_pairs)
{
if ( stats->rg_hash )
{
// Update max_len observed
if ( stats->max_len<read_len )
stats->max_len = read_len;
+ if ( IS_READ1(bam_line) && stats->max_len_1st < read_len )
+ stats->max_len_1st = read_len;
+ if ( IS_READ2(bam_line) && stats->max_len_2nd < read_len )
+ stats->max_len_2nd = read_len;
+
int i;
int gc_count = 0;
if ( IS_ORIGINAL(bam_line) )
{
stats->read_lengths[read_len]++;
+ if ( IS_READ1(bam_line) ) stats->read_lengths_1st[read_len]++;
+ if ( IS_READ2(bam_line) ) stats->read_lengths_2nd[read_len]++;
collect_orig_read_stats(bam_line, stats, &gc_count);
}
if ( is_fwd*is_mfwd>0 )
stats->isize->inc_other(stats->isize->data, isize);
- else if ( is_fst*pos_fst>0 )
+ else if ( is_fst*pos_fst>=0 )
{
if ( is_fst*is_fwd>0 )
stats->isize->inc_inward(stats->isize->data, isize);
int ncig = bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
if ( !ncig ) continue; // curiously, this can happen: 0D
if ( cig==BAM_CDEL ) readlen += ncig;
- else if ( cig==BAM_CMATCH )
+ else if ( cig==BAM_CMATCH || cig==BAM_CEQUAL || cig==BAM_CDIFF )
{
if ( iref < stats->reg_from ) ncig -= stats->reg_from-iref;
else if ( iref+ncig-1 > stats->reg_to ) ncig -= iref+ncig-1 - stats->reg_to;
// Count the whole read
for (i=0; i<bam_line->core.n_cigar; i++)
{
- if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CMATCH || bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CINS )
+ int cig = bam_cigar_op(bam_get_cigar(bam_line)[i]);
+ if ( cig==BAM_CMATCH || cig==BAM_CINS || cig==BAM_CEQUAL || cig==BAM_CDIFF )
stats->nbases_mapped_cigar += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
- if ( bam_cigar_op(bam_get_cigar(bam_line)[i])==BAM_CDEL )
+ if ( cig==BAM_CDEL )
readlen += bam_cigar_oplen(bam_get_cigar(bam_line)[i]);
}
}
if ( stats->is_sorted )
{
- if ( stats->tid==-1 || stats->tid!=bam_line->core.tid )
+ if ( stats->tid==-1 || stats->tid!=bam_line->core.tid ) {
round_buffer_flush(stats, -1);
+ }
+
+ //cleanup the pair hash table to free memory
+ stats->last_read_flush++;
+ if ( stats->pair_count > DEFAULT_PAIR_MAX && stats->last_read_flush > DEFAULT_PAIR_MAX) {
+ stats->pair_count -= cleanup_overlaps(read_pairs, bam_line->core.pos);
+ stats->last_read_flush = 0;
+ }
+
+ if ( stats->last_pair_tid != bam_line->core.tid) {
+ stats->pair_count -= cleanup_overlaps(read_pairs, INT_MAX-1);
+ stats->last_pair_tid = bam_line->core.tid;
+ stats->last_read_flush = 0;
+ }
// Mismatches per cycle and GC-depth graph. For simplicity, reads overlapping GCD bins
// are not splitted which results in up to seq_len-1 overlaps. The default bin size is
// Coverage distribution graph
round_buffer_flush(stats,bam_line->core.pos);
- round_buffer_insert_read(&(stats->cov_rbuf),bam_line->core.pos,bam_line->core.pos+seq_len-1);
+ if ( stats->regions ) {
+ uint32_t p = bam_line->core.pos, pnew, pmin, pmax, j;
+ pmin = pmax = i = j = 0;
+ while ( j < bam_line->core.n_cigar && i < stats->nchunks ) {
+ int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+ int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+ switch(op) {
+ case BAM_CMATCH:
+ case BAM_CEQUAL:
+ case BAM_CDIFF:
+ pmin = MAX(p, stats->chunks[i].from-1);
+ pmax = MIN(p+oplen, stats->chunks[i].to);
+ if ( pmax >= pmin ) {
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, pmin, pmax);
+ else
+ round_buffer_insert_read(&(stats->cov_rbuf), pmin, pmax-1);
+ }
+ break;
+ case BAM_CDEL:
+ break;
+ }
+ pnew = p + (bam_cigar_type(op)&2 ? oplen : 0); // consumes reference
+
+ if ( pnew >= stats->chunks[i].to ) {
+ // go to the next chunk
+ i++;
+ } else {
+ // go to the next CIGAR op
+ j++;
+ p = pnew;
+ }
+ }
+ } else {
+ uint32_t p = bam_line->core.pos, j;
+ for (j = 0; j < bam_line->core.n_cigar; j++) {
+ int op = bam_cigar_op(bam_get_cigar(bam_line)[j]);
+ int oplen = bam_cigar_oplen(bam_get_cigar(bam_line)[j]);
+ switch(op) {
+ case BAM_CMATCH:
+ case BAM_CEQUAL:
+ case BAM_CDIFF:
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, p, p+oplen);
+ else
+ round_buffer_insert_read(&(stats->cov_rbuf), p, p+oplen-1);
+ break;
+ case BAM_CDEL:
+ break;
+ }
+ p += bam_cigar_type(op)&2 ? oplen : 0; // consumes reference
+ }
+ }
+ if ( stats->info->remove_overlaps )
+ remove_overlaps(bam_line, read_pairs, stats, -1, -1); //remove the line from the hash table
}
}
void output_stats(FILE *to, stats_t *stats, int sparse)
{
// Calculate average insert size and standard deviation (from the main bulk data only)
- int isize, ibulk=0;
- uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0;
+ int isize, ibulk=0, icov;
+ uint64_t nisize=0, nisize_inward=0, nisize_outward=0, nisize_other=0, cov_sum=0;
+ double bulk=0, avg_isize=0, sd_isize=0;
for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
{
// Each pair was counted twice
nisize += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
}
- double bulk=0, avg_isize=0, sd_isize=0;
for (isize=0; isize<stats->isize->nitems(stats->isize->data); isize++)
{
- bulk += stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+ uint64_t num = stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize);
+ if (num > 0) ibulk = isize + 1;
+ bulk += num;
avg_isize += isize * (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) + stats->isize->other(stats->isize->data, isize));
if ( bulk/nisize > stats->info->isize_main_bulk )
}
avg_isize /= nisize ? nisize : 1;
for (isize=1; isize<ibulk; isize++)
- sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / nisize;
+ sd_isize += (stats->isize->inward(stats->isize->data, isize) + stats->isize->outward(stats->isize->data, isize) +stats->isize->other(stats->isize->data, isize)) * (isize-avg_isize)*(isize-avg_isize) / (nisize ? nisize : 1);
sd_isize = sqrt(sd_isize);
-
fprintf(to, "# This file was produced by samtools stats (%s+htslib-%s) and can be plotted using plot-bamstats\n", samtools_version(), hts_version());
if( stats->split_name != NULL ){
fprintf(to, "# This file contains statistics only for reads with tag: %s=%s\n", stats->info->split_tag, stats->split_name);
fprintf(to, "SN\treads QC failed:\t%ld\n", (long)stats->nreads_QCfailed);
fprintf(to, "SN\tnon-primary alignments:\t%ld\n", (long)stats->nreads_secondary);
fprintf(to, "SN\ttotal length:\t%ld\t# ignores clipping\n", (long)stats->total_len);
+ fprintf(to, "SN\ttotal first fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_1st);
+ fprintf(to, "SN\ttotal last fragment length:\t%ld\t# ignores clipping\n", (long)stats->total_len_2nd);
fprintf(to, "SN\tbases mapped:\t%ld\t# ignores clipping\n", (long)stats->nbases_mapped); // the length of the whole read goes here, including soft-clips etc.
fprintf(to, "SN\tbases mapped (cigar):\t%ld\t# more accurate\n", (long)stats->nbases_mapped_cigar); // only matched and inserted bases are counted here
fprintf(to, "SN\tbases trimmed:\t%ld\n", (long)stats->nbases_trimmed);
fprintf(to, "SN\terror rate:\t%e\t# mismatches / bases mapped (cigar)\n", stats->nbases_mapped_cigar ? (float)stats->nmismatches/stats->nbases_mapped_cigar : 0);
float avg_read_length = (stats->nreads_1st+stats->nreads_2nd)?stats->total_len/(stats->nreads_1st+stats->nreads_2nd):0;
fprintf(to, "SN\taverage length:\t%.0f\n", avg_read_length);
+ fprintf(to, "SN\taverage first fragment length:\t%.0f\n", stats->nreads_1st? (float)stats->total_len_1st/stats->nreads_1st:0);
+ fprintf(to, "SN\taverage last fragment length:\t%.0f\n", stats->nreads_2nd? (float)stats->total_len_2nd/stats->nreads_2nd:0);
fprintf(to, "SN\tmaximum length:\t%d\n", stats->max_len);
+ fprintf(to, "SN\tmaximum first fragment length:\t%d\n", stats->max_len_1st);
+ fprintf(to, "SN\tmaximum last fragment length:\t%d\n", stats->max_len_2nd);
fprintf(to, "SN\taverage quality:\t%.1f\n", stats->total_len?stats->sum_qual/stats->total_len:0);
fprintf(to, "SN\tinsert size average:\t%.1f\n", avg_isize);
fprintf(to, "SN\tinsert size standard deviation:\t%.1f\n", sd_isize);
fprintf(to, "SN\toutward oriented pairs:\t%ld\n", (long)nisize_outward);
fprintf(to, "SN\tpairs with other orientation:\t%ld\n", (long)nisize_other);
fprintf(to, "SN\tpairs on different chromosomes:\t%ld\n", (long)stats->nreads_anomalous/2);
+ fprintf(to, "SN\tpercentage of properly paired reads (%%):\t%.1f\n", (stats->nreads_1st+stats->nreads_2nd)? (float)(100*stats->nreads_properly_paired)/(stats->nreads_1st+stats->nreads_2nd):0);
+ if ( stats->target_count ) {
+ fprintf(to, "SN\tbases inside the target:\t%u\n", stats->target_count);
+ for (icov=stats->info->cov_threshold+1; icov<stats->ncov; icov++)
+ cov_sum += stats->cov[icov];
+ fprintf(to, "SN\tpercentage of target genome with coverage > %d (%%):\t%.2f\n", stats->info->cov_threshold, (float)(100*cov_sum)/stats->target_count);
+ }
int ibase,iqual;
if ( stats->max_len<stats->nbases ) stats->max_len++;
if ( stats->max_qual+1<stats->nquals ) stats->max_qual++;
- fprintf(to, "# First Fragment Qualitites. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
+ fprintf(to, "# First Fragment Qualities. Use `grep ^FFQ | cut -f 2-` to extract this part.\n");
fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
- for (ibase=0; ibase<stats->max_len; ibase++)
+ for (ibase=0; ibase<stats->max_len_1st; ibase++)
{
fprintf(to, "FFQ\t%d",ibase+1);
for (iqual=0; iqual<=stats->max_qual; iqual++)
}
fprintf(to, "\n");
}
- fprintf(to, "# Last Fragment Qualitites. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
+ fprintf(to, "# Last Fragment Qualities. Use `grep ^LFQ | cut -f 2-` to extract this part.\n");
fprintf(to, "# Columns correspond to qualities and rows to cycles. First column is the cycle number.\n");
- for (ibase=0; ibase<stats->max_len; ibase++)
+ for (ibase=0; ibase<stats->max_len_2nd; ibase++)
{
fprintf(to, "LFQ\t%d",ibase+1);
for (iqual=0; iqual<=stats->max_qual; iqual++)
fprintf(to, "# ACGT content per cycle. Use `grep ^GCC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
for (ibase=0; ibase<stats->max_len; ibase++)
{
- acgtno_count_t *acgtno_count = &(stats->acgtno_cycles[ibase]);
- uint64_t acgt_sum = acgtno_count->a + acgtno_count->c + acgtno_count->g + acgtno_count->t;
+ acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+ acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+ uint64_t acgt_sum = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t +
+ acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
if ( ! acgt_sum ) continue;
- fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1, 100.*acgtno_count->a/acgt_sum, 100.*acgtno_count->c/acgt_sum, 100.*acgtno_count->g/acgt_sum, 100.*acgtno_count->t/acgt_sum, 100.*acgtno_count->n/acgt_sum, 100.*acgtno_count->other/acgt_sum);
+ fprintf(to, "GCC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*(acgtno_count_1st->a + acgtno_count_2nd->a)/acgt_sum,
+ 100.*(acgtno_count_1st->c + acgtno_count_2nd->c)/acgt_sum,
+ 100.*(acgtno_count_1st->g + acgtno_count_2nd->g)/acgt_sum,
+ 100.*(acgtno_count_1st->t + acgtno_count_2nd->t)/acgt_sum,
+ 100.*(acgtno_count_1st->n + acgtno_count_2nd->n)/acgt_sum,
+ 100.*(acgtno_count_1st->other + acgtno_count_2nd->other)/acgt_sum);
+
+ }
+ fprintf(to, "# ACGT content per cycle for first fragments. Use `grep ^FBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+ for (ibase=0; ibase<stats->max_len; ibase++)
+ {
+ acgtno_count_t *acgtno_count_1st = &(stats->acgtno_cycles_1st[ibase]);
+ uint64_t acgt_sum_1st = acgtno_count_1st->a + acgtno_count_1st->c + acgtno_count_1st->g + acgtno_count_1st->t;
+
+ if ( acgt_sum_1st )
+ fprintf(to, "FBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*acgtno_count_1st->a/acgt_sum_1st,
+ 100.*acgtno_count_1st->c/acgt_sum_1st,
+ 100.*acgtno_count_1st->g/acgt_sum_1st,
+ 100.*acgtno_count_1st->t/acgt_sum_1st,
+ 100.*acgtno_count_1st->n/acgt_sum_1st,
+ 100.*acgtno_count_1st->other/acgt_sum_1st);
+
+ }
+ fprintf(to, "# ACGT content per cycle for last fragments. Use `grep ^LBC | cut -f 2-` to extract this part. The columns are: cycle; A,C,G,T base counts as a percentage of all A/C/G/T bases [%%]; and N and O counts as a percentage of all A/C/G/T bases [%%]\n");
+ for (ibase=0; ibase<stats->max_len; ibase++)
+ {
+ acgtno_count_t *acgtno_count_2nd = &(stats->acgtno_cycles_2nd[ibase]);
+ uint64_t acgt_sum_2nd = acgtno_count_2nd->a + acgtno_count_2nd->c + acgtno_count_2nd->g + acgtno_count_2nd->t;
+
+ if ( acgt_sum_2nd )
+ fprintf(to, "LBC\t%d\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", ibase+1,
+ 100.*acgtno_count_2nd->a/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->c/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->g/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->t/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->n/acgt_sum_2nd,
+ 100.*acgtno_count_2nd->other/acgt_sum_2nd);
+
}
fprintf(to, "# Insert sizes. Use `grep ^IS | cut -f 2-` to extract this part. The columns are: insert size, pairs total, inward oriented pairs, outward oriented pairs, other pairs\n");
for (isize=0; isize<ibulk; isize++) {
int ilen;
for (ilen=0; ilen<stats->max_len; ilen++)
{
- if ( stats->read_lengths[ilen]>0 )
- fprintf(to, "RL\t%d\t%ld\n", ilen, (long)stats->read_lengths[ilen]);
+ if ( stats->read_lengths[ilen+1]>0 )
+ fprintf(to, "RL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths[ilen+1]);
+ }
+
+ fprintf(to, "# Read lengths - first fragments. Use `grep ^FRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+ for (ilen=0; ilen<stats->max_len_1st; ilen++)
+ {
+ if ( stats->read_lengths_1st[ilen+1]>0 )
+ fprintf(to, "FRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_1st[ilen+1]);
+ }
+
+ fprintf(to, "# Read lengths - last fragments. Use `grep ^LRL | cut -f 2-` to extract this part. The columns are: read length, count\n");
+ for (ilen=0; ilen<stats->max_len_2nd; ilen++)
+ {
+ if ( stats->read_lengths_2nd[ilen+1]>0 )
+ fprintf(to, "LRL\t%d\t%ld\n", ilen+1, (long)stats->read_lengths_2nd[ilen+1]);
}
fprintf(to, "# Indel distribution. Use `grep ^ID | cut -f 2-` to extract this part. The columns are: length, number of insertions, number of deletions\n");
+
for (ilen=0; ilen<stats->nindels; ilen++)
{
if ( stats->insertions[ilen]>0 || stats->deletions[ilen]>0 )
fprintf(to, "# Coverage distribution. Use `grep ^COV | cut -f 2-` to extract this part.\n");
if ( stats->cov[0] )
fprintf(to, "COV\t[<%d]\t%d\t%ld\n",stats->info->cov_min,stats->info->cov_min-1, (long)stats->cov[0]);
- int icov;
for (icov=1; icov<stats->ncov-1; icov++)
if ( stats->cov[icov] )
fprintf(to, "COV\t[%d-%d]\t%d\t%ld\n",stats->info->cov_min + (icov-1)*stats->info->cov_step, stats->info->cov_min + icov*stats->info->cov_step-1,stats->info->cov_min + icov*stats->info->cov_step-1, (long)stats->cov[icov]);
if ( !fp ) error("%s: %s\n",file,strerror(errno));
kstring_t line = { 0, 0, NULL };
- int warned = 0;
+ int warned = 0, r, p, new_p;
int prev_tid=-1, prev_pos=-1;
while (line.l = 0, kgetline(&line, (kgets_func *)fgets, fp) >= 0)
{
if ( prev_pos>stats->regions[tid].pos[npos].from )
error("The positions are not in chromosomal order (%s:%d comes after %d)\n", line.s,stats->regions[tid].pos[npos].from,prev_pos);
stats->regions[tid].npos++;
+ if ( stats->regions[tid].npos > stats->nchunks )
+ stats->nchunks = stats->regions[tid].npos;
}
free(line.s);
if ( !stats->regions ) error("Unable to map the -t sequences to the BAM sequences.\n");
fclose(fp);
+
+ // sort region intervals and remove duplicates
+ for (r = 0; r < stats->nregions; r++) {
+ regions_t *reg = &stats->regions[r];
+ if ( reg->npos > 1 ) {
+ qsort(reg->pos, reg->npos, sizeof(pos_t), regions_lt);
+ for (new_p = 0, p = 1; p < reg->npos; p++) {
+ if ( reg->pos[new_p].to < reg->pos[p].from )
+ reg->pos[++new_p] = reg->pos[p];
+ else if ( reg->pos[new_p].to < reg->pos[p].to )
+ reg->pos[new_p].to = reg->pos[p].to;
+ }
+ reg->npos = ++new_p;
+ }
+ for (p = 0; p < reg->npos; p++)
+ stats->target_count += (reg->pos[p].to - reg->pos[p].from + 1);
+ }
+
+ stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
}
void destroy_regions(stats_t *stats)
free(stats->regions[i].pos);
}
if ( stats->regions ) free(stats->regions);
+ if ( stats->chunks ) free(stats->chunks);
}
void reset_regions(stats_t *stats)
int i = reg->cpos;
while ( i<reg->npos && reg->pos[i].to<=bam_line->core.pos ) i++;
if ( i>=reg->npos ) { reg->cpos = reg->npos; return 0; }
- if ( bam_line->core.pos + bam_line->core.l_qseq + 1 < reg->pos[i].from ) return 0;
+ int64_t endpos = bam_endpos(bam_line);
+ if ( endpos < reg->pos[i].from ) return 0;
+
+ //found a read overlapping a region
reg->cpos = i;
stats->reg_from = reg->pos[i].from;
stats->reg_to = reg->pos[i].to;
+ //now find all the overlapping chunks
+ stats->nchunks = 0;
+ while (i < reg->npos) {
+ if (bam_line->core.pos < reg->pos[i].to && endpos >= reg->pos[i].from) {
+ stats->chunks[stats->nchunks].from = MAX(bam_line->core.pos+1, reg->pos[i].from);
+ stats->chunks[stats->nchunks].to = MIN(endpos, reg->pos[i].to);
+ stats->nchunks++;
+ }
+ i++;
+ }
+
return 1;
}
+int replicate_regions(stats_t *stats, hts_itr_multi_t *iter) {
+ if ( !stats || !iter)
+ return 1;
+
+ int i, j, tid;
+ stats->nregions = iter->n_reg;
+ stats->regions = calloc(stats->nregions, sizeof(regions_t));
+ stats->chunks = calloc(stats->nchunks, sizeof(pos_t));
+ if ( !stats->regions || !stats->chunks )
+ return 1;
+
+ for (i = 0; i < iter->n_reg; i++) {
+ tid = iter->reg_list[i].tid;
+ if ( tid < 0 )
+ continue;
+
+ if ( tid >= stats->nregions ) {
+ regions_t *tmp = realloc(stats->regions, (tid+10) * sizeof(regions_t));
+ if ( !tmp )
+ return 1;
+ stats->regions = tmp;
+ memset(stats->regions + stats->nregions, 0,
+ (tid+10-stats->nregions) * sizeof(regions_t));
+ stats->nregions = tid+10;
+ }
+
+ stats->regions[tid].mpos = stats->regions[tid].npos = iter->reg_list[i].count;
+ stats->regions[tid].pos = calloc(stats->regions[tid].mpos, sizeof(pos_t));
+ if ( !stats->regions[tid].pos )
+ return 1;
+
+ for (j = 0; j < stats->regions[tid].npos; j++) {
+ stats->regions[tid].pos[j].from = iter->reg_list[i].intervals[j].beg+1;
+ stats->regions[tid].pos[j].to = iter->reg_list[i].intervals[j].end;
+
+ stats->target_count += (stats->regions[tid].pos[j].to - stats->regions[tid].pos[j].from + 1);
+ }
+ }
+
+ return 0;
+}
+
void init_group_id(stats_t *stats, const char *id)
{
#if 0
fprintf(samtools_stdout, " -S, --split <tag> Also write statistics to separate files split by tagged field.\n");
fprintf(samtools_stdout, " -t, --target-regions <file> Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.\n");
fprintf(samtools_stdout, " -x, --sparse Suppress outputting IS rows where there are no insertions.\n");
+ fprintf(samtools_stdout, " -p, --remove-overlaps Remove overlaps of paired-end reads from coverage and base count computations.\n");
+ fprintf(samtools_stdout, " -g, --cov-threshold Only bases with coverage above this value will be included in the target percentage computation.\n");
sam_global_opt_help(samtools_stdout, "-.--.@");
fprintf(samtools_stdout, "\n");
}
free(stats->gcd);
free(stats->rseq_buf);
free(stats->mpc_buf);
- free(stats->acgtno_cycles);
+ free(stats->acgtno_cycles_1st);
+ free(stats->acgtno_cycles_2nd);
free(stats->read_lengths);
+ free(stats->read_lengths_1st);
+ free(stats->read_lengths_2nd);
free(stats->insertions);
free(stats->deletions);
free(stats->ins_cycles_1st);
stats_t *curr_stats = NULL;
for(i = kh_begin(split_hash); i != kh_end(split_hash); ++i){
if(!kh_exist(split_hash, i)) continue;
- curr_stats = kh_value(split_hash, i);
- cleanup_stats(curr_stats);
+ curr_stats = kh_value(split_hash, i);
+ cleanup_stats(curr_stats);
}
kh_destroy(c2stats, split_hash);
}
info->filter_readlen = -1;
info->argc = argc;
info->argv = argv;
+ info->remove_overlaps = 0;
+ info->cov_threshold = 0;
return info;
}
stats->ngc = 200;
stats->nquals = 256;
stats->nbases = 300;
- stats->max_len = 30;
- stats->max_qual = 40;
stats->rseq_pos = -1;
stats->tid = stats->gcd_pos = -1;
stats->igcd = 0;
stats->is_sorted = 1;
stats->nindels = stats->nbases;
stats->split_name = NULL;
+ stats->nchunks = 0;
+ stats->pair_count = 0;
+ stats->last_pair_tid = -2;
+ stats->last_read_flush = 0;
+ stats->target_count = 0;
return stats;
}
stats->isize = init_isize_t(info->nisize ?info->nisize+1 :0);
stats->gcd = calloc(stats->ngcd,sizeof(gc_depth_t));
stats->mpc_buf = info->fai ? calloc(stats->nquals*stats->nbases,sizeof(uint64_t)) : NULL;
- stats->acgtno_cycles = calloc(stats->nbases,sizeof(acgtno_count_t));
+ stats->acgtno_cycles_1st = calloc(stats->nbases,sizeof(acgtno_count_t));
+ stats->acgtno_cycles_2nd = calloc(stats->nbases,sizeof(acgtno_count_t));
stats->read_lengths = calloc(stats->nbases,sizeof(uint64_t));
+ stats->read_lengths_1st = calloc(stats->nbases,sizeof(uint64_t));
+ stats->read_lengths_2nd = calloc(stats->nbases,sizeof(uint64_t));
stats->insertions = calloc(stats->nbases,sizeof(uint64_t));
stats->deletions = calloc(stats->nbases,sizeof(uint64_t));
stats->ins_cycles_1st = calloc(stats->nbases+1,sizeof(uint64_t));
{"sparse", no_argument, NULL, 'x'},
{"split", required_argument, NULL, 'S'},
{"split-prefix", required_argument, NULL, 'P'},
+ {"remove-overlaps", no_argument, NULL, 'p'},
+ {"cov-threshold", required_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
};
int opt;
- while ( (opt=getopt_long(argc,argv,"?hdsxr:c:l:i:t:m:q:f:F:I:1:S:P:@:",loptions,NULL))>0 )
+ while ( (opt=getopt_long(argc,argv,"?hdsxpr:c:l:i:t:m:q:f:F:g:I:1:S:P:@:",loptions,NULL))>0 )
{
switch (opt)
{
case 'f': info->flag_require = bam_str2flag(optarg); break;
- case 'F': info->flag_filter = bam_str2flag(optarg); break;
+ case 'F': info->flag_filter |= bam_str2flag(optarg); break;
case 'd': info->flag_filter |= BAM_FDUP; break;
case 's': break;
case 'r': info->fai = fai_load(optarg);
case 'x': sparse = 1; break;
case 'S': info->split_tag = optarg; break;
case 'P': info->split_prefix = optarg; break;
+ case 'p': info->remove_overlaps = 1; break;
+ case 'g': info->cov_threshold = atoi(optarg);
+ if ( info->cov_threshold < 0 || info->cov_threshold == INT_MAX )
+ error("Unsupported value for coverage threshold %d\n", info->cov_threshold);
+ break;
case '?':
case 'h': error(NULL);
default:
bam_fname = "-";
}
- if (init_stat_info_fname(info, bam_fname, &ga.in)) return 1;
+ if (init_stat_info_fname(info, bam_fname, &ga.in)) {
+ free(info);
+ return 1;
+ }
if (ga.nthreads > 0)
hts_set_threads(info->sam, ga.nthreads);
// .. hash
khash_t(c2stats)* split_hash = kh_init(c2stats);
+ khash_t(qn2pair)* read_pairs = kh_init(qn2pair);
+
// Collect statistics
bam1_t *bam_line = bam_init1();
if ( optind<argc )
{
- // Collect stats in selected regions only
- hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
- if (bam_idx == 0)
- error("Random alignment retrieval only works for indexed BAM files.\n");
-
- int i;
- for (i=optind; i<argc; i++)
- {
- hts_itr_t* iter = bam_itr_querys(bam_idx, info->sam_header, argv[i]);
- while (sam_itr_next(info->sam, iter, bam_line) >= 0) {
- if (info->split_tag) {
- curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
- collect_stats(bam_line, curr_stats);
+ int filter = 1;
+ // Prepare the region hash table for the multi-region iterator
+ void *region_hash = bed_hash_regions(NULL, argv, optind, argc, &filter);
+ if (region_hash) {
+
+ // Collect stats in selected regions only
+ hts_idx_t *bam_idx = sam_index_load(info->sam,bam_fname);
+ if (bam_idx) {
+
+ int regcount = 0;
+ hts_reglist_t *reglist = bed_reglist(region_hash, ALL, ®count);
+ if (reglist) {
+
+ hts_itr_multi_t *iter = sam_itr_regions(bam_idx, info->sam_header, reglist, regcount);
+ if (iter) {
+
+ if (!targets) {
+ all_stats->nchunks = argc-optind;
+ if ( replicate_regions(all_stats, iter) )
+ fprintf(samtools_stderr, "Replications of the regions failed.");
+ }
+
+ if ( all_stats->nregions && all_stats->regions ) {
+ while (sam_itr_multi_next(info->sam, iter, bam_line) >= 0) {
+ if (info->split_tag) {
+ curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
+ collect_stats(bam_line, curr_stats, read_pairs);
+ }
+ collect_stats(bam_line, all_stats, read_pairs);
+ }
+ }
+
+ hts_itr_multi_destroy(iter);
+ } else {
+ fprintf(samtools_stderr, "Creation of the region iterator failed.");
+ hts_reglist_free(reglist, regcount);
+ }
+ } else {
+ fprintf(samtools_stderr, "Creation of the region list failed.");
}
- collect_stats(bam_line, all_stats);
+
+ hts_idx_destroy(bam_idx);
+ } else {
+ fprintf(samtools_stderr, "Random alignment retrieval only works for indexed BAM files.\n");
}
- reset_regions(all_stats);
- bam_itr_destroy(iter);
+
+ bed_destroy(region_hash);
+ } else {
+ fprintf(samtools_stderr, "Creation of the region hash table failed.\n");
}
- hts_idx_destroy(bam_idx);
}
else
{
+ if ( info->cov_threshold > 0 && !targets ) {
+ fprintf(samtools_stderr, "Coverage percentage calcuation requires a list of target regions\n");
+ goto cleanup;
+ }
+
// Stream through the entire BAM ignoring off-target regions if -t is given
int ret;
while ((ret = sam_read1(info->sam, info->sam_header, bam_line)) >= 0) {
if (info->split_tag) {
curr_stats = get_curr_split_stats(bam_line, split_hash, info, targets);
- collect_stats(bam_line, curr_stats);
+ collect_stats(bam_line, curr_stats, read_pairs);
}
- collect_stats(bam_line, all_stats);
+ collect_stats(bam_line, all_stats, read_pairs);
}
if (ret < -1) {
if (info->split_tag)
output_split_stats(split_hash, bam_fname, sparse);
+cleanup:
bam_destroy1(bam_line);
bam_hdr_destroy(info->sam_header);
sam_global_args_free(&ga);
cleanup_stats(all_stats);
cleanup_stats_info(info);
destroy_split_stats(split_hash);
+ cleanup_overlaps(read_pairs, INT_MAX);
return 0;
}
}
bool check_test_1(const bam_hdr_t* hdr) {
- char test1_res[200];
- snprintf(test1_res, 199,
+ const char *test1_res =
"@HD\tVN:1.4\n"
"@SQ\tSN:blah\n"
- "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+ "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
if (strcmp(hdr->text, test1_res)) {
return false;
}
bool check_test_2(const bam_hdr_t* hdr) {
- char test2_res[200];
- snprintf(test2_res, 199,
+ const char *test2_res =
"@HD\tVN:1.4\n"
"@SQ\tSN:blah\n"
"@RG\tID:fish\n"
- "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+ "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
if (strcmp(hdr->text, test2_res)) {
return false;
bam_hdr_t* hdr1;
const char* id_to_keep_1 = "1#2.3";
setup_test_1(&hdr1);
- if (verbose > 0) {
+ if (verbose > 1) {
printf("hdr1\n");
dump_hdr(hdr1);
}
fclose(stderr);
if (verbose) printf("END RUN test 1\n");
- if (verbose > 0) {
+ if (verbose > 1) {
printf("hdr1\n");
dump_hdr(hdr1);
}
bam_hdr_t* hdr2;
const char* id_to_keep_2 = "fish";
setup_test_2(&hdr2);
- if (verbose > 0) {
+ if (verbose > 1) {
printf("hdr2\n");
dump_hdr(hdr2);
}
fclose(stderr);
if (verbose) printf("END RUN test 2\n");
- if (verbose > 0) {
+ if (verbose > 1) {
printf("hdr2\n");
dump_hdr(hdr2);
}
}
bool check_test_1(const bam_hdr_t* hdr) {
- char test1_res[200];
- snprintf(test1_res, 199,
+ const char *test1_res =
"@HD\tVN:1.4\n"
"@SQ\tSN:blah\n"
- "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+ "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
if (strcmp(hdr->text, test1_res)) {
return false;
}
bool check_test_2(const bam_hdr_t* hdr) {
- char test2_res[200];
- snprintf(test2_res, 199,
+ const char *test2_res =
"@HD\tVN:1.4\n"
"@SQ\tSN:blah\n"
"@RG\tID:fish\n"
- "@PG\tID:samtools\tPN:samtools\tVN:%s\tCL:test_filter_header_rg foo bar baz\n", samtools_version());
+ "@PG\tID:samtools\tPN:samtools\tVN:x.y.test\tCL:test_filter_header_rg foo bar baz\n";
if (strcmp(hdr->text, test2_res)) {
return false;
bam_hdr_t* hdr1;
const char* id_to_keep_1 = "1#2.3";
setup_test_1(&hdr1);
- if (verbose > 0) {
+ if (verbose > 1) {
fprintf(samtools_stdout, "hdr1\n");
dump_hdr(hdr1);
}
fclose(samtools_stderr);
if (verbose) fprintf(samtools_stdout, "END RUN test 1\n");
- if (verbose > 0) {
+ if (verbose > 1) {
fprintf(samtools_stdout, "hdr1\n");
dump_hdr(hdr1);
}
bam_hdr_t* hdr2;
const char* id_to_keep_2 = "fish";
setup_test_2(&hdr2);
- if (verbose > 0) {
+ if (verbose > 1) {
fprintf(samtools_stdout, "hdr2\n");
dump_hdr(hdr2);
}
fclose(samtools_stderr);
if (verbose) fprintf(samtools_stdout, "END RUN test 2\n");
- if (verbose > 0) {
+ if (verbose > 1) {
fprintf(samtools_stdout, "hdr2\n");
dump_hdr(hdr2);
}
}
printf("text: \"%s\"\n", hdr->text);
}
+
+// For tests, just return a constant that can be embedded in expected output.
+const char *samtools_version(void)
+{
+ return "x.y.test";
+}
}
fprintf(samtools_stdout, "text: \"%s\"\n", hdr->text);
}
+
+// For tests, just return a constant that can be embedded in expected output.
+const char *samtools_version(void)
+{
+ return "x.y.test";
+}
-#define SAMTOOLS_VERSION "1.7"
+#define SAMTOOLS_VERSION "1.9"
+++ /dev/null
-## This script contains some example code
-## illustrating ways to to use the pysam
-## interface to samtools.
-##
-## The unit tests in the script pysam_test.py
-## contain more examples.
-##
-
-import pysam
-
-samfile = pysam.Samfile( "ex1.bam", "rb" )
-
-print "###################"
-# check different ways to iterate
-print len(list(samfile.fetch()))
-print len(list(samfile.fetch( "chr1", 10, 200 )))
-print len(list(samfile.fetch( region="chr1:10-200" )))
-print len(list(samfile.fetch( "chr1" )))
-print len(list(samfile.fetch( region="chr1")))
-print len(list(samfile.fetch( "chr2" )))
-print len(list(samfile.fetch( region="chr2")))
-print len(list(samfile.fetch()))
-print len(list(samfile.fetch( "chr1" )))
-print len(list(samfile.fetch( region="chr1")))
-print len(list(samfile.fetch()))
-
-print len(list(samfile.pileup( "chr1", 10, 200 )))
-print len(list(samfile.pileup( region="chr1:10-200" )))
-print len(list(samfile.pileup( "chr1" )))
-print len(list(samfile.pileup( region="chr1")))
-print len(list(samfile.pileup( "chr2" )))
-print len(list(samfile.pileup( region="chr2")))
-print len(list(samfile.pileup()))
-print len(list(samfile.pileup()))
-
-print "########### fetch with callback ################"
-def my_fetch_callback( alignment ): print str(alignment)
-samfile.fetch( region="chr1:10-200", callback=my_fetch_callback )
-
-print "########## pileup with callback ################"
-def my_pileup_callback( column ): print str(column)
-samfile.pileup( region="chr1:10-200", callback=my_pileup_callback )
-
-
-print "########### Using a callback object ###### "
-
-class Counter:
- mCounts = 0
- def __call__(self, alignment):
- self.mCounts += 1
-
-c = Counter()
-samfile.fetch( region = "chr1:10-200", callback = c )
-print "counts=", c.mCounts
-
-print "########### Calling a samtools command line function ############"
-
-for p in pysam.mpileup( "-c", "ex1.bam" ):
- print str(p)
-
-print pysam.mpileup.getMessages()
-
-print "########### Investigating headers #######################"
-
-# playing arount with headers
-samfile = pysam.Samfile( "ex3.sam", "r" )
-print samfile.references
-print samfile.lengths
-print samfile.text
-print samfile.header
-header = samfile.header
-samfile.close()
-
-header["HD"]["SO"] = "unsorted"
-outfile = pysam.Samfile( "out.sam", "wh",
- header = header )
-
-outfile.close()
-
+++ /dev/null
-'''benchmark pysam BAM/SAM access with the samtools commandline tools.
-
-samtools functions are called via the pysam interface to avoid the over-head
-of starting additional processes.
-'''
-
-import pysam
-import timeit
-
-iterations = 10
-
-def runBenchmark( test,
- pysam_way,
- samtools_way = None):
- print test
- print timeit.repeat( pysam_way, number = iterations, setup="from __main__ import pysam" )
- if samtools_way:
- print timeit.repeat( samtools_way, number = iterations, setup="from __main__ import pysam" )
-
-runBenchmark( "Samfile.fetch",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = list(f.fetch())
-''',
-'''
-f = pysam.view( "ex1.bam" )
-'''
-)
-
-runBenchmark( "Samfile.pileup",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = list(f.pileup())
-''',
-'''
-f = pysam.pileup( "ex1.bam" )
-''')
-
-runBenchmark( "Samfile.pileup with coverage retrieval",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = [ x.n for x in f.pileup() ]
-''' )
-
-runBenchmark( "Samfile.pileup with full retrieval",
-'''
-f = pysam.Samfile( "ex1.bam", "rb" )
-results = [ x.pileups for x in f.pileup() ]
-''' )
-
-runBenchmark( "Samfile.pileup - many references",
-'''
-f = pysam.Samfile( "manyrefs.bam", "rb" )
-results = [ x.pileups for x in f.pileup() ]
-''',
-'''
-f = pysam.pileup( "manyrefs.bam" )
-'''
- )
-
-
-
-
+++ /dev/null
-#!/usr/bin/env python
-'''unit testing code for pysam.
-
-Execute in the :file:`tests` directory as it requires the Makefile
-and data files located there.
-'''
-
-import pysam
-import unittest
-import os, re, sys
-import itertools
-import collections
-import subprocess
-import shutil
-import logging
-
-
-if sys.version_info[0] < 3:
- from itertools import izip as zip_longest
-else:
- from itertools import zip_longest
-
-
-SAMTOOLS="samtools"
-WORKDIR="pysam_test_work"
-
-def checkBinaryEqual( filename1, filename2 ):
- '''return true if the two files are binary equal.'''
- if os.path.getsize( filename1 ) != os.path.getsize( filename2 ):
- return False
-
- infile1 = open(filename1, "rb")
- infile2 = open(filename2, "rb")
-
- def chariter( infile ):
- while 1:
- c = infile.read(1)
- if c == b"": break
- yield c
-
- found = False
- for c1,c2 in zip_longest( chariter( infile1), chariter( infile2) ):
- if c1 != c2: break
- else:
- found = True
-
- infile1.close()
- infile2.close()
- return found
-
-def runSamtools( cmd ):
- '''run a samtools command'''
-
- try:
- retcode = subprocess.call(cmd, shell=True)
- if retcode < 0:
- print >>sys.stderr, "Child was terminated by signal", -retcode
- except OSError as e:
- print >>sys.stderr, "Execution failed:", e
-
-def getSamtoolsVersion():
- '''return samtools version'''
-
- pipe = subprocess.Popen(SAMTOOLS, shell=True, stderr=subprocess.PIPE).stderr
- lines = b"".join(pipe.readlines())
- if sys.version_info[0] >= 3:
- lines = lines.decode('ascii')
- return re.search( "Version:\s+(\S+)", lines).groups()[0]
-
-class BinaryTest(unittest.TestCase):
- '''test samtools command line commands and compare
- against pysam commands.
-
- Tests fail, if the output is not binary identical.
- '''
-
- first_time = True
-
- # a list of commands to test
- commands = \
- {
- "view" :
- (
- ("ex1.view", "view ex1.bam > ex1.view"),
- ("pysam_ex1.view", (pysam.view, "ex1.bam" ) ),
- ),
- "view2" :
- (
- ("ex1.view", "view -bT ex1.fa -o ex1.view2 ex1.sam"),
- # note that -o ex1.view2 throws exception.
- ("pysam_ex1.view", (pysam.view, "-bT ex1.fa -oex1.view2 ex1.sam" ) ),
- ),
- "sort" :
- (
- ( "ex1.sort.bam", "sort ex1.bam ex1.sort" ),
- ( "pysam_ex1.sort.bam", (pysam.sort, "ex1.bam pysam_ex1.sort" ) ),
- ),
- "mpileup" :
- (
- ("ex1.pileup", "mpileup ex1.bam > ex1.pileup" ),
- ("pysam_ex1.mpileup", (pysam.mpileup, "ex1.bam" ) ),
- ),
- "depth" :
- (
- ("ex1.depth", "depth ex1.bam > ex1.depth" ),
- ("pysam_ex1.depth", (pysam.depth, "ex1.bam" ) ),
- ),
- "faidx" :
- (
- ("ex1.fa.fai", "faidx ex1.fa"),
- ("pysam_ex1.fa.fai", (pysam.faidx, "ex1.fa") ),
- ),
- "index":
- (
- ("ex1.bam.bai", "index ex1.bam" ),
- ("pysam_ex1.bam.bai", (pysam.index, "pysam_ex1.bam" ) ),
- ),
- "idxstats" :
- (
- ("ex1.idxstats", "idxstats ex1.bam > ex1.idxstats" ),
- ("pysam_ex1.idxstats", (pysam.idxstats, "pysam_ex1.bam" ) ),
- ),
- "fixmate" :
- (
- ("ex1.fixmate", "fixmate ex1.bam ex1.fixmate" ),
- ("pysam_ex1.fixmate", (pysam.fixmate, "pysam_ex1.bam pysam_ex1.fixmate") ),
- ),
- "flagstat" :
- (
- ("ex1.flagstat", "flagstat ex1.bam > ex1.flagstat" ),
- ("pysam_ex1.flagstat", (pysam.flagstat, "pysam_ex1.bam") ),
- ),
- "calmd" :
- (
- ("ex1.calmd", "calmd ex1.bam ex1.fa > ex1.calmd" ),
- ("pysam_ex1.calmd", (pysam.calmd, "pysam_ex1.bam ex1.fa") ),
- ),
- "merge" :
- (
- ("ex1.merge", "merge -f ex1.merge ex1.bam ex1.bam" ),
- # -f option does not work - following command will cause the subsequent
- # command to fail
- ("pysam_ex1.merge", (pysam.merge, "pysam_ex1.merge pysam_ex1.bam pysam_ex1.bam") ),
- ),
- "rmdup" :
- (
- ("ex1.rmdup", "rmdup ex1.bam ex1.rmdup" ),
- ("pysam_ex1.rmdup", (pysam.rmdup, "pysam_ex1.bam pysam_ex1.rmdup" )),
- ),
- "reheader" :
- (
- ( "ex1.reheader", "reheader ex1.bam ex1.bam > ex1.reheader"),
- ( "pysam_ex1.reheader", (pysam.reheader, "ex1.bam ex1.bam" ) ),
- ),
- "cat":
- (
- ( "ex1.cat", "cat ex1.bam ex1.bam > ex1.cat"),
- ( "pysam_ex1.cat", (pysam.cat, "ex1.bam ex1.bam" ) ),
- ),
- "targetcut":
- (
- ("ex1.targetcut", "targetcut ex1.bam > ex1.targetcut" ),
- ("pysam_ex1.targetcut", (pysam.targetcut, "pysam_ex1.bam") ),
- ),
- "phase":
- (
- ("ex1.phase", "phase ex1.bam > ex1.phase" ),
- ("pysam_ex1.phase", (pysam.phase, "pysam_ex1.bam") ),
- ),
- "import" :
- (
- ("ex1.bam", "import ex1.fa.fai ex1.sam.gz ex1.bam" ),
- ("pysam_ex1.bam", (pysam.samimport, "ex1.fa.fai ex1.sam.gz pysam_ex1.bam") ),
- ),
- "bam2fq":
- (
- ("ex1.bam2fq", "bam2fq ex1.bam > ex1.bam2fq" ),
- ("pysam_ex1.bam2fq", (pysam.bam2fq, "pysam_ex1.bam") ),
- ),
- }
-
- # some tests depend on others. The order specifies in which order
- # the samtools commands are executed.
- # The first three (faidx, import, index) need to be in that order,
- # the rest is arbitrary.
- order = ('faidx', 'import', 'index',
- # 'pileup1', 'pileup2', deprecated
- # 'glfview', deprecated
- 'view', 'view2',
- 'sort',
- 'mpileup',
- 'depth',
- 'idxstats',
- 'fixmate',
- 'flagstat',
- # 'calmd',
- 'merge',
- 'rmdup',
- 'reheader',
- 'cat',
- 'targetcut',
- 'phase',
- 'bam2fq',
- )
-
- def setUp( self ):
- '''setup tests.
-
- For setup, all commands will be run before the first test is
- executed. Individual tests will then just compare the output
- files.
- '''
- if BinaryTest.first_time:
-
- # remove previous files
- if os.path.exists( WORKDIR ):
- shutil.rmtree( WORKDIR )
-
- # copy the source files to WORKDIR
- os.makedirs( WORKDIR )
-
- shutil.copy( "ex1.fa", os.path.join( WORKDIR, "pysam_ex1.fa" ) )
- shutil.copy( "ex1.fa", os.path.join( WORKDIR, "ex1.fa" ) )
- shutil.copy( "ex1.sam.gz", os.path.join( WORKDIR, "ex1.sam.gz" ) )
- shutil.copy( "ex1.sam", os.path.join( WORKDIR, "ex1.sam" ) )
-
- # cd to workdir
- savedir = os.getcwd()
- os.chdir( WORKDIR )
-
- for label in self.order:
- command = self.commands[label]
- samtools_target, samtools_command = command[0]
- try:
- pysam_target, pysam_command = command[1]
- except ValueError as msg:
- raise ValueError( "error while setting up %s=%s: %s" %\
- (label, command, msg) )
- runSamtools( " ".join( (SAMTOOLS, samtools_command )))
- pysam_method, pysam_options = pysam_command
- try:
- output = pysam_method( *pysam_options.split(" "), raw=True)
- except pysam.SamtoolsError as msg:
- raise pysam.SamtoolsError( "error while executing %s: options=%s: msg=%s" %\
- (label, pysam_options, msg) )
- if ">" in samtools_command:
- outfile = open( pysam_target, "wb" )
- if sys.version_info[0] < 3:
- for line in output: outfile.write( line )
- else:
- for line in output: outfile.write(line.encode('ascii'))
- outfile.close()
-
- os.chdir( savedir )
- BinaryTest.first_time = False
-
-
-
- samtools_version = getSamtoolsVersion()
-
-
- def _r( s ):
- # patch - remove any of the alpha/beta suffixes, i.e., 0.1.12a -> 0.1.12
- if s.count('-') > 0: s = s[0:s.find('-')]
- return re.sub( "[^0-9.]", "", s )
-
- if _r(samtools_version) != _r( pysam.__samtools_version__):
- raise ValueError("versions of pysam/samtools and samtools differ: %s != %s" % \
- (pysam.__samtools_version__,
- samtools_version ))
-
- def checkCommand( self, command ):
- if command:
- samtools_target, pysam_target = self.commands[command][0][0], self.commands[command][1][0]
- samtools_target = os.path.join( WORKDIR, samtools_target )
- pysam_target = os.path.join( WORKDIR, pysam_target )
- self.assertTrue( checkBinaryEqual( samtools_target, pysam_target ),
- "%s failed: files %s and %s are not the same" % (command, samtools_target, pysam_target) )
-
- def testImport( self ):
- self.checkCommand( "import" )
-
- def testIndex( self ):
- self.checkCommand( "index" )
-
- def testSort( self ):
- self.checkCommand( "sort" )
-
- def testMpileup( self ):
- self.checkCommand( "mpileup" )
-
- def testDepth( self ):
- self.checkCommand( "depth" )
-
- def testIdxstats( self ):
- self.checkCommand( "idxstats" )
-
- def testFixmate( self ):
- self.checkCommand( "fixmate" )
-
- def testFlagstat( self ):
- self.checkCommand( "flagstat" )
-
- def testMerge( self ):
- self.checkCommand( "merge" )
-
- def testRmdup( self ):
- self.checkCommand( "rmdup" )
-
- def testReheader( self ):
- self.checkCommand( "reheader" )
-
- def testCat( self ):
- self.checkCommand( "cat" )
-
- def testTargetcut( self ):
- self.checkCommand( "targetcut" )
-
- def testPhase( self ):
- self.checkCommand( "phase" )
-
- def testBam2fq( self ):
- self.checkCommand( "bam2fq" )
-
- # def testPileup1( self ):
- # self.checkCommand( "pileup1" )
-
- # def testPileup2( self ):
- # self.checkCommand( "pileup2" )
-
- # deprecated
- # def testGLFView( self ):
- # self.checkCommand( "glfview" )
-
- def testView( self ):
- self.checkCommand( "view" )
-
- def testEmptyIndex( self ):
- self.assertRaises( IOError, pysam.index, "exdoesntexist.bam" )
-
- def __del__(self):
- if os.path.exists( WORKDIR ):
- shutil.rmtree( WORKDIR )
-
-class IOTest(unittest.TestCase):
- '''check if reading samfile and writing a samfile are consistent.'''
-
- def checkEcho( self, input_filename, reference_filename,
- output_filename,
- input_mode, output_mode, use_template = True ):
- '''iterate through *input_filename* writing to *output_filename* and
- comparing the output to *reference_filename*.
-
- The files are opened according to the *input_mode* and *output_mode*.
-
- If *use_template* is set, the header is copied from infile using the
- template mechanism, otherwise target names and lengths are passed
- explicitly.
-
- '''
-
- infile = pysam.Samfile( input_filename, input_mode )
- if use_template:
- outfile = pysam.Samfile( output_filename, output_mode, template = infile )
- else:
- outfile = pysam.Samfile( output_filename, output_mode,
- referencenames = infile.references,
- referencelengths = infile.lengths,
- add_sq_text = False )
-
- iter = infile.fetch()
- for x in iter: outfile.write( x )
- infile.close()
- outfile.close()
-
- self.assertTrue( checkBinaryEqual( reference_filename, output_filename),
- "files %s and %s are not the same" % (reference_filename, output_filename) )
-
-
- def testReadWriteBam( self ):
-
- input_filename = "ex1.bam"
- output_filename = "pysam_ex1.bam"
- reference_filename = "ex1.bam"
-
- self.checkEcho( input_filename, reference_filename, output_filename,
- "rb", "wb" )
-
-
- def testReadWriteBamWithTargetNames( self ):
-
- input_filename = "ex1.bam"
- output_filename = "pysam_ex1.bam"
- reference_filename = "ex1.bam"
-
- self.checkEcho( input_filename, reference_filename, output_filename,
- "rb", "wb", use_template = False )
-
- def testReadWriteSamWithHeader( self ):
-
- input_filename = "ex2.sam"
- output_filename = "pysam_ex2.sam"
- reference_filename = "ex2.sam"
-
- self.checkEcho( input_filename, reference_filename, output_filename,
- "r", "wh" )
-
- def testReadWriteSamWithoutHeader( self ):
-
- input_filename = "ex2.sam"
- output_filename = "pysam_ex2.sam"
- reference_filename = "ex1.sam"
-
- self.checkEcho( input_filename, reference_filename, output_filename,
- "r", "w" )
-
- def testReadSamWithoutHeaderWriteSamWithoutHeader( self ):
-
- input_filename = "ex1.sam"
- output_filename = "pysam_ex1.sam"
- reference_filename = "ex1.sam"
-
- # disabled - reading from a samfile without header
- # is not implemented.
-
- # self.checkEcho( input_filename, reference_filename, output_filename,
- # "r", "w" )
-
- def testFetchFromClosedFile( self ):
-
- samfile = pysam.Samfile( "ex1.bam", "rb" )
- samfile.close()
- self.assertRaises( ValueError, samfile.fetch, 'chr1', 100, 120)
-
- def testClosedFile( self ):
- '''test that access to a closed samfile raises ValueError.'''
-
- samfile = pysam.Samfile( "ex1.bam", "rb" )
- samfile.close()
- self.assertRaises( ValueError, samfile.fetch, 'chr1', 100, 120)
- self.assertRaises( ValueError, samfile.pileup, 'chr1', 100, 120)
- self.assertRaises( ValueError, samfile.getrname, 0 )
- self.assertRaises( ValueError, samfile.tell )
- self.assertRaises( ValueError, samfile.seek, 0 )
- self.assertRaises( ValueError, getattr, samfile, "nreferences" )
- self.assertRaises( ValueError, getattr, samfile, "references" )
- self.assertRaises( ValueError, getattr, samfile, "lengths" )
- self.assertRaises( ValueError, getattr, samfile, "text" )
- self.assertRaises( ValueError, getattr, samfile, "header" )
-
- # write on closed file
- self.assertEqual( 0, samfile.write(None) )
-
- def testAutoDetection( self ):
- '''test if autodetection works.'''
-
- samfile = pysam.Samfile( "ex3.sam" )
- self.assertRaises( ValueError, samfile.fetch, 'chr1' )
- samfile.close()
-
- samfile = pysam.Samfile( "ex3.bam" )
- samfile.fetch('chr1')
- samfile.close()
-
- def testReadingFromSamFileWithoutHeader( self ):
- '''read from samfile without header.
- '''
- samfile = pysam.Samfile( "ex7.sam" )
- self.assertRaises( NotImplementedError, samfile.__iter__ )
-
- def testReadingFromFileWithoutIndex( self ):
- '''read from bam file without index.'''
-
- assert not os.path.exists( "ex2.bam.bai" )
- samfile = pysam.Samfile( "ex2.bam", "rb" )
- self.assertRaises( ValueError, samfile.fetch )
- self.assertEqual( len(list( samfile.fetch(until_eof = True) )), 3270 )
-
- def testReadingUniversalFileMode( self ):
- '''read from samfile without header.
- '''
-
- input_filename = "ex2.sam"
- output_filename = "pysam_ex2.sam"
- reference_filename = "ex1.sam"
-
- self.checkEcho( input_filename, reference_filename, output_filename,
- "rU", "w" )
-
-class TestFloatTagBug( unittest.TestCase ):
- '''see issue 71'''
-
- def testFloatTagBug( self ):
- '''a float tag before another exposed a parsing bug in bam_aux_get - expected to fail
-
- This test is expected to fail until samtools is fixed.
- '''
- samfile = pysam.Samfile("tag_bug.bam")
- read = samfile.fetch(until_eof=True).next()
- self.assertTrue( ('XC',1) in read.tags )
- self.assertEqual(read.opt('XC'), 1)
-
-class TestTagParsing( unittest.TestCase ):
- '''tests checking the accuracy of tag setting and retrieval.'''
-
- def makeRead( self ):
- a = pysam.AlignedRead()
- a.qname = "read_12345"
- a.tid = 0
- a.seq="ACGT" * 3
- a.flag = 0
- a.rname = 0
- a.pos = 1
- a.mapq = 20
- a.cigar = ( (0,10), (2,1), (0,25) )
- a.mrnm = 0
- a.mpos=200
- a.isize = 0
- a.qual ="1234" * 3
- # todo: create tags
- return a
-
- def testNegativeIntegers( self ):
- x = -2
- aligned_read = self.makeRead()
- aligned_read.tags = [("XD", int(x) ) ]
- # print (aligned_read.tags)
-
- def testNegativeIntegers2( self ):
- x = -2
- r = self.makeRead()
- r.tags = [("XD", int(x) ) ]
- outfile = pysam.Samfile( "test.bam",
- "wb",
- referencenames = ("chr1",),
- referencelengths = (1000,) )
- outfile.write (r )
- outfile.close()
-
-
-class TestIteratorRow(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
- def checkRange( self, rnge ):
- '''compare results from iterator with those from samtools.'''
- ps = list(self.samfile.fetch(region=rnge))
- sa = list(pysam.view( "ex1.bam", rnge, raw = True) )
- self.assertEqual( len(ps), len(sa), "unequal number of results for range %s: %i != %i" % (rnge, len(ps), len(sa) ))
- # check if the same reads are returned and in the same order
- for line, (a, b) in enumerate( zip( ps, sa ) ):
- d = b.split("\t")
- self.assertEqual( a.qname, d[0], "line %i: read id mismatch: %s != %s" % (line, a.rname, d[0]) )
- self.assertEqual( a.pos, int(d[3])-1, "line %i: read position mismatch: %s != %s, \n%s\n%s\n" % \
- (line, a.pos, int(d[3])-1,
- str(a), str(d) ) )
- if sys.version_info[0] < 3:
- qual = d[10]
- else:
- qual = d[10].encode('ascii')
- self.assertEqual( a.qual, qual, "line %i: quality mismatch: %s != %s, \n%s\n%s\n" % \
- (line, a.qual, qual,
- str(a), str(d) ) )
-
- def testIteratePerContig(self):
- '''check random access per contig'''
- for contig in self.samfile.references:
- self.checkRange( contig )
-
- def testIterateRanges(self):
- '''check random access per range'''
- for contig, length in zip(self.samfile.references, self.samfile.lengths):
- for start in range( 1, length, 90):
- self.checkRange( "%s:%i-%i" % (contig, start, start + 90) ) # this includes empty ranges
-
- def tearDown(self):
- self.samfile.close()
-
-class TestIteratorRowAll(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
- def testIterate(self):
- '''compare results from iterator with those from samtools.'''
- ps = list(self.samfile.fetch())
- sa = list(pysam.view( "ex1.bam", raw = True) )
- self.assertEqual( len(ps), len(sa), "unequal number of results: %i != %i" % (len(ps), len(sa) ))
- # check if the same reads are returned
- for line, pair in enumerate( zip( ps, sa ) ):
- data = pair[1].split("\t")
- self.assertEqual( pair[0].qname, data[0], "read id mismatch in line %i: %s != %s" % (line, pair[0].rname, data[0]) )
-
- def tearDown(self):
- self.samfile.close()
-
-class TestIteratorColumn(unittest.TestCase):
- '''test iterator column against contents of ex3.bam.'''
-
- # note that samfile contains 1-based coordinates
- # 1D means deletion with respect to reference sequence
- #
- mCoverages = { 'chr1' : [ 0 ] * 20 + [1] * 36 + [0] * (100 - 20 -35 ),
- 'chr2' : [ 0 ] * 20 + [1] * 35 + [0] * (100 - 20 -35 ),
- }
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex4.bam","rb" )
-
- def checkRange( self, rnge ):
- '''compare results from iterator with those from samtools.'''
- # check if the same reads are returned and in the same order
- for column in self.samfile.pileup(region=rnge):
- thiscov = len(column.pileups)
- refcov = self.mCoverages[self.samfile.getrname(column.tid)][column.pos]
- self.assertEqual( thiscov, refcov, "wrong coverage at pos %s:%i %i should be %i" % (self.samfile.getrname(column.tid), column.pos, thiscov, refcov))
-
- def testIterateAll(self):
- '''check random access per contig'''
- self.checkRange( None )
-
- def testIteratePerContig(self):
- '''check random access per contig'''
- for contig in self.samfile.references:
- self.checkRange( contig )
-
- def testIterateRanges(self):
- '''check random access per range'''
- for contig, length in zip(self.samfile.references, self.samfile.lengths):
- for start in range( 1, length, 90):
- self.checkRange( "%s:%i-%i" % (contig, start, start + 90) ) # this includes empty ranges
-
- def testInverse( self ):
- '''test the inverse, is point-wise pileup accurate.'''
- for contig, refseq in self.mCoverages.items():
- refcolumns = sum(refseq)
- for pos, refcov in enumerate( refseq ):
- columns = list(self.samfile.pileup( contig, pos, pos+1) )
- if refcov == 0:
- # if no read, no coverage
- self.assertEqual( len(columns), refcov, "wrong number of pileup columns returned for position %s:%i, %i should be %i" %(contig,pos,len(columns), refcov) )
- elif refcov == 1:
- # one read, all columns of the read are returned
- self.assertEqual( len(columns), refcolumns, "pileup incomplete at position %i: got %i, expected %i " %\
- (pos, len(columns), refcolumns))
-
-
-
- def tearDown(self):
- self.samfile.close()
-
-class TestAlignedReadFromBam(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex3.bam","rb" )
- self.reads=list(self.samfile.fetch())
-
- def testARqname(self):
- self.assertEqual( self.reads[0].qname, "read_28833_29006_6945", "read name mismatch in read 1: %s != %s" % (self.reads[0].qname, "read_28833_29006_6945") )
- self.assertEqual( self.reads[1].qname, "read_28701_28881_323b", "read name mismatch in read 2: %s != %s" % (self.reads[1].qname, "read_28701_28881_323b") )
-
- def testARflag(self):
- self.assertEqual( self.reads[0].flag, 99, "flag mismatch in read 1: %s != %s" % (self.reads[0].flag, 99) )
- self.assertEqual( self.reads[1].flag, 147, "flag mismatch in read 2: %s != %s" % (self.reads[1].flag, 147) )
-
- def testARrname(self):
- self.assertEqual( self.reads[0].rname, 0, "chromosome/target id mismatch in read 1: %s != %s" % (self.reads[0].rname, 0) )
- self.assertEqual( self.reads[1].rname, 1, "chromosome/target id mismatch in read 2: %s != %s" % (self.reads[1].rname, 1) )
-
- def testARpos(self):
- self.assertEqual( self.reads[0].pos, 33-1, "mapping position mismatch in read 1: %s != %s" % (self.reads[0].pos, 33-1) )
- self.assertEqual( self.reads[1].pos, 88-1, "mapping position mismatch in read 2: %s != %s" % (self.reads[1].pos, 88-1) )
-
- def testARmapq(self):
- self.assertEqual( self.reads[0].mapq, 20, "mapping quality mismatch in read 1: %s != %s" % (self.reads[0].mapq, 20) )
- self.assertEqual( self.reads[1].mapq, 30, "mapping quality mismatch in read 2: %s != %s" % (self.reads[1].mapq, 30) )
-
- def testARcigar(self):
- self.assertEqual( self.reads[0].cigar, [(0, 10), (2, 1), (0, 25)], "read name length mismatch in read 1: %s != %s" % (self.reads[0].cigar, [(0, 10), (2, 1), (0, 25)]) )
- self.assertEqual( self.reads[1].cigar, [(0, 35)], "read name length mismatch in read 2: %s != %s" % (self.reads[1].cigar, [(0, 35)]) )
-
- def testARmrnm(self):
- self.assertEqual( self.reads[0].mrnm, 0, "mate reference sequence name mismatch in read 1: %s != %s" % (self.reads[0].mrnm, 0) )
- self.assertEqual( self.reads[1].mrnm, 1, "mate reference sequence name mismatch in read 2: %s != %s" % (self.reads[1].mrnm, 1) )
- self.assertEqual( self.reads[0].rnext, 0, "mate reference sequence name mismatch in read 1: %s != %s" % (self.reads[0].rnext, 0) )
- self.assertEqual( self.reads[1].rnext, 1, "mate reference sequence name mismatch in read 2: %s != %s" % (self.reads[1].rnext, 1) )
-
- def testARmpos(self):
- self.assertEqual( self.reads[0].mpos, 200-1, "mate mapping position mismatch in read 1: %s != %s" % (self.reads[0].mpos, 200-1) )
- self.assertEqual( self.reads[1].mpos, 500-1, "mate mapping position mismatch in read 2: %s != %s" % (self.reads[1].mpos, 500-1) )
- self.assertEqual( self.reads[0].pnext, 200-1, "mate mapping position mismatch in read 1: %s != %s" % (self.reads[0].pnext, 200-1) )
- self.assertEqual( self.reads[1].pnext, 500-1, "mate mapping position mismatch in read 2: %s != %s" % (self.reads[1].pnext, 500-1) )
-
- def testARisize(self):
- self.assertEqual( self.reads[0].isize, 167, "insert size mismatch in read 1: %s != %s" % (self.reads[0].isize, 167) )
- self.assertEqual( self.reads[1].isize, 412, "insert size mismatch in read 2: %s != %s" % (self.reads[1].isize, 412) )
- self.assertEqual( self.reads[0].tlen, 167, "insert size mismatch in read 1: %s != %s" % (self.reads[0].tlen, 167) )
- self.assertEqual( self.reads[1].tlen, 412, "insert size mismatch in read 2: %s != %s" % (self.reads[1].tlen, 412) )
-
- def testARseq(self):
- self.assertEqual( self.reads[0].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "sequence mismatch in read 1: %s != %s" % (self.reads[0].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
- self.assertEqual( self.reads[1].seq, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA", "sequence size mismatch in read 2: %s != %s" % (self.reads[1].seq, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA") )
- self.assertEqual( self.reads[3].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "sequence mismatch in read 4: %s != %s" % (self.reads[3].seq, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
-
- def testARqual(self):
- self.assertEqual( self.reads[0].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "quality string mismatch in read 1: %s != %s" % (self.reads[0].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
- self.assertEqual( self.reads[1].qual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<", "quality string mismatch in read 2: %s != %s" % (self.reads[1].qual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<") )
- self.assertEqual( self.reads[3].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "quality string mismatch in read 3: %s != %s" % (self.reads[3].qual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
-
- def testARquery(self):
- self.assertEqual( self.reads[0].query, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG", "query mismatch in read 1: %s != %s" % (self.reads[0].query, b"AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG") )
- self.assertEqual( self.reads[1].query, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA", "query size mismatch in read 2: %s != %s" % (self.reads[1].query, b"ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA") )
- self.assertEqual( self.reads[3].query, b"TAGCTAGCTACCTATATCTTGGTCTT", "query mismatch in read 4: %s != %s" % (self.reads[3].query, b"TAGCTAGCTACCTATATCTTGGTCTT") )
-
- def testARqqual(self):
- self.assertEqual( self.reads[0].qqual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<", "qquality string mismatch in read 1: %s != %s" % (self.reads[0].qqual, b"<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<") )
- self.assertEqual( self.reads[1].qqual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<", "qquality string mismatch in read 2: %s != %s" % (self.reads[1].qqual, b"<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<") )
- self.assertEqual( self.reads[3].qqual, b"<<<<<<<<<<<<<<<<<:<9/,&,22", "qquality string mismatch in read 3: %s != %s" % (self.reads[3].qqual, b"<<<<<<<<<<<<<<<<<:<9/,&,22") )
-
- def testPresentOptionalFields(self):
- self.assertEqual( self.reads[0].opt('NM'), 1, "optional field mismatch in read 1, NM: %s != %s" % (self.reads[0].opt('NM'), 1) )
- self.assertEqual( self.reads[0].opt('RG'), 'L1', "optional field mismatch in read 1, RG: %s != %s" % (self.reads[0].opt('RG'), 'L1') )
- self.assertEqual( self.reads[1].opt('RG'), 'L2', "optional field mismatch in read 2, RG: %s != %s" % (self.reads[1].opt('RG'), 'L2') )
- self.assertEqual( self.reads[1].opt('MF'), 18, "optional field mismatch in read 2, MF: %s != %s" % (self.reads[1].opt('MF'), 18) )
-
- def testPairedBools(self):
- self.assertEqual( self.reads[0].is_paired, True, "is paired mismatch in read 1: %s != %s" % (self.reads[0].is_paired, True) )
- self.assertEqual( self.reads[1].is_paired, True, "is paired mismatch in read 2: %s != %s" % (self.reads[1].is_paired, True) )
- self.assertEqual( self.reads[0].is_proper_pair, True, "is proper pair mismatch in read 1: %s != %s" % (self.reads[0].is_proper_pair, True) )
- self.assertEqual( self.reads[1].is_proper_pair, True, "is proper pair mismatch in read 2: %s != %s" % (self.reads[1].is_proper_pair, True) )
-
- def testTags( self ):
- self.assertEqual( self.reads[0].tags,
- [('NM', 1), ('RG', 'L1'),
- ('PG', 'P1'), ('XT', 'U')] )
- self.assertEqual( self.reads[1].tags,
- [('MF', 18), ('RG', 'L2'),
- ('PG', 'P2'),('XT', 'R') ] )
-
- def testOpt( self ):
- self.assertEqual( self.reads[0].opt("XT"), "U" )
- self.assertEqual( self.reads[1].opt("XT"), "R" )
-
- def testMissingOpt( self ):
- self.assertRaises( KeyError, self.reads[0].opt, "XP" )
-
- def testEmptyOpt( self ):
- self.assertRaises( KeyError, self.reads[2].opt, "XT" )
-
- def tearDown(self):
- self.samfile.close()
-
-class TestAlignedReadFromSam(TestAlignedReadFromBam):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex3.sam","r" )
- self.reads=list(self.samfile.fetch())
-
-# needs to be implemented
-# class TestAlignedReadFromSamWithoutHeader(TestAlignedReadFromBam):
-#
-# def setUp(self):
-# self.samfile=pysam.Samfile( "ex7.sam","r" )
-# self.reads=list(self.samfile.fetch())
-
-class TestHeaderSam(unittest.TestCase):
-
- header = {'SQ': [{'LN': 1575, 'SN': 'chr1'},
- {'LN': 1584, 'SN': 'chr2'}],
- 'RG': [{'LB': 'SC_1', 'ID': 'L1', 'SM': 'NA12891', 'PU': 'SC_1_10', "CN":"name:with:colon"},
- {'LB': 'SC_2', 'ID': 'L2', 'SM': 'NA12891', 'PU': 'SC_2_12', "CN":"name:with:colon"}],
- 'PG': [{'ID': 'P1', 'VN': '1.0'}, {'ID': 'P2', 'VN': '1.1'}],
- 'HD': {'VN': '1.0'},
- 'CO' : [ 'this is a comment', 'this is another comment'],
- }
-
- def compareHeaders( self, a, b ):
- '''compare two headers a and b.'''
- for ak,av in a.iteritems():
- self.assertTrue( ak in b, "key '%s' not in '%s' " % (ak,b) )
- self.assertEqual( av, b[ak] )
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex3.sam","r" )
-
- def testHeaders(self):
- self.compareHeaders( self.header, self.samfile.header )
- self.compareHeaders( self.samfile.header, self.header )
-
- def testNameMapping( self ):
- for x, y in enumerate( ("chr1", "chr2")):
- tid = self.samfile.gettid( y )
- ref = self.samfile.getrname( x )
- self.assertEqual( tid, x )
- self.assertEqual( ref, y )
-
- self.assertEqual( self.samfile.gettid("chr?"), -1 )
- self.assertRaises( ValueError, self.samfile.getrname, 2 )
-
- def tearDown(self):
- self.samfile.close()
-
-class TestHeaderBam(TestHeaderSam):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex3.bam","rb" )
-
-class TestUnmappedReads(unittest.TestCase):
-
- def testSAM(self):
- samfile=pysam.Samfile( "ex5.sam","r" )
- self.assertEqual( len(list(samfile.fetch( until_eof = True))), 2 )
- samfile.close()
-
- def testBAM(self):
- samfile=pysam.Samfile( "ex5.bam","rb" )
- self.assertEqual( len(list(samfile.fetch( until_eof = True))), 2 )
- samfile.close()
-
-class TestPileupObjects(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
- def testPileupColumn(self):
- for pcolumn1 in self.samfile.pileup( region="chr1:105" ):
- if pcolumn1.pos == 104:
- self.assertEqual( pcolumn1.tid, 0, "chromosome/target id mismatch in position 1: %s != %s" % (pcolumn1.tid, 0) )
- self.assertEqual( pcolumn1.pos, 105-1, "position mismatch in position 1: %s != %s" % (pcolumn1.pos, 105-1) )
- self.assertEqual( pcolumn1.n, 2, "# reads mismatch in position 1: %s != %s" % (pcolumn1.n, 2) )
- for pcolumn2 in self.samfile.pileup( region="chr2:1480" ):
- if pcolumn2.pos == 1479:
- self.assertEqual( pcolumn2.tid, 1, "chromosome/target id mismatch in position 1: %s != %s" % (pcolumn2.tid, 1) )
- self.assertEqual( pcolumn2.pos, 1480-1, "position mismatch in position 1: %s != %s" % (pcolumn2.pos, 1480-1) )
- self.assertEqual( pcolumn2.n, 12, "# reads mismatch in position 1: %s != %s" % (pcolumn2.n, 12) )
-
- def testPileupRead(self):
- for pcolumn1 in self.samfile.pileup( region="chr1:105" ):
- if pcolumn1.pos == 104:
- self.assertEqual( len(pcolumn1.pileups), 2, "# reads aligned to column mismatch in position 1: %s != %s" % (len(pcolumn1.pileups), 2) )
-# self.assertEqual( pcolumn1.pileups[0] # need to test additional properties here
-
- def tearDown(self):
- self.samfile.close()
-
-class TestContextManager(unittest.TestCase):
-
- def testManager( self ):
- with pysam.Samfile('ex1.bam', 'rb') as samfile:
- samfile.fetch()
- self.assertEqual( samfile._isOpen(), False )
-
-class TestExceptions(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
- def testMissingFile(self):
-
- self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.bam", "rb" )
- self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.sam", "r" )
- self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.bam", "r" )
- self.assertRaises( IOError, pysam.Samfile, "exdoesntexist.sam", "rb" )
-
- def testBadContig(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr88" )
-
- def testMeaninglessCrap(self):
- self.assertRaises( ValueError, self.samfile.fetch, "skljf" )
-
- def testBackwardsOrderNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, 'chr1', 100, 10 )
-
- def testBackwardsOrderOldFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, region="chr1:100-10")
-
- def testOutOfRangeNegativeNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, -10 )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, 0 )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", -5, -10 )
-
- self.assertRaises( ValueError, self.samfile.count, "chr1", 5, -10 )
- self.assertRaises( ValueError, self.samfile.count, "chr1", 5, 0 )
- self.assertRaises( ValueError, self.samfile.count, "chr1", -5, -10 )
-
- def testOutOfRangeNegativeOldFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5-10" )
- self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5-0" )
- self.assertRaises( ValueError, self.samfile.fetch, region="chr1:-5--10" )
-
- self.assertRaises( ValueError, self.samfile.count, region="chr1:-5-10" )
- self.assertRaises( ValueError, self.samfile.count, region="chr1:-5-0" )
- self.assertRaises( ValueError, self.samfile.count, region="chr1:-5--10" )
-
- def testOutOfRangNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 9999999999, 99999999999 )
- self.assertRaises( ValueError, self.samfile.count, "chr1", 9999999999, 99999999999 )
-
- def testOutOfRangeLargeNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 9999999999999999999999999999999, 9999999999999999999999999999999999999999 )
- self.assertRaises( ValueError, self.samfile.count, "chr1", 9999999999999999999999999999999, 9999999999999999999999999999999999999999 )
-
- def testOutOfRangeLargeOldFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1:99999999999999999-999999999999999999" )
- self.assertRaises( ValueError, self.samfile.count, "chr1:99999999999999999-999999999999999999" )
-
- def testZeroToZero(self):
- '''see issue 44'''
- self.assertEqual( len(list(self.samfile.fetch('chr1', 0, 0))), 0)
-
- def tearDown(self):
- self.samfile.close()
-
-class TestWrongFormat(unittest.TestCase):
- '''test cases for opening files not in bam/sam format.'''
-
- def testOpenSamAsBam( self ):
- self.assertRaises( ValueError, pysam.Samfile, 'ex1.sam', 'rb' )
-
- def testOpenBamAsSam( self ):
- # test fails, needs to be implemented.
- # sam.fetch() fails on reading, not on opening
- # self.assertRaises( ValueError, pysam.Samfile, 'ex1.bam', 'r' )
- pass
-
- def testOpenFastaAsSam( self ):
- # test fails, needs to be implemented.
- # sam.fetch() fails on reading, not on opening
- # self.assertRaises( ValueError, pysam.Samfile, 'ex1.fa', 'r' )
- pass
-
- def testOpenFastaAsBam( self ):
- self.assertRaises( ValueError, pysam.Samfile, 'ex1.fa', 'rb' )
-
-class TestFastaFile(unittest.TestCase):
-
- mSequences = { 'chr1' :
- b"CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGCTGAGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGGGGCTTAACCTCTGGTGACTGCCAGAGCTGCTGGCAAGCTAGAGTCCCATTTGGAGCCCCTCTAAGCCGTTCTATTTGTAATGAAAACTATATTTATGCTATTCAGTTCTAAATATAGAAATTGAAACAGCTGTGTTTAGTGCCTTTGTTCAACCCCCTTGCAACAACCTTGAGAACCCCAGGGAATTTGTCAATGTCAGGGAAGGAGCATTTTGTCAGTTACCAAATGTGTTTATTACCAGAGGGATGGAGGGAAGAGGGACGCTGAAGAACTTTGATGCCCTCTTCTTCCAAAGATGAAACGCGTAACTGCGCTCTCATTCACTCCAGCTCCCTGTCACCCAATGGACCTGTGATATCTGGATTCTGGGAAATTCTTCATCCTGGACCCTGAGAGATTCTGCAGCCCAGCTCCAGATTGCTTGTGGTCTGACAGGCTGCAACTGTGAGCCATCACAATGAACAACAGGAAGAAAAGGTCTTTCAAAAGGTGATGTGTGTTCTCATCAACCTCATACACACACATGGTTTAGGGGTATAATACCTCTACATGGCTGATTATGAAAACAATGTTCCCCAGATACCATCCCTGTCTTACTTCCAGCTCCCCAGAGGGAAAGCTTTCAACGCTTCTAGCCATTTCTTTTGGCATTTGCCTTCAGACCCTACACGAATGCGTCTCTACCACAGGGGGCTGCGCGGTTTCCCATCATGAAGCACTGAACTTCCACGTCTCATCTAGGGGAACAGGGAGGTGCACTAATGCGCTCCACGCCCAAGCCCTTCTCACAGTTTCTGCCCCCAGCATGGTTGTACTGGGCAATACATGAGATTATTAGGAAATGCTTTACTGTCATAACTATGAAGAGACTATTGCCAGATGAACCACACATTAATACTATGTTTCTTATCTGCACATTACTACCCTGCAATTAATATAATTGTGTCCATGTACACACGCTGTCCTATGTACTTATCATGACTCTATCCCAAATTCCCAATTACGTCCTATCTTCTTCTTAGGGAAGAACAGCTTAGGTATCAATTTGGTGTTCTGTGTAAAGTCTCAGGGAGCCGTCCGTGTCCTCCCATCTGGCCTCGTCCACACTGGTTCTCTTGAAAGCTTGGGCTGTAATGATGCCCCTTGGCCATCACCCAGTCCCTGCCCCATCTCTTGTAATCTCTCTCCTTTTTGCTGCATCCCTGTCTTCCTCTGTCTTGATTTACTTGTTGTTGGTTTTCTGTTTCTTTGTTTGATTTGGTGGAAGACATAATCCCACGCTTCCTATGGAAAGGTTGTTGGGAGATTTTTAATGATTCCTCAATGTTAAAATGTCTATTTTTGTCTTGACACCCAACTAATATTTGTCTGAGCAAAACAGTCTAGATGAGAGAGAACTTCCCTGGAGGTCTGATGGCGTTTCTCCCTCGTCTTCTTA",
- 'chr2' :
- b"TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGACTAAACCAAGCAGAAGAAAGAGGTTCAGAACTTGAAGACAAGTCTCTTATGAATTAACCCAGTCAGACAAAAATAAAGAAAAAAATTTTAAAAATGAACAGAGCTTTCAAGAAGTATGAGATTATGTAAAGTAACTGAACCTATGAGTCACAGGTATTCCTGAGGAAAAAGAAAAAGTGAGAAGTTTGGAAAAACTATTTGAGGAAGTAATTGGGGAAAACCTCTTTAGTCTTGCTAGAGATTTAGACATCTAAATGAAAGAGGCTCAAAGAATGCCAGGAAGATACATTGCAAGACAGACTTCATCAAGATATGTAGTCATCAGACTATCTAAAGTCAACATGAAGGAAAAAAATTCTAAAATCAGCAAGAGAAAAGCATACAGTCATCTATAAAGGAAATCCCATCAGAATAACAATGGGCTTCTCAGCAGAAACCTTACAAGCCAGAAGAGATTGGATCTAATTTTTGGACTTCTTAAAGAAAAAAAAACCTGTCAAACACGAATGTTATGCCCTGCTAAACTAAGCATCATAAATGAAGGGGAAATAAAGTCAAGTCTTTCCTGACAAGCAAATGCTAAGATAATTCATCATCACTAAACCAGTCCTATAAGAAATGCTCAAAAGAATTGTAAAAGTCAAAATTAAAGTTCAATACTCACCATCATAAATACACACAAAAGTACAAAACTCACAGGTTTTATAAAACAATTGAGACTACAGAGCAACTAGGTAAAAAATTAACATTACAACAGGAACAAAACCTCATATATCAATATTAACTTTGAATAAAAAGGGATTAAATTCCCCCACTTAAGAGATATAGATTGGCAGAACAGATTTAAAAACATGAACTAACTATATGCTGTTTACAAGAAACTCATTAATAAAGACATGAGTTCAGGTAAAGGGGTGGAAAAAGATGTTCTACGCAAACAGAAACCAAATGAGAGAAGGAGTAGCTATACTTATATCAGATAAAGCACACTTTAAATCAACAACAGTAAAATAAAACAAAGGAGGTCATCATACAATGATAAAAAGATCAATTCAGCAAGAAGATATAACCATCCTACTAAATACATATGCACCTAACACAAGACTACCCAGATTCATAAAACAAATACTACTAGACCTAAGAGGGATGAGAAATTACCTAATTGGTACAATGTACAATATTCTGATGATGGTTACACTAAAAGCCCATACTTTACTGCTACTCAATATATCCATGTAACAAATCTGCGCTTGTACTTCTAAATCTATAAAAAAATTAAAATTTAACAAAAGTAAATAAAACACATAGCTAAAACTAAAAAAGCAAAAACAAAAACTATGCTAAGTATTGGTAAAGATGTGGGGAAAAAAGTAAACTCTCAAATATTGCTAGTGGGAGTATAAATTGTTTTCCACTTTGGAAAACAATTTGGTAATTTCGTTTTTTTTTTTTTCTTTTCTCTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAAAAAATATTTACAGTAACT",
- }
-
- def setUp(self):
- self.file=pysam.Fastafile( "ex1.fa" )
-
- def testFetch(self):
- for id, seq in self.mSequences.items():
- self.assertEqual( seq, self.file.fetch( id ) )
- for x in range( 0, len(seq), 10):
- self.assertEqual( seq[x:x+10], self.file.fetch( id, x, x+10) )
- # test x:end
- self.assertEqual( seq[x:], self.file.fetch( id, x) )
- # test 0:x
- self.assertEqual( seq[:x], self.file.fetch( id, None, x) )
-
-
- # unknown sequence returns ""
- self.assertEqual( b"", self.file.fetch("chr12") )
-
- def testOutOfRangeAccess( self ):
- '''test out of range access.'''
- # out of range access returns an empty string
- for contig, s in self.mSequences.iteritems():
- self.assertEqual( self.file.fetch( contig, len(s), len(s)+1), b"" )
-
- self.assertEqual( self.file.fetch( "chr3", 0 , 100), b"" )
-
- def testFetchErrors( self ):
- self.assertRaises( ValueError, self.file.fetch )
- self.assertRaises( ValueError, self.file.fetch, "chr1", -1, 10 )
- self.assertRaises( ValueError, self.file.fetch, "chr1", 20, 10 )
-
- def testLength( self ):
- self.assertEqual( len(self.file), 2 )
-
- def tearDown(self):
- self.file.close()
-
-class TestAlignedRead(unittest.TestCase):
- '''tests to check if aligned read can be constructed
- and manipulated.
- '''
-
- def checkFieldEqual( self, read1, read2, exclude = []):
- '''check if two reads are equal by comparing each field.'''
-
- for x in ("qname", "seq", "flag",
- "rname", "pos", "mapq", "cigar",
- "mrnm", "mpos", "isize", "qual",
- "is_paired", "is_proper_pair",
- "is_unmapped", "mate_is_unmapped",
- "is_reverse", "mate_is_reverse",
- "is_read1", "is_read2",
- "is_secondary", "is_qcfail",
- "is_duplicate", "bin"):
- if x in exclude: continue
- self.assertEqual( getattr(read1, x), getattr(read2,x), "attribute mismatch for %s: %s != %s" %
- (x, getattr(read1, x), getattr(read2,x)))
-
- def testEmpty( self ):
- a = pysam.AlignedRead()
- self.assertEqual( a.qname, None )
- self.assertEqual( a.seq, None )
- self.assertEqual( a.qual, None )
- self.assertEqual( a.flag, 0 )
- self.assertEqual( a.rname, 0 )
- self.assertEqual( a.mapq, 0 )
- self.assertEqual( a.cigar, None )
- self.assertEqual( a.tags, [] )
- self.assertEqual( a.mrnm, 0 )
- self.assertEqual( a.mpos, 0 )
- self.assertEqual( a.isize, 0 )
-
- def buildRead( self ):
- '''build an example read.'''
-
- a = pysam.AlignedRead()
- a.qname = "read_12345"
- a.seq="ACGT" * 10
- a.flag = 0
- a.rname = 0
- a.pos = 20
- a.mapq = 20
- a.cigar = ( (0,10), (2,1), (0,9), (1,1), (0,20) )
- a.mrnm = 0
- a.mpos=200
- a.isize=167
- a.qual="1234" * 10
- # todo: create tags
- return a
-
- def testUpdate( self ):
- '''check if updating fields affects other variable length data
- '''
- a = self.buildRead()
- b = self.buildRead()
-
- # check qname
- b.qname = "read_123"
- self.checkFieldEqual( a, b, "qname" )
- b.qname = "read_12345678"
- self.checkFieldEqual( a, b, "qname" )
- b.qname = "read_12345"
- self.checkFieldEqual( a, b)
-
- # check cigar
- b.cigar = ( (0,10), )
- self.checkFieldEqual( a, b, "cigar" )
- b.cigar = ( (0,10), (2,1), (0,10) )
- self.checkFieldEqual( a, b, "cigar" )
- b.cigar = ( (0,10), (2,1), (0,9), (1,1), (0,20) )
- self.checkFieldEqual( a, b)
-
- # check seq
- b.seq = "ACGT"
- self.checkFieldEqual( a, b, ("seq", "qual") )
- b.seq = "ACGT" * 3
- self.checkFieldEqual( a, b, ("seq", "qual") )
- b.seq = "ACGT" * 10
- self.checkFieldEqual( a, b, ("qual",))
-
- # reset qual
- b = self.buildRead()
-
- # check flags:
- for x in (
- "is_paired", "is_proper_pair",
- "is_unmapped", "mate_is_unmapped",
- "is_reverse", "mate_is_reverse",
- "is_read1", "is_read2",
- "is_secondary", "is_qcfail",
- "is_duplicate"):
- setattr( b, x, True )
- self.assertEqual( getattr(b, x), True )
- self.checkFieldEqual( a, b, ("flag", x,) )
- setattr( b, x, False )
- self.assertEqual( getattr(b, x), False )
- self.checkFieldEqual( a, b )
-
- def testLargeRead( self ):
- '''build an example read.'''
-
- a = pysam.AlignedRead()
- a.qname = "read_12345"
- a.seq="ACGT" * 200
- a.flag = 0
- a.rname = 0
- a.pos = 20
- a.mapq = 20
- a.cigar = ( (0, 4 * 200), )
- a.mrnm = 0
- a.mpos=200
- a.isize=167
- a.qual="1234" * 200
-
- return a
-
- def testTagParsing( self ):
- '''test for tag parsing
-
- see http://groups.google.com/group/pysam-user-group/browse_thread/thread/67ca204059ea465a
- '''
- samfile=pysam.Samfile( "ex8.bam","rb" )
-
- for entry in samfile:
- before = entry.tags
- entry.tags = entry.tags
- after = entry.tags
- self.assertEqual( after, before )
-
- def testUpdateTlen( self ):
- '''check if updating tlen works'''
- a = self.buildRead()
- oldlen = a.tlen
- oldlen *= 2
- a.tlen = oldlen
- self.assertEqual( a.tlen, oldlen )
-
- def testPositions( self ):
- a = self.buildRead()
- self.assertEqual( a.positions,
- [20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
- 31, 32, 33, 34, 35, 36, 37, 38, 39,
- 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
- 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] )
-
- self.assertEqual( a.aligned_pairs,
- [(0, 20), (1, 21), (2, 22), (3, 23), (4, 24),
- (5, 25), (6, 26), (7, 27), (8, 28), (9, 29),
- (None, 30),
- (10, 31), (11, 32), (12, 33), (13, 34), (14, 35),
- (15, 36), (16, 37), (17, 38), (18, 39), (19, None),
- (20, 40), (21, 41), (22, 42), (23, 43), (24, 44),
- (25, 45), (26, 46), (27, 47), (28, 48), (29, 49),
- (30, 50), (31, 51), (32, 52), (33, 53), (34, 54),
- (35, 55), (36, 56), (37, 57), (38, 58), (39, 59)] )
-
- self.assertEqual( a.positions, [x[1] for x in a.aligned_pairs if x[0] != None and x[1] != None] )
- # alen is the length of the aligned read in genome
- self.assertEqual( a.alen, a.aligned_pairs[-1][0] + 1 )
- # aend points to one beyond last aligned base in ref
- self.assertEqual( a.positions[-1], a.aend - 1 )
-
-class TestDeNovoConstruction(unittest.TestCase):
- '''check BAM/SAM file construction using ex3.sam
-
- (note these are +1 coordinates):
-
- read_28833_29006_6945 99 chr1 33 20 10M1D25M = 200 167 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1
- read_28701_28881_323b 147 chr2 88 30 35M = 500 412 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2
- '''
-
- header = { 'HD': {'VN': '1.0'},
- 'SQ': [{'LN': 1575, 'SN': 'chr1'},
- {'LN': 1584, 'SN': 'chr2'}], }
-
- bamfile = "ex6.bam"
- samfile = "ex6.sam"
-
- def checkFieldEqual( self, read1, read2, exclude = []):
- '''check if two reads are equal by comparing each field.'''
-
- for x in ("qname", "seq", "flag",
- "rname", "pos", "mapq", "cigar",
- "mrnm", "mpos", "isize", "qual",
- "bin",
- "is_paired", "is_proper_pair",
- "is_unmapped", "mate_is_unmapped",
- "is_reverse", "mate_is_reverse",
- "is_read1", "is_read2",
- "is_secondary", "is_qcfail",
- "is_duplicate"):
- if x in exclude: continue
- self.assertEqual( getattr(read1, x), getattr(read2,x), "attribute mismatch for %s: %s != %s" %
- (x, getattr(read1, x), getattr(read2,x)))
-
- def setUp( self ):
-
-
- a = pysam.AlignedRead()
- a.qname = "read_28833_29006_6945"
- a.seq="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
- a.flag = 99
- a.rname = 0
- a.pos = 32
- a.mapq = 20
- a.cigar = ( (0,10), (2,1), (0,25) )
- a.mrnm = 0
- a.mpos=199
- a.isize=167
- a.qual="<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<"
- a.tags = ( ("NM", 1),
- ("RG", "L1") )
-
- b = pysam.AlignedRead()
- b.qname = "read_28701_28881_323b"
- b.seq="ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA"
- b.flag = 147
- b.rname = 1
- b.pos = 87
- b.mapq = 30
- b.cigar = ( (0,35), )
- b.mrnm = 1
- b.mpos=499
- b.isize=412
- b.qual="<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<<"
- b.tags = ( ("MF", 18),
- ("RG", "L2") )
-
- self.reads = (a,b)
-
- def testSAMWholeFile( self ):
-
- tmpfilename = "tmp_%i.sam" % id(self)
-
- outfile = pysam.Samfile( tmpfilename, "wh", header = self.header )
-
- for x in self.reads: outfile.write( x )
- outfile.close()
-
- self.assertTrue( checkBinaryEqual( tmpfilename, self.samfile ),
- "mismatch when construction SAM file, see %s %s" % (tmpfilename, self.samfile))
-
- os.unlink( tmpfilename )
-
- def testBAMPerRead( self ):
- '''check if individual reads are binary equal.'''
- infile = pysam.Samfile( self.bamfile, "rb")
-
- others = list(infile)
- for denovo, other in zip( others, self.reads):
- self.checkFieldEqual( other, denovo )
- self.assertEqual( other.compare( denovo ), 0 )
-
- def testSAMPerRead( self ):
- '''check if individual reads are binary equal.'''
- infile = pysam.Samfile( self.samfile, "r")
-
- others = list(infile)
- for denovo, other in zip( others, self.reads):
- self.checkFieldEqual( other, denovo )
- self.assertEqual( other.compare( denovo), 0 )
-
- def testBAMWholeFile( self ):
-
- tmpfilename = "tmp_%i.bam" % id(self)
-
- outfile = pysam.Samfile( tmpfilename, "wb", header = self.header )
-
- for x in self.reads: outfile.write( x )
- outfile.close()
-
- self.assertTrue( checkBinaryEqual( tmpfilename, self.bamfile ),
- "mismatch when construction BAM file, see %s %s" % (tmpfilename, self.bamfile))
-
- os.unlink( tmpfilename )
-
-
-class TestDoubleFetch(unittest.TestCase):
- '''check if two iterators on the same bamfile are independent.'''
-
- def testDoubleFetch( self ):
-
- samfile1 = pysam.Samfile('ex1.bam', 'rb')
-
- for a,b in zip(samfile1.fetch(), samfile1.fetch()):
- self.assertEqual( a.compare( b ), 0 )
-
- def testDoubleFetchWithRegion( self ):
-
- samfile1 = pysam.Samfile('ex1.bam', 'rb')
- chr, start, stop = 'chr1', 200, 3000000
- self.assertTrue(len(list(samfile1.fetch ( chr, start, stop))) > 0) #just making sure the test has something to catch
-
- for a,b in zip(samfile1.fetch( chr, start, stop), samfile1.fetch( chr, start, stop)):
- self.assertEqual( a.compare( b ), 0 )
-
- def testDoubleFetchUntilEOF( self ):
-
- samfile1 = pysam.Samfile('ex1.bam', 'rb')
-
- for a,b in zip(samfile1.fetch( until_eof = True),
- samfile1.fetch( until_eof = True )):
- self.assertEqual( a.compare( b), 0 )
-
-class TestRemoteFileFTP(unittest.TestCase):
- '''test remote access.
-
- '''
-
- # Need to find an ftp server without password on standard
- # port.
-
- url = "ftp://ftp.sanger.ac.uk/pub/rd/humanSequences/CV.bam"
- region = "1:1-1000"
-
- def testFTPView( self ):
- return
- result = pysam.view( self.url, self.region )
- self.assertEqual( len(result), 36 )
-
- def testFTPFetch( self ):
- return
- samfile = pysam.Samfile(self.url, "rb")
- result = list(samfile.fetch( region = self.region ))
- self.assertEqual( len(result), 36 )
-
-class TestRemoteFileHTTP( unittest.TestCase):
-
- url = "http://genserv.anat.ox.ac.uk/downloads/pysam/test/ex1.bam"
- region = "chr1:1-1000"
- local = "ex1.bam"
-
- def testView( self ):
- samfile_local = pysam.Samfile(self.local, "rb")
- ref = list(samfile_local.fetch( region = self.region ))
-
- result = pysam.view( self.url, self.region )
- self.assertEqual( len(result), len(ref) )
-
- def testFetch( self ):
- samfile = pysam.Samfile(self.url, "rb")
- result = list(samfile.fetch( region = self.region ))
- samfile_local = pysam.Samfile(self.local, "rb")
- ref = list(samfile_local.fetch( region = self.region ))
-
- self.assertEqual( len(ref), len(result) )
- for x, y in zip(result, ref):
- self.assertEqual( x.compare( y ), 0 )
-
- def testFetchAll( self ):
- samfile = pysam.Samfile(self.url, "rb")
- result = list(samfile.fetch())
- samfile_local = pysam.Samfile(self.local, "rb")
- ref = list(samfile_local.fetch() )
-
- self.assertEqual( len(ref), len(result) )
- for x, y in zip(result, ref):
- self.assertEqual( x.compare( y ), 0 )
-
-class TestLargeOptValues( unittest.TestCase ):
-
- ints = ( 65536, 214748, 2147484, 2147483647 )
- floats = ( 65536.0, 214748.0, 2147484.0 )
-
- def check( self, samfile ):
-
- i = samfile.fetch()
- for exp in self.ints:
- rr = i.next()
- obs = rr.opt("ZP")
- self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
- for exp in [ -x for x in self.ints ]:
- rr = i.next()
- obs = rr.opt("ZP")
- self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
- for exp in self.floats:
- rr = i.next()
- obs = rr.opt("ZP")
- self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
- for exp in [ -x for x in self.floats ]:
- rr = i.next()
- obs = rr.opt("ZP")
- self.assertEqual( exp, obs, "expected %s, got %s\n%s" % (str(exp), str(obs), str(rr)))
-
- def testSAM( self ):
- samfile = pysam.Samfile("ex10.sam", "r")
- self.check( samfile )
-
- def testBAM( self ):
- samfile = pysam.Samfile("ex10.bam", "rb")
- self.check( samfile )
-
-# class TestSNPCalls( unittest.TestCase ):
-# '''test pysam SNP calling ability.'''
-
-# def checkEqual( self, a, b ):
-# for x in ("reference_base",
-# "pos",
-# "genotype",
-# "consensus_quality",
-# "snp_quality",
-# "mapping_quality",
-# "coverage" ):
-# self.assertEqual( getattr(a, x), getattr(b,x), "%s mismatch: %s != %s\n%s\n%s" %
-# (x, getattr(a, x), getattr(b,x), str(a), str(b)))
-
-# def testAllPositionsViaIterator( self ):
-# samfile = pysam.Samfile( "ex1.bam", "rb")
-# fastafile = pysam.Fastafile( "ex1.fa" )
-# try:
-# refs = [ x for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ) if x.reference_base != "*"]
-# except pysam.SamtoolsError:
-# pass
-
-# i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-# calls = list(pysam.IteratorSNPCalls(i))
-# for x,y in zip( refs, calls ):
-# self.checkEqual( x, y )
-
-# def testPerPositionViaIterator( self ):
-# # test pileup for each position. This is a slow operation
-# # so this test is disabled
-# return
-# samfile = pysam.Samfile( "ex1.bam", "rb")
-# fastafile = pysam.Fastafile( "ex1.fa" )
-# for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-# if x.reference_base == "*": continue
-# i = samfile.pileup( x.chromosome, x.pos, x.pos+1,
-# fastafile = fastafile,
-# stepper = "samtools" )
-# z = [ zz for zz in pysam.IteratorSamtools(i) if zz.pos == x.pos ]
-# self.assertEqual( len(z), 1 )
-# self.checkEqual( x, z[0] )
-
-# def testPerPositionViaCaller( self ):
-# # test pileup for each position. This is a fast operation
-# samfile = pysam.Samfile( "ex1.bam", "rb")
-# fastafile = pysam.Fastafile( "ex1.fa" )
-# i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-# caller = pysam.SNPCaller( i )
-
-# for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-# if x.reference_base == "*": continue
-# call = caller.call( x.chromosome, x.pos )
-# self.checkEqual( x, call )
-
-# class TestIndelCalls( unittest.TestCase ):
-# '''test pysam indel calling.'''
-
-# def checkEqual( self, a, b ):
-
-# for x in ("pos",
-# "genotype",
-# "consensus_quality",
-# "snp_quality",
-# "mapping_quality",
-# "coverage",
-# "first_allele",
-# "second_allele",
-# "reads_first",
-# "reads_second",
-# "reads_diff"):
-# if b.genotype == "*/*" and x == "second_allele":
-# # ignore test for second allele (positions chr2:439 and chr2:1512)
-# continue
-# self.assertEqual( getattr(a, x), getattr(b,x), "%s mismatch: %s != %s\n%s\n%s" %
-# (x, getattr(a, x), getattr(b,x), str(a), str(b)))
-
-# def testAllPositionsViaIterator( self ):
-
-# samfile = pysam.Samfile( "ex1.bam", "rb")
-# fastafile = pysam.Fastafile( "ex1.fa" )
-# try:
-# refs = [ x for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ) if x.reference_base == "*"]
-# except pysam.SamtoolsError:
-# pass
-
-# i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-# calls = [ x for x in list(pysam.IteratorIndelCalls(i)) if x != None ]
-# for x,y in zip( refs, calls ):
-# self.checkEqual( x, y )
-
-# def testPerPositionViaCaller( self ):
-# # test pileup for each position. This is a fast operation
-# samfile = pysam.Samfile( "ex1.bam", "rb")
-# fastafile = pysam.Fastafile( "ex1.fa" )
-# i = samfile.pileup( stepper = "samtools", fastafile = fastafile )
-# caller = pysam.IndelCaller( i )
-
-# for x in pysam.pileup( "-c", "-f", "ex1.fa", "ex1.bam" ):
-# if x.reference_base != "*": continue
-# call = caller.call( x.chromosome, x.pos )
-# self.checkEqual( x, call )
-
-class TestLogging( unittest.TestCase ):
- '''test around bug issue 42,
-
- failed in versions < 0.4
- '''
-
- def check( self, bamfile, log ):
-
- if log:
- logger = logging.getLogger('franklin')
- logger.setLevel(logging.INFO)
- formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
- log_hand = logging.FileHandler('log.txt')
- log_hand.setFormatter(formatter)
- logger.addHandler(log_hand)
-
- bam = pysam.Samfile(bamfile, 'rb')
- cols = bam.pileup()
- self.assert_( True )
-
- def testFail1( self ):
- self.check( "ex9_fail.bam", False )
- self.check( "ex9_fail.bam", True )
-
- def testNoFail1( self ):
- self.check( "ex9_nofail.bam", False )
- self.check( "ex9_nofail.bam", True )
-
- def testNoFail2( self ):
- self.check( "ex9_nofail.bam", True )
- self.check( "ex9_nofail.bam", True )
-
-# TODOS
-# 1. finish testing all properties within pileup objects
-# 2. check exceptions and bad input problems (missing files, optional fields that aren't present, etc...)
-# 3. check: presence of sequence
-
-class TestSamfileUtilityFunctions( unittest.TestCase ):
-
- def testCount( self ):
-
- samfile = pysam.Samfile( "ex1.bam", "rb" )
-
- for contig in ("chr1", "chr2" ):
- for start in xrange( 0, 2000, 100 ):
- end = start + 1
- self.assertEqual( len( list( samfile.fetch( contig, start, end ) ) ),
- samfile.count( contig, start, end ) )
-
- # test empty intervals
- self.assertEqual( len( list( samfile.fetch( contig, start, start ) ) ),
- samfile.count( contig, start, start ) )
-
- # test half empty intervals
- self.assertEqual( len( list( samfile.fetch( contig, start ) ) ),
- samfile.count( contig, start ) )
-
- def testMate( self ):
- '''test mate access.'''
-
- readnames = [ x.split(b"\t")[0] for x in open( "ex1.sam", "rb" ).readlines() ]
- if sys.version_info[0] >= 3:
- readnames = [ name.decode('ascii') for name in readnames ]
-
- counts = collections.defaultdict( int )
- for x in readnames: counts[x] += 1
-
- samfile = pysam.Samfile( "ex1.bam", "rb" )
- for read in samfile.fetch():
- if not read.is_paired:
- self.assertRaises( ValueError, samfile.mate, read )
- elif read.mate_is_unmapped:
- self.assertRaises( ValueError, samfile.mate, read )
- else:
- if counts[read.qname] == 1:
- self.assertRaises( ValueError, samfile.mate, read )
- else:
- mate = samfile.mate( read )
- self.assertEqual( read.qname, mate.qname )
- self.assertEqual( read.is_read1, mate.is_read2 )
- self.assertEqual( read.is_read2, mate.is_read1 )
- self.assertEqual( read.pos, mate.mpos )
- self.assertEqual( read.mpos, mate.pos )
-
- def testIndexStats( self ):
- '''test if total number of mapped/unmapped reads is correct.'''
-
- samfile = pysam.Samfile( "ex1.bam", "rb" )
- self.assertEqual( samfile.mapped, 3235 )
- self.assertEqual( samfile.unmapped, 35 )
-
-class TestSamtoolsProxy( unittest.TestCase ):
- '''tests for sanity checking access to samtools functions.'''
-
- def testIndex( self ):
- self.assertRaises( IOError, pysam.index, "missing_file" )
-
- def testView( self ):
- # note that view still echos "open: No such file or directory"
- self.assertRaises( pysam.SamtoolsError, pysam.view, "missing_file" )
-
- def testSort( self ):
- self.assertRaises( pysam.SamtoolsError, pysam.sort, "missing_file" )
-
-class TestSamfileIndex( unittest.TestCase):
-
- def testIndex( self ):
- samfile = pysam.Samfile( "ex1.bam", "rb" )
- index = pysam.IndexedReads( samfile )
- index.build()
-
- reads = collections.defaultdict( int )
-
- for read in samfile: reads[read.qname] += 1
-
- for qname, counts in reads.iteritems():
- found = list(index.find( qname ))
- self.assertEqual( len(found), counts )
- for x in found: self.assertEqual( x.qname, qname )
-
-
-if __name__ == "__main__":
- # build data files
- print ("building data files")
- subprocess.call( "make", shell=True)
- print ("starting tests")
- unittest.main()
- print ("completed tests")
+++ /dev/null
-# usage: cat pysam_ex1.bam | python pysam_test_stdin.pyx
-
-import pysam
-
-samfile = pysam.Samfile( "-", "rb" )
-
-# set up the modifying iterators
-l = list(samfile.fetch( until_eof = True ))
-assert len(l) == 3270
-
+++ /dev/null
-import pp
-import pysam
-
-def sortBam(bam_filepath, out_prefix):
- pysam.sort(bam_filepath, out_prefix)
-
-job_server = pp.Server(ncpus=2, ppservers = (), secret='secret')
-
-job1 = job_server.submit(sortBam, ('ex1.bam', 'tmp1_'), modules=('pysam',))
-job2 = job_server.submit(sortBam, ('ex2.bam', 'tmp2_'), modules=('pysam',))
-
-result1 = job1()
-result2 = job2()
+++ /dev/null
-#!/usr/bin/env python
-'''unit testing code for pysam.'''
-
-import pysam
-import unittest
-import os
-import itertools
-import subprocess
-import shutil
-
-class TestExceptions(unittest.TestCase):
-
- def setUp(self):
- self.samfile=pysam.Samfile( "ex1.bam","rb" )
-
- def testOutOfRangeNegativeNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, -10 )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 5, 0 )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", -5, -10 )
-
- def testOutOfRangeNegativeOldFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5-10" )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5-0" )
- self.assertRaises( ValueError, self.samfile.fetch, "chr1:-5--10" )
-
- def testOutOfRangeLargeNewFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1", 99999999999999999, 999999999999999999 )
-
- def testOutOfRangeLargeOldFormat(self):
- self.assertRaises( ValueError, self.samfile.fetch, "chr1:99999999999999999-999999999999999999" )
-
- def tearDown(self):
- self.samfile.close()
-
-if __name__ == "__main__":
- unittest.main()
-
+++ /dev/null
-#!/usr/bin/env python
-'''unit testing code for pysam.
-
-Execute in the :file:`tests` directory as it requires the Makefile
-and data files located there.
-'''
-
-import sys, os, shutil, gzip
-import pysam
-import unittest
-import itertools
-import subprocess
-
-class TestVCFIterator( unittest.TestCase ):
-
- filename = "example.vcf40.gz"
- columns = ("contig", "pos", "id",
- "ref", "alt", "qual",
- "filter", "info", "format" )
-
- def testRead( self ):
- self.vcf = pysam.VCF()
- self.vcf.connect( self.filename )
-
- for x in self.vcf.fetch():
- print str(x)
- print x.pos
- print x.alt
- print x.id
- print x.qual
- print x.filter
- print x.info
- print x.format
-
- for s in x.samples:
- print s, x[s]
-
-if __name__ == "__main__":
-
- unittest.main()
-
-
-def Test():
-
- vcf33 = """##fileformat=VCFv3.3
-##fileDate=20090805
-##source=myImputationProgramV3.1
-##reference=1000GenomesPilot-NCBI36
-##phasing=partial
-##INFO=NS,1,Integer,"Number of Samples With Data"
-##INFO=DP,1,Integer,"Total Depth"
-##INFO=AF,-1,Float,"Allele Frequency"
-##INFO=AA,1,String,"Ancestral Allele"
-##INFO=DB,0,Flag,"dbSNP membership, build 129"
-##INFO=H2,0,Flag,"HapMap2 membership"
-##FILTER=q10,"Quality below 10"
-##FILTER=s50,"Less than 50% of samples have data"
-##FORMAT=GT,1,String,"Genotype"
-##FORMAT=GQ,1,Integer,"Genotype Quality"
-##FORMAT=DP,1,Integer,"Read Depth"
-##FORMAT=HQ,2,Integer,"Haplotype Quality"
-#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
-20\t14370\trs6054257\tG\tA\t29\t0\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:-1,-1
-17\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3:-1,-1
-20\t1110696\trs6040355\tA\tG,T\t67\t0\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4:-1,-1
-17\t1230237\t.\tT\t.\t47\t0\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2:-1,-1
-20\t1234567\tmicrosat1\tG\tD4,IGA\t50\t0\tNS=3;DP=9;AA=G\tGT:GQ:DP\t0/1:35:4\t0/2:17:2\t1/1:40:3"""
-
- vcf40 = """##fileformat=VCFv4.0
-##fileDate=20090805
-##source=myImputationProgramV3.1
-##reference=1000GenomesPilot-NCBI36
-##phasing=partial
-##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
-##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
-##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
-##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
-##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
-##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
-##FILTER=<ID=q10,Description="Quality below 10">
-##FILTER=<ID=s50,Description="Less than 50% of samples have data">
-##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
-##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
-##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
-##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
-#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
-M\t1230237\t.\tT\t.\t47\tPASS\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2
-20\t1234567\tmicrosat1\tGTCT\tG,GTACT\t50\tPASS\tNS=3;DP=9;AA=G\tGT:GQ:DP\t0/1:35:4\t0/2:17:2\t1/1:40:3
-17\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:.,.
-20\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3
-20\t1110696\trs6040355\tA\tG,T\t67\tPASS\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4"""
-
- if False:
- print "Parsing v3.3 file:"
- print vcf33
- vcf = VCFFile()
- lines = [data for data in vcf.parse( (line+"\n" for line in vcf33.split('\n') ) )]
- print "Writing v3.3 file:"
- vcf.write( sys.stdout, lines )
-
- if False:
- print "Parsing v4.0 file:"
- print vcf40
- vcf = VCFFile()
- lines = [data for data in vcf.parse( (line+"\n" for line in vcf40.split('\n') ) )]
- print "Writing v4.0 file:"
- vcf.write( sys.stdout, lines )
-
- if True:
- print "Parsing v3.3 file:"
- print vcf33
- vcf = sortedVCFFile()
- lines = [data for data in vcf.parse( (line+"\n" for line in vcf33.split('\n') ) )]
- print "Writing v3.3 file:"
- vcf.write( sys.stdout, lines )
-
- if True:
- print "Parsing v4.0 file:"
- print vcf40
- vcf = sortedVCFFile()
- lines = [data for data in vcf.parse( (line+"\n" for line in vcf40.split('\n') ) )]
- print "Writing v4.0 file:"
- vcf.write( sys.stdout, lines )
(6, 27), (7, 28),
(8, 29), (9, 30)])
+ def test_get_aligned_pairs_lowercase_md(self):
+ a = self.build_read()
+ a.query_sequence = "A" * 10
+ a.cigarstring = "10M"
+ a.set_tag("MD", "5g4")
+ self.assertEqual(
+ a.get_aligned_pairs(with_seq=True),
+ [(0, 20, 'A'),
+ (1, 21, 'A'),
+ (2, 22, 'A'),
+ (3, 23, 'A'),
+ (4, 24, 'A'),
+ (5, 25, 'g'),
+ (6, 26, 'A'),
+ (7, 27, 'A'),
+ (8, 28, 'A'),
+ (9, 29, 'A')])
+
+ def test_get_aligned_pairs_uppercase_md(self):
+ a = self.build_read()
+ a.query_sequence = "A" * 10
+ a.cigarstring = "10M"
+ a.set_tag("MD", "5G4")
+ self.assertEqual(
+ a.get_aligned_pairs(with_seq=True),
+ [(0, 20, 'A'),
+ (1, 21, 'A'),
+ (2, 22, 'A'),
+ (3, 23, 'A'),
+ (4, 24, 'A'),
+ (5, 25, 'g'),
+ (6, 26, 'A'),
+ (7, 27, 'A'),
+ (8, 28, 'A'),
+ (9, 29, 'A')])
+
def testNoSequence(self):
'''issue 176: retrieving length without query sequence
with soft-clipping.
def testTruncate(self):
'''see issue 107.'''
# note that ranges in regions start from 1
- p = self.samfile.pileup(region='chr1:170:172', truncate=True)
+ p = self.samfile.pileup(region='chr1:170-172', truncate=True)
columns = [x.reference_pos for x in p]
self.assertEqual(len(columns), 3)
self.assertEqual(columns, [169, 170, 171])
self.assertEqual(samfile.closed, True)
+class TestMultiThread(unittest.TestCase):
+
+ def testSingleThreadEqualsMultithread(self):
+ input_bam = os.path.join(BAM_DATADIR, 'ex1.bam')
+ single_thread_out = get_temp_filename("tmp_single.bam")
+ multi_thread_out = get_temp_filename("tmp_multi.bam")
+ with pysam.AlignmentFile(input_bam,
+ 'rb') as samfile:
+ reads = [r for r in samfile]
+ with pysam.AlignmentFile(single_thread_out,
+ mode='wb',
+ template=samfile,
+ threads=1) as single_out:
+ [single_out.write(r) for r in reads]
+ with pysam.AlignmentFile(multi_thread_out,
+ mode='wb',
+ template=samfile,
+ threads=2) as multi_out:
+ [single_out.write(r) for r in reads]
+ with pysam.AlignmentFile(input_bam) as inp, \
+ pysam.AlignmentFile(single_thread_out) as single, \
+ pysam.AlignmentFile(multi_thread_out) as multi:
+ for r1, r2, r3 in zip(inp, single, multi):
+ assert r1.to_string == r2.to_string == r3.to_string
+
+ def TestNoMultiThreadingWithIgnoreTruncation(self):
+ self.assertRaises(
+ ValueError, pysam.AlignmentFile(os.path.join(BAM_DATADIR, 'ex1.bam'),
+ threads=2,
+ ignore_truncation=True)
+ )
+
+
class TestExceptions(unittest.TestCase):
def setUp(self):
'rb')
+class TestRegionParsiong(unittest.TestCase):
+
+ def test_dash_in_chr(self):
+ with pysam.AlignmentFile(
+ os.path.join(BAM_DATADIR, "example_dash_in_chr.bam")) as inf:
+ self.assertEqual(len(list(inf.fetch(contig="chr-1"))), 1)
+ self.assertEqual(len(list(inf.fetch(contig="chr2"))), 1)
+ self.assertEqual(len(list(inf.fetch(region="chr-1"))), 1)
+ self.assertEqual(len(list(inf.fetch(region="chr2"))), 1)
+
+
class TestDeNovoConstruction(unittest.TestCase):
'''check BAM/SAM file construction using ex6.sam
with open(fn_reference, "w") as outf:
outf.write(">chr1\n{seq}\n>chr2\n{seq}\n".format(
seq=s))
-
+
if mode == "bam":
write_mode = "wb"
elif mode == "sam":
write_mode = "w"
elif mode == "cram":
write_mode = "wc"
-
+
with pysam.AlignmentFile(fn, write_mode,
header=self.header,
reference_filename=fn_reference) as outf:
outf.write(read)
-
with pysam.AlignmentFile(fn) as inf:
ref_read = next(inf)
os.unlink(fn)
os.unlink(fn_reference)
-
+
def test_reading_writing_sam(self):
read = self.build_read()
self.check_read(read, mode="sam")
read = self.build_read()
self.check_read(read, mode="bam")
+ @unittest.skip("fails on linux - https issue?")
def test_reading_writing_cram(self):
+ # The following test fails with htslib 1.9, but worked with previous versions.
+ # Error is:
+ # [W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/ac9fac7c3e9c476f74f1d0e47abc8be2": Input/output error
+ # Error can be reproduced using samtools 1.9 command line.
+ # Could be a conda configuration issue, see
+ # https://github.com/bioconda/bioconda-recipes/issues/9056
+ return
read = self.build_read()
self.check_read(read, mode="cram")
sample["GT"] = sample["GT"]
+class TestMultiThreading(unittest.TestCase):
+
+ filename = os.path.join(CBCF_DATADIR, "example_vcf42.vcf.gz")
+
+ def testSingleThreadEqualsMultithreadResult(self):
+ with pysam.VariantFile(self.filename) as inf:
+ header = inf.header
+ single = [r for r in inf]
+ with pysam.VariantFile(self.filename, threads=2) as inf:
+ multi = [r for r in inf]
+ for r1, r2 in zip(single, multi):
+ assert str(r1) == str(r2)
+
+ bcf_out = get_temp_filename(suffix=".bcf")
+ with pysam.VariantFile(bcf_out, mode='wb',
+ header=header,
+ threads=2) as out:
+ for r in single:
+ out.write(r)
+ with pysam.VariantFile(bcf_out) as inf:
+ multi_out = [r for r in inf]
+ for r1, r2 in zip(single, multi_out):
+ assert str(r1) == str(r2)
+
+ def testNoMultiThreadingWithIgnoreTruncation(self):
+ with self.assertRaises(ValueError):
+ pysam.VariantFile(self.filename,
+ threads=2,
+ ignore_truncation=True)
+
+
class TestSubsetting(unittest.TestCase):
filename = "example_vcf42.vcf.gz"
faidx_empty_seq.fq.gz \
ex1.fa.gz ex1.fa.gz.csi \
ex1_csi.bam \
- example_reverse_complement.bam
+ example_reverse_complement.bam \
+ example_dash_in_chr.bam
# ex2.sam - as ex1.sam, but with header
ex2.sam.gz: ex1.bam ex1.bam.bai
--- /dev/null
+@HD VN:1.0
+@SQ SN:chr-1 LN:1575
+@SQ SN:chr2 LN:1584
+@CO this is a comment
+@CO this is another comment
+read_28833_29006_6945 99 chr-1 33 20 10M1D25M = 200 167 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 PG:Z:P1 XT:A:U
+read_28701_28881_323b 147 chr2 88 30 35M = 500 412 ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2 PG:Z:P2 XT:A:R
self.assertEqual(tabixfile.closed, True)
+class TestMultithreadTabixFile(unittest.TestCase):
+
+ filename = os.path.join(TABIX_DATADIR, "example.gtf.gz")
+
+ def testMultithreadEqualsSinglethread(self):
+ with pysam.TabixFile(self.filename) as tabixfile:
+ single = [r for r in tabixfile.fetch()]
+ with pysam.TabixFile(self.filename, threads=2) as tabixfile:
+ multi = [r for r in tabixfile.fetch()]
+ for r1, r2 in zip(single, multi):
+ assert str(r1) == str(r2)
+
+
if __name__ == "__main__":
subprocess.call("make -C %s" % TABIX_DATADIR, shell=True)
unittest.main()