dhtslib ~hFILE_fix
D bindings for htslib
To use this package, run the following command in your project's root directory:
Manual usage
Put the following dependency into your project's dependences section:
dhtslib
Overview
D bindings and convenience wrappers for htslib, the most widely-used library for manipulation of high-throughput sequencing data.
Installation
Add dhtslib
as a dependency to dub.json
:
"dependencies": {
"dhtslib": "~>0.6.0",
(version number 0.6.0 is example; see https://dub.pm/package-format-json)
Requirements
Dynamically linking to htslib (default)
A system installation of htslib v1.9 or higher is required.
Statically linking to htslib
libhts.a
needs to be added to your project's source files.
Remember to link to all dynamic libraries configured when htslib was built. This may
include bz2, lzma, zlib, defalate, crypto, pthreads, curl.
Finally, if statically linking, the -lhts
flag needs to be removed from compilation
by selecting the dub configuration source-static
as the dub configuration type for dhtslib
within your own project's dub configuration file:
"subConfigurations": {
"dhtslib": "source-static"
},
Usage
D API (OOP Wrappers)
Object-oriented, idomatic D wrappers are available for:
- BGZF compressed files (
dhtslib.bgzf
) - FASTA indexes (
dhtslib.faidx
) - SAM/BAM/CRAM files and streams (
dhtslib.sam
) - Tabix-indexed files (
dhtslib.tabix
) - VCF/BCF files (
dhtslib.vcf
)
For example, this provides access to BGZF files by line as a consumable InputRange.
Or, for BAM files, the ability to query for a range (e.g. "chr1:1000000-2000000") and obtain an InputRange over the BAM records.
For most file type readers, indexing (["coordinates"]
) queries return ranges of records. There are multiple options, including
["chr1", 10_000_000 .. 20_000_000]
and ["chr1:10000000-20000000]
.
See the documentation for more details.
htslib API
Direct bindings to htslib C API are available as submodules under dhtslib.htslib
.
Naming remains the same as the original .h
include files.
For example, import dhtslib.htslib.faidx
for direct access to the C function calls.
The current compatible versions are 1.7-1.9.
Currently implemented:
- bgzf
- faidx
- hts
- hts_log
- kstring
- regidx
- sam
- tbx
- thread_pool (untested)
- vcf
Missing or work-in-progress:
- Some CRAM specific functions, although much CRAM functionality works with
sam_
functions - hfile
- kbitset, kfunc, khash, klist, knetfile, kseq, ksort (mostly used internally anyway)
- syncedbcfreader
- vcf_sweep
- vcfutils
FAQ
Q: Why not use bioD
A:
bioD, as a more general bioinformatics framework, is more comparable to bio-python, bio-ruby, bio-rust, etc.
bioD does have some excellent hts file format (BGZF and SAM) handling, and at one time sambamba, which relied on it, was faster than samtools.
However, the development resources poured into htslib
overall are tremendous, and we with to leverage that rather than writing VCF, tabix, etc. code from scratch.
Q: Why were htslib bindings ported by hand instead of using a C header/bindings translator as in hts-nim or rust-htslib?
A:
Several reasons.
First, this gave the authors of dhtslib a better familiarity with the htslib API including letting us get to know several lesser-known functions.
Second, some elements (particuarlly #define
macros) are difficult or impossible for machines to translate, or translate into efficient code; here we were sometimes able to replace these macros with smarter replacements than a simple macro-expansion-direct-translation.
Finally, instead of dumping all the bindings into an interface file, we left the structure of the file intact to make it easier for the D developer to read the source file as the htslib authors intended the C headers to be read. In addition, this leaves docstring/documentation comments intact, whereas in other projects the direct API has no comments and the developer must refer to the C headers.
Q: Why am I getting a segfault?
A: It's easy to get a segfault by using the direct C API incorrectly. We have tried to eliminate most of this (use after free, etc.) in the OOP wrappers. If you are getting a segfault you cannot understand when using purely the high-level D API, please post an issue.
Bugs and Warnings
Zero-based versus one-based coordinates. Zero-based coordinates are used internally and also by the API for BCF/VCF and SAM/BAM types.
The fadix
C API expects one-based coordinates; we have built this as a template for the user to specify the coordinate system.
See documentation for more details.
Do not call htslog htslog_* from a destructor, as it is potentialy allocating via toStringz
See Also
- gff3d GFF3 record reader/writer
- Registered by James Blachly
- ~hFILE_fix released 5 years ago
- blachlylab/dhtslib
- github.com/blachlylab/dhtslib
- MIT
- Authors:
- Dependencies:
- none
- Versions:
-
0.14.0+htslib-1.13 2022-Mar-02 0.13.3+htslib-1.13 2021-Oct-01 0.13.2+htslib-1.13 2021-Oct-01 0.13.1+htslib-1.13 2021-Sep-30 0.13.0+htslib-1.13 2021-Sep-30 - Download Stats:
-
-
0 downloads today
-
0 downloads this week
-
1 downloads this month
-
828 downloads total
-
- Score:
- 0.0
- Short URL:
- dhtslib.dub.pm