Package dentist on DUB

To use this package, run the following command in your project's root directory:

DENTIST

GitHub

DENTIST uses long reads to close assembly gaps at high accuracy.

Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read based genome assemblies by closing assembly gaps, ideally at high accuracy. DENTIST is a sensitive, highly-accurate and automated pipeline method to close gaps in (short read) assemblies with long reads.

First time here? Head over to [the example](#example) and make sure it works.

Install

Use a Singularity Container (recommended)

Make sure Singularity is installed on your system. You can then use the container like so:

# launch an interactive shell
singularity shell docker://aludi/dentist:stable

# execute a single command inside the container
singularity exec docker://aludi/dentist:stable dentist --version

# run the whole workflow on a cluster using Singularity
snakemake --configfile=snakemake.yml --use-singularity --profile=slurm

The last command is explained in more detail below in the usage section.

Use Pre-Built Binaries

Download the latest pre-built binaries from the releases section and extract the contents. The pre-built binaries are stored in a subfolder called bin. Here are the instructions for v1.0.1:

# download & extract pre-built binaries
wget https://github.com/a-ludi/dentist/releases/download/v1.0.1/dentist.v1.0.1.x86_64.tar.gz
tar -xzf dentist.v1.0.1.x86_64.tar.gz

# make binaries available to your shell
cd dentist.v1.0.1.x86_64
PATH="$PWD/bin:$PATH"

# check installation with
dentist -d
# Expected output:
# 
#daligner (part of `DALIGNER`; see https://github.com/thegenemyers/DALIGNER) [OK]
#damapper (part of `DAMAPPER`; see https://github.com/thegenemyers/DAMAPPER) [OK]
#DAScover (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DASqv (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DBdump (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBdust (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBrm (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBshow (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBsplit (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DAM (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DB (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#computeintrinsicqv (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]
#daccord (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]

The tarball additionally contains the Snakemake workflow, example config files and this README. In short, everything you to run DENTIST.

Build from Source

Be sure to install the D package manager DUB. Install using either

dub install dentist

git clone https://github.com/a-ludi/dentist.git
cd dentist
dub build

Runtime Dependencies

The following software packages are required to run dentist:

The Dazzler Data Base (>=2020-07-27)

Manage sequences (reads and assemblies) in 4bit encoding alongside auxiliary information such as masks or QV tracks

DALIGNER (=2020-01-15)

Find significant local alignments.

DAMAPPER (>=2020-03-10)

Find alignment chains, i.e. sequences of significant local alignments possibly with unaligned gaps.

DAMASKER (>=2020-01-15)

Discover tandem repeats.

DASCRUBBER (>=2020-07-26)

Estimate coverage and compute QVs.

daccord (>=v0.0.17)

Compute reference-based consensus sequence for gap filling.

Please see their own documentation for installation instructions. Note, the available packages on Bioconda are outdated and should not be used at the moment.

Please use the following versions in your dependencies in case you experience troubles. These should be the same versions used in the Dockerfile:

Usage

Before you start producing wonderful scientific results, you should skip over to the example section and try to run the small example. This will make sure your setup is working as expected.

Quick execution with Snakemake (and Singularity)

TL;DR

# edit dentist.json and snakemake.yml
snakemake --configfile=snakemake.yml --use-singularity --profile=slurm

Install Snakemake version >=5.32.1 and copy these files into your working directory:

./snakemake/Snakefile
./snakemake/snakemake.yml
./snakemake/dentist.json

Next edit snakemake.yml and dentist.json to fit your needs and optionally test your configuration with

snakemake --configfile=snakemake.yml --use-singularity --cores=1 -f -- validate_dentist_config

If no errors occurred the whole workflow can be executed using

snakemake --configfile=snakemake.yml --use-singularity --cores=all

For small genomes of a few 100 Mbp this should run on a regular workstation. One may use Snakemake's --jobs to run independent jobs in parallel. Larger data sets may require a cluster in which case you can use Snakemake's cloud or cluster facilities.

Executing on a Cluster

To make execution on a cluster easy DENTIST comes with examples files to make Snakemake use SLURM via DRMAA. Please read the documentation of Snakemake if this does not suit your needs. Another good starting point is the Snakemake-Profiles project.

Start by copying these files to your working/home directory:

./snakemake/Snakefile
./snakemake/snakemake.yml
./snakemake/cluster.yml
One of ./snakemake/profile-slurm.*.yml → ~/.config/snakemake/slurm/config.yaml

Next adjust the profile according to your cluster. This should enable Snakemake to submit and track jobs on your cluster. You may use the configuration values specified in cluster.yml to configure job names and resource allocation for each step of the pipeline. Now, submit the workflow to your cluster by

snakemake --configfile=snakemake.yml --profile=slurm --use-singularity

Note, parameters specified in the profile provide default values and can be overridden by specifying different value on the CLI.

Manual execution

Please inspect the Snakemake workflow to get all the details. It might be useful to execute Snakemake with the -p switch which causes Snakemake to print the shell commands. If you plan to write your own workflow management for DENTIST please feel free to contact the maintainer!

Example

After installing Snakemake (5.32.1 or later) and Singularity 3.5.x or later, you may check your installation with this example dataset (182Mb).

If Singularity is not an option for you, plaese following the installation instructions for an alternative.

wget https://github.com/a-ludi/dentist-example/releases/download/v1.0.1-2/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

Local Execution

Execute the entire workflow on your local machine using all cores:

# run the workflow
snakemake --configfile=snakemake.yml --use-singularity --cores=all

# validate the files
md5sum -c checksum.md5

Execution takes approx. 7 minutes and a maximum of 1.7GB memory on my little laptop with an Intel® Core™ i5-5200U CPU @ 2.20GHz.

Cluster Execution

Execute the workflow on a SLURM cluster:

mkdir -p "$HOME/.config/snakemake/slurm"
# select one of the profile-slurm.{drmaa,submit-async,submit-sync}.yml files
cp -v "profile-slurm.sync.yml" "$HOME/.config/snakemake/slurm/config.yml"
# execute using the cluster profile
snakemake --configfile=snakemake.yml --use-singularity --profile=slurm

# validate the files
md5sum -c checksum.md5

If you want to run with a differnt cluster manager or in the cloud, please read the advice above. The easiest option is to adjust the srun command in profile-slurm.sync.yml to your cluster, e.g. qsub -sync yes. The command must submit a job to the cluster and wait for it to finish.

Configuration

DENTIST comprises a complex pipeline of with many options for tweaking. This section points out some important parameters and their effect on the result or performance.

How to Choose DENTIST Parameters

The following list comprises the important/influential parameters for DENTIST itself. Please keep in mind that the alignments generated by daligner/damapper have immense influence on the performance of DENTIST.

--max-insertion-error: Strong influence on quality and sensitivity. Lower values lead to lower sensitivity but higher quality. The maximum recommended value is 0.05.
--min-anchor-length: Higher values results in higher accuracy but lower sensitivity. Especially, large gaps cannot be closed if the value is too high. Usually the value should be at least 500 and up to 10_000.
--min-reads-per-pile-up: Choosing higher values for the minimum number of reads drastically reduces sensitivity but has little effect on the quality. Small values may be chosen to get the maximum sensitivity in de novo assemblies. Make sure to throughly validate the results though.
--min-spanning-reads: Higher values give more confidence on the correctness of closed gaps but reduce sensitivity. The value must be well below the expected coverage.
--allow-single-reads: May be used under careful consideration in combination with --min-spanning-reads=1. This is intended for one of the following scenarios:
1. DENTIST is meant to close as many gaps as possible in a de novo assembly. Then the closed gaps must be validated by other means afterwards.
2. DENTIST is used not with real reads but with an independent assembly.
--existing-gap-bonus: If DENTIST finds evidence to join two contigs that are already consecutive in the input assembly (i.e. joined by Ns) then it will preferred over conflicting joins (if present) with this bonus. The default value is rather conservative, i.e. the preferred join almost always wins over other joins in case of a conflict.

--join-policy: Choose according to your needs:

scaffoldGaps : Closes only gaps that are marked by Ns in the assembly. This is the

  default mode of operation. Use this if you do not want to alter the
  scaffolding of the assembly. See also `--existing-gap-bonus`.

scaffolds : Allows whole scaffolds to be joined in addition to the effects of

  `scaffoldGaps`. Use this if you have (many) scaffolds that are not
  yet full chromosome-scale.

contigs : Allows contigs to be rearranged freely. This is especially useful in

  _de novo_ assemblies **before** applying any other scaffolding
  methods as it increases the contiguity thus increasing the chance
  that large-scale scaffolding (e.g. Bionano or Hi-C) finds proper
  joins.

Choosing the Read Type

In the examples PacBio long reads are assumed but DENTIST can be run using any kind of long reads. Currently, this is either PacBio or Oxford Nanopore reads. For using none-PacBio reads, the reads_type in snakemake.yml must be set to anything other than PACBIO_SMRT. The recommendation is to use OXFORD_NANOPORE for Oxford Nanopore. These names are borrowed from the NCBI. Further details on the rationale can found in this issue.

Cluster/Cloud Execution

Cluster job schedulers can become unresponsive or even crash if too many jobs with short running time are submitted to the cluster. It is therefore advisable to adjust the workflow accordingly. We tried to provide a default configuration that works in most cases as is but the application scenarios can be very diverse and manual adjustments may become necessary. Here is a small guide which config parameters influence the number of jobs and how much resources they consume.

max_threads: Sets the maximum number of threads/cores a single job may use. A single-threaded job will always allocate a single core but thread-parallel steps, e.g. the sequence alignments, will use up to max_threads if snakemake has been provided enough cores via --cores.
-s<block_size:uint>: The assembly and reads FAST/A files are converted into Dazzler DBs. These DBs store the sequence in a 2-bit encoding and have additional features like tracks (similar to BED files). Also they are split into blocks of <block_size>Mb. Alignments are calculated on the basis of these blocks which enables easy distribution onto the cluster. The larger the block size the longer are the alignment jobs and the more memory they require but also the number of jobs is reduced. Experience shows that the block size should be between 200Mb and 500Mb.
propagate_batch_size: The repeat masks are homogenized by propagating them from the assembly to the reads and back again. Usually these jobs are very short because the propagation is parallelized over the blocks of the reads DB. To reduce the number of jobs both propagation directions are grouped together and submitted in batches of propagate_batch_size read blocks. Increasing propagate_batch_size reduces the number of submitted jobs and increases the run time per job. It has no effect on the memory requirements.
batch_size: In the collect step DENTIST identifies candidates for gap closing each consisting of a pile up of reads. From these pile ups consensus sequences are computed and validated in the process step. Each job process batch_size pile ups. Increasing batch_size reduces the number of submitted jobs and increases the run time per job. It has no effect on the memory requirements.
validation_blocks: The preliminarily closed gaps are validated by analyzing how the reads align to each closed gap. The validation is conducted in independent jobs for validation_blocks many blocks of the gap-closed assembly. Decreasing validation_blocks reduces the number of submitted jobs and increases the run time and memory requirements per job. The memory requirement is proportional to the size of the read alignment blocks.

Troubleshooting

Unexpected `ProtectedOutputException` when running on a single machine

See also: Regular ProtectedOutputException.

When executed on a single machine, snakemake will sometimes quit with an ProtectedOutputException (Snakemake bug report filed). You may try the follow snippet to get snakemake back on track:

# make sure workdir exists to avoid errors with chmod
mkdir -p workdir
# keep track of the number of retries to avoid an infinite loop
RETRY=0
# try running snakemake as long as the gap-closed assembly was not created
# and we have retries left
while [[ ! -f "gap-closed.fasta" ]] && (( RETRY++ < 3 )); do
    # allow snakemake to overwrite protected output
    chmod -R u+w workdir
    # try snakemake...
    snakemake --configfile=snakemake.yaml --use-singularity --cores=all
done

Regular `ProtectedOutputException`

Snakemake has a built-in facility to protect files from accidental overwrites. This is meant to avoid overwriting precious results that took many CPU hours to produce. If executing a rule would overwrite a protected file, Snakemake raises a ProtectedOutputException, e.g.:

ProtectedOutputException in line 1236 of /tmp/dentist-example/Snakefile:
Write-protected output files for rule collect:
workdir/pile-ups.db
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 136, in run_jobs
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 441, in run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 230, in _run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in _run

Here workdir/pile-ups.db is the protected file that caused the error. If you are sure of what you are doing, you can simply raise the protection by chmod -R +w ./workdir and execute Snakemake again. Now, it will overwrite any files.

No internet connection on compute nodes

If you have no internet connection on your compute nodes or even the cluster head node and want to use Singularity for execution, you will need to download the container image manually and put it to a location accessible by all jobs. Assume /path/to/dir is such a location on your cluster. Then download the container image using

# IF internet connection on head node
singularity pull --dir /path/to/dir docker://aludi/dentist:stable

# ELSE (on local machine)
singularity pull docker://aludi/dentist:stable
# copy dentist_stable.sif to cluster
scp dentist_stable.sif cluster:/path/to/dir/dentist_stable.sif

When the image is in place you will need to adjust your configuration in snakemake.yml:

dentist_container: "/path/to/dir/dentist_stable.sif"

Now, you are ready for execution.

Citation

Arne Ludwig, Martin Pippel, Gene Myers, Michael Hiller. DENTIST – using long reads to close assembly gaps at high accuracy. Submitted for peer review. Pre-print at https://doi.org/10.1101/2021.02.26.432990

Maintainer

DENTIST is being developed by Arne Ludwig <ludwig@mpi-cbg.de> at the Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.

Contributing

Contributions are warmly welcome. Just create an issue or pull request on GitHub. If you submit a pull request please make sure that:

the code compiles on Linux using the current release of dmd,
your code is covered with unit tests (if feasible) and
dub test runs successfully.

It is recommended to install the Git hooks included in the repository to avoid premature pull requests. You can enable all shipped hooks with this command:

git config --local core.hooksPath .githooks/

If you do not want to enable just a subset use ln -s .githooks/{hook} .git/hooks. If you want to audit code changes before they get executed on your machine you can you cp .githooks/{hook} .git/hooks instead.

License

This project is licensed under MIT License (see LICENSE).

4.0.0	2022-Sep-14
3.0.0	2021-Dec-09
2.0.0	2021-Jun-21
1.0.2	2021-Apr-26
1.0.1	2021-Feb-22

dentist 1.0.2

DENTIST

Table of Contents

Install

Use a Singularity Container (recommended)

Use Pre-Built Binaries

Build from Source

Runtime Dependencies

Usage

Quick execution with Snakemake (and Singularity)

Executing on a Cluster

Manual execution

Example

Local Execution

Cluster Execution

Configuration

How to Choose DENTIST Parameters

Choosing the Read Type

Cluster/Cloud Execution

Troubleshooting

Unexpected `ProtectedOutputException` when running on a single machine

Regular `ProtectedOutputException`

No internet connection on compute nodes

Citation

Maintainer

Contributing

License

dentist 1.0.2

DENTIST

Table of Contents

Install

Use a Singularity Container (recommended)

Use Pre-Built Binaries

Build from Source

Runtime Dependencies

Usage

Quick execution with Snakemake (and Singularity)

Executing on a Cluster

Manual execution

Example

Local Execution

Cluster Execution

Configuration

How to Choose DENTIST Parameters

Choosing the Read Type

Cluster/Cloud Execution

Troubleshooting

Unexpected ProtectedOutputException when running on a single machine

Regular ProtectedOutputException

No internet connection on compute nodes

Citation

Maintainer

Contributing

License

Unexpected `ProtectedOutputException` when running on a single machine

Regular `ProtectedOutputException`