Helping us debug nanopolish¶

Overview¶

Running into errors with nanopolish? To help us debug, we need to be able to reproduce the errors. We can do this by packaging a subset of the files that were used by a nanopolish. We have provided scripts/extract_reads_aligned_to_region.py and this tutorial to help you do exactly this.

Briefly, this script will:

extract reads that align to a given region in the draft genome assembly
rewrite a new BAM, BAI, FASTA file with these reads
extract the FAST5 files associated with these reads
save all these files into a tar.gz file

Workflow¶

Narrow down a problematic region by running nanopolish variants --consensus [...] and changing the -w parameter.
Run the scripts/extract_reads_aligned_to_region.py.
Send the resulting tar.gz file to us by hosting either a dropbox or google drive.

Tutorial - using extraction helper script to create example datsets¶

We extracted a subset of reads for a 2kb region to create the example dataset for the eventalign and consensus tutorial using scripts/extract_reads_aligned_to_region.py. Here is how:

Generated basecalled --reads file:

Basecalled reads with albacore:

read_fast5_basecaller.py -c r94_450bps_linear.cfg -t 8 -i /path/to/raw/fast5/files -s /path/to/albacore-2.0.1/output/directory -o fastq

Merged the different albacore fastq outputs:

cat /path/to/albacore-2.0.1/output/directory/workspace/pass/*.fastq > albacore-2.0.1-merged.fastq

Converted the merged fastq to fasta format:

paste - - - - < albacore-2.0.1-merged.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > reads.fasta

Generated --bam file with the draft genome assembly (-g):

Ran canu to create draft genome assembly:

canu \
    -p ecoli -d outdir genomeSize=4.6m \
    -nanopore-raw reads.fasta \

Index draft assembly:

bwa index ecoli.contigs.fasta
samtools faidx ecoli.contigs.fasta

Aligned reads to draft genome assembly with bwa (v0.7.12):

bwa mem -x ont2d ecoli.contigs.fasta reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
samtools index reads.sorted.bam

Selected a --window:

Identified the first contig name and chose a random start position:
```
head -3 ecoli.contigs.fasta
```

Output:

>tig00000001 len=4376233 reads=23096 covStat=7751.73 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
AGATGCTTTGAAAGAAACGCAGAATAGATCTCTATGTAATGATATGGAATACTCTGGTATTGTCTGTAAAGATACTAATGGAAAATATTTTGCATCTAAG
GCAGAAACTGATAATTTAAGAAAGGAGTCATATCCTCTGAAAAGAAAATGTCCCACAGGTACAGATAGAGTTGCTGCTTATCATACTCACGGTGCAGATA

As we wanted a 2kb region, we selected a random start position (200000) so our end position was 202000. Therefore our --window was “tig00000001:200000-202000”.

Using the files we created, we ran scripts/extract_reads_aligned_to_region.py, please see usage example below.

Note

Make sure nanopolish still reproduces the same error on this subset before sending it to us.

Usage example¶

python3 extract_reads_aligned_to_region.py \
    --reads reads.fasta \
    --genome ecoli.contigs.fasta \
    --bam reads.sorted.bam \
    --window "tig00000001:200000-202000" \
    --output_prefix ecoli_2kb_region --verbose

Argument name(s)	Req.	Default value	Description
`-b`, `--bam`	Y	NA	Sorted bam file created by aligning reads to the draft genome.
`-g`, `--genome`	Y	NA	Draft genome assembly
`-r`, `--reads`	Y	NA	Fasta, fastq, fasta.gz, or fastq.gz file containing basecalled reads.
`-w`, `--window`	Y	NA	Draft genome assembly coordinate string ex. “contig:start-end”. It is essential that you wrap the coordinates in quotation marks (“).
`-o`, `--output_prefix`	N	reads_subset	Prefix of output tar.gz and log file.
`-v`, `--verbose`	N	False	Use for verbose output with info on progress.

Script overview¶

Parse input files
Assumes readdb file name from input reads file
Validates input
- checks that input bam, readdb, fasta/q, draft genome assembly, draft genome assembly index file exist, are not empy, and are readable
With user input draft genome assembly coordinates, extracts all reads that aligned within these coordinates stores the read_ids. This information can be found from the input BAM.
- uses pysam.AlignmentFile
- uses samfile.fetch(region=draft_ga_coords) to get all reads aligned to region
- if reads map to multiple sections within draft ga it is not added again
Parses through the input readdb file to find the FAST5 files associated with each region that aligned to region
- stores in dictionary region_fast5_files; key = read_id, value = path/to/fast5/file
- path to fast5 file is currently dependent on the user’s directory structure
Make a BAM and BAI file for this specific region
- creates a new BAM file called region.bam
- with pysam.view we rewrite the new bam with reads that aligned to the region…
- with pysam.index we create a new BAI file
Now to make a new FASTA file with this subset of reads
- the new fasta file is called region.fasta
- this first checks what type of sequences file is given { fasta, fastq, fasta.gz, fastq.gz }
- then handles based on type of seq file using SeqIO.parse
- then writes to a new fasta file
Let’s get to tarring
- creates a tar.gz file with the output prefix
- saves the fast5 files in directory output_prefix/fast5_files/
Adds the new fasta, new bam, and new bai file with the subset of reads
Adds the draft genome asssembly and associated fai index file
Performs a check
- the number of reads in the new BAM file, new FASTA file, and the number of files in the fast5_files directory should be equal
Outputs a tar.gz and log file. FIN!