Helping us debug nanopolish¶
Overview¶
Running into errors with nanopolish? To help us debug, we need to be able to reproduce the errors. We can do this by packaging a subset of the files that were used by a nanopolish. We have provided scripts/extract_reads_aligned_to_region.py
and this tutorial to help you do exactly this.
Briefly, this script will:
- extract reads that align to a given region in the draft genome assembly
- rewrite a new BAM, BAI, FASTA file with these reads
- extract the FAST5 files associated with these reads
- save all these files into a tar.gz file
Workflow¶
- Narrow down a problematic region by running
nanopolish variants --consensus [...]
and changing the-w
parameter. - Run the
scripts/extract_reads_aligned_to_region.py
. - Send the resulting
tar.gz
file to us by hosting either a dropbox or google drive.
Tutorial - using extraction helper script to create example datsets¶
We extracted a subset of reads for a 2kb region to create the example dataset for the eventalign and consensus tutorial using scripts/extract_reads_aligned_to_region.py
. Here is how:
Generated basecalled --reads
file:
Basecalled reads with albacore:
read_fast5_basecaller.py -c r94_450bps_linear.cfg -t 8 -i /path/to/raw/fast5/files -s /path/to/albacore-2.0.1/output/directory -o fastq
Merged the different albacore fastq outputs:
cat /path/to/albacore-2.0.1/output/directory/workspace/pass/*.fastq > albacore-2.0.1-merged.fastq
Converted the merged fastq to fasta format:
paste - - - - < albacore-2.0.1-merged.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > reads.fasta
Generated --bam
file with the draft genome assembly (-g
):
Ran canu to create draft genome assembly:
canu \ -p ecoli -d outdir genomeSize=4.6m \ -nanopore-raw reads.fasta \
Index draft assembly:
bwa index ecoli.contigs.fasta samtools faidx ecoli.contigs.fasta
Aligned reads to draft genome assembly with bwa (v0.7.12):
bwa mem -x ont2d ecoli.contigs.fasta reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp samtools index reads.sorted.bam
Selected a --window
:
Identified the first contig name and chose a random start position:
head -3 ecoli.contigs.fasta
Output:
>tig00000001 len=4376233 reads=23096 covStat=7751.73 gappedBases=no class=contig suggestRepeat=no suggestCircular=no
AGATGCTTTGAAAGAAACGCAGAATAGATCTCTATGTAATGATATGGAATACTCTGGTATTGTCTGTAAAGATACTAATGGAAAATATTTTGCATCTAAG
GCAGAAACTGATAATTTAAGAAAGGAGTCATATCCTCTGAAAAGAAAATGTCCCACAGGTACAGATAGAGTTGCTGCTTATCATACTCACGGTGCAGATA
As we wanted a 2kb region, we selected a random start position (200000) so our end position was 202000. Therefore our --window
was “tig00000001:200000-202000”.
Using the files we created, we ran scripts/extract_reads_aligned_to_region.py
, please see usage example below.
Note
Make sure nanopolish still reproduces the same error on this subset before sending it to us.
Usage example¶
python3 extract_reads_aligned_to_region.py \
--reads reads.fasta \
--genome ecoli.contigs.fasta \
--bam reads.sorted.bam \
--window "tig00000001:200000-202000" \
--output_prefix ecoli_2kb_region --verbose
Argument name(s) | Req. | Default value | Description |
---|---|---|---|
-b , --bam |
Y | NA | Sorted bam file created by aligning reads to the draft genome. |
-g , --genome |
Y | NA | Draft genome assembly |
-r , --reads |
Y | NA | Fasta, fastq, fasta.gz, or fastq.gz file containing basecalled reads. |
-w , --window |
Y | NA | Draft genome assembly coordinate string ex. “contig:start-end”. It is essential that you wrap the coordinates in quotation marks (“). |
-o , --output_prefix |
N | reads_subset | Prefix of output tar.gz and log file. |
-v , --verbose |
N | False | Use for verbose output with info on progress. |
Script overview¶
- Parse input files
- Assumes readdb file name from input reads file
- Validates input
- checks that input bam, readdb, fasta/q, draft genome assembly, draft genome assembly index file exist, are not empy, and are readable
- With user input draft genome assembly coordinates, extracts all reads that aligned within these coordinates stores the read_ids. This information can be found from the input BAM.
- uses pysam.AlignmentFile
- uses samfile.fetch(region=draft_ga_coords) to get all reads aligned to region
- if reads map to multiple sections within draft ga it is not added again
- Parses through the input readdb file to find the FAST5 files associated with each region that aligned to region
- stores in dictionary region_fast5_files; key = read_id, value = path/to/fast5/file
- path to fast5 file is currently dependent on the user’s directory structure
- Make a BAM and BAI file for this specific region
- creates a new BAM file called
region.bam
- with pysam.view we rewrite the new bam with reads that aligned to the region…
- with pysam.index we create a new BAI file
- creates a new BAM file called
- Now to make a new FASTA file with this subset of reads
- the new fasta file is called
region.fasta
- this first checks what type of sequences file is given {
fasta
,fastq
,fasta.gz
,fastq.gz
} - then handles based on type of seq file using SeqIO.parse
- then writes to a new fasta file
- the new fasta file is called
- Let’s get to tarring
- creates a
tar.gz
file with the output prefix - saves the fast5 files in directory
output_prefix/fast5_files/
- creates a
- Adds the new fasta, new bam, and new bai file with the subset of reads
- Adds the draft genome asssembly and associated fai index file
- Performs a check
- the number of reads in the new BAM file, new FASTA file, and the number of files in the fast5_files directory should be equal
- Outputs a
tar.gz
andlog
file. FIN!