Quickstart - how to polish a genome assembly

The original purpose of nanopolish was to improve the consensus accuracy of an assembly of Oxford Nanopore Technology sequencing reads. Here we provide a step-by-step tutorial to help you get started.

Requirements:

Download example dataset

You can download the example dataset we will use here:

wget http://s3.climb.ac.uk/nanopolish_tutorial/ecoli_2kb_region.tar.gz
tar -xvf ecoli_2kb_region.tar.gz
cd ecoli_2kb_region

Details:

  • Sample : E. coli str. K-12 substr. MG1655
  • Instrument : MinION sequencing R9.4 chemistry
  • Basecaller : Albacore v2.0.1
  • Region: “tig00000001:200000-202000”
  • Note: Ligation-mediated PCR amplification performed

This is a subset of reads that aligned to a 2kb region in the E. coli draft assembly. To see how we generated these files please refer to the tutorial creating_example_dataset.

You should find the following files:

  • reads.fasta : subset of basecalled reads
  • draft.fa : draft genome assembly
  • draft.fa.fai : draft genome assembly index
  • fast5_files/ : a directory containing FAST5 files
  • ecoli_2kb_region.log : a log file for how the dataset was created with nanopolish helper script (scripts/extract_reads_aligned_to_region.py)

For the evaluation step you will need the reference genome:

curl -o ref.fa https://ftp.ncbi.nih.gov/genomes/archive/old_genbank/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ffn

Analysis workflow

The pipeline below describes the recommended analysis workflow for larger datasets. In this tutorial, we will run through the basic steps of the pipeline for this smaller (2kb) dataset.

nanopolish-tutorial-workflow

Data preprocessing

nanopolish needs access to the signal-level data measured by the nanopore sequencer. To begin, we need to create an index readdb file that links read ids with their signal-level data in the FAST5 files:

nanopolish index -d fast5_files/ reads.fasta

We get the following files: reads.fasta.index, reads.fasta.index.fai, reads.fasta.index.gzi, and reads.fasta.index.readdb.

Compute the draft genome assembly using canu

As computing the draft genome assembly takes a few hours we have included the pre-assembled data for you (draft.fa). We used the following parameters with canu:

canu \
    -p ecoli -d outdir genomeSize=4.6m \
    -nanopore-raw albacore-2.0.1-merged.fastq

Compute a new consensus sequence for a draft assembly

Now that we have reads.fasta indexed with nanopolish index, and have a draft genome assembly draft.fa, we can begin to improve the assembly with nanopolish. Let us get started!

First, we align the original reads (reads.fasta) to the draft assembly (draft.fa) and sort alignments:

minimap2 -ax map-ont -t 8 draft.fa reads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp
samtools index reads.sorted.bam

Checkpoint: we can do a quick check to see if this step worked. The bam file should not be empty.

samtools view reads.sorted.bam | head

Then we run the consensus algorithm. For larger datasets we use nanopolish_makerange.py to split the draft genome assembly into 50kb segments, so that we can run the consensus algorithm on each segment in parallel. The output would be the polished segments in fasta format. Since our dataset is only covering a 2kb region, we skip this step and use the following command:

nanopolish variants --consensus -o polished.vcf \
    -w "tig00000001:200000-202000" \
    -r reads.fasta \
    -b reads.sorted.bam \
    -g draft.fa

We are left with: polished.vcf.

Note: As of v0.10.1, nanopolish variants --consensus only outputs a VCF file instead of a fasta sequence.

To generate the polished genome in fasta format:

nanopolish vcf2fasta --skip-checks -g draft.fa polished.vcf > polished_genome.fa

We only polished a 2kb region, so let’s pull that out:

samtools faidx polished_genome.fa
samtools faidx polished_genome.fa "tig00000001:200000-202000" > polished.fa

Evaluate the assembly

To analyze how nanopolish performed improving the accuracy we use MUMmer. MUMmer contains “dnadiff”, a program that enables us to see a report on alignment statistics. With dnadiff we can compare the two different assemblies.

mkdir analysis
MUMmer3.23/dnadiff --prefix analysis/draft.dnadiff ref.fa draft.fa
MUMmer3.23/dnadiff --prefix analysis/polished.dnadiff ref.fa polished.fa

This generates draft.dnadiff.report and polished.dnadiff.report along with other files. The metric we are interested in is AvgIdentity under [ Alignments ] 1-to-1, which is a measurement of how similar the genome assemblies are to the reference genome. We expect to see a higher value for the polished assembly than the draft ( 99.90 vs 99.53 ), concluding that the nanopolish consensus algorithm worked successfully.

Note

The example dataset was PCR amplified causing a loss of methylation information. We recommend using the -q dam,dcm with nanopolish variants --consensus if you have data with methylation information to account for known bacterial methyltransferases.