Quickstart - how to estimate poly(A) tail lengths from nanopore native RNA reads

Owing to homopolymer effects and the proximity to the sequencing adapters, the polyadenylated tails of reads obtained from nanopore native RNA sequencing are improperly basecalled, making their lengths difficult to quantify. We developed the polya subprogram to use an alternative statistical model to estimate these tail lengths.

In this quickstart tutorial, we’ll show you how to estimate polyadenylated tail lengths step-by-step, starting from nothing but raw fast5 files. We’ll basecall the fast5 files with Oxford Nanopore Technologies’ albacore basecaller, before aligning the resulting reads with minimap2, indexing the files with nanopolish index, and finally segmenting the reads and calling the tail lengths with nanopolish polya.

We’ll be following the steps taken in our benchmarking analysis workflow that accompanies our publication. In each step below, we’ll generate files in the working directory instead of making subdirectories, except in the case of large datasets.

Software requirements

Download raw fast5 data and basecall

Let’s start by downloading a dataset of fast5 files from the European Nucleotide Archive. We’ll download a tarball of fast5 files containing reads that are known to have polyadenylated tail lengths of 30nt.

mkdir data && mkdir data/fastqs
wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR276/ERR2764784/30xpolyA.tar.gz -O 30xpolyA.tar.gz && mv 30xpolyA.tar.gz data/
tar -xzf data/30xpolyA.tar.gz -C data/
guppy_basecaller -c rna_r9.4.1_70bps_hac.cfg -i data/30xpolyA/fast5/pass -s data/fastqs
cat data/fastqs/*.fastq > data/30xpolyA.fastq

In the above, change the value of the -f and -k arguments based on your flow-cell and sequencing kit, as the basecaller’s accuracy is highly dependent upon these settings.

Our directory structure should now look something like this:

(current directory)
|- data/
|-- 30xpolyA.fastq
|-- 30xpolyA.tar.gz
|-- 30xpolyA/
|--- fast5/
|---- pass/
|----- raw_read1.fast5
|----- (... more raw fast5's here ...)
|-- fastqs/
|--- workspace/
|---- pass/

Index with nanopolish index

Let’s construct an index for our reads with nanopolish’s index subprogram. This constructs a fast lookup table that tells our program where to find the raw fast5 file for each basecalled read.

nanopolish index --directory=data/30xpolyA/fast5/pass --sequencing-summary=data/fastqs/sequencing_summary.txt data/30xpolyA.fastq

This should generate a collection of files in the same directory that contains the 30xpolyA.fastq.

Align with minimap2 and format the BAM file

Now we’ll align our basecalled mRNA sequences in the fastq file with our reference. First download a reference fasta:

wget https://raw.githubusercontent.com/paultsw/polya_analysis/master/data/references/enolase_reference.fas
wget https://raw.githubusercontent.com/paultsw/polya_analysis/master/data/references/enolase_reference.fas.fai
mv enolase_reference.* data/

Note that our directory structure should now look like this:

(current directory)
|- data/
|-- enolase_reference.fas
|-- enolase_reference.fai
|-- (... same structure as above ...)

Let’s run minimap2 to align our basecalled reads to this reference and generate a collection of SAM/BAM files with samtools:

minimap2 -a -x map-ont data/enolase_reference.fas data/30xpolyA.fastq | samtools view -b - -o data/30xpolyA.bam
cd data && samtools sort -T tmp -o 30xpolyA.sorted.bam 30xpolyA.bam && samtools index 30xpolyA.sorted.bam && cd ..

Note that we used -x map-ont, which is typically for unspliced reads (e.g. genomic DNA) coming from nanopore sequencers. In typical native mRNA sequencing, you should use -x splice, which is a splice-aware setting (and uses a different gap cost in the alignment). We’re using map-ont due to the fact that our reads come from a control dataset with no splicing.

There should be three more files in the data directory now: 30xpolyA.bam, 30xpolyA.sorted.bam, and 30xpolyA.sorted.bam.bai.

Segmentation and tail length estimation with nanopolish polya

Finally, we can run the polyadenylation estimator:

nanopolish polya --threads=8 --reads=data/30xpolyA.fastq --bam=data/30xpolyA.sorted.bam --genome=data/enolase_reference.fas > polya_results.tsv

Set the –threads flag to the number of parallel threads you want to use. Generally speaking, a larger number of threads tends to lower the compute time, but there are diminishing returns to a higher value and performance can actually decrease if your CPU is incapable of supporting your desired number of parallel threads. The best number of threads to use is highly dependent upon your hardware.

Interpreting the output TSV file

We’ll end this quickstart with a look at the output of the polya program. Let’s look at the top five lines of the polya_results.tsv file we’ve just generated:

head -20 polya_results.tsv | column -t
readname                              contig   position  leader_start  adapter_start  polya_start  transcript_start  read_rate  polya_length  qc_tag
d6f42b79-90c6-4edd-8c8f-8a7ce0ac6ecb  YHR174W  0         54.0          3446.0         7216.0       8211.0            130.96     38.22         PASS
453f3f3e-d22f-4d9c-81a6-8576e23390ed  YHR174W  0         228.0         5542.0         10298.0      11046.0           130.96     27.48         PASS
e02d9858-0c04-4d86-8dba-18a47d9ac005  YHR174W  0         221.0         1812.0         7715.0       8775.0            97.16      29.16         PASS
b588dee2-2c5b-410c-91e1-fe8140f4f837  YHR174W  0         22.0          8338.0         13432.0      14294.0           130.96     32.43         PASS
af9dfee2-1711-4083-b109-487b99895e0a  YHR174W  0         889.0         3679.0         6140.0       7168.0            130.96     39.65         PASS
93f98a86-3b18-48cf-8c4d-15cf277911e2  YHR174W  0         144.0         1464.0         5615.0       6515.0            120.48     30.96         SUFFCLIP
af9dfee2-1711-4083-b109-487b99895e0a  YHR174W  0         889.0         3679.0         6140.0       7168.0            130.96     39.65         SUFFCLIP
93f98a86-3b18-48cf-8c4d-15cf277911e2  YHR174W  0         144.0         1464.0         5615.0       6515.0            120.48     30.96         PASS
ca8d4059-9d82-45ee-aa07-4b8b351618b3  YHR174W  0         1.0           2157.0         4255.0       5862.0            111.56     54.48         PASS
3493c123-78d4-4f7c-add0-cbb249aef00a  YHR174W  0         78.0          1938.0         4829.0       5491.0            136.91     25.05         PASS
f5ff1802-3fdd-479a-8888-c72de01bbaea  YHR174W  0         150.0         3476.0         7233.0       7932.0            130.96     25.35         PASS
bb929728-2ed8-42b0-a5a5-eea4bfd62673  YHR174W  0         91.0          1061.0         6241.0       6910.0            111.56     19.74         PASS
17cf3fef-1acb-4045-8252-e9c00fedfb7c  YHR174W  0         447.0         6004.0         20058.0      20964.0           100.40     25.17         ADAPTER
e3e38de6-8a99-4029-a067-261f470517ca  YHR174W  0         41.0          1588.0         4057.0       5303.0            130.96     49.13         PASS
66f55b56-c22e-4e6d-999e-50687bed6fb7  YHR174W  0         191.0         3160.0         9218.0       10030.0           125.50     28.79         PASS
56c116d7-9286-4b57-8329-e74928b11b38  YHR174W  0         13.0          1516.0         5845.0       6773.0            130.96     35.30         PASS
5ca1392c-c48f-4135-85d3-271bd4ee7a13  YHR174W  0         1.0           1976.0         4854.0       5947.0            136.91     44.64         PASS
66b5a0ef-b8e6-475e-bf20-04b96154a67f  YHR174W  0         98.0          3847.0         7066.0       7925.0            120.48     29.32         PASS
34bf2187-5816-4744-8e6a-3250b5247e02  YHR174W  0         1.0           2897.0         6885.0       7547.0            125.50     22.54         PASS

Each row corresponds to the output for a given read. The columns have the following interpretation: * readname refers to the unique ID associated to this particular read. This string is also used to look up the corresponding fast5 file, e.g. by looking

through the readdb file generated by nanopolish index.
  • contig refers to the reference sequence that this read aligns to, and is taken from the BAM file.

  • position is the 5’ starting position of the alignment to the reference sequence, and also comes from the BAM file.

  • leader_start, adapter_start, polya_start, and transcript_start are the indices of the signal segmentation generated by the underlying model within nanopolish. Briefly, there are four biologically-meaningful regions of the raw sequence of electrical current readings within each fast5 file; these four entries denote the starting index of each of these consecutive regions. The indices start from 0 and are oriented in the 3’-to-5’ direction (due to the inherent orientation of the native RNA nanopore sequencing protocol). A full exposition of this segmentation algorithm is available in the `supplementary note<https://www.biorxiv.org/content/biorxiv/suppl/2018/11/09/459529.DC1/459529-2.pdf>`_ to our associated publication.

  • read_rate is the estimated translocation rate (in units of nucleotides/second) of the read through the pore. The translocation rate is non-uniform during the sequencing process of even a single molecule, so this is ultimately a summary statistic of the dynamic, time-varying rate.

  • polya_length is the estimated polyadenylated tail length, in number of nucleotides. That this value is a float rather than an integer reflects the fact that our estimated tail length is the output of an estimator based on the translocation rate.

  • qc_tag is an additional flag used to indicate the validity of the estimate. Generally speaking, you should only use rows of the output file with this value set to PASS; all other rows with (e.g.) the qc_tag set to SUFFCLIP, ADAPTER, etc. display signs of irregularity that indicate that we believe the estimate to be unreliable. You can easily filter away all rows with the tag set to anything other than PASS using a grep:

    grep 'PASS' polya_results.tsv > polya_results.pass_only.tsv