Kristina Atanasova, Bastien Paré, Martin A. Smith

1
RAWDAPTER : IMPROVING THE SIGNAL - LEVEL ANALYSIS FOR NANOPORE RNA SEQUENCING Kristina Atanasova, Bastien Paré, Martin A. Smith INTRODUCTION Nanopore raw signals are measured by observing characteristic disruptions of electric current as an RNA molecule transits through a protein nanopore.The resulting signal is typically converted to a string of nucleotides using various probabilistic models that are known to introduce errors, which stigmatise this technology [1]. However, several studies have demonstrated that direct analysis on raw signals can increase the precision and sensitivity of RNA sequencing, including the detection of RNA modifications [2,3]. In order to improve the signal-based analysis of single-molecule nanopore sequencing data, we present Rawdapter, a computational strategy that can quickly distinguish the biological informative signal from the uninformative signal in raw nanopore sequencing data and effectively locate and remove the superfluous signal. We first develop a bioinformatic pipeline to create positive and negative controls for training a deep learning model (Fig. 3). We created 2 groups of reads for each possible extremity in a cDNA by filtering the alignments. The ‘buffer’ aims to accommodate the adapter sequences, which will not align to the reference. For each filtering, we extracted the relevant alignments. F5C [7] was used to determine the positions of alignment to a target sequence. SLOW5tools [8] was used to convert FAST5 files into BLOW5 files and to get the raw signal for each read. A script has been written to extract the uninformative signal, that will form the positive sample, and the informative signal that will form the negative sample. Finally, we padded the short signals, normalized the data with the help of z-score normalization, and merged both samples into a single array. RESULTS The preprocessing pipeline produced 1 791 962 unique map- pers. We used kernel density estimate (KDE) plots to estimate the ‘buffer’ values which represent the 3rd quartile of each dis- tribution of basecalled adapter signal sizes (Fig. 5). The sub- sequent preprocessing steps produced 338 446 (5’+) and 480 641 (5’-) signal samples for the positive dataset, and 337 804 (5’+) and 475 750 (5’-) signal samples for the negative dataset. We produced 2 1D CNN models, that is one for the 5’ + end and one for the 5’ - end, capable of distinguishing the informa - tive and uninformative signals. Both models showed impressive validation accuracies and losses in little time (Fig. 6). The per - formance of both models on the testing sets achieved high ac- curacies (Fig. 6). These results exhibit the capacity of 1D CNN networks to classify long cDNA nanopore signals with success. ACKNOWLEDGEMENTS Special thanks the members of the Smith Lab and the Research Center at CHU Ste-Justine. REFERENCES CONCLUSION Rawdapter is a critical step of raw signals preprocessing since it allows the identification, localisation and removal of adapters and can be beneficial for subsequent analysis since it removes uninformative information from raw signals. The ability to process the signal faster than the current state-of-the-art tools of basecalling and alignment could help to improve the prediction of RNA extremities and the d etection of RNA modifications. MATERIALS AND METHODS The data used are synthetic RNA sequences representing spliced mRNA isoforms that align to 78 artificial gene loci encoded on an artificial chromosome in silico [4]. During library preparation, we used the PCS109 sequencing kit to generate cDNA strands by reverse transcription and PCR amplification. Then, adapters are ligated at the ends, including one carrying the motor protein that facilitates passage of cDNA through the pores (Fig.1). The reads were sequenced on a Minion, an Oxford Nanopore Technologies (ONT) sequencing device, that generated raw electrical current data (Fig. 2) and basecalled data which was re-basecalled with Guppy 5.0.7. We used Pychopper [5] to identify full-length cDNA reads and to remove chimeric reads. The reads were aligned to the reference transcriptome with minimap2 [6] and we extracted unique mappers. Fig. 1 : A cDNA strand after library preparation. The uninformative signal (in dotted lines) includes the sequencing adapter (minimal) and the RT-PCR primers (TSO+, TSO-, VNP+, and VNP-) which will be entirely basecalled. The motor protein (in green) is attached to the 5’ strands. During nanopore cDNA sequencing, both strands are sequenced separately from 5′ to 3’. Fig. 3 : Preprocessing pipeline. Bioinformatic processes to create the datasets for the training and testing of the deep learning model. Steps 8 to 13 were repeated for the 2 groups of reads, that is full-length cDNA reads aligned to the plus and minus strands of the cDNA. Then, steps 15 to 18 were repeated for the 2 groups of reads in order to create 2 samples of signals for each group, one sample containing the raw signals that were not aligned to a reference transcript and another sample containing the raw signals belonging to informative signal sampled with a fixed window size. FUTURE WORK Future research would aim to develop a cDNA adapter signal localizer that could return a position within the signal in less than 200ms. It will be important to investigate, with technical replicates, whether the models are overfitted. This tool would help to implement new interactive sequencing methods for nanopore cDNA runs which have great potential for clinical applications and biomedical research, more particularly in the development of diagnostic tools and the identification of biomarkers. Fig. 2 : An example of a raw nanopore signal. The first 5000 time points of this raw nanopore signal show different types of signals. The two regions of less signal amplitude are the results of adapter stalls and poly-A/poly-T homopolymer sequences. The approach used to improve the effectiveness of raw signal comparisons is a 1D convolutional neural network (1D CNN). This type of network consists usually of an input layer that receives the raw signal, convolutional layers, pooling layers, dropout layers, a fully connected layer, and an output layer with the number of neurons equal to the number of classes (Fig. 4) [9]. We used the Keras library integrated in Tensorflow 2.6.0 (with GPU support) to implement this neural network architecture. We separated the data in the ratio 70:10:20 for the training, validation and testing sets respectively. We trained the models on the positive and negative samples, and we assessed them based on their classification accuracy and their loss on the validation set. Fig. 4 : Architecture of a 1D convolutional neural network (CNN). A typical CNN architecture includes an input layer, convolutional layers which apply a filter to help extract potential features. The features are combined by a fully connected (dense) layer, and finally, the output layer creates a classification based on the output classes. The approach used to effectively locate and remove the informative signal is based on a sliding window that scans the raw signal from the beginning and finds the boundary of the uninformative signal. . . . . . . . . . ... Convolutional layers Flatten Layer Dense Layers Output Layer . . . . . . . . . minimap2 -k 14 -t 30 SAM PAF htsbox samview -pS Create KDE plots in R ggplot2 Subset SAM file and convert to BAM Filter the PAF file BAM txt Get reads ID Sort and index BAM files Samtools BAM BAM to Fastq Htsbox bam2fq fastq Index FAST5 files F5C index Align the fastq reads to the raw signal F5C eventalign tsv sequins.fa FAST5 Convert FAST5 to SLOW5 and merge SLOW5tools F2S and merge txt Get reads ID slow5 Extract reads raw signal SLOW5tools get slow5 Extract the raw uninformative signals tsv fastq Estimate values of buffer 4 7 8 9 10 11 12 13 14 15 16 pychopper fastq txt Get reads ID seqtk subseq fastq Get reads ID txt Subset SAM file with unique mappers SAM PAF htsbox samview -pS 1 2 3 5 6 8 for 5' + ends : $3 >= $8, $3 <= B 1 , and $8 <= B 2 for 5' – ends : $3 <= B 3 and $7-$9 <= B 4 Extract the raw informative signals tsv 17 Normalization, padding, and merging 18 tsv npy 0.00 0.02 0.04 0.06 0 50 100 150 200 Coordinates Density Coordinates Start position in read Start position in target Kernel density estimates by group for 5' + reads A 0.000 0.025 0.050 0.075 0.100 0 50 100 150 200 Coordinates Density Coordinates Start position in read Difference between sequence length and end position in target Kernel density estimates by group for 5' − reads B Fig. 6: Comparison of model accuracy and loss Two 1D CNN models were trained for both groups of signals. Each balanced dataset had the same number of positive samples and negative samples. A) Model accuracy for the 5’ + data- set. B) Model loss for the 5’ + dataset. C) Model accuracy for the 5’ - dataset. D) Model loss for the 5’ - dataset. E) Performance on validation and testing set. Fig. 5 : Kernel density estimate (KDE) plots to estimate filtering values In order to create 2 groups of unique cDNA reads that are aligned to one of the 5’ transcript end, we filtered the full-length cDNA alignments by replacing the ‘‘buffers’’ by the chosen values for each possible end. We estimated those values to be represented by the 3rd quartile. ‘‘stall’’ adapter poly-A/ poly-T transcript 1. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19(1):90. 2. Kovaka S, Fan Y, Ni B, Timp W, Schatz MC. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nature Biotechnology. 2021;39(4):431-41. 3. Payne A, Holmes N, Clarke T, Munro R, Debebe BJ, Loose M. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nature Biotechnology. 2021;39(4):442-50. 4. Hardwick SA, Chen WY, Wong T, Deveson IW, Blackburn J, Andersen SB, et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nature Methods. 2016;13(9):792-8. 5. https://github.com/nanoporetech/pychopper 6. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094-100. 7. Gamaarachchi H, Lam CW, Jayatilaka G, Samarakoon H, Simpson JT, Smith MA, et al. GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis. BMC Bioinformatics. 2020;21(1):343. 8. Gamaarachchi H, Samarakoon H, Jenner SP, Ferguson JM, Amos TG, Hammond JM, et al. SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis. bioRxiv. 2021:2021.06.29.450255 9. Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 1D convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing. 2021;151:107398 CHU Sainte-Justine Research Centre, Montreal, Canada Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Canada contact : [email protected] ; https://www.therealsmithlab.com/ TSO+ TSO- VNP+ VNP- 5’ + 3’ - 3’ + 5’ - T T T T T T AAAAAA

Transcript of Kristina Atanasova, Bastien Paré, Martin A. Smith

Page 1: Kristina Atanasova, Bastien Paré, Martin A. Smith

RAWDAPTER : IMPROVING THE SIGNAL-LEVEL ANALYSIS FOR NANOPORE RNA SEQUENCING Kristina Atanasova, Bastien Paré, Martin A. Smith

INTRODUCTIONNanopore raw signals are measured by observing characteristic disruptions of electric current as an RNA molecule transits through a protein nanopore.The resulting signal is typically converted to a string of nucleotides using various probabilistic models that are known to introduce errors, which stigmatise this technology [1]. However, several studies have demonstrated that direct analysis on raw signals can increase the precision and sensitivity of RNA sequencing, including the detection of RNA modifications [2,3]. In order to improve the signal-based analysis of single-molecule nanopore sequencing data, we present Rawdapter, a computational strategy that can quickly distinguish the biological informative signal from the uninformative signal in raw nanopore sequencing data and effectively locate and remove the superfluous signal.

We first develop a bioinformatic pipeline to create positive and negative controls for training a deep learning model (Fig. 3). We created 2 groups of reads for each possible extremity in a cDNA by filtering the alignments. The ‘buffer’ aims to accommodate the adapter sequences, which will not align to the reference. For each filtering, we extracted the relevant alignments. F5C [7] was used to determine the positions of alignment to a target sequence. SLOW5tools [8] was used to convert FAST5 files into BLOW5 files and to get the raw signal for each read. A script has been written to extract the uninformative signal, that will form the positive sample, and the informative signal that will form the negative sample. Finally, we padded the short signals, normalized the data with the help of z-score normalization, and merged both samples into a single array.

RESULTSThe preprocessing pipeline produced 1 791 962 unique map-pers. We used kernel density estimate (KDE) plots to estimate the ‘buffer’ values which represent the 3rd quartile of each dis-tribution of basecalled adapter signal sizes (Fig. 5). The sub-sequent preprocessing steps produced 338 446 (5’+) and 480 641 (5’-) signal samples for the positive dataset, and 337 804 (5’+) and 475 750 (5’-) signal samples for the negative dataset.

We produced 2 1D CNN models, that is one for the 5’ + end and one for the 5’ - end, capable of distinguishing the informa-tive and uninformative signals. Both models showed impressive validation accuracies and losses in little time (Fig. 6). The per-formance of both models on the testing sets achieved high ac-curacies (Fig. 6). These results exhibit the capacity of 1D CNN networks to classify long cDNA nanopore signals with success.

ACKNOWLEDGEMENTSSpecial thanks the members of the Smith Lab and the Research Center at CHU Ste-Justine.

REFERENCES

CONCLUSIONRawdapter is a critical step of raw signals preprocessing since it allows the identification, localisation and removal of adapters and can be beneficial for subsequent analysis since it removes uninformative information from raw signals. The ability to process the signal faster than the current state-of-the-art tools of basecalling and alignment could help to improve the prediction of RNA extremities and the detection of RNA modifications.

MATERIALS AND METHODSThe data used are synthetic RNA sequences representing spliced mRNA isoforms that align to 78 artificial gene loci encoded on an artificial chromosome in silico [4]. During library preparation, we used the PCS109 sequencing kit to generate cDNA strands by reverse transcription and PCR amplification. Then, adapters are ligated at the ends, including one carrying the motor protein that facilitates passage of cDNA through the pores (Fig.1).

The reads were sequenced on a Minion, an Oxford Nanopore Technologies (ONT) sequencing device, that generated raw electrical current data (Fig. 2) and basecalled data which was re-basecalled with Guppy 5.0.7. We used Pychopper [5] to identify full-length cDNA reads and to remove chimeric reads. The reads were aligned to the reference transcriptome with minimap2 [6] and we extracted unique mappers.

Fig. 1 : A cDNA strand after library preparation.The uninformative signal (in dotted lines) includes the sequencing adapter (minimal) and the RT-PCR primers (TSO+, TSO-, VNP+, and VNP-) which will be entirely basecalled. The motor protein (in green) is attached to the 5’ strands. During nanopore cDNA sequencing, both strands are sequenced separately from 5′ to 3’.

Fig. 3 : Preprocessing pipeline.Bioinformatic processes to create the datasets for the training and testing of the deep learning model. Steps 8 to 13 were repeated for the 2 groups of reads, that is full-length cDNA reads aligned to the plus and minus strands of the cDNA. Then, steps 15 to 18 were repeated for the 2 groups of reads in order to create 2 samples of signals for each group, one sample containing the raw signals that were not aligned to a reference transcript and another sample containing the raw signals belonging to informative signal sampled with a fixed window size.

FUTURE WORKFuture research would aim to develop a cDNA adapter signal localizer that could return a position within the signal in less than 200ms. It will be important to investigate, with technical replicates, whether the models are overfitted. This tool would help to implement new interactive sequencing methods for nanopore cDNA runs which have great potential for clinical applications and biomedical research, more particularly in the development of diagnostic tools and the identification of biomarkers.

Fig. 2 : An example of a raw nanopore signal.The first 5000 time points of this raw nanopore signal show different types of signals. The two regions of less signal amplitude are the results of adapter stalls and poly-A/poly-T homopolymer sequences.

The approach used to improve the effectiveness of raw signal comparisons is a 1D convolutional neural network (1D CNN). This type of network consists usually of an input layer that receives the raw signal, convolutional layers, pooling layers, dropout layers, a fully connected layer, and an output layer with the number of neurons equal to the number of classes (Fig. 4) [9]. We used the Keras library integrated in Tensorflow 2.6.0 (with GPU support) to implement this neural network architecture. We separated the data in the ratio 70:10:20 for the training, validation and testing sets respectively. We trained the models on the positive and negative samples, and we assessed them based on their classification accuracy and their loss on the validation set.

Fig. 4 : Architecture of a 1D convolutional neural network (CNN).A typical CNN architecture includes an input layer, convolutional layers which apply a filter to help extract potential features. The features are combined by a fully connected (dense) layer, and finally, the output layer creates a classification based on the output classes.

The approach used to effectively locate and remove the informative signal is based on a sliding window that scans the raw signal from the beginning and finds the boundary of the uninformative signal.

.

.

. ...

.

.

.

...

Convolutional layers Flatten Layer Dense Layers Output Layer

.

.

.

.

.

.

.

.

.

minimap2 -k 14 -t 30

SAM PAFhtsbox samview -pS

Create KDE plots in Rggplot2

Subset SAM file and convert to BAM

Filter the PAF file

BAM

txtGet reads ID

Sort and index BAM filesSamtools

BAM

BAM to FastqHtsbox bam2fq

fastq

Index FAST5 filesF5C index

Align the fastq reads to the raw signalF5C eventalign

tsv

sequins.fa

FAST5

Convert FAST5 to SLOW5 and mergeSLOW5tools F2S and merge

txtGet reads ID

slow5

Extract reads raw signalSLOW5tools get

slow5

Extract the raw uninformative signalstsv

fastq

Estimate values of buffer

4

7

8

9

10

11

12

13

14 15

16

pychopper

fastq txtGet reads ID

seqtk subseq

fastq

Get reads IDtxt

Subset SAM file with unique mappers

SAM PAFhtsbox samview -pS

1

2

3

5

6

8 for 5' + ends : $3 >= $8, $3 <= B

1, and $8 <= B

2

for 5' – ends : $3 <= B

3 and $7-$9 <= B

4

Extract the raw informative signals

tsv

17

Normalization, padding, and merging

18

tsv

npy

0.00

0.02

0.04

0.06

0 50 100 150 200Coordinates

Density Coordinates

Start position in read

Start position in target

Kernel density estimates by group for 5' + readsA

0.000

0.025

0.050

0.075

0.100

0 50 100 150 200Coordinates

Density Coordinates

Start position in readDifference between sequence length and end position in target

Kernel density estimates by group for 5' − readsB

Fig. 6: Comparison of model accuracy and loss Two 1D CNN models were trained for both groups of signals. Each balanced dataset had the same number of positive samples and negative samples. A) Model accuracy for the 5’ + data-set. B) Model loss for the 5’ + dataset. C) Model accuracy for the 5’ - dataset. D) Model loss for the 5’ - dataset. E) Performance on validation and testing set.

Fig. 5 : Kernel density estimate (KDE) plots to estimate filtering valuesIn order to create 2 groups of unique cDNA reads that are aligned to one of the 5’ transcript end, we filtered the full-length cDNA alignments by replacing the ‘‘buffers’’ by the chosen values for each possible end. We estimated those values to be represented by the 3rd quartile.

‘‘stall’’ adapterpoly-A/poly-T transcript

1. Rang FJ, Kloosterman WP, de Ridder J. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol. 2018;19(1):90. 2. Kovaka S, Fan Y, Ni B, Timp W, Schatz MC. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nature Biotechnology. 2021;39(4):431-41. 3. Payne A, Holmes N, Clarke T, Munro R, Debebe BJ, Loose M. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nature Biotechnology. 2021;39(4):442-50. 4. Hardwick SA, Chen WY, Wong T, Deveson IW, Blackburn J, Andersen SB, et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nature Methods. 2016;13(9):792-8.5. https://github.com/nanoporetech/pychopper6. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094-100.7. Gamaarachchi H, Lam CW, Jayatilaka G, Samarakoon H, Simpson JT, Smith MA, et al. GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis. BMC Bioinformatics. 2020;21(1):343.8. Gamaarachchi H, Samarakoon H, Jenner SP, Ferguson JM, Amos TG, Hammond JM, et al. SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis. bioRxiv. 2021:2021.06.29.450255 9. Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 1D convolutional neural networks and applications: A survey. Mechanical Systems and Signal Processing. 2021;151:107398

CHU Sainte-Justine Research Centre, Montreal, Canada Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, Canada

contact : [email protected] ; https://www.therealsmithlab.com/

TSO+TSO-

VNP+

VNP-

5’ +

3’ -

3’ +

5’ -T T T T T TAAAAAA