Improved methods for virus detection and discovery in ...

87
Improved methods for virus detection and discovery in metagenomic sequence data Amanj Bajalan _______________________________________________________________________________________________________________ Degree project in bioinformatics, 2020 Examensarbete I bioinformatik 45 hp till masterexamen, 2020 Biology Education Centre and Department of Cell and Molecular Biology (CMB), Biomedicum, Karolinska Institute Supervisor: Björn Andersson, Professor, CMB, Karolinska Institute

Transcript of Improved methods for virus detection and discovery in ...

Page 1: Improved methods for virus detection and discovery in ...

 

Improved methods for virus detection and discovery in metagenomic sequence data 

 

 

 

 

 

Amanj Bajalan _______________________________________________________________________________________________________________

Degree project in bioinformatics, 2020 Examensarbete I bioinformatik 45 hp till masterexamen, 2020 Biology Education Centre and Department of Cell and Molecular Biology (CMB), Biomedicum, Karolinska Institute Supervisor: Björn Andersson, Professor, CMB, Karolinska Institute

Page 2: Improved methods for virus detection and discovery in ...

1

Page 3: Improved methods for virus detection and discovery in ...

Abstract Our primary goal in this thesis is to improve methods of finding viruses in metagenomic sequence data. The focus will be on Torque Teno virus (TTV) sequences, but other data sets are also used. We are not only interested in finding TTV sequences, but also to further classify the sequences into different TTV types. The second objective of our research is to examine viruses and find previously unknown associations with diseases and human health in general. This thesis is mostly aimed at developing software and utilizing existing bioinformatic tools to further develop existing pipelines produced by Andersson Lab and creating new ones. Some of the methods used are different alignment strategies, many different outputs, calculations, filtering of unwanted data and choosing sequences. We have used Nextflow to build the pipelines and used HTML, charts and tables to visualize data in a more human-readable format for other project members associated with Andersson Lab. The most useful results are currently based on data from the Discovery pipeline and the other ones are fully functional, but need more time for testing data output and validating results. We have used Integrative Genomics Viewer (IGV) to validate results produced by the Discovery pipeline using a reference and reads coverage to see how reliable the result is. There is a lot of research and further improvement to be done in the future, as bioinformatics is a field that is always progressing. We have produced a lot of results based on large clinical samples. Most of the visualizations produced based on the pipeline results have proven to be useful for the various ongoing research projects in Andersson Lab.

2

Page 4: Improved methods for virus detection and discovery in ...

3

Page 5: Improved methods for virus detection and discovery in ...

Virus Detection in Next-Generation Sequencing

Popular science summary

Amanj Bajalan New technologies are improving and taking big steps in producing more genetic material at lower costs. Next-generation sequencing (NGS) is an umbrella term used to describe most of the modern sequencing technologies, which have increased access to genetic material that was previously difficult to attain. This has widened the field, changed the mindset on how research questions are formulated and increased the need for methods in bioinformatics. The field of bioinformatics is needed to interpret DNA sequencing data so that we can make conclusions about biology. NGS technologies have allowed us to study human microbiome communities. The microbiome is all the genetic content of the microorganisms that inhabit our bodies; together, these microorganisms are called the human microbiota. Human diseases and behavioral disorders can be connected to the microbiome in the body, and thus a deeper understanding of the function of the host-microbiome interaction will possibly lead to improvements in treatment and public health in general. A typical microbiome contains a diverse and dynamic community which is composed of bacteria, viruses and fungi. Humans have microorganisms in most of the areas outside their bodies, for example gut-microbiota and skin-microbiota. Every individual has a unique combination of microorganisms, which also function as a symbiotic relation to the host. They protect against foreign pathogens and influence our immune system. Our primary goal is to identify and further classify known and unknown viruses. There are two types of virus that we focus on: one that infects the host and one that infects bacteria. The viruses that infect bacteria are called bacteriophages and are important because they can indirectly target humans, mainly by affecting the microbiome. In this thesis, we study Torque Teno virus (TTV). This is a small DNA virus that exists in both healthy and sick individuals, but the amount of the virus that is present in the blood varies depending on the immune system. There are several theories about TTV, but we are interested in seeing if TTV has any association with chronic disease such as cancer. If there is any association with diseases then it is possible to start implementing treatment to counteract the development of diseases. My task is to develop tools and methods to process this type of data from large clinical samples.

Degree project in bioinformatics, 2020 Examensarbete i bioinformatik 45 hp till masterexamen, 2020

Biology Education Centre and Department of Cell and Molecular Biology, Biomedicum, Karolinska Institute Supervisor: Björn Andersson

4

Page 6: Improved methods for virus detection and discovery in ...

5

Page 7: Improved methods for virus detection and discovery in ...

Table of content 1. Introduction 12

2. Background 12 2.1 Torque Teno Virus 13

3. Goals 14

4. Material 15 4.1 Data 15

4.1.1 TTV data sets 15 4.1.1.1 Detailed information about TTV Bonk data 16

4.1.2 T1D ABIS data set 18 4.1.3 POTS 1 data set 18 4.1.4 MAARS data sets 18

4.2 Databases and references 19 4.3 Hardware 19

5. Methods 20 5.1 Pipelines 20

5.1.1 Pipeline information 21 5.1.1.1 Softwares in command line 21 5.1.1.2 Programming languages 22

5.1.2 Preprocessing pipeline 23 5.1.3 New version of Discovery pipeline 24

5.1.3.1 Compiling data and creating table scripts 25 5.1.3.1.1 Viral parser of reads and table creation 26 5.1.3.1.2 Viral parser of contigs and table creation 27

5.1.3.1.2.1 Simplified version of contig table 28 5.1.4 Development of Extended Discovery pipeline 28 5.1.5 Development of TTV map pipeline 30

5.1.5.1 Initial design 30 5.1.5.2 Final product of pipeline 32 5.1.5.3 Script for creating visualization based on results from TTV map pipeline34

5.1.6 Procedures when running different datasets 35

6. Results 36 6.1 Finished table scripts based on Discovery pipeline 37

6.1.1 TTV Bonk 37 6.1.1.1 Results from TTV Bonk table 37

6.1.1.1.1 Other viral sequences 38 6.1.2 TTV 1b2 39

6.1.2.1 Results from TTV 1b2 table 39

6

Page 8: Improved methods for virus detection and discovery in ...

6.1.3 T1D ABIS 40 6.1.3.1 Results from T1D ABIS table 40

6.1.3.1.1 Other viral sequences 41 6.2 Finished HTML pie charts based on TTV pipeline 42

6.2.1 Summary of all data samples for TTV Bonk 42 6.2.2 Example of one HTML file from the TTV Bonk dataset 46 6.2.3 T1D ABIS pie chart data 46 6.2.4 TTV 1b2 pie chart data 46

7. Discussion 47 7.1 Implementation and testing 50 7.2 Goals 52 7.3 Future work 53 7.4 Conclusions 53

8. References 54

9. Appendix 56 Appendix A 56 Appendix B 57

Appendix B1 57 Appendix B2 58

Appendix C 59 Appendix C1 59 Appendix C2 60 Appendix C3 61 Appendix C4 62

Appendix D 63 Appendix D1 63 Appendix D2 65 Appendix D3 66 Appendix D4 67

Appendix E 68 Appendix E1 68 Appendix E2 69

Appendix F 70 Appendix G 71

Appendix G1 71 Appendix G2 73 Appendix G3 74 Appendix G4 75

7

Page 9: Improved methods for virus detection and discovery in ...

Appendix H 76 Appendix H1 76 Appendix H2 77

Appendix I 78 Appendix J 79

Appendix J1 79 Appendix J2 80

Appendix K 81 Appendix K1 81 Appendix K2 82

Appendix L 83 Appendix L1 83 Appendix L2 84

Appendix M 85 Appendix M1 85 Appendix M2 86

8

Page 10: Improved methods for virus detection and discovery in ...

9

Page 11: Improved methods for virus detection and discovery in ...

Abbreviations AA Aplastic anemia ABIS Alla barn i sydöstra sverige (All children from southeast of Sweden) ALL Acute lymphoblastic leukaemia AML Acute myeloid leukemia B-ALL B-cell ALL B-cell B lymphocytes bp base pair BWA Burrows-Wheeler Aligner CNS Central nervous system E-utilities Entrez Programming Utilities FVE Fast Virome Explorer Hg38/GRCh38 Genome Reference Consortium Human Build 38 HL Hodgkins lymfom IGV Integrative Genomics Viewer NCBI National Center for Biotechnology Information NGS Next Generation Sequencing NHL Non Hodgkin lymphoma nr Non-redundant protein sequences nt Partially non-redundant nucleotide sequences ORFs Open Reading Frames POTS Posturalt Ortostatiskt Takykardisyndrom Pre B-ALL Pancytopenic prodrome B-ALL T1D Type 1 Diabetes T-cell T lymphocyte TTV Torque Teno virus UML Unified Modeling Language UTR Untranslated Region

10

Page 12: Improved methods for virus detection and discovery in ...

11

Page 13: Improved methods for virus detection and discovery in ...

1. Introduction The rapid increase in the capacity of next-generation DNA sequencing technologies (NGS), coupled with improved bioinformatics methods, has enabled large-scale metagenomics studies where microorganisms are characterized using shotgun sequencing of environmental samples without prior cultivation. The ability of viruses and other microorganisms to cause human disease and their importance for our normal well-being are well-known, but we have barely begun to understand the intricacies of the human microflora and how it is regulated, as evidenced by a multitude of new emerging infectious diseases. It is widely recognized that the multitude of viral species and variants has a profound influence on human health and disease. Apart from all the traditional aspects of viral infections, we now know that viruses shape the individual phenotypes of our immune systems, and may be triggers of autoimmunity. However, the human virome is still largely unknown and much work remains.

2. Background

The Andersson lab pioneered the study of the human viral microbiome in 2005 and have since continuously carried out such studies on an increasingly larger scale. The projects have gone from the identification of previously unknown human viruses to broad characterization of entire viral microbiomes and the generation of massive amounts of sequence data from clinical samples and the corresponding analysis. The lab now focuses on chronic human viral infections, the identification of new viruses from unknown sequences, disease associations and studies of skin microbiomes in disease. The Andersson lab publications describe novel human viruses and phages, entire viromes, as well as the development of novel bioinformatics methods for the discovery and characterization of viruses from large DNA sequence data sets (Allander et al. 2005, Allander et al. 2007, Lysholm et al. 2012, Barrientos-Somarribas et al. 2018). The initial pipeline developed by Andersson lab is composed of two sets of pipelines, independent of each other and both written in Nextflow, which allows the programmer to use multiple programming languages and tools to execute each task in the pipeline by defining it within the scripts. The first pipeline is called “Preprocessing” which finds human genetic material and discards it from the total amount of DNA retrieved from a sample by using a human built reference. After preprocessing the next step is to send the data into the “Discovery” pipeline, which finds microorganisms by using different mapping strategies. We used Torque Teno virus (TTV) pooled samples as feed data for the pipelines. The main focus of this thesis is to improve and develop the pipelines to offer a better understanding of viral sequences and other microorganisms from a number of types of human specimens. In order for the pipeline data to be useful, it must be compiled and visualized in a way that is human-readable. For further classifications of targeted viral

12

Page 14: Improved methods for virus detection and discovery in ...

sequences with characterization and comparison of species, a new pipeline will be developed with a different approach to compile and visualize data.

2.1 Torque Teno Virus

Torque Teno virus (TTV), is a small, circular virus with single-stranded negative polarity DNA genome (Vignolini et al. 2016). The DNA molecule is about 3.8 kb with at least four open-reading frames (Focosi et al. 2016). A special characteristic of the virus is that it can infect both healthy and ill individuals (Vignolini et al. 2016). Infection of TTV seems to be high and finding any pathogenic association to any disease is currently unknown, which has sparked interest among researchers. The infections are worldwide, regardless of health and genetic background (Hazanudin et al. 2019). Vignolini et al. (2016) found a significantly high presence of TTV in healthy controls (60%, 15/25), diseased patients (80%, 60/77) and the highest presence in transplant recipients. There are at least 29 major species of TTV discovered, (Focosi et al. 2016), where each consists of numerous strains and are grouped into the Alphatorquetenovirus genus which belongs to the Anelloviridae family.

According to Focosi et al. (2016), “TTV lacks serology reagents and animal models”, which makes it unculturable. It is insensitive to antiviral drugs. Currently, the only valid TTV detection is through plasma or other clinical specimens. The replication appears to occur mainly in T-lymphocytes, but the type of cellular receptor or receptors are unknown (Focosi et al. 2016).

The genome consists of two main parts, the coding and non-coding regions. The non-coding sequence is conserved and is about 1.2 kb in size and regulates the replication. This region is also known as the untranslated (UTR) region. In addition, there is a GC rich region of 117 nucleotides, a downstream poly A sequence and a TATA box. The coding region consists of 3-5 open reading frames (ORFs) and is not conserved with the approximate length of 2.6 kb, where ORF1 is the largest sequence coding for capsid protein in DNA binding during packing of viral DNA. ORF2 codes for nearly 200 amino acids, which is important during protein regulation while the infection is taking place. The last ORF3 is responsible for cell cycle regulation and suppression of antiviral resistance. This makes TTV a heterogeneous virus, where each species has its own variation in the coding region (Hazanudin et al. 2019). ORF1 is the region with the highest level of heterogeneity and which shows distinct variation in genotypes. In comparison with different TTV variants the ORF1 region has about 30% genetic difference. This leads to TTV reproducing itself in a way that results in a significant level of heterogeneity. This could be possible because of several assumptions, such as the persistence of the virus in infected humans. It is possible that individuals with high levels of TTV could be infected with a diverse set of TTV species, because of the virus’ heterogeneity and persistence. This can take place during several points of exposure to the virus or its mutations during its existence. There are several assumptions about how this could occur, for example through TTV genome exploiting the host’s DNA polymerase,

13

Page 15: Improved methods for virus detection and discovery in ...

since the virus lacks gene coding for polymerase. The high level of heterogeneity indicates that TTV could have existed historically for a very long time or that the virus has a mutation rate faster than other DNA viruses (Ball et al. 1999).

One of the goals in this thesis is to compare the level of TTV in different cancer specimens. According to Hazanudin et al. 2019, more than 50% of cancer specimens were positive for TTV. The positive results could occur due to inflamed tissue or rapid cell division. There is interest in finding any features in neoplastic pathology that are beneficial for TTV’s existence and reproduction, i.e. if TTV could affect the chronic condition of an infection or an ideal environmental state for reproduction. To gain a deeper understanding of TTV’s impact on somatic cells, additional research is needed where comparisons are made between tissues with chronic infections and healthy control groups (Hazanudin et al. 2019).

3. Goals In this project we worked on the initial pipeline developed by Andersson lab. To further evaluate pool and individual samples with TTV that could be associated with human disease, the following research objectives were implemented:

1. Further development of the viral-pipeline. Include more outputs, visualizations and methods.

2. Create categories and methods for identifying types and strains of TTV.

3. Integrate unknown gene family pipeline for protein prediction in the viral-pipeline.

4. Starting procedures with skin-microbiomes that are associated with skin disease.

14

Page 16: Improved methods for virus detection and discovery in ...

4. Material Background information about data and hardware we used during the thesis.

4.1 Data The data are based on clinical samples, mostly from pooled samples, of patients diagnosed with different diseases. In each sample we have about 0.5 - 2 million or more reads per data samples after removing genetic material from the host. Each dataset contains both DNA and RNA. All data are produced using Illumina technologies , both MiSeq and NovaSeq using 1

shotgun sequencing.

4.1.1 TTV data sets We have two types of TTV data sets. The first data set is called TTV Bonk and is based on pooled samples of patients diagnosed with different types of cancer. To see the different diagnosis and metadata about this data set, see table 1 in section 4.1.1.1. The TTV data sets are the primary data sets used in this thesis to study the possible association between the TTV virus and various diseases. The second dataset, TTV 1b2, is based on the individual samples. It can be found in the row associated with the group number 1b2 in table 1 and is based on individual samples of patients between the ages of 3 and 5 diagnosed with Pre B-cell Acute lymphoblastic leukemia. For the TTV Bonk dataset we have 496 individual samples pooled into 23 pools and two control samples, where the samples are taken from the Division of Pediatric Oncology in Uppsala. In total this gives us 24 GenomiPhi data samples and 24 RNA data samples, i.e. 48 data samples. For the TTV 1b2 dataset we have single samples. 22 serum samples and two controls. In total, we have 23 GenomiPhi data samples and 23 RNA data samples, i.e. 46 data samples.

1 https://emea.illumina.com/

15

Page 17: Improved methods for virus detection and discovery in ...

4.1.1.1 Detailed information about TTV Bonk data The data samples used in the analysis are displayed in table 1, while table 9 in appendix A provides a translation of some Swedish terminology found in table 1.

Grupp (Group) Antal prov/pool

(nr samples /pool)

Grupp/Pool nr

(Group /Pool nr) Sample_id

Rör till NGS

(pipe to NGS)

Qubit konc

(ng/µL)

Pre B-ALL 0-2 år (0-2 years) 22 1a P12653_1001 1a: RNA 378

P12653_1002 1a: GenomiPhi 39,1

Pre B-ALL 3-5 år (3-5 years) 22 1b1

P12653_1003 1b1: RNA 890

P12653_1004 1b1:

GenomiPhi 6,93

Pre B-ALL 3-5 år (3- 5 years) 22 1b2

P12653_1005 1b2: RNA 660

P12653_1006 1b2:

GenomiPhi 167

B-ALL 6-18 år (6-18 years) 22 1c1 P12752_1001 1c1: RNA 790

P12752_1002 1c1: GenomiPhi 25,8

B-ALL 6-18 år (6-18 years) 22 1c2 P12752_1003 1c2: RNA 770

P12752_1004 1c2: GenomiPhi 3,57

B-ALL 6-18 år (6-18 years) 22 1c3 P12752_1005 1c3: RNA 810

P12752_1006 1c3: GenomiPhi 34,6

T-cells ALL 15 2 P12752_1007 2: RNA 920

P12752_1008 2: GenomiPhi 75,3

Akut myeloisk leukemi (AML) 31 3 P12752_1009 3: RNA 520

P12752_1010 3: GenomiPhi 6,49

Lymfom + Non Hodgkin

lymfom 22 4

P12752_1011 4: RNA 330

P12752_1012 4: GenomiPhi 0,135

Hodgkins lymfom (HL) 18 5 P12752_1013 5: RNA 346

P12752_1014 5: GenomiPhi 4,74

Solida maligna tumörer 0-2 år

(0-2 years) 17 6a

P12752_1015 6a: RNA 360

P12752_1016 6a: GenomiPhi 3,49

Solida maligna tumörer 3-5 år

(3-5 years) 25 6b

P12752_1017 6b: RNA 401

P12752_1018 6b: GenomiPhi 2,99

Solida maligna tumörer 6-18 år

(6-18 years) 28 6c1

P12752_1019 6c1: RNA 257

P12752_1020 6c1: GenomiPhi 2,97

Solida maligna tumörer 6-18 år

(6-18 years) 28 6c2

P12752_1022 6c2: RNA 356

P12752_1022 6c2: GenomiPhi 7,24

CNS tumörer 0-5 år (0-5 years) 15 7a P12752_1023 7a: RNA 317

P12752_1024 7a: GenomiPhi 8,83

CNS tumörer 6-18 år (6-18

years) 23 7b1

P12752_1025 7b1: RNA 97,8

P12752_1026 7b1:

GenomiPhi 1,95

CNS tumörer 6-18 år (6-18 22 7b2 P12752_1027 7b2: RNA 319

16

Page 18: Improved methods for virus detection and discovery in ...

years) P12752_1028

7b2:

GenomiPhi 1,45

Aplastisk anemi (AA) 10 8 P12752_1029 8: RNA 303

P12752_1030 8: GenomiPhi 0,055

Övrig anemi 10 9 P12752_1031 9: RNA 432

P12752_1032 9: GenomiPhi 0,734

Histiocytoser 8 10 P12752_1033 10: RNA 183

P12752_1034 10: GenomiPhi 2,78

Övriga benigna diagnoser 0-2

år (0-2 years) 12 11a

P12752_1035 11a: RNA 353

P12752_1036 11a:

GenomiPhi 24,1

Övriga benigna diagnoser 3-5

år (0-2 years) 13 11b

P12752_1037 11b: RNA 389

P12752_1038 11b:

GenomiPhi 7,5

Övriga benigna diagnoser 6-18

år (6-18 years) 47 11c

P12752_1039 11c: RNA 181

P12752_1040 11c: GenomiPhi 2,05

Kontroll TTV BONK (PBS) 0 Ktr TTV BONK

P12752_1041 Ktr TTV BONK:

RNA 108

P12752_1042 Ktr TTV BONK

GenomiPhi

ej mätbart

(not

measurable

)

Table 1: All data samples and all disease diagnostics. In table 1, “Groups” is different diagnoses of data samples, “nr of samples” is the number of individuals and “Pool nr” is the id for different groups. In “sample_id” we have two sets of data samples for each diagnosis, one for entire genomes and one for RNA. The type of data for each “sample_id” is mentioned under “pipe to NGS”. Row three in table 1 contains the pool number 1b2 which is the new dataset TTV 1b2 (raw library data retrieved 2019-02-12) and is based on individual samples.

17

Page 19: Improved methods for virus detection and discovery in ...

4.1.2 T1D ABIS data set T1D ABIS stands for “Type 1 diabetes all children from southeast of Sweden” (in Swedish: “Typ 1 diabetes alla barn i sydöstra sverige”). This data set contains 141 individual samples pooled into 11 pools and controls. Groups are based on patients’ ages and controls. There are two seperate data sets, P13408 and P13409, each containing 12 Genomiphi data samples and 12 RNA data samples, i.e. 24 in total for each data set. We used this dataset to study the potential association between type 1 diabetes and TTV viruses or picornaviruses.

4.1.3 POTS 1 data set Postural orthostatic tachycardia syndrome or POTS (in Swedish: posturalt ortostatiskt takykardisyndrom) is a data set based on 59 serum samples pooled into two pools (34 patients and 25 controls) and divided into three GenomiPhi and three RNA data sets, i.e. 6 data samples in total.

4.1.4 MAARS data sets MAARS is a large collection of different data sets of skin microbiome data. MAARS has 8 data sets in total. Samples are collected from Finland, England and Germany for the MAARS study and are divided into three groups: AD: Atopic Dermatitis (Atopiskt Eksem), PSO: Psoriasis, C: Controls. The samples are called B.Andersson_13_08 (totaling 24 data samples), B.Andersson_14_05 (totaling 108 data samples), B.Andersson_14_06 (totaling 105 data samples), B.Andersson_14_07 (totaling 107 data samples), B.Andersson_14_08 (totaling 108 data samples), B.Andersson_15_01 (totaling 107 data samples), B.Andersson_15_02 (totaling 107 data samples) and B.Andersson_15_04 (totaling 42 data samples)

18

Page 20: Improved methods for virus detection and discovery in ...

4.2 Databases and references During the thesis I have used a lot of different databases and references to process data. To discard host genetic material from the samples, I used Genome Reference Consortium Human Build 38 (hg38/GRCh38) (Cole et al. 2008). To classify the remaining sequence data I used partially non-redundant nucleotide sequences (nt) , non-redundant protein 2

sequences (nr) , IMG/VR (DNA viruses and retroviruses built with Kallisto (Bray et al. 3

2016) indexing for FVE (Tithi et al. 2018)), NCBI virus (built with Kallisto index for FVE) and 181204_bacvirfun (local database for Kraken2 (Wood et al. 2019) built by Andersson lab, data sources unknown). These were chosen in order to efficiently identify viral sequences. For further classification of sequence data I selected reference genomes of different variants of the targeted virus. When further classifying TTV viruses we used a multifasta with 29 selected TTV sequences (which could be extended a lot) and when targeting picornaviruses we used a multifasta with 5033 nucleotide sequences, which are complete DNA and RNA genomes. These sequences are derived from picornavirus’ entire lineage . 4

4.3 Hardware We had four local servers and one external “Uppmax”. The four local servers are called “Hamlet”, “Othello”, “Henry” and “Duo”. I used Hamlet for most of the primary pipeline runs, Othello to run some scripts and pipelines but mostly for development purposes, Henry to process raw library data and Duo to compile data for the MAARS collection. Uppmax was mostly used for retrieval of finished raw libraries produced by SciLifeLab . 5

2 https://www.uppmax.uu.se/resources/databases/blast-databases/ 3 https://www.uppmax.uu.se/resources/databases/blast-databases/ 4 https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=12058&lvl=3&lin=f&keep=1&srchmode=1&unlock 5 https://www.scilifelab.se/

19

Page 21: Improved methods for virus detection and discovery in ...

5. Methods One of the first steps is to create a method of compiling large amounts of data with up to 48 pooled data samples, each containing about 1 million reads. The samples are based on clinical human specimens. In the initial phase I use two different pipelines developed by Andersson lab. The first step is to run the raw library data in the “Preprocessing” pipeline, which conducts a quality control of the fastq reads, trims data and removes genetic material based on Hg38 (Cole et al. 2008). This should provide data containing mostly genetic material of the microbial community derived from different targeted samples based on individuals with different chronic diseases. In the next step I run the trimmed data in the “Discovery” pipeline using different alignment strategies for outputting results based on paired-end reads and assembly data (contigs) created during pipeline run-time.

5.1 Pipelines Pipelines are used for running data and produce results within one week to two months, depending on the amount of data and samples. In section 5.1.1 I list the methods and technical tools that are applied in the different pipelines. Note that all the content in 5.1.1 is collected from all pipelines and each pipeline does not have all the technical tools listed. In the other sections I describe different types of pipelines and script for producing different types of desired results.

20

Page 22: Improved methods for virus detection and discovery in ...

5.1.1 Pipeline information

5.1.1.1 Softwares in command line The nucleotide-based alignment software used in the pipelines are Blastn (Madden 2013), Magic-BLAST (Boratyn et al. 2019), Burrows-Wheeler Aligner - BWA (Li H 2013), Bowtie2 (Langmead et al. 2009), Kraken2 (alignment tool using k-mers) (Wood et al. 2019 p. 2), Metaphlan2 (alignment tool using amplicon) (Truong et al. 2015) and FastViromeExplorer (Tithi et al. 2018) – FVE (tool for finding viral sequences). FVE uses the Kallisto algorithm (Bray et al. 2016) to classify reads. We have one tool that translates nucleotides to aminoacids to align against a protein database which is the Diamond Blastx (Buchfink et al. 2015) alignment software. Blastn is used in the Extended Discovery pipeline (see section 5.1.4). Magic-BLAST and BWA are used in TTV map pipeline (see section 5.1.5). Bowtie2 in the Preprocessing pipeline (see section 5.1.2). Diamond Blastx, Kraken2, Metaplan2 and FVE are used in the Discovery pipeline (see section 5.1.3). To create databases for the TTV map pipeline we use Makedblastdb (Information et al. 2008) and BWA-indexing function. For filtering, trimming and quality control we have several software in command line used in the pipelines. Seqtk (Li H 2020) is a toolkit for processing sequences and is used for processing sequences lengths in the Discovery pipeline. Trimgalore (Andrews 2019a) to trim sequenced data with adapters and quality control in the Preprocessing pipeline. The BBToolkits (Bushnell 2020), which have several tools to process genomic data, are used for aggressive trimming of low quality reads after running Preprocessing pipeline and also for mapping back reads to contigs (assembly data) in the Discovery pipeline. For quality control of sequenced reads, FastQC (Andrews 2019b) is used. To create assembly data in the Discovery pipeline we use two metagenomic assemblers Megahit (Li D et al. 2015) and MetaSPAdes (Nurk et al. 2017). For parsing data in the pipeline run we use Bedtools (Quinlan & Hall 2010) and Samtools (Li H et al. 2009). Both have several built-in functions for different genomic operations and are used in all pipelines, except for the Extended Discovery pipeline. One important tool that is used frequently in the TTV map pipeline to finalize results is called Datamash (Gordon 2020), which performs statistical, numerical and textual operations. FragGeneScan (Rho et al. 2010) is part of the Discovery pipeline but currently not part of the development and data analysis for this thesis. VirFinder (Ren et al. 2017) is another package in the pipeline but not used in this thesis. It is a library in R that predicts whether a 6

sequence is viral or not.

6 https://www.r-project.org/

21

Page 23: Improved methods for virus detection and discovery in ...

5.1.1.2 Programming languages To build the pipeline, Nextflow is used to create the structure, data channels and processes. 7

Within the processes we use a unix shell and command language Bash which also contains 8

a data-driven scripting language called AWK and a textual processing utility called Sed 9 10

that are important for most of the processes. Another programming language used in the processes is Pyhton3 . 11

7 https://www.nextflow.io/ 8 https://www.gnu.org/software/bash/ 9 https://www.gnu.org/software/gawk/manual/gawk.html 10 https://www.gnu.org/software/sed/ 11 https://www.python.org/download/releases/3.0/

22

Page 24: Improved methods for virus detection and discovery in ...

5.1.2 Preprocessing pipeline

Figure 1: A UML chart displaying the different processes in the Preprocessing pipeline. In figure 1 we can see the Preprocessing pipeline. It is mostly unchanged apart from adding a fastQC run after host removal and data translation from sam to fastq (note that the extra fastQC run is not in the figure). When doing a quality control of the preprocessing results I compile all of the results using MultiQC (Ewels et al. 2016), which gathers results from different analysis from all samples and produces an HTML report. If there are reads of very bad quality, I would run BBDuk (part of BBTools) for quality trimming and filtering.

23

Page 25: Improved methods for virus detection and discovery in ...

5.1.3 New version of Discovery pipeline

Figure 2: A UML chart displaying the different processes in the new version of the Discovery pipeline. Yellow processes are part of the initial pipeline, orange processes are also part of the initial pipeline but slightly modified and blue processes are new. In figure 2 we can see three different colors: yellow, orange and blue. Yellow is for unchanged processes, orange for slightly modified code and/or output and blue for new processes.

24

Page 26: Improved methods for virus detection and discovery in ...

There are five processes marked in orange in figure 2. Firstly, asm_filter_contigs, which filters contigs based on the number of base pairs (bp). This was changed from the minimum requirement of 500 bp to 200 bp to find smaller contigs that could be interesting for the project members, such as unknown or similar sequences with lower coverage. tax_reads_kraken2, which was changed to output unmapped/unknown reads for further analysis in the Extended Discovery pipeline (see section 5.1.4). tax_reads_metaphlan2, which was not working and thus I changed the syntax of. tax_contigs_diamond and tax_contigs_kraken2, both of which were changed to output unmapped/unknown sequences to be sent to “tax_contigs_unmapped_merged”.

There are three processes marked in blue in figure 2. tax_contigs_unmapped_merged creates one file for unmapped sequences that are unique, i.e. unmapped both in Kraken2 and Diamond. This would be sent to the Extended Discovery pipeline (see section 5.1.4) for further analysis. tax_contigs_diamond_fetch_organism_names and tax_contigs_diamond_fetch_taxonomic_info both these processes are very large coding-wise in comparison with the other processes. In the best case scenario these should be divided into several smaller processes and then joined into a single data channel for a final compilation. The challenge with this was to produce results in a way where we have one instance (tax_contigs_diamond_fetch_organism_names) of retrieving scientific names by using blastdbcmd and reading patterns to find which sentences are actual names, but in some few instances the patterns are inconsistent and these are stored in a different container. In the second process (tax_contigs_diamond_fetch_taxonomic_info) we utilize different NCBI Entrez Programming Utilities (E-utilities) to fetch as much taxonomic information as possible that could be useful for data categorizations. The limitation is the number of requests per second, which is 3 per second or 10 per second when obtaining an API key. This makes it impossible to run parallel scripts calling on E-utilities. Between each data gathering instance the system is set to sleep for a few seconds to avoid receiving an error message from NCBI:s remote servers.

5.1.3.1 Compiling data and creating table scripts When completing a run with the Discovery pipeline, there is a lot of data produced for each data sample. To make the data useful for other project members that are associated with Andersson lab I have created an interactive HTML table that has several properties, such as a filtering process for selecting viruses based on the taxonomic information, taxonomic hierarchy (family, genus, species and subspecies), rank, division, sequence id, scientific name, jQuery search box for filtering table list and column selection. Note that the figures 3 and 4 are a simplified version of the actual process for creating HTML tables. All HTML files are produced using Python3. I have also produced a readme HTML file with an interactive table of contents on the left panel and the information about the data on the right side (see appendix C).

25

Page 27: Improved methods for virus detection and discovery in ...

See section 5.1.3.1.1 for read based tables and section 5.1.3.1.2 for contig-based tables.

5.1.3.1.1 Viral parser of reads and table creation

Figure 3: A UML chart displaying the sequential workflow for the read parser script. In figure 3 we can see the sequential process for compiling and producing HTML files. It collects all data, matches different results, combines different samples, creates a tsv file, translates tsv into json and produces HTML code with the json content integrated in the HTML. For the read tables there is no need to fetch taxonomic information about the sequences, because most of it is already present in the alignment results produced by Kraken2, FVE and Metaphlan2.

26

Page 28: Improved methods for virus detection and discovery in ...

5.1.3.1.2 Viral parser of contigs and table creation

Figure 4: A UML chart displaying the sequential workflow for the contig parser script. In figure 4 we can see the sequential process for producing HTML files. In the case of contigs we do not have the taxonomic information provided from Diamond Blastx results and therefore retrieving the information needed to compile the tables is more complex. First, the script collects data from Diamond alignment results, taxonomic files and files with scientific names. During data collection results are parsed into a single line for each sequence and compiled into one container while data without scientific names (see section 5.1.3 under blue processes) are stored in another container. The data without scientific names still have the fasta header retrieved during the pipeline run and are instead used to display results. These two containers are later stored in two separate HTML files, one with detailed information about each contig sequence and one with slightly less information.

27

Page 29: Improved methods for virus detection and discovery in ...

For the main HTML table (container with scientific names) the Diamond results are compared with Kraken2 and if the results are similar, i.e. have the same sequence id and same scientific name, missing components from Kraken2 are added to the results. Some of the Kraken2 results do not match with the Diamond results and are listed at the end of each sample run.

5.1.3.1.2.1 Simplified version of contig table

In the original tables the results are not filtered and several matches are displayed for each sample id and sequence id; some tables have more than 60 000 rows of results. To simplify data analysis for other project members I have created a secondary table called “simplified version” where I use datamash, awk and sed to produce only the best hit for each sample id and sequence id. Sample id and sequence id are together a unique identifier for each nucleotide sequence (contigs data).

5.1.4 Development of Extended Discovery pipeline

Figure 5: A UML chart displaying the different processes in the Extended Discovery pipeline. Yellow shows processes that have been implemented, red shows processes that have been implemented but will be removed and green shows processes that plan to be implemented in the future.

28

Page 30: Improved methods for virus detection and discovery in ...

In figure 5 we can see the processes in the Extended Discovery pipeline. All the processes listed in the figure are created by me. For the time being this pipeline is of low priority and is unlikely to be finished during my thesis. Red colors mean that the processes are unfeasible (the execution time takes weeks and thus the process should be removed) and green colored processes plan to be added in the future. Yellow signifies keeping current processes. The idea with the pipeline is to try to classify data that is unclassified in the Discovery pipeline and in the end form a conclusion if these are unknown sequences.

29

Page 31: Improved methods for virus detection and discovery in ...

5.1.5 Development of TTV map pipeline In section 5.1.4.1 you can see the design phase and in 5.1.4.2 the end product.

5.1.5.1 Initial design

Figure 6: A UML chart displaying the design phase for the development of the TTV map pipeline.

30

Page 32: Improved methods for virus detection and discovery in ...

In figure 6 we can see the initial Unified Modeling Language (UML) design that I have created and used to develop the TTV map pipeline. In column “reads paired-end” I run two separate data analyses based on short Illumina reads, one using BWA and the other is based on Magic-BLAST. In the column “contigs” we can see another separate run, but for contigs (assembly data) and also using Magic-BLAST. The starting point in the UML is in column “Data”. Each line with a pointer represents a data channel. When a line points to a diamond shaped figure it represents data divided into two separate data channels. The column “Database” is where all the selected nucleotide references are stored to be called on when running the different alignment tools. Each yellow box represents a process in the pipeline, except in the columns “Data” and “Database”.

31

Page 33: Improved methods for virus detection and discovery in ...

5.1.5.2 Final product of pipeline

Figure 7: A UML chart displaying the end product of the TTV map pipeline.

32

Page 34: Improved methods for virus detection and discovery in ...

In the UML diagram in figure 7 we have 23 processes in the pipeline. Note that fetch data is not a process; it is a channel for retrieving data when starting the pipeline. All the arrows represent data channels between processes: where there is a bold underline, the data channels are combined into one channel, while the duplicates of one data channel into several processes are signified with a diamond. The 23 processes can be divided into three parallel runs. For assembly data, eight processes are sequential and dependent on each other. For the “read”-part, only “read_counter” is a part of the two independent runs, which is BWA with seven processes and Magic-BLAST with seven processes. The efficiency of the parallelism is dependent on the speed of data distribution and the execution time of the processes. The efficiency of each process is dependent on the availability of computing resources, the length of execution time and how well the code is optimized. There are five processes that handle counting sequences. Counting sequences are done before and after the alignment run. Note that when counting the results after an alignment run it will only take the best hit for each sequence; this process uses different operations and filtering before doing the actual counting. The following processes are used for counting sequences: contig_counter, contig_magicblast_counter, reads_counter, reads_bwa_counter and reads_magicblast_counter. We have three processes executing different alignment tools. The following processes; contig_magicblast, reads_bwa and reads_magicblast execute local alignment on nucleotide sequences. BWA is one of the standard tools for short-Illumina reads mapped to chromosome data, but in our case we are using a multifasta sequences with viral genomes. Magic-BLAST is a fast alignment program that works for both DNA and RNA, similar to other BLAST programs, but can take fastq paired reads, works to avoid repeats and the cumulative pair score is used for selection of the best results (Boratyn et al. 2019). The idea is to compare the results from the standard tool (BWA) and Magic-BLAST. We have three processes that work to select reads based on the alignment score in the case for Magic-BLAST and mapQ value when selecting BWA reads. Datamash is used for different operations (statistical, numeric and textual), most of the creation of intermediate files and compiling end results. AWK and Sed are two scripting languages used for filtering unwanted characters, reoccurring strings and compiling lists in cooperation with Datamash. When fetching names for each result, I use the accession numbers in blastdbcmd to fetch the header from the local database, to speed up this process I have limited the process to only retrieve one nucleotide, because blastdbcmd will retrieve the entire sequence with the header. Processes contig_magicblast_selection, reads_bwa_selection and reads_magicblast_selection select reads that are shown in the end results.

33

Page 35: Improved methods for virus detection and discovery in ...

The intermediate processes are contigs_magicblast_final_sampleResults (individual samples), reads_bwa_final_sampleResults (individual samples), reads_magicblast_final_sampleResults (individual samples), contigs_magicblast_concat_allResults (creates one file for all samples), reads_bwa_concat_allResults (creates one file for all samples) and reads_magicblast_allResults (creates one file for all samples). These processes help to format the data for the end results. In order to create the files for the end results we have six processes. The six process are contig_magicblast_final_results (individual samples), reads_bwa_final_results (individual samples), reads_contigs_final_results (individual samples), contig_magicblast_final_allResults (final file for summary of all results), reads_bwa_final_allResults (final file for summary of all results) and contig_magicblast_final_allResults (final file for summary of all results). The final data produced can be used for creating a graphical representation in HTML and, perhaps in the future, for other types of data visualizations. Note that the scripts used for creating visualizations are not part of the pipeline.

5.1.5.3 Script for creating visualization based on results from TTV map pipeline The first approach was finding a suitable way of presenting data. I used one of Google's interactive pie charts that utilizes table formation and java-script to produce a pie chart where the user can see the list of mapped genomes, highlight, select results and view names by hovering over the list. The pie chart displays percentages and the number of sequences for each mapped result. To mass produce pie charts I have used Python3 scripting language to create HTML code, fetch data from different sub-directories, calculate values and produce HTML files.

34

Page 36: Improved methods for virus detection and discovery in ...

5.1.6 Procedures when running different datasets

In the Preprocessing pipeline we used TTV Bonk, TTV 1b2, T1D ABIS and POS 1 as feed data. For the Discovery pipeline we used TTV Bonk, TTV 1b2, T1D ABIS, POS 1 and four of the eight MAARS data sets. In the TTV map pipeline TTV Bonk and TTV 1b2 was used as feed data to further classify TTV sequences. A modified version of TTV map pipeline was created to target the picornavirus linages instead of TTV when using T1D ABIS as feed data. When running the table scripts for both reads and contigs all data sets were used as well as some of the four data sets from the MAARS collection.

35

Page 37: Improved methods for virus detection and discovery in ...

6. Results In this section I will clarify what results are produced by each dataset and give examples of significant hits, focusing on TTV data (TTV Bonk and TTV 1b2) and T1D ABIS. Note that at the time of writing this thesis, not all data sets are completely finished and that these findings are used in ongoing research projects. I have listed some of the other projects in section 4 and 5.1.6: these are ongoing research and are not part of the main goals of this thesis, therefore I do not display the results in this paper. In each result section for each data set I mention the quantity of TTV sequences and other viruses, see section 6.1.1 (TTV Bonk), 6.1.2 (TTV 1b2) and 6.1.3 (T1D ABIS). In these sections I also display results from other viral sequences in table format and some figures with different TTV variations from the HTML files. The discovery HTML files (table results) are each several hundred pages to several thousand pages long, with 100 rows/results in each page. In section 6.2 we have some examples of data (pie charts) produced by the TTV map pipeline (see section 5.1.5.2). For the summary version which is for all the samples we have nine different pie charts and for each individual sample we have six pie charts. Two pie charts are displayed on each HTML page, which gives us three pages for individual samples and five pages for the summary file. In total, we have 49 HTML (pie chart) files, one table used for validating results (not displayed in the result section) and many text based results (not displayed in the results section) for TTV Bonk. TTV 1b2 has 47 HTML (pie chart) files and T1D ABIS 49 HTML (pie chart) files. Note that the data is based on clinical samples without any access to personal information about the patients, who are anonymized and taken from pooled samples, except for TTV 1b2 which is based on individual samples. All the data listed in 5.1.6 are datasets with finished results from the different pipelines, except MAARS data sets which require a lot more time to finish.

36

Page 38: Improved methods for virus detection and discovery in ...

6.1 Finished table scripts based on Discovery pipeline The scripts that create tables are based on data output from the Discovery pipeline. Tables have different header content in the different HTML files. Table scripts are divided into two parts, reads and contigs (for more information, see section 5.1.3.1.1 (reads) and 5.1.3.1.2 (contigs)). In all the tables, results illustrated below the data are filtered to only keep viruses and bacteriophages. Note that other types of data are also produced in separate HTML files, but are not presented in this thesis paper.

6.1.1 TTV Bonk There are various tables presenting the results from the Discovery pipeline and the scripts that produce tables. The contig based table has 62020 results where we have several hits on the same contigs. In total 28 862 records are TTV and 33 158 are other viral sequences. We also created a version where the sample id and sequence id is unique and only one hit per contig is displayed, i.e. each row contains one contig and one result. This table is a simplified version of the contig table and was produced using the original contig table and choosing results based on the lowest e-value. In total, it contains 11 675 rows, where 6731 records are TTV and 4944 are other viral sequences. The reads based version contains 2577 rows of results which are based on short Illumina reads, where 1810 rows are TTV and 767 other viral sequences. Note that each row can contain from only a few reads up to a 1000 or more reads. See appendix B for an illustration of TTV types in the read based HTML table. There is a read me HTML file in appendix C to learn more about the table contents. In appendix D there is a table with different classification categories and stacked bar charts to summarize the content of the read based tables. Appendix D1 contains a table with summary of total, TTV, other viruses, bacteria and other sequences. Appendix D2 contains a bar chart with summary of total, TTV, other viruses, bacteria and other sequences. Appendix D3 contains a bar chart with summary of total, TTV and other viruses. In the final bar chart (see appendix D4) we have a summary of the total amount of TTV in all samples.

6.1.1.1 Results from TTV Bonk table

All results displayed in this section are data based on the simplified version of the contig table. In appendix E we can see different types of TTV. Note that alignment length is based on amino acid chains and sequence length is based on nucleotide length. To compare “alignment_len” with “seq_len” the value in “alignment_len” needs to be multiplied by three.

37

Page 39: Improved methods for virus detection and discovery in ...

6.1.1.1.1 Other viral sequences

In this section the results displayed are based on other viral sequences retrieved from the simplified version of the contig table. We have one record of a complete polyomavirus genome in table 2, where alignment length is 512 x 3 = 1566 bp and length of contig k141_10005 in sample P12752_1024 is 5080 bp.

sample id

seq id name accession number e-value alignment length

sequence length

P12752_1024

k141_10005

STL_polyomavirus AMQ77271.1 1.3e-197

512 5080

Table 2: Displaying one record of polyomavirus from the TTV Bonk data set P12752_1024. 83 records of Human parvovirus B19. 14 records of Rhinovirus C. 1539 records of GB virus C (pegivirus). Seven records of Parechovirus. One record of Rotavirus A in table 3.

sample id

seq id name accession number e-value alignment length

sequence length

P12752_1011

k141_3972

Rotavirus_A ADK26995.1 2.1e-52 109 363

Table 3: Displaying one record of Rotavirus A from the TTV Bonk data set P12752_1011. One record of Astrovirus in table 4.

sample id

seq id name accession number e-value alignment length

sequence length

P12752_1027

k141_247

Astrovirus_MLB1 ACN29691.1 1.5e-05 34 375

Table 4: Displaying one record of Astrovirus from the TTV Bonk data set P12752_1027. Two records of Herpesvirus in table 5.

sample id

seq id name accession number e-value alignment length

sequence length

P12752_1013

k141_1341

Human_alphaherpesvirus_1_strain_RH2

BAM73347.1 5.0e-126

215 648

P12752_1033

k141_3130

Human_betaherpesvirus_5 AAL10763.1 4.9e-08 28 337

Table 5: Displaying two records of Herpesvirus from the TTV Bonk data set P12752_1013 and P12752_1033. We also found an unknown virus from sample id P12752_1015 are based on following contigs: K141_1466, K141_2208, K141_2271, K141_2276, K141_3265 and K141_3825. This is an ongoing research and data will not be published in this thesis.

38

Page 40: Improved methods for virus detection and discovery in ...

6.1.2 TTV 1b2 There are different tables presenting the results from the Discovery pipeline and the scripts that produce the tables. The contig based table has 45 069 results where we have several hits on the same contigs. In total, 19 061 records are TTV and 26 008 are other viral sequences. We also created a version where the sample id and sequence id is unique and only one hit per contig is displayed, i.e. each row contains one contig and one result. This table is a simplified version of the original contig table and was produced using the contig table and choosing results based on the lowest e-value. In total, it contains 8717 rows, where 2784 records are TTV and 5933 other viral sequences. The reads based version contains 7895 rows of results which are based on short Illumina reads, where 2140 rows are TTV and 5755 other viral sequences. Note that each row can contain from only a few reads up to a 1000 or more reads. See appendix F for a sample of TTV types in the read based table. In appendix G there is a table with different classification categories and stacked bar charts to summarize the content of the read based tables. Appendix G1 contains a table with summary of total, TTV, other viruses, bacteria and other sequences. Appendix G2 contains a bar chart with summary of total, TTV, other viruses, bacteria and other sequences. Appendix G3 contains a bar chart with summary of total, TTV and other viruses. In the final bar chart (see appendix G4) we have a summary of the total amount of TTV in all samples.

6.1.2.1 Results from TTV 1b2 table All results displayed in this section are data based on the simplified version of the contig table. In appendix H we can see different types of TTV. Note that alignment length is based on amino acid chains and sequence length is based on nucleotide length. To compare “alignment_len” with “seq_len” the value in “alignment_len” needs to be multiplied by three. The remaining viral sequences based on the simplified version of the contig table have not been analyzed yet and therefore I cannot display other viral sequences for this data set. First, we must examine each result individually and validate that they are correct before publishing results.

39

Page 41: Improved methods for virus detection and discovery in ...

6.1.3 T1D ABIS There are different tables presenting the results from the Discovery pipeline and the scripts that produce the tables. The contig based table has 20 131 results where we have several hits on the same contigs. In total 9330 records are TTV and 5236 are other viral sequences. We also created a version where the sample id and sequence id is unique with only one hit per contig is displayed, i.e. each row contains one contig and one result. This table is a simplified version of the original contig table and was produced using the contig table and choosing results based on the lowest e-value. In total, it contains 2406 rows, where 1091 records are TTV and 1315 other viral sequences. The reads based version contains 8187 rows of results which are based on short Illumina reads, where 1443 rows are TTV and 6744 other viral sequences. Note that each row can contain from only a few reads up to a 1000 or more reads. See appendix I for a sample of TTV types in the read based table.

6.1.3.1 Results from T1D ABIS table All results displayed in this section are data based on the simplified version of the contig table. In appendix J we can see different types of TTV. Note that alignment length is based on amino acid chains and sequence length is based on nucleotide length. To compare “alignment_len” with “seq_len” the value in “alignment_len” needs to be multiplied by three.

40

Page 42: Improved methods for virus detection and discovery in ...

6.1.3.1.1 Other viral sequences

The results displayed in this section are based on other viral sequences retrieved from a simplified version of the contig table. All the sequences below belong to the Picornavirus family. When comparing the alignment length with sequence length remember that alignment length is based on protein sequences and sequence length is based on nucleotide sequences. One amino acid equals one codon (three nucleotides). We have one record of Rhinovirus C in table 6.

sample id

seq id name accession number e-value alignment length

sequence length

P13409_1007

K141_3961

Rhinovirus C AFD64770.1 6.5e-46 125 629

Table 6: Displaying one record of Rhinovirus C from the T1D ABIS data set P13409_1007. Three records of Human poliovirus 1 in table 7.

sample id

seq id name accession number e-value alignment length

sequence length

P13409_1003

k141_3487

Human_poliovirus_1 ANV28233.1 4.9e-51 130 594

P13409_1007

k141_20415

Human_poliovirus_1 CAA24456.1 6.7e-71 132 546

P13409_1007

k141_5183

Human_poliovirus_1 CAB64659.1 1.6e-75 135 583

Table 7: Displaying three records of Human poliovirus from the T1D ABIS data set P13409_1007 and P13409_1003. One record of Enterovirus C in table 8.

sample id

seq id name accession number e-value alignment length

sequence length

P13409_1007

k141_9382

Enterovirus C NP_740473.1 1.1e-43 92 526

Table 8: Displaying one record of Enterovirus C from the T1D ABIS data set P13409_1007.

41

Page 43: Improved methods for virus detection and discovery in ...

6.2 Finished HTML pie charts based on TTV pipeline The HTML pie charts are generated by a python script and the data is based on the output from the TTV map pipeline (see section 5.1.5). All the pie chart results are liable to change in the future depending on how stringent the values in alignment options are set. We are still working on validating the results from the TTV map pipeline.

6.2.1 Summary of all data samples for TTV Bonk The scripts generate a summary version of all “sample_id” in the dataset, where we can see the results for the total hits of TTV for all samples in both reads and contigs, see figure 8. TTV reference occurrences represent the total hits of TTV (displayed in figure 8), where we can see the total amount of hits and percentages based on the alignment results, filtration and sequences retrieved from the database for methods (BWA and Magic-BLAST). To see the reference occurrences, see figure 9. In figure 10, we can see how much TTV data provided from the different “sample_ids” for both reads and contigs results. The following figures will display the different pie charts produced in the summary version of all “sample_ids”. The summary HTML file contains 9 pie charts for the different analyses.

42

Page 44: Improved methods for virus detection and discovery in ...

Figure 8: This figure shows three pie charts displaying results from three different methods and data. On the left side we can see BWA reads and Magicblast reads based on reads data and on the right side we can see Magicblast contigs which are based on contigs data. In figure 8 we can see the pie chart displaying the different methods and different data used based on the alignment results from Magic-BLAST and BWA. These pie charts display how much of the total sequence content is classified as TTV and the remaining as other sequences. For BWA reads 36,6% are sequences that aligned with the 29 TTV reference genomes in the database. The pie chart displaying Magic-BLAST reads 37,1% are sequences that aligned with the 29 TTV reference genomes in the database. The final pie chart for contigs data using Magic-BLAST alignment tool shows that 8% of the sequence data is aligned with the reference genomes in the database.

43

Page 45: Improved methods for virus detection and discovery in ...

Figure 9: This figure displays the total amount of different TTV variation in comparison with other TTV variants for BWA reads, Magicblast reads and Magicblast contigs. Figure 9 shows a pie chart displaying different percentages and number of sequences aligned to the different types of TTV. Each TTV variant is compared with the total number of other TTV variants.

44

Page 46: Improved methods for virus detection and discovery in ...

Figure 10: This figure displays the total amount of TTV from the various sample data in comparison with each other in the different pie charts. Figure 10 shows a pie chart displaying results based on reads aligned using Magic-BLAST and BWA and contigs aligned using Magic-BLAST. In each case different sample data are compared with each other to display how much TTV each sample provides.

45

Page 47: Improved methods for virus detection and discovery in ...

6.2.2 Example of one HTML file from the TTV Bonk dataset I have also created HTML files for each “sample_id”. There are 48 “sample_ids” for the TTV Bonk dataset. Each HTML file contains six pie charts. For the individual data samples I have many results (pie charts) but I only display one example, which is sample P12653_1002. For more information about sample P12653_1002, see table 1 in section 4.1.1.1. The figures are in the same format as in section 6.2.1, with the exception of figures 17 which are not necessary for one “sample_id”. To see an example of pie charts based on sample P12653_1002 see appendix K, L and M. Appendix K displays two pie charts for BWA based on reads data, appendix L displays two pie charts for Magic-BLAST based on reads data and appendix M displays two pie charts for Magic-BLAST based on contigs data.

6.2.3 T1D ABIS pie chart data For T1D ABIS the pipeline is modified to classify picornaviruses instead, but since the content has not yet been fully reviewed it is not reasonable to display results and it is a part of an ongoing research.

6.2.4 TTV 1b2 pie chart data Owing to the size of the raw library data for TTV 1b2 and the long waiting time in producing results in comparison with TTV Bonk data set (which is much smaller) there is not enough time to review the results.

46

Page 48: Improved methods for virus detection and discovery in ...

7. Discussion We have conducted and obtained results from many different microbiome projects, but the main focus for this thesis is TTV. I exclude a lot of the results since it is not reasonable to insert them all in this paper and most of them are part of ongoing research projects. For TTV Bonk data we have finished some parts that I give examples of in the result sections. I also give examples of some data from TTV 1b2 and T1D ABIS. TTV data sets are from patients with different forms of leukemia, tumors and other cancer specimens (see section 4.1.1.1 for table content of different diseases). In the T1D ABIS data set we focus on TTV sequences and picornaviruses to try to find any association with diabetes type 1. The results show that we can both discover and classify TTV variants in large volumes of patient data. In the future these results can be used for further research into the way these chronic infections help shape the host immune system and possibly find associations with disease, including possible connections to autoimmunity.

We have also worked on the identification of previously unknown viruses in these datasets. This was done by producing a list with unknown sequences for each data set in the Discovery pipeline, but thus far we have not reviewed any of them. We can use this information in combination with other tools like VirFinder (Ren et al. 2017) to classify reads into two bins, one for unknown viral sequences and the other for unknown non-viral sequences. In 6.1.1.1.1 we found a new virus that is distantly related to known viruses and is only detectable by protein alignment. This case is uncommon and most of the unknown viruses would be in the list containing unknown sequences. We can conclude that the pipeline can detect known viral sequences and predict unknown viral sequences. This opens the door for many types of future studies of infectious diseases, including the identification of viruses that may be involved in future pandemics.

When working with the initial pipeline produced in the Andersson Lab a lot of new extensions were added, but the most important part was when working with the alignment results for the assembly data (contigs). The alignment results displayed insufficient information about each result and each data set could have up to one million results. The first step I tried to solve was gathering scientific names of each matched sequence. The fastest method to acquire this information was by implementing a script that first retrieves the header for each fasta sequences in the local database and tries to decode the scientific name by a hardcoded pattern. The second step was trying to fetch all the taxonomic information associated with the scientific name. The taxonomic classifications are important so that we can filter non-viral sequences in the end-results. Most of the taxonomic searches are made for small volumes of data and are using web-based tools. I found that NCBI has a set of eight server-side programs called Entrez Programming Utilities (E-utilities) that I could use in the command line to fetch information from their remote servers. To sort large

47

Page 49: Improved methods for virus detection and discovery in ...

amount of data I have used several filtration methods and compilation methods when producing HTML pages to remove unnecessary information and combine different results from the different processes in the Discovery pipeline. The metadata retrieved from NCBI together with the alignment results, filtration methods, compilation methods and data visualization have made it easier to analyse large volumes of sequence data.

To make the results more accurate the pipelines are using different methods to compare the results. In the Discovery pipeline three methods are used and compared when classifying viral results for short reads. Methaplan2 (Truong et al. 2015) is a nucleotide alignment tool used for short reads and the method is using an amplicon database. Kraken2 (Wood et al. 2019) is another nucleotide alignment method using k-mers to classify sequence data and this is used for both short reads and longer reads (contigs). The last nucleotide alignment tool for short reads is FastViromeExplorer (Tithi et al. 2018) which uses the Kallisto algorithm (Bray et al. 2016) to identify viruses. In appendix B1, we can see Kraken2 results in the third column, FastViromeExplorer in the fourth column and Metaphlan2 in the fifth column. If the three alignment methods have the same result outcome, it indicates that there is a high reliability that the results exists within the sample.

When it comes to the longer reads we are using assemblers specialized in constructing assembly data (contigs) based on metagenomic data. Nucleotide to protein alignment was chosen for contigs data because it could classify unknown nucleotide sequences that could have a similar expression to other viruses when translated into protein sequences. Diamond Blastx (Buchfink et al. 2015) was chosen for nucleotide to protein alignment because it is much faster than the standard Blastx and almost as accurate with the number of matches, i.e. Diamond Blastx is less resource demanding and important when running a large amount of clinical data. We also have the Kraken2 results which is based on nucleotide to compare matches with Diamond results, but it is less significant since the database in Diamond is much larger.

In the TTV map pipeline, we are using two different nucleotide alignment methods Burrows-Wheeler Aligner (Li H 2013) and Magic-BLAST (Boratyn et al. 2019) to compare the resulting outcome. We can see the results and comparisons in the pie charts displayed in section 6.2.

I found two other viral-pipelines that use next-generation sequencing (NGS) data with similar ideas but different approaches when classifying viral sequences. VirusSeeker (Zhao et al. 2017) is a viral pipeline that uses similar alignment methods using Blastx and Blastn (NCBI 2016) to classify reads. Blastn is a nucleotide based alignment tool. Their first approach is using virus only database. In all the Blast searches they use e-value as a threshold for a matched sequence and filter out low quality results. The first step is trying to classify the read using nucleotide based approach by performing a Blastn search, and the results without hits are sent into a process where they use Blastx search (nucleotide to protein alignment). After performing Blastx, the next step would classify the unresolved

48

Page 50: Improved methods for virus detection and discovery in ...

sequences either as non-viral or sent into the next process where they try to classify the viral reads to bacterial genomes. If the reads are classified as bacteria they are also classified as non-viral (false positive). In the following next steps more Blast searches are included both against nucleotide and protein, but this time against a larger NCBI database to filter out false positives. In the end they use a NCBI taxonomy database to classify matched results into virus, phage, ambiguous and non-viral.

In comparison with the Discovery pipeline, we are trying to classify viral, phage, fungi and bacterial genomes. Depending on the project, the script that produces the visualizations can use the taxonomic information gathered for each result to filter out unwanted results. The approach in VirusSeeker is interesting and could help solve some implementation ideas for the Extended Discovery pipeline. When we are trying to further classify unknown sequences in a faster way, we could first use a virus only database and later on use a larger database to reduce execution time. For short reads we are using different methods which are not based on the traditional Blast searches and are more diverse and faster in their approach (amplicon, k-mers and kallisto), but for the unresolved reads we were planning to use Blastn to further classify them in the Extended Discovery pipeline.

The other pipeline that I have looked into is called VirFind (Ho & Tzanetakis 2014) and is a web-based graphical front-end interface. Since it is web-based I would assume that it is not made for a very large volume of sequence data. In our project we have worked with datasets up to 0.5 terabyte in raw library format. Similarly, they are using both Blastn and Blastx to classify sequences into viral and non-viral. Results that are unresolved in both Blastn and Blastx are sent into a process where they are first translated into protein and later used for conserved domain searches.

49

Page 51: Improved methods for virus detection and discovery in ...

7.1 Implementation and testing During the development of the pipelines, my approach has been to take notes, gain input from project members, test my own ideas and create UML charts for structural purposes. One of the issues during development concerned creating processes for the Discovery pipeline. The process of fetching taxonomic information can be split into several smaller processes, but since there are restrictions for fetching data from NCBI:s remote servers I created one process for all steps. Later on during the development I also acquired one API key from my NCBI account to increase the number of requests from 3 per second to 10 requests per second. In the future I might try to request a special API key from NCBI to increase the number of requests even more or run parallel discovery pipelines in different servers/computers. Another time-consuming aspect is the number of reads contained in each data sample (about one million reads), which requires long hours of debugging and processing data. Most data validation has been done with or by the supervisor and to some extent with other project members. For example, we found Rhinovirus C and Poliovirus in T1D ABIS data set (contig table) where we then extracted reads and contigs to create assemblies. The reads and assemblies were verified with sources online and in Integrative Genomics Viewer (IGV) (Robinson et al. 2011). Below are several examples of IGV data presentation with reads covering the chosen references. In figure 11 we can see an example with high coverage, in figure 12 with good coverage and in 13 with low coverage.

Figure 11: An example of reads coverage to polyomavirus (high coverage) using IGV.

50

Page 52: Improved methods for virus detection and discovery in ...

Figure 12: An example of reads coverage to parechovirus (good coverage) using IGV.

Figure 13: An example of reads coverage to Rhinovirus C (low coverage) using IGV.

51

Page 53: Improved methods for virus detection and discovery in ...

7.2 Goals The goal of this thesis is to develop new methods, improve/extend current pipelines and create new pipelines. In the list below, we can see the research objectives for this thesis in numerical format where number 1 is the highest priority and 4 the lowest priority. Below each mark/number I will include an answer on how well I accomplished the goals.

1. Further development of the viral-pipeline. Include more outputs, visualizations and methods.

In this stage I have created new processes for the Discovery pipeline to make the Diamond Blastx data useful and offer a visualization of data produced from the pipeline. I have also added more outputs, intermediate outputs, final outputs and a list with unknown sequence content. Note that not all of the data has been analyzed in this thesis. The most useful part of the extension is the contigs table created with the viral classifications of all the sequence data. This table version has good data reliability and has been tested with several data sets. Currently, it is the most useful visualization method for the members of the different projects.

2. Create categories and methods for identifying types and strains of TTV.

For this part I have created a new pipeline (see section 5.1.5.2) to identify and classify variation of TTV sequences. Currently, some alignment options are added but it is under testing and needs more data validation before it can be useful. I might also add new processes for repeats if the results have a lot of sequences with repeats. The pipeline is fully functional, but needs some testing before we can ensure the data is reliable (See section 6.2 for pie charts).

3. Integrate unknown gene family pipeline for protein prediction in the viral-pipeline.

We are still waiting for the unknown gene family pipeline to finish before we can integrate it with our pipeline.

4. Starting procedures with skin microbiome that are associated with skin disease.

Procedures have started with skin microbiome (MAARS data), but I will not display any results in this paper. Research is ongoing.

52

Page 54: Improved methods for virus detection and discovery in ...

7.3 Future work There are many improvements and new development projects that can be made to increase reliability and new data outputs. We have discussed future work in this thesis. In the bullet points below I mention some future work and possible future developments:

● Using Oxford Nanopore Technology (Lu et al. 2016) to produce reads of complete genomes and increase reliability of data classified in the pipeline. This will require a new pipeline to cover the error rate using the coverage from the Illumina reads and improve the assembled reads. Later on extending the pipeline to make it similar to the TTV map pipeline, though perhaps using different alignment tools Since we have longer reads the run-time will decrease.

● Possibly create classifications in protein sequences and compare it with nucleotides. It will depend on what type of viral sequence and viral properties we are working on.

● Continued work creating conditions/steps for validating results or changing/adding alignment parameters.

● I will try to obtain a special API key from NCBI to increase pipeline speed and parallelism when running the Discovery and Extended Discovery pipeline.

● Adding a clustering method for similar TTV sequences and more fasta references for TTV map pipeline.

7.4 Conclusions I have accomplished goal 1 and 2 which are more crucial for this thesis. In goal 2 we have the TTV map pipeline which would require more methods when the number of sequences in the database is increased to classify TTV types in a better way. A lot of the data produced by the pipelines is of good reliability. This was done by studying and validating specific results that the project members were interested in. Goal 3 is the protein prediction software and has not been finished within the thesis’ timeframe and therefore cannot be integrated into our pipeline. Goal 4 is not completed due to ongoing research. The pipeline produced in goal 1 was used for processing the skin-microbiome datasets.

53

Page 55: Improved methods for virus detection and discovery in ...

8. References

Allander T, Andreasson K, Gupta S, Bjerkner A, Bogdanovic G, Persson MAA, Dalianis T,

Ramqvist T, Andersson B. 2007. Identification of a Third Human Polyomavirus. Journal of Virology 81: 4130–4136.

Allander T, Tammi MT, Eriksson M, Bjerkner A, Tiveljung-Lindell A, Andersson B. 2005. Cloning of a human parvovirus by molecular screening of respiratory tract samples. Proceedings of the National Academy of Sciences of the United States of America 102: 12891–12896.

Andrews S. 2019a. Babraham Bioinformatics - Trim Galore! online 2019: https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/. Accessed March 10, 2020.

Andrews S. 2019b. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. online 2019: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed March 10, 2020.

Ball JK, Curran R, Berridge S, Grabowska AM, Jameson CL, Thomson BJ, Irving WL, Sharp PM. 1999. TT virus sequence heterogeneity in vivo: evidence for co-infection with multiple genetic types. Journal of General Virology 80: 1759–1768.

Barrientos-Somarribas M, Messina DN, Pou C, Lysholm F, Bjerkner A, Allander T, Andersson B, Sonnhammer ELL. 2018. Discovering viral genomes in human metagenomic data by predicting unknown protein families. Scientific Reports 8: 1–12.

Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B, Madden TL. 2019. Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20: 405.

Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34: 525–527.

Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60.

Bushnell B. 2020. BBTools. online 2020: https://jgi.doe.gov/data-and-tools/bbtools/. Accessed March 10, 2020.

Cole CG, McCann OT, Collins JE, Oliver K, Willey D, Gribble SM, Yang F, McLaren K, Rogers J, Ning Z, Beare DM, Dunham I. 2008. Finishing the finished human chromosome 22 sequence. Genome Biology 9: R78.

Ewels P, Magnusson M, Lundin S, Käller M. 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32: 3047–3048.

Focosi D, Antonelli G, Pistello M, Maggi F. 2016. Torquetenovirus: the human virome from bench to bedside. Clinical Microbiology and Infection: The Official Publication of the European Society of Clinical Microbiology and Infectious Diseases 22: 589–593.

Gordon A. 2020. gnu.org,Datamash. online 2020: https://www.gnu.org/software/datamash/. Accessed March 10, 2020.

Hazanudin SN, Othman Z, Sekawi Z, Kqueen CY, Rasdi R. 2019. Torque Teno Virus and Hepatitis: A review on correlation. Life Sciences, Medicine and Biomedicine, doi 10.28916/lsmb.3.6.2019.31.

Ho T, Tzanetakis IE. 2014. Development of a virus detection and discovery pipeline using next generation sequencing. Virology 471–473: 54–60.

Information NC for B, Pike USNL of M 8600 R, MD B, Usa 20894. 2008. Building a BLAST database with local sequences. National Center for Biotechnology Information (US)

54

Page 56: Improved methods for virus detection and discovery in ...

Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10: R25.

Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics (Oxford, England) 31: 1674–1676.

Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio]

Li H. 2020. lh3/seqtk. online 2020: https://github.com/lh3/seqtk. Accessed March 10, 2020. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin

R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25: 2078–2079.

Lu H, Giordano F, Ning Z. 2016. Oxford Nanopore MinION Sequencing and Genome Assembly. Genomics, Proteomics & Bioinformatics 14: 265–279.

Lysholm F, Wetterbom A, Lindau C, Darban H, Bjerkner A, Fahlander K, Lindberg AM, Persson B, Allander T, Andersson B. 2012. Characterization of the viral microbiome in patients with severe lower respiratory tract infections, using metagenomic sequencing. PloS One 7: e30875.

Madden T. 2013. The BLAST Sequence Analysis Tool. National Center for Biotechnology Information (US)

Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. 2017. metaSPAdes: a new versatile metagenomic assembler. Genome Research 27: 824–834.

Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842.

Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F. 2017. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5: 69.

Rho M, Tang H, Ye Y. 2010. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Research 38: e191.

Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. 2011. Integrative genomics viewer. Nature Biotechnology 29: 24–26.

Tithi SS, Aylward FO, Jensen RV, Zhang L. 2018. FastViromeExplorer: a pipeline for virus and phage identification and abundance profiling in metagenomics data. PeerJ 6: e4227.

Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. 2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature Methods 12: 902–903.

Vignolini T, Macera L, Antonelli G, Pistello M, Maggi F, Giannecchini S. 2016. Investigation on torquetenovirus (TTV) microRNA transcriptome in vivo. Virus Research 217: 18–22.

Wood DE, Lu J, Langmead B. 2019. Improved metagenomic analysis with Kraken 2. Genome Biology 20: 257.

Zhao G, Wu G, Lim ES, Droit L, Krishnamurthy S, Barouch DH, Virgin HW, Wang D. 2017. VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. Virology 503: 21–30.

2016. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 44: D7–D19.

55

Page 57: Improved methods for virus detection and discovery in ...

9. Appendix

Appendix A

Swedish/English

abbreviations

Terminology in

Swedish Terminology in English

ALL Akut lymfatisk

leukemi Acute lymphoblastic leukaemia

B-ALL ALL av B

lymfocyt-karaktär

B lymphocytes (B-cell) acute

lymphoblastic leukaemia

T-cells ALL T-cells akut lymfatisk

leukemi

T lymphocyte (T-cell) acute

lymphoblastic leukaemia

Pre B-ALL Pancytopeni

prodrome B-ALL Pancytopenic prodrome B-ALL

AML Akut myeloisk leukemi Acute myeloid leukemia

Lymfom/Lymphoma +

NHL

Lymfom + Non

Hodgkin lymfom Lymphoma + Non Hodgkin lymphoma

HL Hodgkins lymfom Hodgkin lymphoma

- Solida maligna

tumörer Solid malignant tumors

CNS Centrala nervsystemet Central nervous system

CNS tumörer/tumors Tumörer i centrala

nervsystemet Central nervous system tumor

AA Aplastisk anemi Aplastic anemia

- Övrig anemi Other anemia

- Histiocytoser Histiocytosis

- Övriga benigna

diagnoser Other diagnosis of benign tumors

- Kontroll TTV BONK

(PBS) Controls TTV Bonk

Table 9: A table with Swedish terminology translated into English.

56

Page 58: Improved methods for virus detection and discovery in ...

Appendix B

Appendix B1

Figure 14: This figure displays the left side of the read based table for TTV Bonk. The right side of the table can be found in figure 15 in appendix B2. All entries listed are TTV sequences.

57

Page 59: Improved methods for virus detection and discovery in ...

Appendix B2

Figure 15: This figure displays the right side of the read based table for TTV Bonk. The left side of the table can be found in figure 14 in appendix B1. All entries listed are TTV sequences.

58

Page 60: Improved methods for virus detection and discovery in ...

Appendix C

Appendix C1

Figure 16: This figure shows the readme in HTML format from the first page.

59

Page 61: Improved methods for virus detection and discovery in ...

Appendix C2

Figure 17: This figure shows the readme in HTML format from the second page.

60

Page 62: Improved methods for virus detection and discovery in ...

Appendix C3

Figure 18: This figure shows the readme in HTML format from the third page.

61

Page 63: Improved methods for virus detection and discovery in ...

Appendix C4

Figure 19: This figure shows the readme in HTML format from the last page.

62

Page 64: Improved methods for virus detection and discovery in ...

Appendix D

Appendix D1 SampleID Total TTV Total other viruses Total bacteria Total other sequences

P12653_1001 23 150 1640 238

P12653_1002 1275769 12 4463 4955

P12653_1003 42 15 4400 765

P12653_1004 2207126 10045 18773 55876

P12653_1005 119 29789 12760 1914

P12653_1006 1719216 1 1039 2698

P12752_1001 59 47058 3440 425

P12752_1002 2458177 155 6453 4985

P12752_1003 6 5224 1316 180

P12752_1004 78003 32 21069 40476

P12752_1005 8 4 2984 419

P12752_1006 677271 4 2410 1986

P12752_1007 17 38 2721 1051

P12752_1008 810482 35 4208 5014

P12752_1009 6 7193 4825 734

P12752_1010 763000 29 23720 5619

P12752_1011 27 276049 95074 11827

P12752_1012 1846 1714 170 469

P12752_1013 53 48 41984 5989

P12752_1014 139354 27 5024 17670

P12752_1015 12 154264 30313 3597

P12752_1016 1652363 198 16990 13347

P12752_1017 64 127 12413 1373

P12752_1018 1784757 75 61602 6959

P12752_1019 4 17 4073 294

P12752_1020 1230829 43 30991 18207

P12752_1021 8 167588 45611 4507

P12752_1022 104177 106 11962 12613

P12752_1023 83 33 27170 4675

P12752_1024 1075061 1536 3930 11553

P12752_1025 14 445 15663 3344

P12752_1026 2807476 100 44159 28252

P12752_1027 10 121 63497 10087

P12752_1028 844545 8 21929 31251

P12752_1029 13 132722 210741 36909

P12752_1030 5128 0 141 976

P12752_1031 46 24238 46537 3655

P12752_1032 480 0 129 136

P12752_1033 4 1605 123603 15326

P12752_1034 911434 147 89617 111558

P12752_1035 51 6421 75227 15850

63

Page 65: Improved methods for virus detection and discovery in ...

P12752_1036 2972853 17 5921 22400

P12752_1037 206 21 62589 8432

P12752_1038 2857081 171 33308 17006

P12752_1039 117 527 9635 2673

P12752_1040 1488715 16981 47288 32723

P12752_1041 19 6833 4543848 444756

P12752_1042 27 9 2281 2656

Table 10: A table displaying total count of all sequences based on the read table for TTV Bonk data sets.

64

Page 66: Improved methods for virus detection and discovery in ...

Appendix D2

Figure 20: This figure shows a stacked bar chart for each data sample based on the data from table 10 in Appendix C1.

65

Page 67: Improved methods for virus detection and discovery in ...

Appendix D3

Figure 21: This figure shows a stacked bar chart for each data sample based on the data from table 10 in Appendix C1. In this case the bar chart displays data for TTV and other viral sequences.

66

Page 68: Improved methods for virus detection and discovery in ...

Appendix D4

Figure 22: This figure shows a stacked bar chart for each data sample based on the data from table 10 in Appendix C1. In this case the bar chart displays data for TTV sequences only.

67

Page 69: Improved methods for virus detection and discovery in ...

Appendix E

Appendix E1

Figure 23: This figure displays the left side of the simplified version of the contig table for TTV Bonk. The right side of the table can be found in figure 24 in appendix E2. All entries listed are TTV sequences.

68

Page 70: Improved methods for virus detection and discovery in ...

Appendix E2

Figure 24: This figure displays the right side of the simplified version of the contig table for TTV Bonk. The left side of the table can be found in figure 23 in appendix E1. All entries listed are TTV sequences.

69

Page 71: Improved methods for virus detection and discovery in ...

Appendix F

Figure 25: This figure displays the read-based table for TTV 1b2. All entries listed are TTV sequences.

70

Page 72: Improved methods for virus detection and discovery in ...

Appendix G

Appendix G1 SampleID Total TTV Total other viruses Total bacteria Total other sequences

P15003_1001 522 13758 74531 80502

P15003_1002 15819 120846 2682144 1875834

P15003_1003 573 3843 654010 1260317

P15003_1004 138109 84518 2518300 1019089

P15003_1005 567 991 1000185 1491514

P15003_1006 641250 281524 4600729 3187611

P15003_1007 1709 74978 361959 482985

P15003_1008 72546 271259 7226265 2964779

P15003_1009 291 10856 353380 396545

P15003_1010 6052 168961 3949857 2227808

P15003_1011 339 1254 1914180 1060983

P15003_1012 35644 146002 3615364 1960236

P15003_1013 379 2797 2222369 1517567

P15003_1014 258260 135007 3523737 2369518

P15003_1015 1991 64516 132547 175060

P15003_1016 253262 173377 3639010 2023587

P15003_1017 9834 8932682 546189 456135

P15003_1018 139321 143785 3144195 1476749

P15003_1019 10121 232511 37758 33691

P15003_1020 505031 462112 6160230 3610577

P15003_1021 11322 37444 994191 1424328

P15003_1022 31533 179652 4225487 2025180

P15003_1023 24343 37638929 504637 1295322

P15003_1024 14834268 234636 4757447 2063809

P15003_1025 423 12721 17717 12541

P15003_1026 845 77523 1739555 898920

P15003_1027 475 993 56545 45182

P15003_1028 4583 314480 4668942 2868741

P15003_1029 476 849 43210 37017

P15003_1030 1939 106728 3041661 1343873

P15003_1031 2697 78295 37236 63890

P15003_1032 620780 223868 4314763 2695897

P15003_1033 24927 16113 1103258 411624

P15003_1034 147938 95296 1189645 607128

P15003_1035 30984 16782 3606178 2914949

P15003_1036 606450 132272 1531013 3498191

P15003_1037 32654 76798 470470 438597

P15003_1038 41429 59892 1199947 570866

P15003_1039 2474184 12655158 1070205 1017144

P15003_1040 25421083 17253 150772 88350

P15003_1041 205 11321 345468 682421

71

Page 73: Improved methods for virus detection and discovery in ...

P15003_1042 941 6367 260278 114300

P15003_1043 255 1206 179248 186557

P15003_1044 404703 80232 1691900 1322443

P15003_1045 276 35404 14859246 12730750

P15003_1046 1648 1038064 11319212 8165745

Table 11: A table displaying the total count of all sequences based on the read table for TTV 1b2 data sets.

72

Page 74: Improved methods for virus detection and discovery in ...

Appendix G2

Figure 26: This figure shows a stacked bar chart for each data sample based on the data from table 11 in Appendix I1.

73

Page 75: Improved methods for virus detection and discovery in ...

Appendix G3

Figure 27: This figure shows a stacked bar chart for each data sample based on the data from table 11 in Appendix I1. In this case the bar chart displays data for TTV and other viral sequences.

74

Page 76: Improved methods for virus detection and discovery in ...

Appendix G4

Figure 28: This figure shows a stacked bar chart for each data sample based on the data from table 11 in Appendix I1. In this case the bar chart displays data for TTV sequences only.

75

Page 77: Improved methods for virus detection and discovery in ...

Appendix H

Appendix H1

Figure 29: This figure displays the left side of the simplified version of the contig table for TTV 1b2. The right side of the table can be found in figure 30 in appendix H2. All entries listed are TTV sequences.

76

Page 78: Improved methods for virus detection and discovery in ...

Appendix H2

Figure 30: This figure displays the right side of the simplified version of the contig table for TTV 1b2. The left side of the table can be found in figure 29 in appendix H1. All entries listed are TTV sequences.

77

Page 79: Improved methods for virus detection and discovery in ...

Appendix I

Figure 31: This figure shows the read-based table for T1D ABIS. All entries listed are TTV sequences

78

Page 80: Improved methods for virus detection and discovery in ...

Appendix J

Appendix J1

Figure 32: This figure displays the left side of the simplified version of the contig table for T1D ABIS. The right side of the table can be found in figure 33 in appendix J2. All entries listed are TTV sequences.

79

Page 81: Improved methods for virus detection and discovery in ...

Appendix J2

Figure 33: This figure displays the right side of the simplified version of the contig table for T1D ABIS. The left side of the table can be found in figure 32 in appendix J1. All entries listed are TTV sequences.

80

Page 82: Improved methods for virus detection and discovery in ...

Appendix K

Appendix K1

Figure 34: A HTML pie chart displaying results based on reads aligned using BWA. 59.1% are sequences that aligned with the 29 TTV reference genomes in the database.

81

Page 83: Improved methods for virus detection and discovery in ...

Appendix K2

Figure 35: A HTML pie chart displaying results based on reads aligned using BWA. Here we can see the different percentages and number of sequences aligned to the different types of TTV.

82

Page 84: Improved methods for virus detection and discovery in ...

Appendix L

Appendix L1

Figure 36: A HTML pie chart displaying results based on reads aligned using Magic-BLAST. 59.7% are sequences that aligned with the 29 TTV reference genomes in the database.

83

Page 85: Improved methods for virus detection and discovery in ...

Appendix L2

Figure 37: A HTML pie chart displaying results based on reads aligned using Magic-BLAST. Here we can see the different percentages and number of sequences aligned to the different types of TTV.

84

Page 86: Improved methods for virus detection and discovery in ...

Appendix M

Appendix M1

Figure 38: A HTML pie chart displaying results based on contigs aligned using Magic-BLAST. 27% are sequences that aligned with the 29 TTV reference genomes in the database.

85

Page 87: Improved methods for virus detection and discovery in ...

Appendix M2

Figure 39: A HTML pie chart displaying results based on contigs aligned using Magic-BLAST. Here we can see the different percentages and number of sequences aligned to the different types of TTV.

86