Bacterial species determination using bioinformatic tools ... · Bacterial species determination...

40
UPTEC X 04 035 ISSN 1401-2138 SEP 2004 ÅSA INNINGS Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Master’s degree project

Transcript of Bacterial species determination using bioinformatic tools ... · Bacterial species determination...

Page 1: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

UPTEC X 04 035 ISSN 1401-2138 SEP 2004

ÅSA INNINGS

Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Master’s degree project

Page 2: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Molecular Biotechnology Programme Uppsala University School of Engineering

UPTEC X 04 035 Date of issue 2004-09 Author

Åsa Innings Title (English) Bacterial species determination using bioinformatic tools and the

Pyrosequencing technology

Title (Swedish) Abstract Pyrosequencing is a technology for real-time DNA sequencing. It can be used in a variety of application including microbial identification. In this study methods for bacterial classification using the Pyrosequencing technology have been evaluated. The rRNA gene rnpB was used as target and members within the Streptococcus genus as model species. Bioinformatic tools were used to evaluate and find discriminatory regions of the target gene as well as developing result evaluation tools. The relatively short sequences that can be obtained by Pyrosequencing were found to be discriminatory enough to separate 27 of 48 streptococci species and subspecies. When a second target region was selected, 42 species could be distinguished. The Pyrosequencing assay designed for streptococci identification was highly accurate and reproducible.

Keywords Pyrosequencing, DNA sequencing, bacterial identification, Streptococcus, rnpB, RNase P.

Supervisor Margareta Krabbe

Biosystems, Biotage AB Scientific reviewer

Björn Herrmann Department of Clinical Microbiology, Uppsala University Hospital

Project name

Sponsors

Language English

Security

ISSN 1401-2138

Classification

Supplementary bibliographical information Pages 37

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

Page 3: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Bacterial species determination using bioinformatic tools and the Pyrosequencing technology

Åsa Innings

Sammanfattning

De framgångar som forskning inom molekylärbiologi och genetik rönt har givit nya möjligheter och förenklat arbetet inom en rad olika områden t ex brottsplatsunder-sökningar och faderskapstester. Analys av det genetiska materialet, DNA, kan ha oerhört hög känslighet och specificitet. För att identifiera bakterier i kliniska prov används idag i stort sett uteslutande icke-molekylära test. Grupper skiljs åt genom att undersöka morfologiska och biokemiska skillnader. Dessa metoder har en rad begränsningar som skulle kunna minskas genom att istället analysera skillnader i bakteriers DNA-sekvens. Det finns en mängd olika sätt att analysera DNA. DNA-sekvensering, att avläsa den genetiska koden, är kanske den mest fundamentala men är förhållandevis tekniskt komplicerad. I den här studien har en speciell metod för DNA-sekvensering använts; Pyrosekvensering. Metoden applicerades på genen rnpB som finns hos alla bakterier men vars DNA-sekvens varierar mellan olika arter. Tack vare denna variation kan olika arter identifieras. För att utvärdera pyrosekvensering och genen rnpB för bakterieidentifikation användes genuset streptokocker, men metoden är tänkt att kunna användas även för andra bakteriegrupper.

Examensarbete 20 poäng, Molekylär bioteknikprogrammet Uppsala universitet september 2004

Page 4: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

1. INTRODUCTION................................................................................................................. 2

2. BACKGROUND ................................................................................................................... 3 2.1 BACTERIAL IDENTIFICATION................................................................................................ 3

2.1.1 Conventional methods .................................................................................................. 3 2.1.2 Molecular approaches.................................................................................................. 3 2.1.3 The target gene............................................................................................................. 3 2.1.4 Streptococcal species identification ............................................................................. 4

2.2 SEQUENCING ....................................................................................................................... 6 2.2.1 The Pyrosequencing technology................................................................................... 6 2.2.2 Pyrosequencing as a method for bacterial identification ............................................ 8

2.3 AIMS.................................................................................................................................... 8

3. MATERIALS AND METHODS ......................................................................................... 9 3.1 BACTERIAL STRAINS............................................................................................................ 9 3.2 MULTIPLE ALIGNMENT ........................................................................................................ 9 3.3 PCR AMPLIFICATION ........................................................................................................... 9 3.4 PYROSEQUENCING ............................................................................................................. 10 3.5 INFORMATION CONTENT .................................................................................................... 11

4. RESULTS ............................................................................................................................ 12 4.1 VARIABILITY CALCULATIONS ............................................................................................ 12

4.2.1 Program for pair-wise differences ............................................................................. 13 4.2.2 Selecting the species tag............................................................................................. 13

4.3 PRIMER DESIGN AND PCR AMPLIFICATION........................................................................ 15 4.4 PYROSEQUENCING ............................................................................................................. 16

4.4.1 Result evaluation ........................................................................................................ 16 4.4.2 Sequence analysis....................................................................................................... 16 4.4.3 Further analysis ......................................................................................................... 18 4.4.4 Reproducibility test .................................................................................................... 19 4.4.4 Increasing the discriminatory power of the assay ..................................................... 20 4.4.5 PyroStrepID ............................................................................................................... 21

5. DISCUSSION ....................................................................................................................... 22 5.1 THE PYROSTREPID ASSAY .................................................................................................. 22

5.1.2 Errors in the sequence interpretation ........................................................................ 23 5.2 RESULT EVALUATION ........................................................................................................ 23

5.2.2 Pyrosequencing adjusted search tool......................................................................... 24 5.2.1 Databases ................................................................................................................... 24

5.3 RNPB VERSUS 16S ............................................................................................................ 24 5.4 DISCRIMINATORY POWER .................................................................................................. 24 5.5 IS A SHORT REGION OF A GENE ENOUGH FOR SPECIES IDENTIFICATION?............................. 25

6. CONCLUSIONS ................................................................................................................. 26

7. FUTURE PROSPECTS...................................................................................................... 26

8. ACKNOWLEDGEMENTS ............................................................................................... 27

9. REFERENCES.................................................................................................................... 28 APPENDIX............................................................................................................................... 31

1

Page 5: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

1. INTRODUCTION Correct classification of bacteria is crucial in clinical diagnostics, as well as in studies of antibiotic resistance and bacterial epidemiology. Rapid identification of the etiological agent of microbial infections can give both clinical and financial benefits. Phenotypic methods have been, and still are, the most common way of identifying bacterial species. In recent years nucleic acid-based technologies have been developed to complement and improve bacterial classification. The development of correct and easy-to-use assays is important for the introduction of molecular methods to routine bacterial identification. DNA sequencing has an advantage over other molecular methods since it delivers data that has high information content, is easy to communicate and can be directly related to its origin i.e. the DNA molecule. Conventional sequencing methods are, however, labour intensive, expensive and time consuming. In contrast, the Pyrosequencing technology offers a system that is fast, easy and allows for large-scale sequencing of relatively short sequences. The technology has a potential of becoming a widely used method for bacterial identification. The aim of this project was to develop methods for design of microorganism identification assays using the Pyrosequencing technology. The Streptococcus genus has been used as a model system using the rnpB gene as target. In a previous study the rnpB gene from 50 Streptococcus species was sequenced by conventional methods (Tapp et al., 2003). Sequence information and DNA samples from that study were used in this project.

2

Page 6: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

2. BACKGROUND 2.1 BACTERIAL IDENTIFICATION

2.1.1 Conventional methods Conventional phenotypic methods for bacterial detection and identification are dependent on cultivation of microbial cultures in liquid or plated media followed by morphological and biochemical tests. These approaches are comparably slow and can be limited in their ability to detect specific species because of unique growth requirements or biochemical inertness. Misidentification can also occur because of phenotypic variation within groups and deviations in the interpretation of the tests.

2.1.2 Molecular approaches There are a number of molecular methods used to identify bacteria, most of which are based on the amplification of a target sequence fragment. The amplicon is then analysed to get a yes/no answer (DNA-hybridisation, species specific PCR) or a pattern (fragment polymorphism, DNA sequencing) that can be associated with a specific strain, species or group (Maiwald, 2004). Genotyping methods for identifying bacteria have many advantages over phenotypic typing. Both time and labour can be saved since no culture step is needed, which is particularly significant for slow-growing bacteria. Genotyping methods also enable identification of dead bacteria and fastidious species that are overgrown by other bacteria. Finally, the sensitivity and specificity of molecular approaches are in many cases higher than for phenotypic tests. There are advantages with DNA sequencing compared to other molecular methods. Most methods detect or identify the DNA molecule but it is only sequencing that generates the primary information, the DNA sequence. A sequence is an objective result, and it has a large information content because it contains both nucleotide composition and position information. A DNA sequence obtained from an organism can be compared to the extensive data available in GenBank and other databases.

2.1.3 The target gene The most widely used target gene for comparative sequence analysis in bacteria is the 16S rRNA gene, which codes for the structural part of the 30S ribosomal small subunit. DNA sequencing of the 16S rRNA gene is an important tool for phylogenetic studies (Woese et al., 1975), and has also been used for bacterial identification (Jonasson et al., 2002). The 16S gene is rather large (~1500 bases) and contains too little variation for classification of closely related strains (Stackebrandt et al., 1999). In many genera it is a multi-copy gene, which may lead to sequence heterogeneity. Because of these disadvantages, additional target genes have been investigated including rpoB (Mollet et al., 1997), rnpB (Brown & Pace, 1991) and groEL (Teng et al., 2002). In this study the chosen target gene is the rnpB gene, coding for ribonuclease P RNA. It has the required characteristics to make it a potential tool for identification of bacteria, and may possibly be even better than the conventional 16S gene for certain genera (Herrmann et al., 2000; Tapp et al., 2003).

3

Page 7: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Bacterial ribonuclease P (RNase P) is an enzyme involved in tRNA maturation. It consists of a catalytic RNA subunit of approximately 400 nucleotides and a small protein cofactor. The enzyme is essential in all cells that synthesize tRNA but is best understood in bacteria where catalytic activity by the RNA alone has been demonstrated (Kirsebom, 1995 and references therein). The rnpB gene consists of both variable and conserved regions. It is the tertiary structure of the RNA molecule that ensures the enzyme activity. Therefore the secondary structure elements, which consist of stem-loop helices attached to a conserved sequence core (Figure 1), are allowed to vary greatly. The helices vary in number and size between bacterial genera, and both size and nucleotide composition can vary to a great extent even within a genus (Haas & Brown, 1998).

P3

P9

Illustration used with permission from Brown (1999)

Figure 1. Secondary structure model of the RNase P RNA of Streptococcus pneumonia.

2.1.4 Streptococcal species identification

The genus Streptococcus is a diverse group of gram-positive both pathogenic and commensal bacteria (Figure 1), many of which cause a large variety of human diseases. Two well-known streptococcal species that infect humans are S. pneumoniae, which is a common cause of bacterial pneumonia, and the most pathogenic species in the genus, S. pyogenes. The latter causes a number of diseases and has in the media been referred to as “flesh-eating bacteria” because of its ability to cause infection deep down in the tissue resulting in extensive destruction of muscle and fat (Murray, 1998). Although Streptococcus species can cause life-threatening diseases, many are part of the human normal bacterial flora. Members of the viridans group are normal inhabitants of the oral cavity and some cause dental caries, but they can also cause serious infections as endocarditis (infection in the heart valves or inside walls).

Illustation used with permission from Reaseheath College.

Figure 2. Species in the Streptococcus genus are gram-positive, spherical bacteria arranged in pairs or chains.

Differentiation of species within the Streptococcus genus is complicated as classification relies on a combination of complex characters. Different schemes of classification have made group and species definitions unclear (Facklam, 2002). There are three traditional ways of grouping streptococci: (i) Lancefield grouping based on group-specific antigens, (ii) hemolytic patterns depending on to what degree red blood cells are hemolysed, and (iii) biochemical properties. Each method has its weaknesses. Lancefields groups are not species specific, and hemolytic activity may vary within species (Murray, 1998). Phenotypic test for streptococci identification performed by clinical laboratories today may identify up to 13 % of the stains incorrectly

4

Page 8: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

(Kikuchi et al., 1995). Species within the viridans group are especially difficult to differentiate using conventional tests, but correct identification can be crucial as clinical significance and antibiotic susceptibility varies between species (Teng et al., 2002). The use of molecular methods offers a more precise approach to species identification. Several DNA-based techniques have been developed for the identification of streptococci to the species level. The target genes have included 16S rRNA genes (Bentley et al., 1991; Kawamura et al., 1995), the tRNA gene intergenic spacer (Baele et al., 2000; De Gheldre et al., 1999) 16S-23S rRNA spacers (Schlegel et al., 2003), the gene for D-alanine-D-alanine ligase (Garnier et al., 1997), the rnpB gene (Tapp et al., 2003), the gene for manganese-dependent superoxide dismutase (Poyart et al., 1998) and the groEL gene (Teng et al., 2002). DNA sequencing of streptococci genes, especially the 16S rRNA genes but also sodA and rnpB has contributed largely to phylogenetic studies of the genus, redefined groups and species and added new members (Facklam, 2002). Differentiation methods based on genetic analysis have enhanced the possibility of correct species identification of streptococci, but no method has yet proven to separate all species in the genus.

5

Page 9: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

2.2 SEQUENCING

The Sanger dideoxy method for DNA sequencing was introduced in the late 1970s (Sanger et al., 1977) and is still the preferred and mostly used sequencing method. The sequence is determined by synthesising DNA sequences to different lengths and separating them by gel-electrophoresis or capillary separation methods. The technique has been greatly improved since its introduction, but efforts have also been made in finding alternative methods.

2.2.1 The Pyrosequencing technology The Pyrosequencing technology is a rapid, easy-to-use sequencing-by-synthesis method. No gel electrophoresis, dyes or labels are needed. It allows for large-scale sequencing of short stretches of DNA, 20-50 bases or more in length (Ronaghi, 2001 and references therein). Illustration used with permission from Biotage AB Figure 3. Principle of the Pyrosequencing Technology. As the polymerase incorporates a dispensed nucleotide pyrophosphate is generated. The pyrophosphate triggers a cascade of enzyme reactions that produces light. Apyrase degrades excessive nucleotides before the next nucleotide is added.

Light

time

C

C

C

C

C

C

C

C

C

C

C

C

C

CC

C

C

PPi

ATP

The technology is based on the detection of released pyrophosphate during DNA synthesis. As a nucleotide is incorporated a three-enzyme system triggers a reaction cascade resulting in light. If the complementary nucleotide to the template is dispensed to the reaction mixture, it is incorporated by the polymerase and pyrophosphate (PPi) is released. The PPi is in turn converted to ATP by the enzyme ATP-sulfurylase. ATP provides the energy to luciferase to oxidize luciferin, which generates a proportional amount of light. If the dispensed nucleotide is not incorporated, no PPi is produced and no light is emitted. Instead the nucleotide is degraded by apyrase (Figure 3).

6

Page 10: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

S e q u e n c er e a d : A G G C A GS e q u e n c er e a d : A G G C A GA G G C A G

A d d e dn u c le o t id e s

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T G

5 , 5

rela

tive

liht

units

(RLU

)

1 0 , 5 ≈ 2 x 5 , 5

A d d e dn u c le o t id e s

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T G

5 , 5

rela

tive

liht

units

(RLU

)

1 0 , 5 ≈ 2 x 5 , 5

A d d e dn u c le o t id e s

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T G

5 , 5

rela

tive

liht

units

(RLU

)

1 0 , 5 ≈ 2 x 5 , 5

A d d e dn u c le o t id e s

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T GA d d e dn u c le o t id e s

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T G

1

2

3

4

5

6

7

8

9

1 0

1 1

0

- 1

A AG C T G

5 , 5

rela

tive

liht

units

(RLU

)

1 0 , 5 ≈ 2 x 5 , 5

Illustration used with permission from Biotage AB

Figure 4. Standardised pyrogram showing how peaks are interpreted into a sequence.

The emitted light is detected by a CCD camera and visualised as a peak, the height of which is proportional to the amount of light. Each nucleotide incorporated results in a peak and together the peaks comprise a Pyrogram. If there is a homopolymer in the template chain, the number of nucleotides incorporated is proportional to the amount of PPi produced as well as the height of the peak (Figure 4). The Pyrosequencing software finally interprets the pyrogram to the resulting sequence.

Illustration used with permission from Biotage AB Figure 5. Example of a Pyrogram. Sequence read: ATA CCG TAT CAG CAA GCT ATT CC

7

Page 11: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

2.2.2 Pyrosequencing as a method for bacterial identification

Designing a Pyrosequencing assay for bacterial identification includes several steps. An overview of the strategy is shown in Figure 6. As a first step a target gene is selected and DNA sequences for the species are collected, e.g. from GenBank. A number of DNA sequences are aligned and a short region for sequence analysis, a species tag, is selected. This region should be species-specific for as many species as possible, i.e. have high discriminatory power, and should preferably be flanked by conserved regions. If one region cannot discriminate between all species a second species tag can be selected. In the next step broad-range primers for PCR amplification are designed. These primers should surround the species tag(s) and be complementary to conserved regions to be able to amplify sequences for all species in question. Broad-rang sequencing primers located close to the tag is also designed. PCR and Pyrosequencing analysis is performed and optimised for DNA sequences from all, or a subset, of the selected species. In the final step the resulting sequences are evaluated by comparison to the original sequences, giving the correctness as well as the discriminatory power of the method.

PCR and Pyrosequencing

Primer design, PCR and sequencing primer

Species tag(s) selection

Target gene sequences

Result evaluation

Figure 6. Assay design strategy for bacterial species identification using the Pyrosequencing technology.

2.3 AIMS

The project had two aims. One was to investigate how a bacterial identification panel using the Pyrosequencing technology could be set up, i.e. develop methods for assay design. The other was to develop such an assay for identification of streptococcal species using the rnpB gene. Several steps in the assay design needed investigation i.e. choosing species tag, primer design for broad-range primers and result evaluation. At present there are no automatic methods for any of these tasks. In this study selecting the species tag has been looked into in detail, which includes defining and evaluating information content and discriminatory power. Result evaluation has also been developed. The rnpB gene was chosen as the target gene for the streptococci assay because of its suitable characteristics, and the high quality sequence data available (Tapp et al., 2003). These sequences have been used for species tag selection as well as primer design and result evaluation.

8

Page 12: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

3. MATERIALS AND METHODS 3.1 BACTERIAL STRAINS

For alignment, species tag selection and primer design rnpB sequences from 50 type strains were used. Strains were originally obtained from CCUG (Culture Collection University of Göteborg). A type strain is a strain of a bacterial species of defined origin, which works as a reference for other strains. Thirty of the 50 type strains, chosen because they are human pathogens and common species, as well as 22 clinical strains were analysed using the Pyrosequencing technology. Thirteen of the clinical strains had not previously been sequenced. Strains with CCUG number are listed in Appendix 1. The database used for comparison contains the rnpB gene sequences from 113 streptococci strains of which 49 are type strains. All sequence data is from Tapp et al (2003).

3.2 MULTIPLE ALIGNMENT

The rnpB sequences from 50 Streptococcus species were aligned using the algorithm Clustal W (Thompson et al., 1994) in the sequence alignment editing program BioEdit. S. pleomorphus was removed as its sequence is clearly different from the others, and it has been suggested not to belong to the Streptococcus genus (Ludwig 1988). Two species, S. waius and S. macedonicus, have identical rnpB sequences and may constitute a single species (Poyart et al., 2002). S. waius has been excluded in the rest of the study. This left 48 sequences. Conserved regions were used as targets for primer design and variable regions as putative species tags. Alignment of 23 Streptococcus species is given in Appendix 2.

3.3 PCR AMPLIFICATION

Primers for PCR amplification were designed using the commercial program OLIGO® (MedProbe, www.medprobe.com) on the rnpB sequence from S. oralis. Top score was obtained for the primer-pair A1118FP (forward) and A1120RP (reverse), generating a 242-254 nucleotide subsequence of the rnpB gene. Pyrosequencing requires biotinylated PCR products. The primers named A1119FPB and A1121RPB are 5’-biotinylated versions of A1118FP and A1120RF. The PCR primer pairs A1118FP+A1121RPB, and A1119FPB+A1120RP were used for forward and reverse sequencing, respectively. Table 1. PCR primers

1 Position in rnpB sequence for GenBank accession number AJ511692 (S. oralis)

Name A1118FP A1119FPB (biotinylated)

A1120RP A1121RPB (biotinylated)

Direction forward reverse Sequence (5’-3’) GTGCAATTTTTGGATAATCG TGGGTTGCTAGCTTGAGG Position1 3-22 233-250 Length 20 18 Tm 64.2°C 66.0°C

The amplification reaction contained 1 x PCR buffer including 1.5 mM MgCl2 (Qiagen), 0.2 µM primers, 0.20 mM dNTP and 0.2 µM HotStar Taq polymerase (Qiagen). Two microlitres of 100 x diluted prepared DNA (QIAamp DNA Mini kit, Qiagen) were added to the reaction mixture. The PCR was performed in a Thermo Hybaid MBS 0.2 G Thermocycler according to

9

Page 13: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

the following set-up: 15 min of enzyme activation in 94°C, 45 cycles of 94ºC for 30 s, 58°C for 30 s and 72ºC for 30 s, and finally 72°C for 5 min.

3.4 PYROSEQUENCING

Sequencing primers for the P3 region were designed using the Pyrosequencing primer design software (SNP primer design 1.1). The fragments of the rnpB sequences in between the PCR primers from 10 Streptococcus species; S. bovis, S. gordonii, S. infantis, S. intermedius, S. mutans, S. oralis, S. pneumoniae, S. pyogenes, S. sobrinus, S. vestibularis, were used as input data. As the software is designed for single nucleotide polymorphism (SNP) assays the first variable position in the P3 region was indicated as a SNP. The three alternative primers shown in Table 2 and Figure 10 were selected. Sequencing primers for the P9 region were designed in a similar manner and are shown in Table 3 and Figure 14. Table 2. Sequencing primers for the P3 region Name A1122FS A1123FS A1124RS Direction forward forward reverse Sequence (5’-3’) TTTTGGATAATCGC CAATTTTTGGATAATCG AGCATGGACTTTCCTC Position1 10-23 6-22 45-60 Length 14 17 16 Dispensation order CGCT(CTGA)×20 TCGA(CTGA)×20 CATG(CTGA)×20 1 Position in rnpB sequence for GenBank accession number AJ511692 (S. oralis) Table 3. Sequencing primers for the P9 region Name A1166FS A1171FS Direction forward forward Sequence (5’-3’) AATAAGCCTAGGG AATAAGCCTAGGGA Position1 101-113 101-114 Length 13 14 Dispensation order CGCT(CTGA)×20 GCTA(GACT)×20 1 Position in rnpB sequence for GenBank accession number AJ511692 (S. oralis) Sample preparation was performed using the standard protocol for SQA96MA and VPT (Vacuum Prep Tool). Ten or 25 µl of biotinylated PCR product was used. To reduce background peaks 1.1 µg SSB (Single Strand Binding protein; Biotage AB) was added after annealing in some reactions. Sequencing was performed in a 96MA instrument with the SQA reagent kit. The nucleotide dispensation order is found in Tables 2 and 3. Initial sequence analysis of the DNA from type strains of S. agalactiae, S. anginosus, S. intermedius, S. mutans, S. pneumoniae and S. pyogenes were performed to optimise the sequencing conditions and evaluate the primers. DNA from 23 type strains were then sequenced to further evaluate primers and conditions. Seven additional type strains and 22 clinical strains were analysed using the selected primer and sequencing conditions. DNA from seven species was sequenced to test the reproducibility of the assay. Each species was sequenced in five repeated analysis, with four reactions in each.

10

Page 14: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

3.5 INFORMATION CONTENT

The diversity, i.e. the variability, was calculated using the Shannon-Wiener equation, which provides the Shannon-Wiener information index H:

)(log21

i

C

ii ppH ∑

=

=

where C is the number of categories (the four nucleotides and possibly gaps) and pi is the proportion of samples belonging to the ith category (Shannon, 1949; Wiener 1949). The S-W index (Shannon-Wiener information index) ranges from 0 (conserved nucleotide positions) to log2(1/C) (equal proportion of the samples belonging to each category) for each nucleotide site. Another way to express the index is as entropy. The index was calculated for each nucleotide position in the multiple rnpB sequence alignment of the 48 streptococci type strains, identifying conserved as well as highly variable regions. Comparative 16S Shannon-Wiener index analyses were also performed using the GenBank sequences in Tapp et al (2003). A method for discriminatory power estimations was developed, based on calculation of pair-wise nucleotide differences. Matlab version 6.1.0.450 release 12.1 was used to implement the program, which was based on the Needleman-Wunsch algorithm (Needleman & Wunsch, 1970). The program was implemented in a C# environment to be more user-friendly.

11

Page 15: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4. RESULTS 4.1 VARIABILITY CALCULATIONS

Variability calculations using the Shannon-Wiener index throughout the rnpB gene displayed highly variable and conserved regions (Figure 7). The region corresponding to P3 helix loop was found to be amongst the most variable as well as being flanked by regions with no variability. The variability of the P9 region was also large, but its flanking regions were less conserved. The P3 variable region had an average S-W index of 1.2. The average value for the whole gene is 0.45 (Tapp et al., 2003).

P3 P9

Figure 7. S-W index diagram displaying the index for each site in the rnpB sequences of 48 streptococcal species. The P3 and P9 regions are indicated in blue. Similar S-W index calculations were performed for 16S. Not all 48 species could be included due to incomplete sequence data. Thirty-two type strains had more or less complete 16S sequence data and the S-W index is shown in Figures 8 b, c and d. For comparison, the same 32 type strains have been used to calculate the S-W index for rnpB (Figure 8 a). The S-W index diagram for 16S was split in three because 16S is approximately three times larger than rnpB. 16S displayed large regions of small or no variability, but also regions with variation in the same magnitude as the P3 loop in rnpB. The whole 16S gene has an average S-W index of 0.15 (Tapp et al., 2003).

a b

Figure 8. Comparative S-W index calculations. 8 a: calculations made on rnpB sequences from 32 streptococcal species. 8 b, c, d: 16S sequences from the same 32 species divided in: b, position 49 to 497; c, position 498 to 956; d, position 957 to 1414. Positions in S. oralis, GenBank accession no. AFAF003932.

c d

12

Page 16: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4.2.1 Program for pair-wise differences The discriminatory power of a specific sequence region can be evaluated by calculating the pair-wise differences of the sequences. This information can be used to select the most informative region. A program was developed to quantify the pair-wise differences called Difference Quantification using Pair-wise alignment, DQuPA. It is mainly designed for genes containing many gaps e.g. RNA coding genes. There are programs available that make similar calculations, but lack specific features. All programs examined ignore positions with gaps in the difference quantification. No program was found which use pair-wise alignment to compare two sequences. DQuPA takes a multiple alignment of DNA sequences as input data and compares a selected region of the alignment pair-wise for all sequences. The program either compares the aligned sequences directly or makes a new pair-wise alignment between all sequence pairs. A multiple alignment is a kind of compromised alignment of the sequences. There might be sequence pairs that align in a way that results in more differences between the sequences than necessary. Making a new pair-wise alignment ensures optimal alignment between all pairs. The option to compare the sequences directly from the multiple alignment has been implemented since the pair-wise alignment algorithm makes the programs slow for large datasets (>20 sequences). The program makes (n2-n)/2 alignments where n is the number of sequences, and each alignment takes about 0.4 s. For 48 sequences the program needs about 8 minutes to complete the calculations. An additional option has been implemented, which can be used for regions with large insertions in a subset of the sequences: instead of a region a starting point and the number of nucleotides to compare are selected. This option can also be performed with and without pair-wise alignment. The comparison is quantified as the number of nucleotide differences between the sequence pairs. A gap in one of the sequences is considered to be a difference equal to a nucleotide substitution. The output is as follows: the average number of differences, the difference matrix, pairs of species with identical sequences, pairs of species with one difference, and a difference matrix where gaps are ignored. The user interface is shown in Appendix 3.

4.2.2 Selecting the species tag Hypervariable regions in the rnpB gene found with the S-W index were further investigated using the method described above. The average number of differences in the P3 region was 7.8 with pair-wise alignment and 10.2 without. All pair-wise differences are shown in the form of a difference matrix in Appendix 4. The P3 region separated 27 species leaving nine pairs and one triplet with identical P3 region sequences. The P9 region had 9.6 differences on average with pair-wise alignment and 11.6 without. Five species pairs and one triplet had identical sequences. Even though the P9 region was slightly more discriminatory, the P3 region was chosen as the primary target because of the conserved region surrounding it, which simplifies primer design. P3 is located in the beginning of the rnpB gene and consists of 11 (S. ferus) to 23 (S. mutans) variable nucleotides (Figure 9).

13

Page 17: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Figure 9. Sequences of the P3 region from the ten species used for sequence primer design and S. ferus. Surrounding conserved regions shaded in grey. The P9 region was selected as a complementary target to enable discrimination of species with identical P3 sequences. Together the sequence corresponding to the P3 and P9 loop can separate all but three pairs of species: S. anginosus and S. constellatus, S. macedonicus and S. gallolyticus, S. equinus and S. bovis. These pairs have four, three and one nucleotide difference in the whole gene, respectively. The latter two pairs have been suggested to constitute a single species (Facklam, 2002) and S. anginosus in the former has an atypical type strain (Whiley 1997). Table 4. Nine pairs of species and one triplet have identical P3 region sequences

No. of pair-wise differences Species identical in P3 region P9 region whole rnpB

S. anginosus / S. constellatus 0 4S. salivarius / S. vestibularis 1 6S. bovis / S. equinus 0 1S. infant coli / S. infant infant 1 3S. gallolyticus / S. macedonicus1 (S. waius1) 0 3S. pyogenes / S. iniae1 >4 >10S. orisratti1 / S. urinalis >4 >10S. phocae1 / S. pluranimalium1 >4 >10S. dysgal equisimilis / S. parauberis1 >4 >10S. canis / S. dysgal. dysgal.1 / S. equi zooepid 2 / 2 / 4 >101 Species never or exceptionally reported as human pathogens

14

Page 18: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4.3 PRIMER DESIGN AND PCR AMPLIFICATION

The designed PCR primers generated a fragment containing 243-255 nucleotides of the rnpB gene including both the P3 and P9 regions. This fragment was used as template in the Pyrosequencing primer design software. Broad-range sequencing primers for Pyrosequencing were designed by evaluating sequences one at a time. The ten species used for these tests were selected to represent the sequence variation in the Streptococcus genus. The software did not detect template looping, a common problem, in any of the ten template sequences. The forward primer A1123FS was deemed to be the best primer for all species, except S. intermedius and S. gordonii for which the software warned for alternative priming sites and therefore gave a low score. The highest scoring forward primer that all species had in common was A1122FS. A1124RS was the highest scoring primer for reverse sequencing. The genus specificity of the PCR primers and sequencing primer A1123FS was determined by searching against GenBank. The forward PCR primer sequence gave exclusively streptococci among the species with the highest BLAST-score and lowest E-value. The lowest E-value for non-Streptococcus species was about 1000 times higher than that for streptococci. The reverse PCR primer was a little less exclusive. The vast majority of the species that gave the highest BLAST-scores with the sequencing primer were Streptococcus species. Together, these analyses indicate that the primer used are highly selective for the Streptococcus genus and that DNA from species of other bacterial genera are not likely to be amplified in the PCR reaction.

GTGCAATTTTTGGATAATCGCGTGAGGAGAATTGCTTCTCATGAGGAAAGTCCATGCTAGCACAGP3

A1124RS A1122FS

A1123FS

Figure 10. The three sequence primers designed for the P3 region. The template sequence is from S. pneumoniae (GenBank accession number AJ511703) The PCR reaction amplified the fragment of the rnpB gene for all tested streptococcal species to a high yield and specificity, as indicated by clear and distinct bands on an electrophoresis gel (Figure 11).

Figure 11. Gel electrophoresis of the ~250 bp PCR product from 44 Streptococcus strains. The two empty wells are a consequence of laboratory errors.

15

Page 19: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4.4 PYROSEQUENCING

The Pyrosequencing instrument produces a pyrogram for each sequencing reaction. The pyrogram is then interpreted to a DNA sequence by the software. The sequence can in turn be associated with its corresponding species.

4.4.1 Result evaluation Linking the obtained sequence with its species requires a comparison to reference sequences. There is a tool for multiple alignment in the Pyrosequencing software where the obtained sequences can be compared to a master sequence. Sequences from different species can be evaluated with this tool only by evaluating one species at a time. The resulting sequences can also be exported to a sequence alignment/editing program, such as Bioedit, together with reference sequences from a database. This approach enables simultaneous alignment of many sequences. The quantification of the result is either the number of correctly read bases until the first error (read length) or number of error in the first X bases, where X preferably is longer than the variable region. Here the read length is presented as well as the errors in the first 30 nucleotides. Sequences may also be matched with sequences in a database using a search tool, e.g. BLAST. This approach was tested using the local BLAST application in Bioedit and the database sequences mentioned above. A search can also be made against a public nucleotide database. Only one sequence at a time can be evaluated using BLAST. This approach provides additional information, i.e. BLAST score and E-value. It is the preferred evaluation method when unknown samples are analysed.

4.4.2 Sequence analysis Initial sequence analysis of the rnpB DNA fragment of six type strains was performed to evaluate the three sequencing primers shown in Table 2 and Figure 10, the effects of SSB and for determining the appropriate amount of PCR product. The standard protocol recommends 25µl of PCR product, but this amount resulted in peaks that were high (approximately 40 RLU) and wide, which reduced the read length. Lowering the amount of PCR product to 10µl gave more narrow peaks with heights of about 10-20 RLU (data not shown). No background signal was detected for any of the three primers in sequencing reactions with no template. Sequencing without primer was performed for two templates, S. anginosus and S. intermedius. No peaks were higher than 1.6 RLU (data not shown). Results from sequencing reactions using the three primers, 10µl of PCR product, with and without the addition of SSB is shown in Table 4. Using primer A1122FS long read lengths were achieved but the addition of SSB dramatically decreased the quality of the pyrograms and reduced the read length. The other two sequencing primers yielded good results and the addition of SSB slightly improved the read lengths, especially for primer A1124RS. The significant decrease in read length observed when using primer A1122RS with added SSB indicates unstable binding between primer and template. This primer was therefore excluded from further analysis.

16

Page 20: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Table 5. Initial Pyrosequencing analysis. Read lengths for the three primers with and without the addition of SSB Primer: A1122FS A1123FS A1124RS SSB No SSB SSB No SSB SSB No SSB S. anginosus 0 39 50 35 39 7 S. pyogenes nd 53 54 54 32 35 S. pneumoniae 0 48 52 52 40 40 S. agalactiae 8 48 27 51 37 32 S. mutans nd 55 56 56 39 34 S. intermedius 8 40 57 41 41 40

Figure 12. Example pyrogram from the sequencing of the P3 region with primer A1123FS on template from S. pneumoniae. Variable region in coloured box.

Sequence analysis of the PCR fragment from the type strains of 23 human pathogenic streptococci was performed to further evaluate the two primers A1123FS and A1124RS. Results are shown in Appendix 5. Five reactions with single peak heights less than 4 RLU were excluded because the true peaks were difficult to separated from background peaks. The variation in read length and the effect of SSB for the two primers are represented in Figure 13 A, B. SSB had a large impact on the read length for primer A1124RS, whereas the sequencing reaction for primer A1123FS was more or less unaffected. Sequencing S. gordonii and S. intermedius DNA using primer A1123FS resulted in good-quality pyrograms and long read lengths despite the warning for alternative priming sites in the primer design software (Figure 12 A, species 7 and 11). The average number of correctly read bases was much higher for primer A1123FS, both with and without the addition of SSB. This was partly a consequence of the sequence reactions reaching the end of the PCR fragment using primer A1124RS. A more fair comparison is the number of errors in the first 30 nucleotides. Primer A1123FS was again superior and was the primer used in subsequent analyses.

17

Page 21: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

A

0 10 20 30 40 50

S. vestibularis

S. urinalis

S. sobrinus

S. sanguinis

S. salivarius

S. pyogenes

S. pneumoniae

S. peroris

S. parasanguinis

S. oralis

S. mutans

S. mitis

S. intermedius

S. infantis

S. infant infant

S. infant coli

S. gordonii

S. dysgal equi

S. cristatus

S. constellatus

S. bovis

S. anginosus

S. agalactiae

Read Length

SSBNo SSB

No SSB 2nd run

B

0 10 20 30 40 50

S. vest ibularisS. urinalis

S. sobrinusS. sanguinisS. salivariusS. pyogenes

S. pneumoniaeS. peroris

S. parasanguinisS. oralis

S. mutansS. mit is

S. intermediusS. infant is

S. infant_infantS. infant_coli

S. gordoniiS. dysgal equi

S. cristatusS. constellatus

S. bovisS. anginosusS. agalact iae

Read Length

SSB

No SSB

Figure 12. Number of correct nucleotides in the DNA sequences (read lengths) obtained from the analysis of 23 streptococci type strains using primer A1123FS (A) and A1124RS (B) with and without the addition of SSB.

4.4.3 Further analysis To evaluate the assay further a larger dataset of additional type strains, clinical strains and strains with unclear identification, were sequenced using primer A1123FS, no SSB and 10µl

18

Page 22: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

PCR product. The result is shown in Appendix 6. Two type strains, S. canis and S. gallolyticus, gave poor result in the first analysis, but improved results were obtained in the second experiment. The pyrograms of the clinical strains S. equi zooepidemicus CCUG 43890 and S. equi equi CCUG 27367 could not be interpreted correctly by the software due to false peaks. Addition of SSB to the reactions solved the problem. The unclassified strains had been more or less precisely identified by biochemical methods. DNA sequences obtained in the analysis for three of the 13 strains had no 100 percent match with any of the sequences in the database. Identification results from the Pyrosequencing analysis were for four strains inconsistent with the group or species they were previously identified as (Appendix 6).

4.4.4 Reproducibility test Seven species were selected to test the reproducibility of the assay using the above described conditions. The choice of species was based on different characteristics of the species and their corresponding sequences: S. pneumoniae and S. pyogenes are common human pathogens, S. pyogenes has sequence similarities to S. dysgalactiae eqisimilis and S. vestibularis to S. thermophilus. In previous experiments S. vestibularis and S. equi zooepidemicus gave short read lengths and S. gordonii had a six nucleotide homopolymer in the P3 region. The species were sequenced four times in five repeat analyses, making 20 sequencing reactions in total for each species. In addition, the four reactions for each species were placed on different positions on the 96-well plate. All the sequences obtained were correct for more than 25 bases, except for the homopolymer in S. gordonii (Figure 13). In eleven of the 20 sequencing reactions for S. gordonii the homopolymer was interpreted as five nucleotides instead of the correct six, which made the correct gordonii sequences only 17 nucleotides in length. For S. pyogenes and S. thermophilus one reaction gave sequences shorter than 30 nucleotides and, for S. vestibularis, three reactions. All other sequences were correct for 38 nucleotides or more. There was no tendency for variation between the different positions on the plate. The BLAST search approach for result evaluation was used and the correct species received the top score for all sequencing reactions used to test reproducibility.

51 51

59

52 54 52 5248 47

34

4946 48

44

38 38

17

43

25 27 27

0

10

20

30

40

50

60

70

S. dysgal equi

S. equi zooepid

S. gordonii

S. pneumoniae

S. pyogenes

S. thermophilus

S. vestibularis

Rea

d Le

ngth Max

Average

Min

Figure 13. Result from the reproducibility test. The rnpB amplicon from each species has been sequenced 20 times. The average, as well as the smallest and largest number of correctly read bases, is presented. The data used in the diagram is found in Appendix 7.

19

Page 23: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4.4.4 Increasing the discriminatory power of the assay The P9 region of the rnpB gene was used as a second species tag to enable resolution of the species identical in the P3 region (see Table 3). Forward sequencing primers were designed for the P9 region with the same methods as for the P3 region. Forward sequencing enables the usage of the same PCR fragment as for the P3 region. The P9 region is not surrounded by conserved regions to the same extent as the P3 region (Figure 14), which complicates primer design. The first primer designed (A1166FS) had the potential of priming the sequencing reaction for all the P3 inseparable species except S. iniae (the 5’ end mismatch for S. pyogenes is very unlikely to influence the sequencing reaction). This primer binds to itself producing a primer dimer, which means the primer primes sequencing on itself. Pyrosequencing with primer A1166FS resulted in pyrograms with high background levels (data not shown). Additional primers were designed and the most promising one was A1171FS (5’-AATAAGCCTAGGGA-3’). This primer can theoretically not prime the sequencing of eight species including S. iniae and S. orisratti, neither of which can be discriminated in the P3 region.

A1166FS

A1171FS

Figure 14. Alignment of the P9 region of the rnpB gene for a selection of the species that are identical in the P3 region. The sequencing primer A1166FS and A1171FS are indicated with arrows. Pyrosequencing was performed on sequences of the species that could not be resolved using the P3 region and that belonged to the group of 30 type strains that had been previously sequenced. Results from reactions with and without the addition of SSB are shown in Table 6. Table 6. Read lengths from the sequencing of the P9 region with and without the addition of SSB. No SSB SSB S. canis 6 60 S. dysgal dysgal 60 60 S. equi zoo 54 54 S. infant coli 51 51 S. infant infant 39 48

No SSB SSB S. pyogenes 54 51 S. salivarius 43 56 S. vestibularis 52 52 S. urinalis 43 54

Addition of 1.1µg SSB was necessary to obtain satisfying read length for S. canis and seemed to improve the read length for other species. Further analysis was not performed due time limitation.

20

Page 24: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

4.4.5 PyroStrepID The Pyrosequencing assay designed for species identification of the Streptococcus genus was named PyroStrepID. The target regions were the P3 and P9 loop in the rnpB gene. The PCR primers were A1118FP and A1121RPB used in a 45-cycle PCR reaction with an annealing temperature of 58°C. Sequencing primers for the P3 region was A1123FS and for the P9 region A1171FS. The result quality increases when 1.1 µg SSB is added to the sequencing reactions but SSB is not essential for adequate quality for most templates. The assay can totally discriminate 42 streptococci type strains. Three pairs of closely related species cannot be resolved from each other. PyroStrepID, therefore, has the potential of identifying clinical Streptococcus strains with high resolution.

21

Page 25: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

5. DISCUSSION The Pyrosequencing technology is an efficient method for bacterial identification. The assay developed for the Streptococcus genus could, in principle, discriminate all streptococcal species when the two regions, P3 and P9, of the rnpB gene were sequenced. The species that could not be separated have unclear species definitions. In many cases the P3 region gives sufficient discrimination. The sequence analysis of the P3 region resulted in correct sequences of required lengths for all tested species and the analysis proved to be highly reproducible. The sequencing result from the P9 region looked promising but further experiments are needed. One aim of this study was to develop strategies for finding variable regions with high discriminatory power from complex datasets. The strategy of choice is dependent on the amount of data. If only a few sequences are compared, visual inspection is easily done in a multiple alignment. For larger datasets the calculation of the SW-index is a practical method for estimating which gene, or region within a gene, has the largest information content. For evaluation of the discriminatory power of a region, the determination of pair-wise nucleotide differences is the superior quantification method. The developed program DQuPA provides the information needed to select species tags for Pyrosequencing assays. The Pyrosequencing instrument and software have been developed with a focus on SNP analysis. In identification assays the primers have to target multiple sequences, for which the primer design software is not optimised. The desired concept is a batch analysis using all template sequences as input, generating a broad-range sequencing primer that is the same for all sequences. To improve the efficiency of microorganism identification additional software features could be added to the instrument, e.g. a result evaluation tool. In this study results have been evaluated by comparing sequences in an alignment and using BLAST to search against a local database. This evaluation strategy is time consuming and could take hours for a 96-wellplate. An integrated search tool in the software that can perform simultaneous searches would be very valuable.

5.1 THE PYROSTREPID ASSAY

The primer A1123FS was found to give the best results of the three primers designed for sequencing of the P3 region. The sequencing reactions with this primer were more or less unaffected by SSB addition. The SSB independence is advantageous as SSB is rather expensive and adds an extra step to the preparation procedure. The sequencing reactions with this primer are not SSB sensitive, which also is favourable because SSB might be added to the standard Pyrosequencing kit in the near future. Primer A1171 was used for sequencing the P9 region. The sequencing reaction for eight species cannot be primed with this primer because of primer-template mismatch. Two of these species cannot be resolved in the P3 region, but because these species are uncommon or not reported as human pathogens, this limitation was regarded as a minor problem. For most of the DNA sequences tested in the P9 region, good quality results did not require SSB. Addition of SSB did increase the read length for some species though, e.g. S. canis. Using SSB in the assay is probably a good alternative, despite the extra costs and labour.

22

Page 26: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

5.1.2 Errors in the sequence interpretation Two kinds of sequence reading errors were detected: incorrect number of nucleotides in a homopolymer stretch and background peaks. The software had difficulties determining the right number of nucleotides in a homopolymer peak because the relationship between the number of bases incorporated and the light signal emitted is not always linear. The longer the homopolymer is, the larger is the deviation from linearity. The rnpB gene of Streptococcus gordonii has a six-thymidine homopolymer in the P3 region. In the reproducibility test this homopolymer was interpreted as five thymdines in about half of the reactions. Homopolymer errors in shorter homopolymers are a common source of error later in the sequence. Background peaks usually arise due to two sequence reaction problems: out of phase sequencing and template looping. In template looping the 3’-end of the template makes a loop and primes a sequencing reaction on itself. Out of phase sequencing occurs because of misincorporation of nucleotides in some of the templates. Both these problems increase as the sequencing reaction continues and contribute to the limitation in read length. As only minor peaks were detected in reactions where the primer has been excluded, template looping cannot be a significant problem in the PyroStepID assay. To totally rule out this problem sequence templates from all species should be sequenced without primer. The background peaks that have come up are probably primarily a cause of out-of-phase sequencing, since the errors have mainly occurred late in the sequencing. For the vast majority of the sequencing reactions with primer A1123FS the sequence of the P3 region was correctly interpreted. Out of a total of 261 sequence analyses, 88% (230 analyses) resulted in read lengths of more than 30 nucleotides. If the homopolymer error in S. gordonii is ignored (this error cannot result in false identification), the number is increased to 93%. Thirty bases cover the P3 and P9 regions, respectively. Only 1,5% of the reactions resulted in errors in the first 20 bases. Twenty bases provide enough information in the assay except for the discrimination of S. salivarius and S. vestibularis.

5.2 RESULT EVALUATION

The Pyrosequencing software interprets the pyrogram to generate a sequence. It also grades the sequence in quality windows. The first window is the passed nucleotides. If a problem is encountered the following nucleotides are graded “check”. Where the pyrogram gets very difficult to interpret the grading is “failed”. In this study the quality grading has been ignored. The reason for ignoring the quality grading is that there was a poor correlation between grading and read length. Check and even failed nucleotides were in many cases correctly interpreted. The only reactions that were disregarded in the result were the ones with initial single peak heights less than 4 RLU. Background peaks can be up to 2 RLU and therefore signals of less than 4 RU were considered to be too weak to give reliable results. Arbitrary low peak heights could have been due to unsuccessful amplification or mistakes in the sample preparation. The accuracy of the sequencing reaction in the study has been presented in two ways. One is the read length, which is a good measure of the accuracy of the technique. Another is the number of errors in the first X nucleotides. Sequencing 30 nucleotides covers the variable P3 region for all streptococcal species, therefore the number of errors in the first 30 nucleotides will tell how accurate the P3 region was sequenced. When the obtained sequence originates from a known strain with predetermined sequence data, which has been the case when developing the assay, the fastest way to evaluate the result is to

23

Page 27: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

compare the sequence obtained with the predetermined sequence in an alignment. This strategy can also be used for unknown samples when many of the obtained sequences are identical, i.e. many samples are from the same species. Unknown samples are otherwise most efficiently associated with their corresponding species by matching the sequences obtained with reference sequences in a database using a search tool, e.g. BLAST.

5.2.2 Pyrosequencing adjusted search tool BLAST is not an optimal search tool for matching sequences obtained from a Pyrosequencing reaction. Specific characteristics of the obtained sequence would require an adjusted search tool. The first bases are more likely to be correct than bases later in the sequence and a homopolymer stretch of the sequence is more likely to contain an error. The bases in the query sequence could be linearly graded making the first base the most significant in the search. A homopolymer insertion or deletion could give a lower penalty than other mismatches. Additional features could be included, such as using the second best match to calculate how probable it is that the best scoring match is correct. If the second best match has a much lower score than the best, it is very probable that the best match corresponds to the actual species in the sample.

5.2.1 Databases A large database with reference sequences to match the obtained sequences against is a crucial ingredient in an assay to identify microorganisms. The database should ultimately contain all sequence variants of possible species. This is not easily obtained, but as an identification assay is being used its users can continuously upgrade it. When a strain has no 100% match against the database sequences, this strain can be identified by other methods e.g. full gene sequencing or phenotypic tests, and afterwards be included in the database.

5.3 RNPB VERSUS 16S

The information content in the rnpB gene is larger than that in 16S as indicated by a higher Shannon-Wiener index average, which indicates that rnpB is better suited for identification of streptococci. rnpB consists of highly variable regions as well as regions that are conserved within the Streptococcus genus. On the other hand, 16S also has highly variable regions, e.g. the V1 region, which are as discriminatory as the P3 and P9 regions in the rnpB gene. 16S also consists of regions that are very well conserved, even between bacterial genera. These conserved regions enable the use of the same assay for a large variety of bacterial species. Primers complementary to regions conserved between bacterial genera can be disadvantageous in a single genus identification assay. In the PyroStrepID assay, DNA from species belonging to other genera will most likely not be amplified and therefore contamination from for example water and reagents will be less of a problem as compared with a 16S-based assay. The major advantage with the 16S gene as a target for bacterial identification is the extensive sequence data available. For streptococci rnpB sequence data of guaranteed high quality is available (Tapp et al., 2003), although the number of strains covered is lower than for 16S. Furthermore rnpB is a single copy gene, while in streptococci there are around five copies of the 16S gene in the genome. The copies of 16S may not be identical (Nubel et al., 1996), which could lead to data that is hard to interpret.

5.4 DISCRIMINATORY POWER

The value of biological information is dependent on its evolutionary history. The most accurate way to investigate the information content would be to place each position from an alignment

24

Page 28: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

in a phylogenetic tree made from the sequences involved. The informative value of the position can be evaluated on the basis of how the sequences have evolved (Thollesson, 2004). This is not possible with this amount of data and would result in information that is too complex. Calculation of the pair-wise differences between sequences is a way of evaluating discriminatory power that suits the Pyrosequencing technology. The technology has no problem discriminating a sequence with one or more differences to all other sequences. Pair-wise differences calculations can be visualised in a difference matrix. For large datasets the difference matrix contains too much information to appraise the discriminatory power of the region. Therefore the program developed for pair-wise difference quantification reveals additional information, i.e. the average number of differences and the sequence pairs with no difference and only one difference. The implemented algorithm calculating the pair-wise differences is rather slow. One reason for this is that the implementation is made in Matlab, which does not offer a particularly fast programming language. Translating the program into C# made it faster, but implementation from the ground in C# or another fast programming language would probably make it even faster. The pair-wise calculation algorithm can probably be made more efficient.

5.5 IS A SHORT REGION OF A GENE ENOUGH FOR SPECIES IDENTIFICATION?

Using the PyroStrepId assay would of course involve sequencing the rnpB gene of clinical strains. The discriminatory power of the P3 and P9 regions has been evaluated for type strains. In an ideal situation all strains within a species would have the same sequence as the type strain. In reality regions with high variability between species will also have variability between strains within the same species. An optimal species tag should have low intra-species variability and high inter-species variability. The species concept for prokaryotes is not easily defined since a species does not represent a natural entity. The definition of a species is as a taxonomic rank below the genus rank in the hierarchical system (Stackebrandt & Goebel, 1994 and references therein). The 16S gene has been used to define bacterial species. Two strains with an average 16S sequence similarity of less than 97% are considered to be separate species. A higher similarity means that the 16S gene cannot be used as species definition. Then the strains have to be compared in a DNA-DNA-hybridisation assay. If the strains hybridise more than 70% they are considered to belong to a single species. In the PyroStrepId assay about 40 variable base positions in the rnpB gene are sequenced. This relatively small amount of data is not sufficient for phylogenetic analysis and therefore problems can arise when strains have substitutions compared to reference strains in the database. One way to increase the accuracy of identification would be to include a large number of reference strains in the database, which would increase the probability of finding an identical sequence in the database. For strains with no identical sequence more data would have to be collected, i.e. full gene sequencing, to be able to perform phylogenetic analysis. After certain identification the strain could be included in the database. In reality, a strain with only one or a few substitutions to a reference strain would probably be considered to belong to the same species as the reference strain even if it is not proven statistically. Another problem is the reliability of the species classification of the database strains. Strains not included in the database, which have been previously identified by other methods, have

25

Page 29: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

been sequenced in the P3 region and compared to the database. In specific cases the sequence comparison to the database resulted in a different species than the original classification. It could be that the classification of either the sequenced strain or the strain in the database is incorrect. It could also be that the information from the P3 region is not enough because the intra-species variation is too big. The magnitude of this problem decreases with increased collection of sequence data, i.e. sequencing a second region, but it cannot be made negligible until a large number of well-defined clinical strains have been tested.

6. CONCLUSIONS

The Pyrosequencing technology has a great advantage in that it gives reliable results and can rapidly sequence many templates at a time. A limitation is the short read length. The Pyrosequencing identification assay developed was shown to be accurate in sequencing the P3 as well as the P9 region in the rnpB gene of Streptococcus species. The question that arises is if the relatively few bases are sufficiently discriminatory to separate all species. The program DQuPA developed in the study delivers valuable information for the selection of a sequence stretch with high discriminatory power to analyse in the Pyrosequencing assay. To be able to use the Pyrosequencing instrument for microbial identification in the routine lab, software for result evaluation would be desirable, as well as a large database with microbial species and strains. For primer design a broad-rang primer design program would simplify design towards new genera and new target genes.

7. FUTURE PROSPECTS Further sequence analyses need to be performed. Pyrosequencing analysis of all type strains in both the P3 and P9 region is necessary for full evaluation of the assay. Evaluating the clinical relevance of the assay requires sequencing of a large number of clinical isolates. These isolates have to be identified by another method for comparison. The sequence results will reveal how large the intra-species variation of the two regions is. Other research groups involved in streptococci identification could run the assay for further evaluation and for building up the database. An ideal instrument in a routine bacterial identification lab would be one single instrument where the sample is inserted and the final result is presented. Such an instrument would have to include DNA preparation, PCR amplification and sequencing. All sample preparation steps would have to be automated, as well as the result evaluation, and the output would be the determined species with a probability grade. Real-time PCR is used to detect and quantify the amplification during PCR. It can be used as a species identification method and for quantification of the original amount of DNA. In many bacterial genera one or a few of the species are much more common than the others in clinical samples. S. pneumoniae is the species found in a majority of clinical streptococci samples. Real-time PCR could be carried out as a first step to find out if the sample contains the common species or not. If not, a following sequencing reaction could be performed. Such a combination would in many cases be cheaper and faster.

26

Page 30: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

8. ACKNOWLEDGEMENTS First and most deeply I would like to thank my supervisor Margareta Krabbe for her support, encouragement and our rewarding discussions. I would also like to thank my scientific reviewer Björn Herrmann for supporting me through the project, as well as supplying me with sequence data and DNA samples, and Erik Bongcam for bioinformatic supervision. During the software development I received invaluable help from Fredrik Björnlund, thank you! Mikael Lundgren has also been helpful with the software development, but I would especially like to thank him for the personal encouragement and supervision he has given me throughout this period. Finally I would like to thank all people in Björn Ingemarsson’s group, Biotage AB, Biosystem division, for answering my questions and contributing with a friendly and encouraging environment during hard times.

27

Page 31: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

9. REFERENCES Baele, M., Baele, P., Vaneechoutte, M., Storms, V., Butaye, P., Devriese, L. A., Verschraegen, G., Gillis, M. & Haesebrouck, F. (2000). Application of tRNA intergenic spacer PCR for identification of Enterococcus species. J Clin Microbiol 38, 4201-4207. Bentley, R. W., Leigh, J. A. & Collins, M. D. (1991). Intrageneric structure of Streptococcus based on comparative analysis of small-subunit rRNA sequences. Int J Syst Bacteriol 41, 487-494. Brown, J. W. (1999). The Ribonuclease P Database. Nucleic Acids Res 27, 314. Brown, J. W. & Pace, N. R. (1991). Structure and evolution of ribonuclease P RNA. Biochimie 73, 689-697. De Gheldre, Y., Vandamme, P., Goossens, H. & Struelens, M. J. (1999). Identification of clinically relevant viridans streptococci by analysis of transfer DNA intergenic spacer length polymorphism. Int J Syst Bacteriol 49 Pt 4, 1591-1598. Facklam, R. (2002). What happened to the streptococci: overview of taxonomic and nomenclature changes. Clin Microbiol Rev 15, 613-630. Garnier, F., Gerbaud, G., Courvalin, P. & Galimand, M. (1997). Identification of clinically relevant viridans group streptococci to the species level by PCR. J Clin Microbiol 35, 2337-2341. Haas, E. S. & Brown, J. W. (1998). Evolutionary variation in bacterial RNase P RNAs. Nucleic Acids Res 26, 4093-4099. Herrmann, B., Pettersson, B., Everett, K. D., Mikkelsen, N. E. & Kirsebom, L. A. (2000). Characterization of the rnpB gene and RNase P RNA in the order Chlamydiales. Int J Syst Evol Microbiol 50 Pt 1, 149-158. Jonasson, J., Olofsson, M. & Monstein, H. J. (2002). Classification, identification and subtyping of bacteria based on pyrosequencing and signature matching of 16S rDNA fragments. Apmis 110, 263-272. Kawamura, Y., Hou, X. G., Sultana, F., Miura, H. & Ezaki, T. (1995). Determination of 16S rRNA sequences of Streptococcus mitis and Streptococcus gordonii and phylogenetic relationships among members of the genus Streptococcus. Int J Syst Bacteriol 45, 406-408. Kikuchi, K., Enari, T., Totsuka, K. & Shimizu, K. (1995). Comparison of phenotypic characteristics, DNA-DNA hybridization results, and results with a commercial rapid biochemical and enzymatic reaction system for identification of viridans group streptococci. J Clin Microbiol 33, 1215-1222. Kirsebom, L. A. (1995). RNase P--a 'Scarlet Pimpernel'. Mol Microbiol 17, 411-420.

28

Page 32: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Ludwig, W., Weizenegger, M., Kilpper-Bälz, R. & Schleifer, K. H. (1998). Phylogenetic relationships of anaerobic streptococci. Int J Syst Bacteriol 38, 15-18. Maiwald, M. (2004). Broad-range PCR for detection and identification of bacteria. Molecular Microbiology: Diagnostic Principles and Practice (ed. David H. Persing et al) ASM Press, Washington, D.C. USA. 379-390. Mollet, C., Drancourt, M. & Raoult, D. (1997). rpoB sequence analysis as a novel basis for bacterial identification. Mol Microbiol 26, 1005-1011. Murray, P.R., Rosenthal, K.S., Kobayashi, G. S. & Pfaller, M. A. (1998). Streptococcus. Medical Microbiology, Third edition. Mosby, St. Louis, Missouri, USA, 189-205. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-453. Nubel, U., Engelen, B., Felske, A., Snaidr, J., Wieshuber, A., Amann, R. I., Ludwig, W., Backhaus, H. (1996). Sequence heterogeneities of genes encoding 16S rRNAs in Paenibacillus polymyxa detected by temperature gradient gel electrophoresis. J Bacteriol 178, 5636-5643. Poyart, C., Quesne, G. & Trieu-Cuot, P. (2002). Taxonomic dissection of the Streptococcus bovis group by analysis of manganese-dependent superoxide dismutase gene (sodA) sequences: reclassification of 'Streptococcus infantarius subsp. coli' as Streptococcus lutetiensis sp. nov. and of Streptococcus bovis biotype 11.2 as Streptococcus pasteurianus sp. nov. Int J Syst Evol Microbiol 52, 1247-1255. Poyart, C., Quesne, G., Coulon, S., Berche, P. & Trieu-Cuot, P. (1998). Identification of streptococci to species level by sequencing the gene encoding the manganese-dependent superoxide dismutase. J Clin Microbiol 36, 41-47. Ronaghi, M. (2001). Pyrosequencing sheds light on DNA sequencing. Genome Res 11, 3-11. Sanger, F., Nicklen, S. & Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463-5467. Schlegel, L., Grimont, F., Grimont, P. A. & Bouvet, A. (2003). Identification of major Streptococcal species by rrn-amplified ribosomal DNA restriction analysis. J Clin Microbiol 41, 657-666. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of Illinois Press. Stackebrandt, E., Kramer, I., Swiderski, J. & Hippe, H. (1999). Phylogenetic basis for a taxonomic dissection of the genus Clostridium. FEMS Immunol Med Microbiol 24, 253-258. Tapp, J., Thollesson, M. & Herrmann, B. (2003). Phylogenetic relationships and genotyping of the genus Streptococcus by sequence determination of the RNase P RNA gene, rnpB. Int J Syst Evol Microbiol 53, 1861-1871.

29

Page 33: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Teng, L. J., Hsueh, P. R., Tsai, J. C., Chen, P. W., Hsu, J. C., Lai, H. C., Lee, C. N. & Ho, S. W. (2002). groESL sequence determination, phylogenetic analysis, and species differentiation for viridans group streptococci. J Clin Microbiol 40, 3172-3178. Thollesson, M. (2004). Department of Molecular Evolution, EBC, Uppsala University, Personal communication. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-4680. Whiley, R.A.,Hall, L. M. C., Hardie, J.M. & Beighton, D. (1997). Genotypic and phenotypic diversity within Streptococcus anginosus. Int J Syst Bacteriol 47, 645-650. Woese, C. R., Fox, G. E., Zablen, L., Uchida, T., Bonen, L., Pechman, K., Lewis, B. J. & Stahl, D. (1975). Conservation of primary structure in 16S ribosomal RNA. Nature 254, 83-86.

30

Page 34: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix I. The streptococci strains used in the study.

23 type strains of human pathogens 22 well defined clinical strains S. agalactiae CCUG 4208 S. anginosus CCUG 35271 S. anginosus CCUG 27298 S. bovis CCUG 38214 S. bovis CCUG 17828 S. canis CCUG 27492 S. constellatus CCUG 24889 S. constellatus CCUG 4215 S. cristatus CCUG 33481 S. cristatus CCUG 43159 S. dysgal equisimilis CCUG 36637 S. equi equi CCUG 27367 S. gordonii CCUG 33482 S. equi zooepid CCUG 43890 S. infant coli CCUG 43822 S. gallolyticus CCUG 44889 S. infant infant CCUG 43820 S. gordonii CCUG 39576 S. infantis CCUG 39817 S. infantarius coli CCUG 37794 S. intermedius CCUG 32759 S. infantarius infant CCUG 44960 S. mitis CCUG 31611 S. intermedius CCUG 28203 S. mutans CCUG 6519 S. mitis CCUG 21026 S. oralis CCUG 13229 S. mutans CCUG 39070 S. parasanguinis CCUG 30417 S. oralis CCUG 27681 S. peroris CCUG 39814 S. peroris CCUG 39815 S. pneumoniae CCUG 28588 S. pneumoniae CCUG 28588 S. pyogenes CCUG 4207 S. pyogenes CCUG 38726 S. salivarius CCUG 11878 S. sanguinis CCUG 35770 S. sanguinis CCUG 17826 S. salivarius CCUG 41462 S. sobrinus CCUG 25735 S. thermophilus CCUG 30577 S. urinalis CCUG 41590 S. vestibularis CCUG 24683 S. vestibularis CCUG 24893 13 clinical strains with unclear identification 7 additional type strains S. bovis 1.15 S. canis CCUG 27661 S. mitis/oralis α-haemolytic 2.41 S. cricetus CCUG 27300 Pneumokock 3.56 S. equi zooepid CCUG 23256 α-haemolytic 8.13 S. gallolyticus CCUG 35224 S. sanguinis S. rattus CCUG 27642 S. oralis S. suis CCUG 7984 S. sobrinus S. thermophilus CCUG 21957 S. salivarius S. mitis S. vestibularis S. parasanguinis S. intermedius S. constellatus

31

Page 35: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

10 20 30 40 50 60 70 80 90 100 110 120 130 140. . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . |

S. agalactiae G T G C A A T T T T T G G A T A A T C G C G T A G T A T - - - - - - T G - A T A T A C T A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G T T T G T G C T A G G C G A A A A A A T A A G C C T A G G G A G A T A G C T A G C T A T C T T A C G G C AS. anginosus G T G C A A T T T T T G G A T A A T C G C G T G A A G A - - - - - G T C G T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C G G G C T G T G A T G C C C G T A G T G T T T G T G C T A G G T G A A A C A A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C GS. bovis G T G C A A T T T T T G G A T A A T C G C G T G G T A C - - - - - C T T A C - G T A C C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G T G A A T T A A T A A G C C T A G G G A C A T C T T T T T - G A T G T T A C G G C GS. constellatus G T G C A A T T T T T G G A T A A T C G C G T G A A G A - - - - - G T C G T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C G G G C T G T G A T G C C C G T A G T G T T T G T G C T A G G T G A A A C A A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C GS. cristatus G T G C A A T T T T T G G A T A A T C G C G T G A A G A T - - - - T T G A T A T T T T C A T G A G G A A A G T C C A T G C T A G C A C C G G C T G T G A T G C T G G T A G T G T T T G T G C T A G G C G A A A C A A T A A G C C T A G G G A C G G A G C A A T - C - C G T T A C G G C GS. dysgal equis G T G C A A T T T T T G G A T A A T C G C G T A G T A T - - - - - - - T T C A A T A C T A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G A T T G T G C T A G G C G A A C A A A T A A G C C T A G G G A T G T G C T T A - - T A C A T T A C G G C GS. gordonii G T G C A A T T T T T G G A T A A T C G C A T G A A A A - - - - - G T T A T - T T T T T A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G C G A A A C A A T A A G C C T A G G G A C G G A T T A T T - C - C G T T A C G G C GS. infant coli G T G C A A T T T T T G G A T A A T C G C G T G G T A T - - - - - T C T A - - A T A C C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G T G A A T T A A T A A G C C T A G G G A C A T C T T T T - - G A T G T T A C G G C GS. infant infan G T G C A A T T T T T G G A T A A T C G C G T G G T A T - - - - - T C T A - - A T A C C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G T G A A T T A A T A A G C C T A G G G A C A T C T T G T - - G A T G T T A C G G C GS. infantis G T G C A A T T T T T G G A T A A T C G C G T G G A G A G - - - - T T T A T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G C G A A A C C A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C GS. intermedius G T G C A A T T T T T G G A T A A T C G C G T G A G A A T - - - - A T T T T A T T T T C A T G A G G A A A G T C C A T G C T A G C A C G G G C T G T G A T G C C C G T A G T G T T T G T G C T A G G T G A A A C A A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C GS. mitis G T G C A A T T T T T G G A T A A T C G C G T G A G G A G - - - - A A T T T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G C G A A A C C A T A A G C C T A G G G A C G A G A G A T - - - - C G T T A C G G C AS. mutans G T G C A A T T T T T G G A T A A T C G C G T G G T A A A T A T T G C A A T T T T A T C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G T T T G T G C T A G A C A A A A A A A T A A G T C T A G G G A T G T G C T T T - G C G C A T T A C G G C GS. oralis G T G C A A T T T T T G G A T A A T C G C G T G A A G A G - - - - G A T C T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G C G A A T C C A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C AS. parasanguini G T G C A A T T T T T G G A T A A T C G C G T G A G A G - - - - - A T T C T T C C C T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G T G A A T T C A T A A G C C T A G G G A C T T G A T G T T - C A A G T T A C G G C GS. peroris G T G C A A T T T T T G G A T A A T C G C G T G G A G G G - - - - T T T A T C T T T T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G C G A A A C C A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C GS. pneumoniae G T G C A A T T T T T G G A T A A T C G C G T G A G G A G - - - - A A T T G C T T C T C A T G A G G A A A G T C C A T G C T A G C A C A G G C T G T G A T G C C T G T A G T G T T T G T G C T A G G C G A A A C C A T A A G C C T A G G G A C G A G A A A T - - - - C G T T A C G G C AS. pyogenes G T G C A A T T T T T G G A T A A T C G C G T A G T A T - - - - - - - T T T A A T A C T A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G A T T G T G C T A G G C G A A C A C A T A A G C C T A G G G A T G T G C A T A - - C A C A T T A C G G C GS. salivarius G T G C A A T T T T T G G A T A A T C G C A T G G T T G - - - - - - C T A G T C T T C C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G T T T G T G C T A G G T G A A T T A A T A A G C C T A G G G A C T T G A T T T T - C A A G T T A C G G C GS. sanguinis G T G C A A T T T T T G G A T A A T C G C G T G G A A A T - - - - T T T A G A T T T T C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G C G A A G A A A T A A G C C T A G G G A C G G A T T T T T - C - C G T T A C G G C GS. sobrinus G T G C A A T T T T T G G A T A A T C G C A T G T C G C T T - - - T C T T - G G C G A C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G T T T G T G C T A G G T G A A A A A A T A A G C C T A G G G A T G T C T A T T T - G A C A T T A C G G C GS. urinalis G T G C A A T T T T T G G A T A A T C G C G T A G T A T - - - - - - - T C A A A T A C T A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C G G T A G T G T T T G T G C T A G G C G A A C A A A T A A G C C T A G G G A C A T A T A A A T A T A T G T T A C G G C GS. vestibularis G T G C A A T T T T T G G A T A A T C G C A T G G T T G - - - - - - C T A G T C T T C C A T G A G G A A A G T C C A T G C T A G C A C T G G C T G T G A T G C C A G T A G T G T T T G T G C T A G G T G A A T C A A T A A G C C T A G G G A C T T G A T T T T - C A A G T T A C G G C G

150 160 170 180 190 200 210 220 230 240 250 260 270 280. . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . |

S. agalactiae A G C A A A A G G G C T A A G T C T T T G G A T A T G C C T G A A T A G C T T T G A A A - G T G C C A C A G T G A C G T A G T T T C T A G G G A A A T C T A G A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T AS. anginosus G A C G A A A C A G C T A C G T C T C T G G A T A T G C T G G A A T A G T C C T G A A A - G T G C C A C A G T G A C G T A G T T T T T G T G G A A A C A C A A G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A C G G A A T GS. bovis A G T G A A A A A G C T A A G T C T T T T G A T A T G C T T G A A T A G C T C T G A A A A G T G C C A C A G T G A C G T A G T T C T T G G G A A A A C C C G A G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A C G G G A T AS. constellatus G A C G A A A C A G C T A C G T C T C T G G A T A T G C T G G A A T A G T C C T G A A A - G T G C C A C A G T G A C G T A G T T T T T G C G G A A A C G C A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A C G G A A T GS. cristatus A C T G A A A A G G C T A A G T C T T T A G A T A G G T C T G A A T A G G T C T G A A A - G T G C C A C A G T G A C G T A G T T C T T C T G G A A A C A G A A G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A C G G A A T GS. dysgal equis A A G A A A A T G G C T A A G T C C T T G G A T A T G C C A A A G T A C T T C T G A A A - G T G C C A C A G T G A C G A A G T T T T T A T G G A A A C G T A A A A A A T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T AS. gordonii G A T G A A A C A G C T A A G T C T C T T G A T A T G C T G G A G T A G G C C T G A A A - G T G C C A C A G T G A C G T A G T T T T T G T G G A A A C A C A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A T G G A A T GS. infant coli A G T G A A A T G G C T A A G T C T T T - G A T A T G C C T A A A T A G C T C T G A A A A G T G C C A C A G T G A C G T A G T T C T T G A G G A A A C T C G G G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A C G G G A T AS. infant infan A G T G A A A A G G C T A A G T C T T T - G A T A T G C C T A A A T A G C T C T G A A A A G T G C C A C A G T G A C G T A G T T C T T G A G G A A A C T C G A G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A C G G G A T AS. infantis A G T G A A T A G A C T A A G T T T T C G G A T A T G T T T T A G T A G C T C T G A A A - G T G C C A C A G T G A C G A A G T C T C T C T G G A A A C A G A G A G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A T T T T G G T C G G G G C A T G G A G T GS. intermedius G A C G A A A C A G C T A C G T C T C T G G A T A T G C T G G A A T A G T C C T G A A A - G T G C C A C A G T G A C G T A G T T T T T G C A G A A A C G C A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A C G G A A T GS. mitis G T T G A A A T G G C T A A G T C T T C G G A T A G G T C A G A G T A G G C T T G A A A - G T G C C A C A G T G A C G G A G T C T T T C T G G A A A C G G A G A G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A T T T C G G T C G G G G C A T G G A G T GS. mutans G A T A A A A T G G C T A A G T C T T T G - A T A G G C C G G A G T A A T T C T G A A A - G T G C C A C A G T G A C G T A G C T T T T A T G G A A A C A T A A A A G G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G A A A AS. oralis G T C G A A A T G G C T A A G T C T T C G G A T A G G T C A A A A T A G G C T T G A A A - G T G C C A C A G T G A C G G A G T C T T T C T G G A A A C A G A G A G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A T T T T G G T C G G G G C A T G G A G T GS. parasanguini G G T G A A A G A G C T A A G T C T T T G G A T A G G T T C T A G T A A C T C T G A A A - G T G C C A C A G T G A C G T A G T T T T T A G G G A A A C C T A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A T G G A A T AS. peroris A G T G A A T A G A C T A A G T T T T C G G A T A T G T T T T A G T A G C T C T G A A A - G T G C C A C A G T G A C G A A G T C T C T C T G G A A A C A G A G A G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A T T T T G G T C G G G G C A T G G A G T GS. pneumoniae G T T G A A A T G G C T A A G T C C T T G G A T A G G C C A G A G T A G G C T T G A A A - G T G C C A C A G T G A C G G A G T C T T T C T G G A A A C A G A G A G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A T T T T G G T C G G G G C A T G G A G T AS. pyogenes A A G G A A A T G G C T A A G T C G T T T G A T A T G C C A A A G T A C T T C T G A A A - G T G C C A C A G T G A C G T A G T T T T T A T G G A A A C G T A A A A A A T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T AS. salivarius A G T G A A C T G G C T A A G T C T T C G G A T A G G T T T T A G T A G C T C T G A A A - G T G C C A C A G T G A C G G A G T T T T T A G G G A A A C C T A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T GS. sanguinis A C T G A A A G G A C T A A G T C T T T G G A T A T G C C T G A T T A G G T C T G A A A - G T G C C A C A G T G A C G T A G T T C T T C T G G A A A C A G A A G A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A T G G A G T GS. sobrinus G A C G A A A G A G C T A A G T C T T T G G A T A A G T T T C - - T A G T C C T G A A A - G T G C C A C A G T G A C G G A G T T C T T G G G G A A A C T C A G G G A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T C G G G G C A C A A G A T GS. urinalis A G T A A A A T A G C T A A G T C T T T T G A T A T G C T T G A G T A G C T C T G A A A - G T G C C A C A G T A A C G A A G T T T T T A T G G A A A C G T A A A A A A T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T AS. vestibularis A G C G A A C T A G C T A A G T C T T A G G A T A G G T T T T A G T A G C T C T G A A A - G T G C C A C A G T G A C G G A G T T T T T A G G G A A A C C T A A A A A G T G G A A C G C G G T A A A C C C C T C A A G C T A G C A A C C C A A A C T T T G G T A G G G G C A T G G G A T G

290 300 310 320 330 340 350 360 370 380 390 400. . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . . . . | . .

S. agalactiae G T T G G A A T G A G A A C A A T C T A T C C T G A C T G C T T T - - - G C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A A G A A A C A A T T C C T A G T T G T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. anginosus C G T G G A A A C - G A A C G C A G C A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G A T A C C T A G T T A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. bovis G T A G G A A A C - G A A C G A T T T A T C C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A G A T A A G A C C T A G T T A T C T C T G G A A C A A A A C A T G G C T T A T A G A AS. constellatus C G T G G A A A C - G A A C A C A G C A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G A T A C C T A G T T A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. cristatus C G T G G A A A C - G A A C A T A G C A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A G A T G G T A C C T A G T C A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. dysgal equis G T T G G A A A C - G A A C A A G C T A T C C T G A C T G T T A T T A G A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A A G A G A T G A G A C C T A G T C A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. gordonii C G T G G A A A C - G A A C A T T G C A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G G T A C C T A G T C A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. infant coli G T A G G A A T C C G A A C G A T C T A T C C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A G A T A A G A C C T A G T T A T C T C T G G A A C A A A A C A T G G C T T A T A G A AS. infant infan G T A G G A A T C C G A A C G A T C T A T C C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A G A T A A G A C C T A G T T A T C T C T G G A A C A A A A C A T G G C T T A T A G A AS. infantis C G C G G A A A C - G A A C G T A G T A C T C T G A C T G C T A - - - - G C A G C T T C T A G C T G T T A G T G G T A G A C A G A T G A T T A T C G A A A G A A G T G G T - C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. intermedius C G T G G A A A C - G A A C A T A G C A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G A T A C C T A G T T A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. mitis C A C G G A A A C - G A A C G T A G T A C T C T G A C T G C T A - - - - T C A G C T A T A G G C T G T T A G T G G T A G A C A G A T G A T T A T C A A A G G A A G T G G T - C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. mutans G C T G G A A A A A G A A C G G T C T T T T C T G A C T G C A T A A - - G C A - - - - - - - - - - - - - - - - - G T A G A C A G A T G A T T A T C A A A A A A G G T G G T A C C T A G T C A C C T T T G G A A C A A A A C A T G G C T T A T A G A AS. oralis C A C G G A A A C - G A A C G T A G T A C T C T G A C T G C T A - - - - G C A G A T T T A T G C T G T T A G C G G T A G A C A G A T G A T T A T C G A A G G A A G T G G T - C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. parasanguini C G T G G A A A C - G A A C A T A G T A T T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G G T A C C T A G T C A T T T C T G G A A C A A A A C A T G G C T T A T A G A AS. peroris C G C G G A A A C - G A A C G T A G T A C T C T G A C T G C T A - - - - G C A G C T T C T A G C T G T T A G T G G T A G A C A G A T G A T T A T C G A A A G A A G T G G T - C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. pneumoniae C G C G G A A A C - G A A C G T A G T A T T C T G A C T G C T A - - - - T C A G C T A G A G - C T G T T A G T G G T A G A C A G A T G A T T A T C G A A G G A A G T G G T - C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. pyogenes G T T G G A A A C - G A A C A A G C T A T C C T G A C T G T C A - - - - A C A G A - - - - - - - - - - - - - C G G T A G A C A G A T G A T T A T C G A A G G A A A T A A T T C C T A G T T A T T T C C G G A A C A A A A C A T G G C T T A T A G A AS. salivarius G C T G G A A A C - G A A C G G A C C A T C C T G A C T G C T T T - - - G C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A G T G G T A C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A AS. sanguinis C G T G G A A G C - G A A C A T A G C A C T C T G A C T G G A A - - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G G T A C C T A G T C A T T T C C G G A A C A A A A C A T G G C T T A T A G A AS. sobrinus G C T G G A A T T C G A A C G G A C C A T C T T G A C T G C G C A A - - G C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G G G A C A G - G C C T A G T - G T C T C T G G A A C A A A A C A T G G C T T A T A G A AS. urinalis G T T G G A A A C - G A A C A A G C T A T C C T G A C T G T T T G - - - A C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A A T G A T T C C T A G T T A T T T C C G G A A C A A A A C A T G G C T T A T A G A AS. vestibularis G C T G G A A A C - G A A C G G A C C A T T C T G A C T G C T A T - - - G C A G - - - - - - - - - - - - - - - - - T A G A C A G A T G A T T A T C G A A G G A A G T G G T A C C T A G T C A C T T C T G G A A C A A A A C A T G G C T T A T A G A A

Appendix II. A

lignment of the rnpB

gene from 23 streptococci type strains

Page 36: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix III. The DQuPA interface. Appendix III. The DQuPA interface.

33

33

Page 37: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix IV. Pair-wise differences matrix of the P3 region in the rnpB gene from the 49 streptococci type strains calculated using DQuPA.

34

S. acidomimimus 7 8 10 7 9 10 13 11 12 9 9 9 9 9 6 7 9 7 8 7 7 9 9 11 9 7 9 11 9 8 9 9 9 9 9 8 11 9 7 5 12 10 4 4 9 8 5 7S. agalactiae 7 10 7 3 10 10 7 13 3 3 2 3 9 8 5 9 5 2 5 5 9 3 8 6 5 11 11 11 2 10 3 9 3 3 10 7 3 8 10 9 13 6 10 2 2 10 5S. alacto lyticus 9 5 9 9 11 7 11 9 8 8 9 11 9 6 11 5 6 6 6 8 8 8 5 6 7 9 8 9 9 8 8 7 7 8 8 8 6 10 8 9 6 8 8 9 10 6S. anginosus 10 10 0 13 4 12 10 10 10 10 6 11 10 6 10 9 9 9 4 11 7 10 10 5 10 4 10 6 10 5 11 11 6 10 11 7 9 8 12 8 10 10 10 9 10S. bovis 7 11 11 11 12 7 7 7 7 10 8 4 10 5 7 5 5 8 6 10 6 4 9 11 11 7 9 7 9 7 7 9 9 6 8 8 9 10 4 9 7 7 8 4S. canis 10 11 9 13 0 1 1 0 11 9 4 11 6 3 3 3 11 2 9 9 4 10 11 12 1 10 1 11 3 3 11 8 2 9 10 9 13 7 10 2 1 10 4S. constellatus 13 4 12 10 10 10 10 6 11 10 6 10 9 9 9 4 11 7 10 10 5 10 4 10 6 10 5 11 11 6 10 11 7 9 8 12 8 10 10 10 9 10S. cricetus 11 7 11 10 11 11 9 13 10 9 10 9 11 11 11 10 10 10 10 13 11 14 10 12 10 11 9 9 13 7 10 10 10 9 6 11 11 11 10 10 10S. cristatus 11 9 8 9 9 6 11 8 6 8 7 9 9 4 8 5 10 8 7 8 6 9 9 8 5 9 9 9 8 8 7 9 4 11 10 10 9 9 9 8S. downei 13 12 13 13 12 14 11 12 12 12 12 12 10 12 11 10 11 11 14 12 14 12 12 10 12 12 11 10 12 12 9 12 5 10 10 13 14 9 11S. dysgal_dysgal 1 1 0 11 9 4 11 6 3 3 3 11 2 9 9 4 10 11 12 1 10 1 11 3 3 11 8 2 9 10 9 13 7 10 2 1 10 4S. dysgal_equisimilis 1 1 10 9 3 10 5 2 4 4 10 1 8 8 3 9 10 11 2 9 0 10 2 2 10 7 1 9 11 8 11 6 10 2 2 11 3S. equi_equi 1 11 9 4 11 6 3 4 4 11 2 9 7 4 10 10 12 2 10 1 11 3 3 11 8 2 9 11 9 12 7 9 1 2 11 4S. equi_zooepid 11 9 4 11 6 3 3 3 11 2 9 9 4 10 11 12 1 10 1 11 3 3 11 8 2 9 10 9 13 7 10 2 1 10 4S. equinus 11 10 0 11 9 11 11 6 10 7 11 10 8 10 7 10 9 10 7 10 10 9 10 10 6 9 7 12 9 10 10 10 9 10S. ferus 7 11 7 9 7 7 10 9 11 8 7 10 15 11 9 10 9 11 9 9 10 11 9 9 8 11 12 8 8 9 9 8 7S. gallo lyticus 10 3 4 1 1 8 2 6 5 0 9 9 11 4 9 3 9 3 3 9 8 2 6 8 6 11 3 9 3 4 8 0S. gordonii 11 9 11 11 6 10 7 11 10 8 10 7 10 9 10 7 10 10 9 10 10 6 9 7 12 9 10 10 10 9 10S. hyointestinalis 5 4 4 8 5 9 5 3 10 11 11 5 9 5 8 4 4 9 7 5 8 7 7 11 5 8 6 5 7 3S. hyovaginalis 5 5 9 2 7 7 4 10 11 11 3 9 2 9 2 2 10 6 2 8 9 9 10 5 10 2 3 9 4S. infant_coli 0 9 3 7 6 1 10 10 10 3 8 4 10 4 4 10 9 3 6 7 7 11 4 8 3 3 7 1S. infant_infant 9 3 7 6 1 10 10 10 3 8 4 10 4 4 10 9 3 6 7 7 11 4 8 3 3 7 1S. infantis 9 5 10 8 5 9 4 10 7 10 1 10 10 7 9 9 7 9 4 12 6 10 10 10 9 8S. iniae 7 7 2 10 11 12 2 10 1 10 1 1 9 6 0 8 10 7 12 5 11 1 2 10 2S. intermedius 9 6 4 8 7 9 7 8 6 7 7 6 7 7 4 11 5 12 8 12 8 9 11 6S. macacae 6 10 10 11 8 8 8 11 7 7 9 9 8 7 7 9 8 6 8 7 8 7 6S. macedonicus 9 9 11 4 9 3 9 3 3 9 8 2 6 8 6 11 3 9 3 4 8 0S. mitis 9 3 11 7 9 6 10 10 2 10 10 7 10 7 13 7 9 11 11 10 9S. mutans 11 11 10 10 10 11 11 10 14 11 8 12 7 14 10 11 11 11 12 9S. oralis 12 7 10 5 11 11 5 11 11 7 11 8 13 9 10 12 12 11 11S. orisratti 9 2 10 3 3 11 8 2 9 10 9 13 7 9 2 0 10 4S. parasanguinis 9 8 9 9 6 9 9 7 11 9 10 7 12 10 10 11 9S. parauberis 10 2 2 10 7 1 9 11 8 11 6 10 2 2 11 3S. peroris 10 10 7 9 10 8 9 5 12 7 10 11 10 9 9S. phocae 0 8 6 1 8 10 7 11 5 11 2 3 10 3S. pluranimalium 8 6 1 8 10 7 11 5 11 2 3 10 3S. pneumoniae 11 9 7 9 8 14 7 9 10 11 9 9S. porcinus 6 9 9 8 8 9 10 7 7 9 8S. pyogenes 8 10 7 12 5 11 1 2 10 2S. rattus 9 7 12 5 10 8 9 9 6S. salivarius 9 8 6 1 10 11 0 9S. sanguinis 13 8 10 8 9 9 6S. sobrinus 9 10 12 12 9 11S. suis 7 6 7 6 3S. thermophilus 11 9 1 10S. uberis 2 10 3S. urinalis 10 4S. vestibularis 9S. waius

Page 38: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix V. Pyrosequencing of the P3 region in the rnpB gene for 23 type strains using primer A1123FS. The table shows the number of obtained bases in the sequence analysis and the number of errors in the first 30 sequenced nucleotides.

A1123FS SSB 1st run SSB 2nd run No SSB 1st run No SSB 2nd run no correct errors1 no correct errors1 no correct errors1 no correct errors1 S. anginosus 50 0 45 0 24 1 45 0 S. bovis 52 0 53 0 38 0 nd - S. constellatus 45 0 45 0 40 0 nd - S. cristatus 58 0 58 0 54 0 nd - S. dysgal_equisimilis 51 0 27 2 42 0 nd - S. gordonii 17 1 53 0 53 0 nd - S. infant_coli 51 0 55 0 55 0 nd - S. infant_infant 55 0 42 0 55 0 nd - S. infantis 49 0 36 0 37 0 nd - S. intermedius 30 0 58 0 40 0 nd - S. mitis 49 0 52 0 52 0 nd - S. mutans 56 0 50 0 49 0 nd - S. oralis 29 1 46 0 34 0 nd - S. pneumoniae 52 0 49 0 7 1 46 0 S. parasanguinis 48 0 48 0 48 0 nd - S. peroris 54 0 54 0 54 0 nd - S. pyogenes nd - 54 0 25 1 54 0 S. salivarius low peaks - 50 0 52 0 nd - S. sanguinis low peaks - 59 0 59 0 nd - S. sobrinus low peaks - 54 0 49 0 nd - S. urinalis low peaks - 51 0 51 0 nd - S. vestibularis 47 0 52 0 49 0 nd - S. agalactiae low peaks - 46 0 51 0 nd -

A1124RS SSB 1st run SSB 2nd run No SSB 1st run No SSB 2nd run no correct errors1 no correct errors1 no correct errors1 no correct errors1 S. anginosus 39 0 39 0 7 2 nd - S. bovis 38 0 38 0 38 0 nd - S. constellatus 39 0 18 1 7 1 nd - S. cristatus 7 1 40 0 1 >4 nd - S. dysgal_equisimilis 37 0 37 0 37 0 nd - S. gordonii 16 3 38 0 1 3 nd - S. infant_coli 37 0 33 0 37 0 nd - S. infant_infant 37 0 1 1 37 0 nd - S. infantis 35 0 7 1 15 1 1 >4 S. intermedius 7 1 40 0 40 0 nd - S. mitis 7 1 40 0 1 2 1 1 S. mutans 23 1 39 0 9 1 9 1 S. oralis 40 0 40 0 35 0 nd - S. pneumoniae 40 0 36 0 35 0 40 0 S. parasanguinis 34 0 39 0 34 0 nd - S. peroris 7 1 40 0 1 >4 1 >4 S. pyogenes 32 0 37 0 35 0 37 0 S. salivarius 38 0 38 0 1 >4 1 >4 S. sanguinis 7 1 7 1 7 >4 19 3 S. sobrinus 38 0 40 0 35 0 nd - S. urinalis 37 0 37 0 1 >4 1 >4 S. vestibularis 34 0 38 0 1 >4 1 >4 S. agalactiae 37 0 37 0 32 0 nd -

1 Number of errors in the first 30 sequenced nucleotides

35

Page 39: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix VI. Pyrosequencing of the P3 region of the rnpB gene from 7 additional type strains and clinical strains using primer A1123FS. The table shows the obtained number of bases in the sequence analysis and the number of errors in the first 30 nucleotides. 1st analysis 2nd analysis Type strains No correct errors1 No correct errors1

S. canis CCUG 27661 1 >4 49 0 S. cricetus CCUG 27300 52 0 nd - S. equi zooepid CCUG 23256 51 0 nd - S. gallolyticus CCUG 35224 5 >4 56 0 S. rattus CCUG 27642 52 0 nd - S. suis CCUG 7984 53 0 nd - S. thermophilus CCUG 21957 47 0 nd - Well defined clinical strains S. anginosus CCUG 35271 40 0 nd - S. bovis CCUG 38214 54 0 49 0 S. canis CCUG 27492 49 0 nd - S. constellatus CCUG 4215 40 0 nd - S. cristatus CCUG 43159 41 0 nd - S. equi equi CCUG 27367 22 >4 51SSB 0 S. equi zooepid CCUG 43890 25 3 29SSB 1 S. gallolyticus CCUG 44889 51 0 nd - S. gordonii CCUG 39576 56 0 nd - S. infantarius coli CCUG 37794 50 0 nd - S. infantarius infant CCUG 44960 55 0 nd - S. intermedius CCUG 28203 41 0 nd - S. mitis CCUG 21026 43 0 nd - S. mutans CCUG 39070 56 0 nd - S. oralis CCUG 27681 46 0 nd - S. peroris CCUG 39815 52 0 nd - S. pneumoniae CCUG 28588 46 0 nd - S. pyogenes CCUG 38726 54 0 nd - S. sanguinis CCUG 35770 59 0 nd - S. salivarius CCUG 41462 54 0 nd - S. thermophilus CCUG 30577 52 0 nd - S. vestibularis CCUG 24683 Low peaks - 39 0 Clinical strains with unclear identification S. bovis 1.152,4 (S. gallolyticus) 423 0 483 0 S. mitis/oralis α-haemolytic 2.412 (S. pneumoniae) 433 0 283 1 Pneumokock 3.564 473 0 263 0 α-haemolytic 8.134 433 0 nd - S. sanguinis2 (S. oralis) 463 0 393 0 S. oralis 433 0 nd - S. sobrinus 493 0 nd - S. salivarius 543 0 nd - S. mitis2 (S. Anginosus / S. constellatus) 403 0 273 S. vestibularis 273 1 503 0 S. parasanguinis 453 0 nd - S. intermedius 523 0 nd - S. constellatus 273 3 19 2 1 Number of errors in the first 30 sequenced nucleotides 2 Pyrosequencing of the strain resulted in a different species than the annotated. The result species in parenthesis. 3 Number of correct nucleotides can be estimated because the first error occurred in conserved region. 4 Strain with no 100% mach to any reference strains in the database. SSB SSB was used in the reaction

36

Page 40: Bacterial species determination using bioinformatic tools ... · Bacterial species determination using bioinformatic tools and the Pyrosequencing technology Title (Swedish) Abstract

Appendix VII. Reproducibility test result: The table shows the number of obtained bases and the number of errors in the first 30 sequenced nucleotides.

analysis 1st analysis 2nd analysis 3rd analysis 4th analysis 5th analysis No correct errors1 No correct errors1 No correct errors1 No correct errors1 No correct errors1

S. dysgal equi 49 51 38 51 51 49 51 38 51 38 51 51 49 51 51 51 51 38 51 51 S. equi zooepid 51 51 38 51 51 46 51 38 51 46 51 51 38 51 41 51 51 38 51 51 S. gordonii 59 59 17 1 17 1 17 1 58 49 17 1 17 1 49 58 17 1 17 1 17 1 53 17 1 17 1 17 1 58 57 S. pneumoniae 50 46 46 51 49 52 51 51 49 49 52 51 51 49 49 50 43 43 51 43 S. pyogenes 54 53 38 53 53 54 52 38 53 46 52 38 38 25 1 46 33 54 38 54 54 S. thermophilus 52 52 47 52 52 52 47 39 47 27 1 52 52 44 44 47 52 52 47 52 52 S. vestibularis 52 47 39 52 47 47 47 47 47 27 1 29 1 47 47 50 27 2 47 52 47 44 47

1 Number of errors in the first 30 sequenced nucleotides

37