IDENTIFICATION AND ANNOTATION OF …macaque/resources/RCKennedy.pdfIDENTIFICATION AND ANNOTATION OF...

IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS

AND

AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION

A Dissertation

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

by

Ryan C. Kennedy,

Gregory R. Madey, Co-Director

Frank H. Collins, Co-Director

Graduate Program in Computer Science and Engineering

Notre Dame, Indiana

January 2011

IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS

AND

AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION

Abstract

by

Ryan C. Kennedy

The work presented here has two primary components: 1) the identification

and annotation of transposable elements (TEs) and 2) a spatially-aware agent-

based model of pathogen transmission.

Recent advances in sequencing technology have resulted in an explosion of

genomic data. The identification of TEs is an important part of every genome

project. This dissertation presents an automated homology-based approach to

identify TEs, implemented as TESeeker, that produces consensus TEs up to 98%

identical to manually annotated sequences. It also offers a design and implementa-

tion plan to allow for the inclusion of TEs on VectorBase’s community annotation

pipeline.

Agent-based modeling is very adept at modeling natural phenomena. Coupling

geographical information system (GIS) data with agent-based modeling further

increases the utility of such simulations. This dissertation presents a GIS aware

agent-based model of pathogen transmission as well as methods and recommenda-

tions for incorporating GIS data into a simulation. The model, named LiNK, was

specifically developed to study the impact of landscape on pathogen transmission.

DEDICATION

To my family and friends

ii

CONTENTS

FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Identification and Annotation of Transposable Elements . . . . . . 11.3 Agent- and GIS-based Modeling of Pathogen Transmission . . . . 31.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

CHAPTER 2: TRANSPOSABLE ELEMENT AND BIOINFORMATICSBACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 VectorBase . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Transposable Elements . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Transposable Element Identification . . . . . . . . . . . . . . . . . 16

2.5.1 De novo Discovery . . . . . . . . . . . . . . . . . . . . . . 172.5.2 Structure-based Discovery . . . . . . . . . . . . . . . . . . 172.5.3 Comparative Genomic Methods . . . . . . . . . . . . . . . 182.5.4 Homology-based Discovery . . . . . . . . . . . . . . . . . . 18

2.6 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.1 DAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Ensembl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.2.1 Ensembl Genebuild . . . . . . . . . . . . . . . . . . 212.6.3 Chado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

2.6.4 Hibernate . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6.5 VectorBase Community Annotation Pipeline . . . . . . . . 25

2.6.5.1 Planned Updates to the VectorBase Community An-notation Pipeline . . . . . . . . . . . . . . . . . . . 28

2.7 Transposable Element Annotation . . . . . . . . . . . . . . . . . . 282.7.1 VisualRepbase . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

CHAPTER 3: AUTOMATED HOMOLOGY-BASED APPROACH FORTHE IDENTIFICATION OF TRANSPOSABLE ELEMENTS . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Approach for Identification of Transposable Elements . . . . . . . 33

3.2.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1.1 Library of Representative Sequences . . . . . . . . . 333.2.1.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.3 DNASTAR SeqMan II . . . . . . . . . . . . . . . . 343.2.1.4 CAP3 . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.5 ClustalW2 . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.6 BioPerl . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.2 General Description of Approach . . . . . . . . . . . . . . 353.2.2.1 Identify Coding Region . . . . . . . . . . . . . . . . 373.2.2.2 Encompass Complete Transposable Element . . . . 393.2.2.3 Generate Consensus . . . . . . . . . . . . . . . . . . 413.2.2.4 Identify Complete Transposable Element . . . . . . 41

3.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 423.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Pediculus humanus humanus . . . . . . . . . . . . . . . . . 45

3.3.1.1 Class I Elements . . . . . . . . . . . . . . . . . . . 473.3.1.2 Class II Elements . . . . . . . . . . . . . . . . . . . 48

3.3.2 Culex quinquefasciatus . . . . . . . . . . . . . . . . . . . . 493.3.3 Anopheles gambiae PEST Genome . . . . . . . . . . . . . 493.3.4 Other Organisms . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

CHAPTER 4: DESIGN AND PROOF-OF-CONCEPT PLAN FOR COM-MUNITY ANNOTATION OF TRANSPOSABLE ELEMENTS ON VEC-TORBASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 Transposable Elements and the VectorBase Community Annotation

Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

iv

4.2.1 Similarities to the VectorBase Community Annotation Pipeline 564.2.2 Differences from the VectorBase Community Annotation

Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.3 Transposable Element Representation in Chado . . . . . . 604.2.4 Proof-of-Concept . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Design and Implementation Plan . . . . . . . . . . . . . . . . . . 654.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

CHAPTER 5: SIMULATION AND MODELING BACKGROUND . . . . 685.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Simulation and Modeling . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.1 Advantages and Disadvantages . . . . . . . . . . . . . . . 705.2.2 Building a Simulation Model . . . . . . . . . . . . . . . . . 715.2.3 Simulation Model Types . . . . . . . . . . . . . . . . . . . 725.2.4 Agent-based Modeling . . . . . . . . . . . . . . . . . . . . 745.2.5 Equation-based Modeling . . . . . . . . . . . . . . . . . . 74

5.3 Geographic Information Systems . . . . . . . . . . . . . . . . . . . 755.3.1 Raster Data . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.2 Vector Data . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Integrating Geographic Information System Data into Agent-basedModeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

CHAPTER 6: A GIS AWARE AGENT-BASED MODEL OF PATHOGENTRANSMISSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 LiNK Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.1 Model Background . . . . . . . . . . . . . . . . . . . . . . 806.2.2 Conceptual Model . . . . . . . . . . . . . . . . . . . . . . 826.2.3 ODD Protocol Description of LiNK . . . . . . . . . . . . . 91

6.2.3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . 916.2.3.2 State Variables and Scales . . . . . . . . . . . . . . 916.2.3.3 Process Overview and Scheduling . . . . . . . . . . 946.2.3.4 Design Concepts . . . . . . . . . . . . . . . . . . . 956.2.3.5 Initialization . . . . . . . . . . . . . . . . . . . . . . 966.2.3.6 Input . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2.3.7 Submodels . . . . . . . . . . . . . . . . . . . . . . . 97

6.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 986.2.5 Verification and Validation . . . . . . . . . . . . . . . . . . 98

6.3 Geographic Information System Data and Agent-Based Modeling 1006.3.1 Approximating Geographic Information System Data in Sim-

ulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

v

6.3.2 Raster Queries . . . . . . . . . . . . . . . . . . . . . . . . 1006.3.3 Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.3.1 Simplified Spatial Queries . . . . . . . . . . . . . . 1016.3.4 Precalculated Query Matrix . . . . . . . . . . . . . . . . . 1036.3.5 GIS Aware Agents . . . . . . . . . . . . . . . . . . . . . . 104

6.3.5.1 Movement . . . . . . . . . . . . . . . . . . . . . . . 1046.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.5 Analyzing Massive Amounts of Simulation Data . . . . . . . . . . 116

6.5.1 LiNKStat . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

CHAPTER 7: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 1217.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2 Automated Homology-based Approach for the Identification of Trans-

posable Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.3 Community Annotation of Transposable Elements on VectorBase 1227.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.4 GIS Aware Agent-based Model of Pathogen Transmission . . . . . 1237.4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

APPENDIX A: AUTOMATED APPROACH WALKTHROUGH . . . . . 127A.1 Representative Amino Acid Coding Regions . . . . . . . . . . . . 127A.2 Identify Coding Region . . . . . . . . . . . . . . . . . . . . . . . . 131

A.2.1 tblastn Search . . . . . . . . . . . . . . . . . . . . . . . . 131A.2.2 Extract Sequences from the Genome . . . . . . . . . . . . 135A.2.3 CAP3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . 137

A.2.3.1 CAP3 Contigs . . . . . . . . . . . . . . . . . . . . . 137A.2.3.2 CAP3 Contigs Quality Scores . . . . . . . . . . . . . 141

A.3 Encompass Complete Transposable Element . . . . . . . . . . . . 148A.4 Generate Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . 149A.5 Identify Complete Transposable Element . . . . . . . . . . . . . . 150

A.5.1 CAP3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . 150A.5.2 CAP3 Contigs Quality File . . . . . . . . . . . . . . . . . . 151A.5.3 Trimmed CAP3 Contigs . . . . . . . . . . . . . . . . . . . . 153

APPENDIX B: TESeeker WEBSITE . . . . . . . . . . . . . . . . . . . . . 154

vi

APPENDIX C: TESeeker USER MANUAL . . . . . . . . . . . . . . . . . 156C.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157C.3 Example Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165C.4 Additional Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 166C.5 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

APPENDIX D: SELECTED AUTOMATED APPROACH SOURCE CODE 170D.1 Combine BLAST Hits . . . . . . . . . . . . . . . . . . . . . . . . 170D.2 Extract Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 177D.3 Trim CAP3 Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . 179D.4 Generate Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . 181

APPENDIX E: TRANSPOSABLE ELEMENTS IDENTIFIED . . . . . . 183E.1 P. humanus humanus . . . . . . . . . . . . . . . . . . . . . . . . . 183

E.1.1 Non-LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . 183E.1.1.1 Hope-like SART . . . . . . . . . . . . . . . . . . . . 183E.1.1.2 Dong-like R4 . . . . . . . . . . . . . . . . . . . . . 186

E.1.2 LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189E.1.2.1 Mdg1 ty3/gypsy . . . . . . . . . . . . . . . . . . . . 189

E.1.3 Transposons . . . . . . . . . . . . . . . . . . . . . . . . . . 192E.1.3.1 mariner . . . . . . . . . . . . . . . . . . . . . . . . 192E.1.3.2 MITE1 . . . . . . . . . . . . . . . . . . . . . . . . 194E.1.3.3 MITE2 . . . . . . . . . . . . . . . . . . . . . . . . 195

E.2 C. quinquefasciatus . . . . . . . . . . . . . . . . . . . . . . . . . . 196E.2.1 Non-LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

E.2.1.1 CR1 . . . . . . . . . . . . . . . . . . . . . . . . . . 196E.2.1.2 I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198E.2.1.3 Jockey . . . . . . . . . . . . . . . . . . . . . . . . . 200E.2.1.4 L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 202E.2.1.5 L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 204E.2.1.6 LOA . . . . . . . . . . . . . . . . . . . . . . . . . . 206E.2.1.7 Loner . . . . . . . . . . . . . . . . . . . . . . . . . 208E.2.1.8 Outcast . . . . . . . . . . . . . . . . . . . . . . . . 211E.2.1.9 R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 213E.2.1.10 RTE . . . . . . . . . . . . . . . . . . . . . . . . . . 216E.2.1.11 Unclassified LINE . . . . . . . . . . . . . . . . . . 218

E.3 D. melanogaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219E.3.1 Transposons . . . . . . . . . . . . . . . . . . . . . . . . . . 219

E.3.1.1 mariner . . . . . . . . . . . . . . . . . . . . . . . . 219

vii

APPENDIX F: SCRIPT USED TO IDENTIFY MITES . . . . . . . . . . 220

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

viii

FIGURES

1.1 Dissertation Components . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Central Dogma of Molecular Biology . . . . . . . . . . . . . . . . 8

2.2 Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Global and Local Sequence Alignment . . . . . . . . . . . . . . . 10

2.4 Typical mariner Class II Transposon Structure . . . . . . . . . . 13

2.5 Transposable Element (TE) Classification Scheme and Structures 14

2.6 Ensembl Location-based View on VectorBase . . . . . . . . . . . . 22

2.7 Ensembl Gene-based View on VectorBase . . . . . . . . . . . . . . 23

2.8 Ensembl Transcript-based View on VectorBase . . . . . . . . . . . 24

2.9 VectorBase Gene Submission Form . . . . . . . . . . . . . . . . . 26

2.10 VectorBase Community Annotation Pipeline Data Flow . . . . . . 27

2.11 VisualRepbase Interface . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Approach Schematic . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Methods of Combination . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 P. humanus humanus mariner element . . . . . . . . . . . . . . . 45

3.4 C. quinquefasciatus Jockey element . . . . . . . . . . . . . . . . . 45

4.1 Client-side TE Submission Process . . . . . . . . . . . . . . . . . 58

4.2 TE Submission Form . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Entity-Relationship Diagram of Selected Chado Tables . . . . . . 61

4.4 TE Start and Submit Page . . . . . . . . . . . . . . . . . . . . . . 62

4.5 TE Details Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.6 TE Structure Page . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.7 Proof-of-Concept Configuration . . . . . . . . . . . . . . . . . . . 65

5.1 Raster vs. Vector Data . . . . . . . . . . . . . . . . . . . . . . . . 77

ix

6.1 Female Macaque and Infant . . . . . . . . . . . . . . . . . . . . . 81

6.2 Uluwatu Temple Site . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Life Cycle Transition Diagram . . . . . . . . . . . . . . . . . . . . 85

6.4 LiNK Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.5 LiNK Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.6 Temple Site Display . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.7 Temporal Relationship of Pathogen Parameters and Related Events 89

6.8 Pathogen Transition Diagram . . . . . . . . . . . . . . . . . . . . 90

6.9 Verification and Validation Techniques for Agent-based Models . . 99

6.10 Spatial Data Approximation . . . . . . . . . . . . . . . . . . . . . 102

6.11 Macaque Movement . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.12 Comparison of Total Number of Infections grouped by Landscapeand Population Size at Varying Sites . . . . . . . . . . . . . . . . 109

6.13 Pathogen Spread to Varying Temple Sites . . . . . . . . . . . . . 110

6.14 Performance Comparison of Varying Query Methods . . . . . . . 113

6.15 Scalability with Respect to Initial Number of Dispersed Macaquesand Amount of GIS data . . . . . . . . . . . . . . . . . . . . . . . 115

6.16 LiNKStat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.17 LiNKStat Pathogen Transmission Graph . . . . . . . . . . . . . . 118

B.1 TESeeker Website . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

C.1 TESeeker Desktop . . . . . . . . . . . . . . . . . . . . . . . . . . 158

C.2 TESeeker Genomes Folder . . . . . . . . . . . . . . . . . . . . . . 159

C.3 TESeeker TELibrary . . . . . . . . . . . . . . . . . . . . . . . . . 160

C.4 TESeeker Documentation . . . . . . . . . . . . . . . . . . . . . . 161

C.5 TESeeker Web Interface . . . . . . . . . . . . . . . . . . . . . . . 162

C.6 TESeeker BLAST Interface . . . . . . . . . . . . . . . . . . . . . 163

C.7 TESeeker Extract Interface . . . . . . . . . . . . . . . . . . . . . 164

C.8 TESeeker Default Parameters . . . . . . . . . . . . . . . . . . . . 166

C.9 Web Interface File Browser . . . . . . . . . . . . . . . . . . . . . . 167

C.10 ClustalX Alignment with Annotated Element . . . . . . . . . . . 168

x

TABLES

3.1 Pediculus humanus humanus NON-LONG TERMINAL REPEAT(NON-LTR) RESULTS . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Culex quinquefasciatus RESULTS . . . . . . . . . . . . . . . . . . 50

6.1 MOVEMENT VALUES FOR DISPERSING MACAQUES . . . . 92

6.2 STATE VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . 936.3 PERFORMANCE COMPARISON OF GUI LOAD TIME . . . . 112

6.4 PERFORMANCE COMPARISON OF TIME STEPS/S . . . . . 112

6.5 SCALABILITY COMPARISON OF TIME STEPS/S . . . . . . . 114

6.6 ADVANTAGES AND DISADVANTAGES (1- POOR; 5- EXCEL-LENT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

xi

ACKNOWLEDGMENTS

I would like to thank my advisors, Dr. Frank Collins and Dr. Greg Madey

for their direction, patience, and encouragement. I would especially like to thank

Dr. Greg Madey, with whom I have collaborated since I was an undergraduate,

for his constant support and for the providing me the opportunity to pursue this

degree.

Thank you also to my committee members: Dr. Scott Emrich for his bioinfor-

matics guidance, Dr. Agustın Fuentes for his direction on the LiNK project, and

Dr. Tijana Milenkovic for her valuable contributions on this dissertation.

Additionally, I am grateful to Dr. Nora Besansky for serving as my outside chair

and to Dr. Hope Hollocher for her direction on the LiNK simulation model and

for serving as the outside chair for my proposal. Thank you to our administrative

assistant, Ms. Joyce Yeats, for her invaluable assistance over the years.

Special thanks to Dr. Scott Christley for his unwavering support, insight, and

contributions. Thank you also to Maria Unger and Jenica Abrudan for sharing

their biological expertise while also contributing to much of this work. Thank you

to Kelly Lane for her collaboration on the LiNK simulation model.

Finally, I would like to thank my family and friends, particularly my parents

Bill and Martha, and Carrianne Scheib. This work would not be possible without

them.

This research was supported in part by NIAID/NIH contracts HHSN272200-

xii

900039C and HHSN266200400039C for “VectorBase: A Bioinformatics Resource

Center for Invertebrate Vectors of Human Pathogens” [143] and NSF grants

BCS#0639787 and BCS#0629787. Selected bioinformatics simulations were per-

formed on the Notre Dame Biocomplexity Cluster supported in part by NSF MRI

Grant No. DBI-0420980. Additional computational resources provided in part by

the Notre Dame Center for Research Computing [142].

xiii

CHAPTER 1

INTRODUCTION

1.1 Overview

The work presented in this dissertation consists of two related parts. The first

part, Chapters 2, 3, and 4, concerns the discovery and annotation of transposable

elements (TEs). The second part, Chapters 5 and 6, involves the development

of a simulation model that utilizes agent-based modeling (ABM) and geographic

information system (GIS) data to model pathogen spread. Each of these parts

could be categorized under the computational biology realm, shown visually in

Figure 1.1. The following sections provide a brief introduction to each chapter,

including our motivations.

1.2 Identification and Annotation of Transposable Elements

Transposable elements (TEs) are a type of repetitive sequence that have been

found in nearly all eukaryotic genomes. They have the ability to move about

and replicate within a genome and are believed to pay a major role in genome

evolution [82, 100, 130].

Largely because of their diversity and mobility, TEs are difficult to identify.

We present an automated homology-based approach for the identification of TEs.

This approach utilizes a comprehensive library of representative sequences as the

1

ComputationalBiology

BioinformaticsAgent-based

Modeling

TEIdentification

TEAnnotation

Model ofPathogen

Transmission

Global Health

Figure 1.1. Dissertation Components. The first part of this dissertationlies within the Bioinformatics realm, while the second is categorized as

Agent-based modeling. Each part of this dissertation shares globalhealth implications.

basis for our search. We make heavy use of common bioinformatics technologies,

namely BLAST, ClustalW2, CAP3, and BioPerl. Our approach, implemented as

TESeeker, is designed to be easier to use than existing approaches and to produce

high-quality consensus TEs, allowing for quicker genome annotation.

We also present a design and implementation plan for the inclusion of TEs in

VectorBase’s community annotation pipeline for genes. Existing TE annotation

websites, such as TEfam [134] and Repbase [119], lack the detailed data that

is available for genes. Extending VectorBase’s community annotation pipeline

to include TEs would fill this gap and allow researchers resources and detailed

information not available elsewhere.

2

1.3 Agent- and GIS-based Modeling of Pathogen Transmission

There are numerous advantages to using a simulation for scientific study [10],

including the ability to model and predict behavior of a real-world system with-

out altering the actual system. If a simulation can utilize real-world data, such

as geographic data, it has even more potential to be valuable. We present an

ABM that utilizes GIS data to simulate pathogen transmission. In particular, we

are interested in the effect landscape has on pathogen transmission. Our simu-

lation, named LiNK, has been developed to model pathogen transmission among

macaque monkeys on the island of Bali, Indonesia. GIS data has been incorpo-

rated into LiNK to allow for spatially aware macaques. We are unaware of any

other epidemiological studies to couple ABM with GIS data, or of any work study-

ing efficient ways for mobile agents to interact with their environment. As such,

we explore different means to include GIS data in an agent-based simulation and

offer suggestions for a variety of applications.

1.4 Goals

This dissertation makes the following contributions:

• Development and implementation of an automated approach to detect trans-posable elements.

• Design and implementation plan for the incorporation of TEs into the Vec-torBase community annotation pipeline.

• Development and implementation of a GIS aware agent-based model ofpathogen transmission.

3

1.5 Organization

The remainder of this dissertation is organized as follows. Chapter 2 intro-

duces biological concepts necessary to the understanding of the first part of this

dissertation. Chapter 3 describes our approach for the automatic identification

of TEs, as well as the implementation of our approach, TESeeker. This chap-

ter also presents the results of our approach, applied to a number of genomes,

including comparisons to published data. Chapter 4 describes the community an-

notation pipeline for genes on VectorBase and how to extend it to allow for TEs.

We present a design and implementation plan and describe a preliminary imple-

mentation. An introduction to agent-based simulations and GIS is described in

Chapter 5. Chapter 6 thoroughly describes our model of pathogen transmission,

LiNK. Chapter 7 summarizes the contributions of this dissertation and proposes

future work. We conclude this document with supplementary information: Ap-

pendix A presents a walkthrough of the inner-workings of TESeeker, Appendix B

presents the TESeeker website, Appendix C presents the TESeeker user manual,

Appendix D presents selected source code for TESeeker, Appendix E presents

selected transposable elements, and Appendix F presents our script to detect

MITEs.

1.6 Contributions

Applications of our approach to identify TEs have been included in the Pedicu-

lus humanus humanus and Culex quinquefasciatus genome papers [5, 83]. In par-

ticular, we authored all TE-related sections of the P. humanus humanus paper

and contributed to the non-LTR section of the TE analysis in the C. quinquefas-

ciatus paper. A paper describing our approach and its implementation is under

4

review [80]. LiNK has been described in detail in an invited journal manuscript

[76] and in conference proceedings [79]. A manuscript detailing initial biological

implications of the model is in preparation [88].

5

CHAPTER 2

TRANSPOSABLE ELEMENT AND BIOINFORMATICS BACKGROUND1

2.1 Introduction

Much of this dissertation relies on a basic understanding of molecular biology in

addition to a familiarity with transposable elements and the field of bioinformatics.

This chapter introduces these biological and bioinformatics concepts.

2.2 Molecular Biology

Cells form the basic components of life and serve varying purposes, but most

have the ability to replicate. Within each cell, there is enough genetic information

and mechanisms for the cell to make a complete copy of itself, a process called

replication [69]. Jones and Pevzner liken this to a car factory that gathers the raw

materials, prepares the materials, and assembles a copy of itself, all while making

cars at the same time [69]. Because cells are the basic reaction vesicles in the body,

understanding their inner-workings would lead to a greater overall understanding

as to how the body functions, which would be very valuable to scientists.

Although cells come in a myriad of shapes and sizes, each has three common

components: DNA, RNA, and protein molecules. DNA, or deoxyribonucleic acid,

1Portions of this chapter previously reported in Kennedy [75].

6

is often described as the building block of life, as it contains the genetic mate-

rial governing how a cell operates. DNA, composed of the nucleotides adenine,

guanine, cytosine, and thymine, is a double-stranded, helical molecule. RNA, on

the other hand, or ribonucleic acid, is composed of only a single strand of the nu-

cleotides adenine, guanine, cytosine, and uracil. RNA is used to transfer pieces of

the DNA strand to other locations in the cell. Proteins are molecules made up of

amino acids, which we describe later, that produce enzymes that can be thought

of as the laborers of the cell. This is because they perform functions varying from

assembling strands of nucleotides to signaling other cells.

The double-stranded helical structure of DNA lends itself to replication. To

replicate, a chromosome in the DNA “unzips” from its matching strand, and

then an enzyme called DNA polymerase, which is prevalent throughout the cell,

attaches itself to one of the strands. It then moves along the strand of DNA,

attracting complementary nucleotides. These nucleotides hydrogen bond to one

another and the process continues until the chromosome is copied. The same

process happens concurrently on the original strand, completing replication.

It is also important to understand the Central Dogma of Molecular Biology,

which outlines the general process by which proteins are generated from DNA,

shown in Figure 2.1. First, DNA “unzips” and then an enzyme called RNA

polymerase binds complementary nucleotides to one strand of DNA, starting at the

promoter region. This “unzipping” continues until the RNA polymerase reaches

the terminator region, at which point the RNA strand breaks off and the DNA

strands “zip” back together, completing transcription. Next, ribosomes allow the

RNA to translate and produce a polypeptide chain. The RNA strand produced by

transcription is more specifically called messenger RNA, or mRNA. Another type

7

translationproteinDNA

transcriptionRNA

Figure 2.1. Central Dogma of Molecular Biology. This is the process bywhich DNA undergoes transcription to produce RNA, which in turn

undergoes translation to produce protein.

of RNA, transfer RNA, or tRNA, continually floats around within the nucleus.

Ribosomes move along the mRNA strand, reading groups of three nucleotides,

called codons, at a time. Each codon encodes for an amino acid. There are

sixty-four possible combinations of nucleotides for a codon, yet there are only

twenty different amino acids, as multiple combinations can code for the same

amino acid. Partially for this reason, amino acids are the preferred components

of our transposable element library, described in detail in Chapter 3. Figure 2.2

shows which codons refer to which amino acids. As the ribosomes move along the

mRNA, they decipher the codons and the tRNA molecules bring the correlating

anticodons. These amino acids are assembled into a chain, called a polypeptide

chain, which is folded to make up a protein. This process proceeds from the start

codon and continues until the stop codon is encountered.

2.3 Bioinformatics

Bioinformatics is the application of computer science techniques to solve bi-

ological problems. Bioinformatics aims to develop a better understanding of the

function of genes through the use of advanced, yet easy-to-use and often web-

based, interfaces. Computer science plays a central role in bioinformatics because

the study and analysis of large amounts of genetic data, which is otherwise time-

consuming and prone to error, readily lends itself to the computer science dis-

8

U C A G

Phenylalanine Tyrosine Cysteine U

UUU UUC UUA UUG Leucine

UCU UCC UCA UCG

Serine

UAU UAC UAA UAG Stop

UGU UGC UGA UGG

Stop Tryptophan

Histidine C

CUU CUC CUA CUG

Leucine

CCU CCC CCA CCG

Proline

CAU CAC CAA CAG Glutamine

CGU CGC CGA CGG

Arginine

Asparagine Serine A

AUU AUC AUA AUG

Isoleucine Methionine

ACU ACC ACA ACG

Threonine

AAU AAC AAA AAG Lysine

AGU AGC AGA AGG Arginine

Aspartic acid G

GUU GUC GUA GUG

Valine

GCU GCC GCA GCG

Alanine

GAU GAC GAA GAG Glutamic acid

GGU GGC GGA GGG

Glycine

Figure 2.2. Genetic Code. A codon is represented by a set of threenucleotides. Each codon codes for an amino acid.

cipline. We next briefly describe several relevant bioinformatics research areas,

most of which are utilized in Chapters 3 and 4.

Genome Annotation Genome annotation refers to locating genes in a sequence

and then giving biological meaning to those regions. Genome annotation is

often classified into functional and structural annotation. Functional anno-

tation refers to deciphering the function of a gene, such as how a gene is

expressed. Structural annotation identifies characteristics of a gene, such as

where a coding region is located.

Sequence Alignment Sequence alignment is the comparison of two or more

sequences to one another. The goal of sequencing is to find similarities be-

tween or among the sequences. There are three types of sequence alignment:

global, local, and semiglobal alignment. Global alignment involves finding

9

Global Alignment

CAATCAGATCTCAT

|||| | ||||

CAATGA----TCAT

Local Alignment

Input Sequences

CAATCAGATC

|||| ||||

CAAT--GATC

CAATCAGATCTCAT

CAATGATCAT

Figure 2.3. Global and Local Sequence Alignment. Here, we showexamples of global and local alignment of the same two sequences.

Global alignment aims to match best over the entire sequence, whichmay result in the insertion of multiple gaps, as shown here.

Additionally, both sequences are found in their entirety. While globalalignment produces the best alignment that utilizes all nucleotides in

each sequence, local alignment does not necessarily utilize allnucleotides and uses shorter segments. Local alignment produces a

different alignment because it preferentially aligns shorter fragments.

the best alignment of two sequences using every nucleotide or amino acid in

each sequence, while local alignment concentrates on aligning shorter frag-

ments in highly conserved regions. The Needleman-Wunsch algorithm is

typically used for global alignment, and the Smith-Waterman algorithm is

used for local alignment. Figure 2.3 shows example global and local align-

ment for two sequences. Semiglobal alignment aims to align two sequences

such that only one sequence needs to be used in its entirety while only part

of the other sequence is aligned. Aligning more than two sequences together

is called multiple sequence alignment.

Sequencing Sequencing is the process used to find the order of nucleotides in

10

DNA. A common method to perform this is known as Sanger sequencing

[125]. Another common technique is to perform shotgun sequencing on the

Sanger sequencing results [102, 125]. In shotgun sequencing, a sequence is

divided into many small, random fragments and then sequenced using the

chain termination method. This method was used in the sequencing of the

human genome [145]. High-throughput techniques, such as 454, Illumina,

and SOLiD sequencing, are now the most commonly used DNA sequencing

methods [96].

Genome Assembly Genome assembly refers to the process of assembling many

short DNA sequences together to form the original chromosome(s) they once

composed. The short sequences are often generated by shotgun sequencing.

Gene Expression Translating information from nucleotide DNA sequence into

protein or RNA is referred to as gene expression. This technique helps

elucidate the function of such a sequence and is reflected phenotypically,

meaning the effect is observable in the organism.

2.3.1 VectorBase

VectorBase [91, 92] is a bioinformatics resource that serves as a web-based

facilitator to a wealth of information and tools pertaining to invertebrate vectors

of human pathogens. At VectorBase, researchers can, among other things, browse

genomic data, contribute to community annotation of a genome, run bioinformat-

ics tools such as BLAST, ClustalW, or HMMER, and obtain relevant information

about the vector organisms. VectorBase is an NIAID Bioinformatics Resource

Center (BRC) and serves as a facilitator and motivator for much of the work in

this dissertation.

11

2.4 Transposable Elements

Transposable elements (TEs) are a type of repetitive sequence that have been

found in nearly all eukaryotic (nucleus containing) genomes. First discovered and

analyzed by McClintock in the 1950s [98], TEs have the ability to move about

and replicate within a genome. Due to their mobile and replicative nature, TEs

often occupy large portions of genomes. TEs are estimated to represent 47% of

the yellow-fever mosquito genome, Aedes aegypti [105], 35% of the frog genome,

Xenopus tropicalis [61], and 45% of the human genome, Homo sapiens [65]. This

prevalence of TEs poses a major difficulty in sequence assembly, as repeat regions

are prone to misassembly [101, 111]. TEs can impact host genomes in a number of

ways. They are believed to play a major role in genome evolution [82, 100, 130], as

they can insert themselves into, mutate, and move genes, thereby influencing gene

expression. In turn TEs can cause gene variation and transfer genetic material

[12, 28, 129, 138].

The process by which TEs move about a genome is called transposition. TEs

are classified according to their transposition mechanism into Class I and Class II

elements. Class I TEs, or retrotransposons, are mediated by an RNA intermediate,

typically produced by a TE encoded reverse transcriptase. Class I TEs transcribe

themselves to RNA and are reverse transcribed back into DNA by the reverse

transcriptase enzyme, the so-called “copy-and-paste” mechanism. The presence

or absence of long terminal repeats (LTRs) further classifies Class I TEs into non-

LTR and LTR elements. Class II TEs, or transposons, are DNA-mediated and

transpose through the use of a transposase enzyme. Class II TEs are typically

bounded on each end by terminal inverted repeats (TIRs), which flank and serve

as the recognition sequence for the transposase. The transposase adheres to a

12

GTACAGC...AATTACG GAT...GAT TTAC...GCGTACGC...GTAA TAC...CAT

TIR TSD

TA

TSD

TA

TIR UTR UTRTRANSPOSASE

2 bp 20-30 bp 2 bp20-30 bp ~100 bp ~100 bp~900 bp

Figure 2.4. Typical mariner Class II Transposon Structure. Marinertransposons are characterized by a single transposase flanked byterminal inverted repeats (TIRs) and a preferential target site

duplication (TSD). There are generally 20-30 base pairs (bp) ofuntranslated region (UTR) flanking the transposase as well.

“cut-and-paste” mechanism, as it cuts out the TE from the host DNA and allows

it to insert at a new site in the host DNA. Many TEs have preferential insertion

sites and the method by which TEs move about genomes often produces artifacts

flanking the TEs, called target site duplications (TSDs). Both Class I and II

TEs are further divided into families, each with distinguishing characteristics. We

follow the classification scheme described by Tu [139], summarized in Figure 2.5.

Class I RNA-mediated retrotransposons are divided into several main families,

which we next describe.

LTR Retrotransposons– Members of the LTR retrotransposon group are

typically 5-9 kb (kilo base pairs) long and have a 200 to 500 bp (base

pairs) LTR on both ends. LTRs encode polymerase (pol) and group-

associated antigen (gag) proteins [22, 139]. Example families of LTRs

include Ty1/copia, Ty3/gypsy, and BEL retrotransposons.

Non-LTR Retrotransposons– The non-LTR retrotransposons include long

interspersed nuclear elements (LINEs). LINEs are generally 5-8 kb

long with open reading frames (ORFs) that contain coding necessary

13

LTR LTR

gag PR IN RT RH

TSD TSD

TSD TSD

TIR TIR

TIR TIR

non-coding region (GAA)n

(GAA)n

ORF

ORF I ORF II(gag) (pol)

APE RHRT

tRNA-related region

promoters

ORF

A T

Helicase

LTR Retrotransposon, 5-9 kb

non-LTR Retrotransposon (LINE), 5-8 kb

SINE, < 500 bp

"cut-and-paste" DNA Transposon, typically under 5 kb

MITE, typically under 500 bp

Helitron, around 7 - 9 kb

Class I TEs

Class II TEs

Figure 2.5. TE Classification Scheme and Structures. The figure above,adapted from [139], shows the typical division of TE classes, as well as

major families within each. PR, IN, RT, RH, and APE refer toenzymes. TSD refers to target site duplications and ORF refers to openreading frame. ORFs are coding sequences without a stop codon. This

figure is not drawn to scale.

14

for transposition [22, 129, 139]. Non-LTRs also have a pol region. L1,

L2, R1, Jockey, and Penelope are several example non-LTR Retrotrans-

posons.

SINEs– SINEs, or short interspersed nuclear elements, are similar to LINEs,

but much shorter, with a typical length of under 500 bp [139]. SINEs

do not code for proteins and instead use transposition mechanisms from

other retrotransposons.

Class II Transposons are mediated by DNA transposition. Transposons cut and

paste themselves within genomes without the use of an RNA intermediate.

Descriptions of several families of transposons follow.

mariner– The mariner transposons are characterized by a DDD motif and

generally contain one exon transposase. A motif is a sequence of amino

acids, which, in this case, are all Aspartic acids and are each rep-

resented by a ‘D.’ Mariners are widespread in invertebrates and are

well characterized. We show an annotated mariner transposon in Ap-

pendix E.1.3.1.

P– The P element transposons were first discovered in the fruit fly Drosophila

melanogaster genome and were the first transposons to be used as a

gene vector [73, 122]. Since the initial discovery of P elements, they

have been found in several other species, most notably the malaria

mosquito Anopheles gambiae [126].

piggyBac– The piggyBac family of transposons was first discovered in the

cabbage looper moth Trichoplusia ni [23]. This transposon has since

been found in a number of organisms and has proven effective as a gene

15

vector for a variety of organisms, including Aedes aegypti [95].

Tc1 – Like all transposons, Tc1 transposons are flanked by inverted repeat

sequences. They are valuable gene vectors, extensively used in the

analysis of Caenorhabditis elegans [110]. Tc1 transposons are typically

recognized by their DDE motif.

pogo– These transposable elements are members of the Tc1 superfamily.

They are very similar to Tc1 transposons, but lack the DDE catalytic

motif.

MITEs– MITEs, or miniature inverted-repeat transposable elements, are

small elements often under 500 bp. Their mechanism of transposition

is not entirely known and they do not encode a protein.

Helitrons– Helitrons replicate using a rolling-circle mechanism and encode

helicase-like proteins.

2.5 Transposable Element Identification

There are several difficulties with TE identification. TEs do not adhere to a

universal structure; instead, some families of TEs follow specific structures. An

example would be the TIR-transposase-TIR general structure of a Class II trans-

poson, such as in the mariner element. Additionally, the structure of TEs can

mutate over time. For example, TEs may preferentially insert themselves in sim-

ilar regions of the genome, or even within one another, leading to many nested

and fragmented copies. While autonomous, or active, TEs possess intact reading

frames which serve as mechanisms for transposition, the majority of TEs are non-

autonomous, or not active. Non-autonomous TEs can often still be transposed,

using the transcription machinery of other elements in their class, but they can-

16

not transpose themselves. For these reasons, a general approach cannot be used

to identify all TEs. Instead, several approaches are used with varying levels of

effectiveness.

The automatic computational identification of TEs is not as robust or mature

as analogous methods currently used for genes [115]. Bergman and Quesneville [13]

describe many TE discovery methods and classify existing TE discovery techniques

into de novo, structure-based, comparative genomic, and homology-based. Saha

et al. and Lerat more recently reviewed approaches to identify TEs [93, 124] and

classify identification techniques into similar groups: ab initio, signature-based,

and library-based techniques. We next describe the approaches according to the

Bergman and Quesneville classification.

2.5.1 De novo Discovery

De novo TE discovery approaches look for similar sequences found at multiple

positions within a genome. Once identified, the sequences are typically clustered,

filtered, and characterized. While computationally expensive, this approach can

identify novel TEs and is most effective in discovering TEs with high prevalence

within a genome. De novo techniques are not as effective in identifying degraded

TEs, meaning TEs with mutated or incomplete structure. Example de novo tools

include PILER [40] and RECON [11].

2.5.2 Structure-based Discovery

Structure-based approaches, such as LTR STRUC [97], typically work well

to identify complete TEs that comply to a defined and conserved structure. In

this case, LTR STRUC is effective at finding retrotransposons with LTRs at each

17

end of the element. Structure-based methods are less useful when searching for

degraded TEs or for TEs without a conserved structural characteristic.

2.5.3 Comparative Genomic Methods

A comparative genomic discovery method described by Caspi and Pachter

[24] uses multiple sequence alignments of closely related genomes to detect large

changes between the genomes. The idea is that differences in the genomes, called

insertion regions, could be TEs or caused by TEs. Such differences are analyzed

and classified. This approach is useful when related genomes are readily available

and can identify new families of TEs. Common ancestral TEs will likely not be

identified by this approach.

2.5.4 Homology-based Discovery

Homology-based approaches utilize known TEs as a means to discover similar

TEs in genomes. This is typically done by manually using alignment programs,

such as BLAST [1], to align known TEs to the genome in question and then care-

fully analyzing the results. Biedler and Tu [14] reference a suite of TE-related

programs to identify and characterize TEs that are homology-based, and Ques-

neville et al. offer the BLASTER suite of tools [116] to detect TEs. Although

there are few homology-based tools and despite the fact that they struggle in

identifying TEs unrelated to known elements, they are normally most accurate in

identifying known TEs, as well as detecting degraded TEs. Existing homology-

based approaches also sometimes utilize hidden Markov models (HMMs) [2]. Such

approaches are effective for closely related genomes, but struggle with distantly

related species, as the models tend to capture more irrelevant data when searching

18

for diverse sequences. Additionally, homology-based approaches currently avail-

able are the fewest in number and least automated. Moreover, many are not

geared to output high-quality consensus sequences. For these reasons, our auto-

mated approach to identify TEs, described in Chapter 3, is homology-based.

2.6 Annotation

The annotation of genomic data refers to giving meaning to such data, namely

describing the functions served by specific regions of DNA. Annotation is most

often applied to finding the structure and function of genes, a process that can be

very time-consuming in the lab. Several computational tools are commonly used in

the automatic annotation of genomes. These tools can be categorized as structure-

or homology-based and are often used in conjunction with other tools. Genescan

[20] is a gene identification tool that utilizes the structure of introns (non-coding

regions) and exons (coding regions) in its computation. GeneWise [15] is another

tool that works based on protein sequence similarity, shown to be very effective

in locating known genes [16]. These tools often utilize Hidden Markov Models

(HMMs) [38, 39, 102]. Ensembl’s automatic gene annotation system is one of the

better-known annotation systems; however, it is far from the “gold standard” of

annotation [30]. Such a standard is labor-intensive and takes years to complete.

Genes vary in many ways, making their automatic annotation difficult. Kohany

et al. [84] note that while automatic annotation has the potential for very high

throughput and is not susceptible to user error or bias, it is difficult to reconstruct

sequences, particularly TEs (largely due to their fragmentation), using automated

techniques.

The VectorBase community annotation pipeline eases many of these concerns.

19

Gene experts have the ability to annotate genes through the community pipeline,

as well as link their scientific publications to their efforts. Their work is visible to

the world through VectorBase, and these experts are publicly recognized. Vector-

Base’s current system utilizes the DAS, Ensembl, Chado, Hibernate technologies,

each of which are used in our design and implementation plan to annotate TEs,

described in Chapter 4. We elaborate upon each technology below.

2.6.1 DAS

DAS, or distributed annotation system [35, 68], is the protocol used to dy-

namically display data in the VectorBase Ensembl genome browser. DAS utilizes

a client-server setup through which a client interacts with multiple servers. Its

main advantage is that the annotation information can be located in multiple

databases and on multiple servers, but can be gathered and used by a single client

(VectorBase).

2.6.2 Ensembl

Ensembl exists with the goal to produce and maintain automatic annotations

on selected eukaryotic genomes [44]. Ensembl regularly increases supported con-

tent; as of release 59, Ensembl supports 56 species [45]. For supported organisms,

Ensembl offers detailed information. Researchers can visually browse to locations

within a genome and then view detailed information, such as for a gene. The anno-

tation information is displayed through Ensembl’s rich genome browser’s various

views. The location-based view shows details for regions of the genome. The

gene-based view shows detailed gene information. Transcript information is also

available in the transcript-view. This rich genomic annotation data is produced

20

through DAS tracks. Figures 2.6, 2.7, and 2.8 show the VectorBase implementa-

tion of each of these views.

2.6.2.1 Ensembl Genebuild

Ensembl uses an automated pipeline to annotate genes in newly sequenced

genomes [31]. Their pipeline is largely homology-based and heavily relies on pro-

tein alignments with GeneWise [15]. A major advantage of GeneWise is that it

allows for frameshifts in the coding regions. Frameshifts are regions where the

reading frame has changed due to an insertion or deletion. Other alignment tools

such as exonerate are also utilized. Protein sources are obtained from UniProt

[141] and RefSeq [113]. A detailed description of the Ensembl pipeline is available

in Curwen et al. [31].

21

Fig

ure

2.6.

Ense

mbl

Loca

tion

-bas

edV

iew

onV

ecto

rBas

e.H

ere,

we

show

the

Vec

torB

ase

imple

men

tati

onof

Ense

mbl’s

loca

tion

-bas

edvie

won

the

2Lch

rom

osom

eof

the

An

ophe

les

gam

biae

genom

e.

22

Fig

ure

2.7.

Ense

mbl

Gen

e-bas

edV

iew

onV

ecto

rBas

e.T

he

figu

reab

ove

show

sth

eV

ecto

rBas

eim

ple

men

tati

onof

Ense

mbl’s

gene-

bas

edvie

w,

show

ing

gene

AG

AP

0033

95in

An

ophe

les

gam

biae

.

23

Fig

ure

2.8.

Ense

mbl

Tra

nsc

ript-

bas

edV

iew

onV

ecto

rBas

e.T

he

Vec

torB

ase

imple

men

tati

onof

Ense

mbl’s

tran

scri

pt-

bas

edvie

wfo

rA

nop

hele

sga

mbi

aege

ne

AG

AP

0033

95is

show

nab

ove.

24

2.6.3 Chado

Chado is a part of the GMOD (generic model organism database) [53] project.

GMOD is a collection of open source software tools for creating and managing

genome-scale biological databases. Chado is a modular schema that is designed to

allow the addition of new modules for new data types. VectorBase uses Chado to

host its complex biological data, currently stored in more than 250 tables. TEs can

have a more complicated structure than “regular” genes and the existing Chado

schema is complex and abstract enough to able to accommodate them. There

is limited built-in TE-specific support for TEs within Chado; in fact, GMOD

encourages adapting TEs to Chado [27], which prompted us to adapt Chado to

our specific needs, which is outlined in Section 4.2.3.

2.6.4 Hibernate

Hibernate [62] is a library for Java that provides a framework for mapping

java classes to a database using XML. This allows for the automatic updating,

retrieval, creation, and deletion of object data. The use of Hibernate eliminates

the need to write SQL for such operations.

2.6.5 VectorBase Community Annotation Pipeline

VectorBase has implemented a community annotation pipeline (CAP) for genes,

which has the ability to accept four types of annotation information:

1. Gene Models Users can submit gene sequence to be incorporated into genebuilds.

2. Publications Sequence data can be linked to the literature.

3. Controlled Vocabulary Terms Controlled vocabulary terms can be as-sociated with genes.

25

Figure 2.9. VectorBase Gene Submission Form. The figure above showsa portion of the spreadsheet researchers populate with gene data to

submit to VectorBase.

4. Comments General comments are available for publication.

This data is submitted by full-time annotators and community researchers.

Data is collected through spreadsheets, a portion of one which can be found in

Figure 2.9. The spreadsheets are uploaded through the VectorBase website and

the genome-specific curator is notified of the submission. If approved, the data is

aligned to the genome with exonerate and incorporated into VectorBase, making

it available to the community. Once submitted, links are available for researchers

to associate publications or controlled vocabulary terms with the data, as well as

to add comments. Additionally, community submitted genomic data is displayed

through Ensembl’s browser on VectorBase via DAS.

Community annotation data on VectorBase flows according to the schematic

shown in Figure 2.10, adapted from Bruggner [19], and also briefly described in

Butler [21]. Many technologies are utilized, most of which have been described

previously. VectorBase uses PostgreSQL databases [112], which communicate with

DAS through Hibernate. SOAP [131], an XML protocol for exchanging XML

messages, interfaces with Hibernate and the VectorBase web pages.

26

Manual

Annotation

Database

Java Hibernate Chado API

FeatureSubmission

Functions

FeatureSubmission Web

Service Interface

FeatureSubmissionQuery

Functions

FeatureSubmissionQuery

Web Service Interface

PostgreSQL Apache Tomcat & Axis

Community

Genomic

Annotation

SubmissionForm.xls

DAS Server

Apache HTTP Server & PHP

Community AnnotationSubmission & Review

Interface

Genome Browser

Contig View

Genome Browser

Gene View Gene DAS

Report

SOAP

Figure 2.10. VectorBase Community Annotation Pipeline Data Flow.We show the flow of information on VectorBase for community gene

annotation, adapted from Bruggner [19].

27

2.6.5.1 Planned Updates to the VectorBase Community Annotation Pipeline

VectorBase has been heavily updated since the initial implementation of CAP.

At the time of this work, the original VectorBase CAP has not been restored

to working order. In the long term, and particularly with VectorBase 2.0, CAP

is likely to evolve further. Potential new features to CAP include an improved

interface for the submission and presentation of data, additional means to submit

data, as well as the ability to search for community submitted data. Additional

plans for the new CAP are currently under development and should eventually

include TEs.

2.7 Transposable Element Annotation

There has been much work done on the annotation of genes [16, 20, 30, 84, 87],

but little on the annotation of TEs. TEfam [134], Repbase [119], and WikiPoson

[148] currently serve as the best online TE resources. TEfam currently caters

to TEs from the Anopheles gambiae, Aedes aegypti, and Culex quinquefasciatus

mosquito species, as well as the Ixodes scapularis tick. TEs from the body louse,

Pediculus humanus humanus, are available internally, with plans to make them

public in early 2011. There, users can submit and view representative sequences.

There is no structural display provided, but the TEs are well-annotated. The

submission process is regulated through user accounts. Meanwhile, Repbase has a

large database of consensus TEs from many taxons. Registered users may export

TEs in the EMBL, FASTA, or IG file format and can submit properly formatted

TEs by emailing the editor-in-chief at the site. WikiPoson is a newer site that has

various information about TEs, including descriptions of how TEs are classified,

descriptions of many elements, as well as TE-related news. Stand-alone tools,

28

such as Apollo [94], also exist, but they largely rely on biological expertise and

their data are often not available online.

2.7.1 VisualRepbase

VisualRepbase [136] offers an interface for the study of TEs available on Rep-

base. VisualRepbase is available as a downloadable Java archive and has several

important features. First, researchers can search for TEs by family and genome

and download sequences in the FASTA format. Second, VisualRepbase can dis-

play the location and orientation of its TEs on the selected genome, as well as

any available annotation data. Third, VisualRepbase can also display the location

of any properly formatted annotation data available for a given genome. Lastly,

VisualRepbase lists the number of occurrences by chromosome for selected TEs.

This mapping was performed with Censor [72] and is stored in a table format that

includes coordinates within the genome. Figure 2.11 shows the distribution of the

MARINERN3 AG transposon on the 2R Chromosome of Anopheles gambiae.

While useful, there are drawbacks to VisualRepbase. The visual display of

TEs lacks structural information. While the sequence orientation is shown, struc-

tural TE features such as TIRs are not shown. VisualRepbase is most useful for

showing the distribution of TEs within a genome, as well as their proximity to

any previously annotated data. Rich genome browsing features, such as those

available in Ensembl’s Genome Browser, would increase the utility of VisualRep-

base. Additionally, while VisualRepbase allows for the entry of new sequences,

it appears that users must also submit the associated annotation information,

including genomic coordinates, which is not always straightforward.

29

Figure 2.11. VisualRepbase Interface. The figure above shows thedistribution of the MARINERN3 AG transposon on the 2R

chromosome of Anopheles gambiae. As shown in the figure, there are109 occurrences on the chromosome with at least 50% identity to the

consensus.

30

2.8 Summary

This chapter has presented background information necessary for an under-

standing of Chapters 3 and 4. In particular, we have discussed basic molecular bi-

ology and introduced several bioinformatics research areas, including annotation.

We have described TEs and techniques utilized to detect them within genomes.

Lastly, we have described TE annotation techniques. Chapter 3 uses the infor-

mation this chapter has presented to describe an approach to identify TEs and

Chapter 4 describes a plan for the annotation of TEs on VectorBase.

31

CHAPTER 3

AUTOMATED HOMOLOGY-BASED APPROACH FOR THE

IDENTIFICATION OF TRANSPOSABLE ELEMENTS1

3.1 Introduction

The identification of TEs is an important part of every genome project. Unfor-

tunately, the automatic identification of TEs in novel genomes is far from mature.

In particular, there is a lack of automated homology-based approaches that pro-

duce high-quality consensus TEs [13]. As the number of sequenced genomes has

rapidly risen, the identification of TEs has received greater attention from the

scientific community. The ability to identify TEs automatically and effectively

in a manner similar to the methods used for genes is also of increasing impor-

tance. There exist many difficulties in identifying TEs, including their tendency

to degrade over time and that many do not adhere to a conserved structure. In

this chapter, we describe an easy-to-use, automated homology-based approach to

discover high-quality putative TEs. We apply this approach to recently sequenced

arthropod genomes and identify consensus TEs up to 98% identical to manually

annotated TEs. The implementation of our approach, TESeeker, is available for

download as a virtual appliance.

1Results of this chapter have appeared in the TE sections of Arensburger et al. [5], andKirkness et al. [83]. Our approach, with selected results, is under review in Kennedy et al. [80].

32

3.2 Approach for Identification of TEs

Our approach targets the identification of TEs with homology (similarity) to

known TEs. While our approach has many potential applications, we focus on

characterizing TEs in novel genomes. The approach utilizes a diverse library of

representative TEs as a basis for BLAST searches against the genome in question.

The hits then undergo multiple iterations of processing before we produce a high-

quality consensus TE. This section introduces and describes our approach.

3.2.1 Dependencies

This approach relies upon several notable bioinformatics tools, described be-

low.

3.2.1.1 Library of Representative Sequences

Our modular homology-based approach relies on a thorough and high-quality

library of representative TEs, organized by family. When strong information is

available, amino acid coding regions, reverse transcriptases for Class I TEs and

transposases for Class II TEs, are the preferred components of the library. Nu-

cleotide sequences can also be used, but such sequences do not allow for as much

nucleotide variance during the search. Sequences for our library were chosen man-

ually from TEfam [134], NCBI [104], Repbase [71], and the literature. LTR reverse

transcriptases within the representative library were chosen with the assistance of

Jose Manuel C. Tubıo [140]. Sequences with complete amino acid coding regions

were preferentially chosen, and a wide variety of related sequences was assembled

for each family. Currently, the library consists of 475 representative coding re-

gions from a variety of organisms and covering the major TE families. Further

33

details on the provided library are available within the FASTA files and online

[137]. Because the library consists of sequences in the FASTA format, researchers

can easily modify the library or create their own library for use in the approach.

3.2.1.2 BLAST

We utilize BLAST [17, 102], or basic local alignment search tool, to perform

sequence similarity searches. BLAST is one of the most widely used algorithms

in bioinformatics and works well for our purposes.

3.2.1.3 DNASTAR SeqMan II

While not used in the automated approach, DNASTAR SeqMan II [33] played

a central role in the development of this approach. DNASTAR SeqMan II works

similar to an assembler in that it allows one to set various parameters, such as

match size, minimum match percentage, or minimum sequence length, to produce

contigs, which are consensus sequences generated from multiple sequences. We

utilized DNASTAR SeqMan II extensively in early versions of this approach.

3.2.1.4 CAP3

CAP3 [64] is a popular and mature sequence assembly tool. In most cases, CAP3

produces better quality contigs than the phrap assembler [57]. The ability to clip

low-quality regions of the input sequences is an added plus.

3.2.1.5 ClustalW2

ClustalW2 [89] is the newest version of the most widely used multiple sequence

alignment program. ClustalW2 offers a balance of speed and accuracy, while also

34

supporting the ability to produce phylogenetic trees.

3.2.1.6 BioPerl

We use Perl, or practical extraction report language, and an extension of Perl

called BioPerl [133] to perform the majority of our sequence analysis. Perl is

an interpreted scripting language that lends itself well to the bioinformatics field

largely because of its parsing capabilities. BioPerl has many of the common bioin-

formatics applications built into modules, which makes it very powerful.

3.2.2 General Description of Approach

Our approach varies slightly depending on whether the representative TEs

are amino acid or nucleotide sequences, the main difference being that amino

acid searches require only a translated nucleotide genome search, tblastn, while

nucleotide sequences require translation of both themselves and the host genome,

tblastx. We next describe the approach that starts with an amino acid library

of TEs, shown graphically in Figure 3.1. A walkthrough of our approach, for the

mariner TE in P. humanus humanus is described in Appendix A.

The approach begins with BLAST searches against the genome using repre-

sentative TEs for the chosen family. Resulting BLAST hits are combined if they

overlap or are very close together, and are then extracted from the genome. We

next assemble the hits with CAP3 in an attempt to get a viable representation of

the coding sequence. We use the CAP3 results to do another BLAST search against

the genome and process the hits in the same manner. However, when extracting

the sequences from the genome, we add flanking regions. The length of the flank-

ing region is dependent on the type of TE and is utilized to enable us to capture

35

BLAST CAP3

ClustalW2

trim

combine

generateconsensus

ConsensusTE

TE Library

Genome

Genome

Figure 3.1. Approach Schematic. The approach is composed ofmultiple, iterative steps. The general flow is as follows. A TE family isused in a BLAST search against the genome. Hits are then combined,

extracted from the genome, and assembled with CAP3. Next, thesequences are trimmed and again used in a BLAST search against the

genome. The results are then used to produce a multiple sequencealignment in ClustalW2. We generate a consensus from the alignment,and then perform a final BLAST search against the genome. We againcombine, extract, and assemble with CAP3. Finally, the consensus TE is

generated from CAP3.

36

the entire TE. These extracted results are then aligned and a consensus is gener-

ated. We use the consensus to perform a final BLAST search, again combining,

extracting, and assembling the sequences. CAP3 then produces the high-quality,

full-length consensus TE. We next describe the approach in more detail.

3.2.2.1 Identify Coding Region

The coding region is generally most conserved across TEs within a genome,

as it must be complete to produce a functional protein. We begin with local

sequence alignments using BLAST. Nucleotide-based blastn searches are not as

effective in identifying TEs and are not used; the nucleotide sequence for a given

TE may vary considerably, while the translated amino acid sequence is more likely

to be conserved. Instead, tblastn searches are used to identify the coding region.

BLAST produces a set of hits for each TE query against the genome and we

consider hits with an expectation value (e-value) less than 1E-20 for our approach.

Lower e-values correlate to more significant hits. This cutoff was determined

from our empirical data to limit the hits to the most probable TEs while also

eliminating most false positives and can be manually adjusted. Due to slight

sequence variations, BLAST results are often rich in short, nearly-adjacent hits.

We process BLAST results such that all hits are combined if they are within a

specified distance of one another, 50 bp by default, and originate from the same

query sequence. Hits with overlapping coordinates are combined as well. These

combinations increase the quality of our hits and the potential to capture more

complete sequences. In the case where there is a gap between sequences, we also

include the intervening sequence data in our hit. Figure 3.2 shows combination

scenarios. Once all possible combinations are performed, hits are extracted from

37

AB

C

A B

C

AB

C

AB

C

A

A

B

B

C

C

Figure 3.2. Methods of Combination. This figure shows the five generalcombination scenarios used in our scripts. In each case, hit sequences A

and B are consolidated into a single sequence C, which represents asection of nucleotides from the genome. We have shown combinations of

overlaps, nested sequences, and sequences separated by a short,prespecified distance.

the genome.

At this point, we have a set of possible coding sequences, both complete and

partial, many of which are copies or partial copies of one another. To consolidate

and improve our results, we assemble the sequences with the CAP3 assembly pro-

gram [64]. CAP3 produces contigs and singleton sequences. Singleton sequences

are sequences that did not assemble with other sequences. CAP3 also generates

accompanying quality scores for the contigs, based upon the underlying sequences

that produced the consensus. We use the quality scores to trim the sequences

such that the highest quality sequence remains. To do this, we iterate through a

38

contig, keeping track of the cumulative sum of quality scores for a given number

of consecutive nucleotides, called the sliding window, which is 20 bp by default.

When the average value of a nucleotide in this sliding window exceeds a thresh-

old, typically 18, we consider the corresponding sequence to be high quality. If the

average value drops below the threshold, the sequence is ignored. Once we have

read the entire sequence, there will likely be gaps in the sequence where there is

little commonality. In these cases, we only keep the low-quality regions if they are

of short length and have adjacent high-quality sequences. These results are then

reassembled in CAP3, trimmed, and considered the best potential complete coding

region. In the case that CAP3 produces only singletons, we perform the aforemen-

tioned analysis with them. We then aim to extend the sequence to encompass the

entire TE. Pseudocode for the steps described in this section of our approach is

shown in Algorithm 1.

3.2.2.2 Encompass Complete Transposable Element

Once the putative coding region has been identified, we create a consensus for

the complete TE. We perform a blastn search with each contig and singleton

from the previous (CAP3) step to find the instances of the TE within the genome.

We again process these hits as before and extract them from the genome, but

this time we also extract flanking regions on both sides of the viable hits in an

attempt to capture the entire TE. This extracted set of instances can then be used

to generate a consensus sequence.

39

Algorithm 1 P=IdentifyPutativeSequences (Q, S, evalue, distance)

Let Q be the set of representative TEsLet S be the genomeLet P be the set of putative hitsLet evalue be the maximum e-value of a potential hitLet distance be the maximum distance between potential hits// Search genome and sort hits according to locationfor all q ∈ Q do

Hq ←BLAST(q, S)Hq ←sort(Hq, position)

end for// Combine overlapping hitsfor all q ∈ Q do

for all h ∈ Hq doif h ≤ evalue then

for all i ∈ Hq doif i ≤ evalue then

if abs(h.location− i.location) ≤ distance thenh← (h + i)

end ifend if

end forend if

end forend for// Extract putative TEs from genomefor all q ∈ Q do

for all h ∈ Hq doPq ←extract(h, S)

end forend for// Assemble consensus TEsfor all p ∈ Pq do

p←trim(CAP3(p))end forreturn P

40

3.2.2.3 Generate Consensus

The extracted near full-length sequences from the previous step are inherently

very similar on a nucleotide-by-nucleotide basis. To generate a consensus from

this set of sequences, we perform a multiple sequence alignment with ClustalW2

[89]. A consensus sequence from the multiple sequence alignment is generated as

follows. We record counts for each nucleotide at each position in the alignment

file. If a gap is encountered, counts for each nucleotide are incremented. If the

percentage for any nucleotide at a given position exceeds a given threshold, 49%

by default, that nucleotide is used for that position in the consensus. We now

have a consensus sequence for the TE that is the most likely sequence to occur in

the genome and we need to verify that it is complete.

3.2.2.4 Identify Complete Transposable Element

To validate and improve the consensus sequence, we look for similar copies

of it in the genome with a blastn search. We again process the BLAST hits as

previously described and extract them from the genome, generally adding short

flanking sequences. The resulting extracted sequences are again iteratively exam-

ined with CAP3 and trimmed. CAP3 produces a sequence which represents the best

estimate for a representative putative TE in the novel genome. Manual inspec-

tion on the putative TE is advisable, both in terms of validity and classification.

Once validated, this TE can then be utilized to calculate the density its particular

family within the genome and to find individual instances.

41

3.2.3 Implementation

Our approach is implemented as TESeeker and was purposely designed to be

modular, while relying upon common bioinformatics tools, namely BLAST, CAP3,

and ClustalW2, as well as BioPerl [133]. TESeeker is released as a VirtualBox

[144] virtual appliance. The local web browser interface to TESeeker offers the

main gateway to the core TESeeker functionality; however, TESeeker can also be

run through the command line. A researcher needs to only provide basic param-

eters, such as TE family, host genome, closeness to combine, minimum BLAST

hit length, flank length, CAP3 window size, CAP3 quality score threshold, and

the nucleotide percentage threshold for consensus generation. We offer suggested

parameters that were determined through extensive testing of the approach on

various TE families. These tests were largely performed on arthropod genomes.

Suggested parameters include combining BLAST hits within 50 bp, a CAP3 win-

dow size of 20 bp, combine distance of 50 bp, and quality score threshold of 18,

a 49% nucleotide commonality cutoff for consensus generation, and a 70% mini-

mum length cutoff (with respect to the query) for the final BLAST search. Further

details on suggested parameters, as well as means to perform a sample run are

provided with the virtual appliance. While not parallelized, researchers can easily

run multiple instances of TESeeker while varying parameters and TE families,

offering scalability.

3.2.4 Advantages

Our approach offers many advantages to researchers. First, TESeeker allows

for the fast and accurate detection of TEs. As demonstrated in several genomes,

across multiple TE families, TESeeker effectively identifies TEs. In addition to

42

TE identification, our approach offers opportunities to reexamine and validate

previous research. Second, TESeeker is very easy to use; we provide TESeeker

as a virtual appliance, completely configured. Researchers must only provide a

few parameters to begin searching. Parameters are easily modified and multiple

iterations of the approach can be run simultaneously. Third, TESeeker is general.

While we primarily evaluated our approach on Class II TEs in arthropod genomes,

the parameters can be adjusted to allow for the effective detection of a variety of

TE families in any genome, including genomes that contain only degraded TEs.

Less stringent parameters will be more effective in detecting such degraded TEs,

but will also increase the number of false positives. As mentioned previously, we

have utilized various stages of this approach to identify non-LTR and LTR TEs

in a number of genome projects. Last, our approach eases the burden on expert

annotators, decreasing genome annotation time.

3.2.5 Limitations

While robust, this approach has several limitations. First, results are highly

dependent on the quality of the sequences in the library and whether the novel

genome contains TEs with homology to those in the library. The library must

contain a thorough representation of TEs for a given family, preferably amino

acid coding regions. The provided library has performed well, but extensive test-

ing has not been performed on LTR elements. Additionally, this approach is not

designed to detect TEs without a coding region, such as SINEs or MITEs. Second,

the approach is most effective for TEs that exist in multiple copies throughout

the genome. While TESeeker has been shown to find TEs that have only a sin-

gle full-length instance, the quality of its output and the extra effort required

43

by the researcher to alter the parameters can be time consuming. Last, results

from TESeeker must be closely examined. An ongoing issue with TEs concerns

their classification. If a search is seeded with mariner sequences, it may produce

consensus TEs that are not true mariners, but are rather mariner -like TEs. For

this study, TEs were classified through examination of their amino acid coding

regions.

3.3 Results

Our approach was developed over the course of several TE detection projects

on several arthropod genomes [5, 83], but was not originally automated. DNASTAR

SeqMan II [33] was used in place of CAP3 and ClustalW2. DNASTAR SeqMan II

produced viable results, but it required extensive interaction from a researcher.

Sequences had to be manually examined and trimmed in the program, a process

which took considerable time and required a trained researcher. This manual ap-

proach produced results that we consider a high-quality annotation of TEs. We

used these results to partially validate TESeeker against the Pediculus humanus

humanus genome, described below. For example, running the approach with de-

fault parameters for a mariner Class II element in P. humanus humanus and with

a Jockey Class I element in Culex quinquefasciatus produced a consensus TE that

was more than 98% identical to the manually produced element. Additionally, the

elements were correctly trimmed. These consensus sequences were generated with

amino acid coding sequences - transposase in the mariner element and both open

reading frames of the reverse transcriptase in the Jockey element. Figures 3.3 and

3.4 show alignments of the automated approach’s consensus versus manually an-

notated elements. We also evaluated our approach against published results from

44

Automated

Manual5’ 3’

AA-TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA

||||||||||||||||||||| ||||||||||||||||||||||

TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA

Figure 3.3. P. humanus humanus mariner element. The alignment ofthe consensus sequence from our approach and the manually annotated

element is shown above. As evident from the figure, our approachcorrectly identifies the element and trims the ends almost perfectly.

Automated

Manual5’ 3’

TTT...TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTG

|||||||||||||||||||||||| |||||||||||||||||||

TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTGAT

Figure 3.4. C. quinquefasciatus Jockey element. The alignment of theconsensus sequence from our approach and the manually annotatedelement is shown above. Extra thymine elements on the 5’ end aretypical of the corresponding poly(A) tail, characteristic of Jockey

elements.

the Anopheles gambiae PEST genome, as well as a number of other genomes. Var-

ious stages of our methodology have been applied to a number of genome projects

which we describe next. In all cases, we utilized our library of representative cod-

ing regions. If we were searching an annotated genome with representative coding

regions already in our library, we removed them before running TESeeker.

3.3.1 Pediculus humanus humanus

The body louse, Pediculus humanus humanus, is the primary vector of typhus

and several other diseases [109]. It has the smallest presently sequenced insect

45

TABLE 3.1

Pediculus humanus humanus NON-LTR RESULTS

Class I Family Element Length (bp) Full-length Copies Copies Density

non-LTR

SART Hope-like 4655 1 522 0.18%

R4 Dong-like 5266 4 1739 0.45%

LTR

ty3/gypsy Mdg1 5395 2 976 0.28%

Class II Family Element Length (bp) Full-length Copies Copies Density

mariner/Tc1 mariner 1276 24 216 0.09%

MITE MITE1 623 4 39 0.02%

MITE2 169 16 66 0.007%

TOTAL 1.027%

genome at roughly 110 Mb (mega base pairs). TESeeker was able to identify all

Class I and II TEs, with the exception of MITEs, reported in Kirkness et al. [83].

A separate tool was developed to detect MITEs and is not described here. Unlike

many other arthropod genomes, only 1% of the P. humanus humanus genome

is made up of TEs. Our approach’s ability to discover TEs of varying families,

across classes, in a genome with so few TEs demonstrates its utility. Following

is a description of our results for each class of elements, which is summarized

in Table 3.1. Additional detail is provided in the P. humanus humanus genome

paper [83].

46

3.3.1.1 Class I Elements

LTR Retrotransposons– Only one element of the LTR retrotransposon family is

well-represented in the P. humanus humanus genome. Phylogenetic anal-

ysis shows that it belongs to the Mdg1 lineage of LTR retrotransposons

(Ty3/gypsy clade). There were no active copies found - the canonical copy

has point mutations in the gag-like domain. There are only two full length

copies in the genome, which suggests that these genomic insertions are rel-

atively recent and that selective pressure is very efficient in purging func-

tional copies from the genome. The other copies are present in the form of

solo-LTRs and partial to highly deleted proviral copies, demonstrating that

solo-LTR formation (by recombination between the two LTRs of the same

copy) and deletions are important mechanisms in the inactivation and/or

elimination of TEs from this genome. Another characteristic of this element

is that the target site is always ATAT, and many of the copies are located in

poly-AT regions (possible heterochromatin), where recombination rate may

be lower and, therefore, selection pressure is also lower, permitting frag-

mented copies to evolve like pseudogenes over time until selection finally

eliminates them.

Non-LTR Retrotransposons– Two distinct types of non-LTRs were reconstructed

and identified. The longest element is 5266 bp. It has 52% homology to

the A. gambiae Dong reverse transcriptase (R4), possesses a single reading

frame, and has a TAA target site [86]. Four full-length copies and many

partial-length copies were found in the genome.

The second element that was reconstructed was about 4655 bp long; how-

ever, it was difficult to determine the boundaries of the element outside of its

47

coding region. This transposable element is not represented in the genome

as a full-length copy. However, several copies exist in the genome with in-

terrupted reading frames. The highest homology was found to the Bombyx

mori (50% homology) and the Papilio xuthus (48% homology) Hope reverse

transcriptase protein [85]. Probable loss of target site specificity and a trend

of insertions across all genomic locations was reported by Kojima and Fuji-

wara [85] for Hope-like elements. However, it was observed that some copies

of Hope-like elements in P. humanus humanus have a targeting sequence

similar to 28S rDNA.

3.3.1.2 Class II Elements

mariner/Tc1 – From the many inspected Class II families only one mariner/Tc1

element was identified. It is a 1276 bp mariner transposon, with 33 bp

TIRs. We show the consensus for this element in Appendix E.1.3.1. Its

transposase has highest homology to Apis mellifera Ammar1 transposase

and Ceratitis capitata Ccmar2 transposase. Reconstruction of the mariner

revealed that the element has two reading frames and that some copies have

a 24 bp deletion in the coding part, which caused further reading frame

interruptions. No autonomous elements were found, but 24 full-length copies

and many deteriorated ones are still present in the P. humanus humanus

genome.

MITE– Two MITEs were identified in the P. humanus humanus genome. The

first is 623 bp long, present in 4 copies, with a 12 bp TIR. The second is

169 bp long, present in 16 copies, with a 20 bp TIR. Dot plot analysis re-

vealed that the 623 bp element consists of 4-5 repeats within itself and that

48

the 169 bp element has 2 repeats. No homologies with other P. humanus hu-

manus TEs were identified. The MITEs were identified by a separate script

we developed that aims to find inverted repeats within specified distances

within the genome. This script is available in Appendix F.

3.3.2 Culex quinquefasciatus

The Culex quinquefasciatus mosquito is the primary vector of the West Nile

virus and St. Louis encephalitis. We searched its roughly 580 Mb genome for

non-LTR retrotransposons and identified 11 of the 17 known families of non-

LTRs, together occupying 4.4% of the total genome. Among these, full-length

copies of the CR1, I, Jockey, L1, L2, LOA, Loner, and R1 families were found.

The Loner and Outcast families are unique to mosquitoes. There is evidence of

recent activity in the CR1, Jockey, L1, and L2 elements. Table 2 contains our

results, also presented in the C. quinquefasciatus genome paper [5]. Across all TE

families, TEs occupy roughly 29% of the genome, comparable to similar mosquito

species. Our full results are shown in Table 3.2.

3.3.3 Anopheles gambiae PEST Genome

Anopheles gambiae serves as the main vector of malaria [63]. The PEST strain

consists of roughly 273 Mb and has been extensively studied. Class II P elements

within the genome have been especially closely examined. Sarkar et al. originally

identified 6 distinct P elements [126]. More recently, Oliveira de Carvalho et al.

identified 4 additional P elements [106], while Quesneville et al. described 9 clades

(subfamilies) that are at least 30% divergent at the nucleotide level [117]. In all,

previous research has described 12 clades of P elements in A. gambiae that are

49

TABLE 3.2

Culex quinquefasciatus RESULTS

Class I Family Number of Elements Copies Density

non-LTR

CR1 31 973 0.28%

I 11 63 0.02%

Jockey 16 5028 1.77%

L1 57 662 0.15%

L2 9 1416 0.61%

LOA 9 184 0.09%

Loner 2 127 0.12%

Outcast 4 15 0.00%

R1 32 250 0.14%

RTE 8 892 0.38%

Unclassified LINE 30 11,117 0.88%

TOTAL 4.45%

50

more than 30% divergent at the nucleotide level.

TESeeker detected 11 out of the 12 P elements within A. gambiae, as well as an

additional 2 partial hits that showed strong similarity to P element transposase,

but that were more than 30% divergent at the nucleotide level. The lone element

that TESeeker missed, AgaP14, is most divergent from the other elements, which

may explain its absence and which also suggests our library does not fully represent

the P element family. Additionally, TESeeker produced consensus sequences with

TIRs on every P element where they had been previously reported.

Searches for additional Class II TE families were also successful. In particular,

of the 13 piggyBac elements identified in Sarkar et al. [127], we identified 10,

including TIRs where previously described. Again, the elements TESeeker missed

were most divergent from the other sequences. TESeeker did especially well with

mariner elements. TESeeker identified each of the 5 elements at TEfam, each

with complete TIRs and 4 with the expected TSDs.

3.3.4 Other Organisms

TESeeker was also validated on select elements in a variety of organisms. Of

particular note, we detected a previously unreported putative mariner element

in the well-studied Drosophila melanogaster genome. The 1061 bp element has

TIRs 26 bp in length, with 3 mismatches, but with no apparent TSDs. A sin-

gle full-length copy, as well as several partial hits, exist within the genome. Its

transposase has a high homology to related insects, such as Chymomyza amoena

and Cladodiopsis seyrigi. Searches for this element in existing TE annotations for

D. melanogaster produced no hits. Please refer to Section E.3.1.1 of Appendix E

for an annotated version of this putative element. Further investigation in collab-

51

oration with FlyBase [36] is warranted to validate this result.

Additionally, TESeeker was used to search for mariner elements in the human

(Homo sapiens), frog (Xenopus tropicalis), and chicken (Gallus gallus) genomes.

Mariner elements are known to exist in the human, frog, and chicken genomes,

which were found using TESeeker.

3.4 Conclusion

As the number of sequenced genomes rises, the necessity to identify TEs within

them also grows. TEs are an important evolutionary force present in the majority

of these genomes. While there exist mature, effective, and automated gene identi-

fication systems, the tools available for the identification of TEs are not as robust.

Particularly, current homology-based approaches are typically very interactive,

requiring numerous user decisions and many separate tools.

The approach described herein successfully identifies TEs in novel genomes in

an automated and easy to use package, offering researchers the ability to quickly

produce high-quality consensus TEs. TESeeker was developed and refined over

the course of several TE identification projects and works best to detect TEs

with homology to known TEs. We are able to generate high-quality putative

TEs as well as characterize the prevalence of TEs in many genomes. We provide

TESeeker as a web-based tool within a VirtualBox virtual appliance, while also

providing our representative TE library both within the virtual appliance and as a

separate download. While its local web interface automates the underlying logic,

each TESeeker step can be manually started through the command line, offering

additional flexibility. Additionally, we provide documentation and test cases to

evaluate the approach with the virtual appliance. The ability to automatically an-

52

alyze a genome alleviates the exhaustive, error-prone, and time-consuming task of

manually inspecting and manipulating results. The performance of the approach

varies, but is largely dependent on the length of the TE family (longer sequences

take longer to assemble) and its abundance in the genome.

53

CHAPTER 4

DESIGN AND PROOF-OF-CONCEPT PLAN FOR COMMUNITY

ANNOTATION OF TRANSPOSABLE ELEMENTS ON VECTORBASE

4.1 Introduction

This chapter presents our design and implementation plan for an online com-

munity annotation platform for TEs, designed to work in conjunction with Vec-

torBase [91, 92]. Although TEs often represent a very high percentage of genomic

data within a genome, repositories of TE data are lacking and remain unstandard-

ized, specifically for vectors of human pathogens. Moreover, existing TE reposi-

tories typically lack the user-friendliness that other genomic data is afforded. For

example, at NCBI [104], there is an extensive amount of information available

for genes; one can search for them in multiple ways, as well as visualize and eas-

ily browse many details. In comparison, the primary online resources for TEs,

namely TEfam [134], RepBase [119], and WikiPoson [148], offer far less informa-

tion. TEfam allows users to submit TEs and some associated information about

them, but only supports four organisms and offers no structural display options

for TEs. However, TEfam does have the capability for users to submit detailed

TE information, such as the amino acid open reading frames (ORFs), terminal

inverted repeats (TIRs), long terminal repeats (LTRs), and target site duplica-

tions (TSDs). RepBase has a much larger database of TEs for many organisms

54

but simply offers them in the standard FASTA and EMBL file formats, most of-

ten only in nucleotide form. Both sites have a review process for TE submission

yet neither offer any kind of structural display. Additionally, neither offers com-

munity feedback concerning the hosted TEs. The third site, WikiPoson, offers

user feedback in the form of a MediaWiki [99]. While not standardized and quite

limited, WikiPoson offers researchers the ability to submit information and offers

classification guidelines.

The ability to store and visualize the structure of a TE, as well as allow for

community feedback is important for several reasons. First, it allows researchers

unfamiliar with certain types of TEs to be quickly exposed to other types. Second,

the opportunity for user feedback is critical. Much of the existing TE research

is not moderated. The ability for feedback from outsiders increases credibility.

Lastly, a moderated repository for TE data would encourage more standardized

research of TEs.

Our design and implementation plan utilizes the existing VectorBase [91, 92]

bioinformatics resource center for vectors of human pathogens. VectorBase stores

genomic data for a variety of insects and offers the capability to browse and display

genomic data, run scientific analysis tools, obtain information about the organ-

isms, and to provide community feedback. VectorBase also provides a means for

community annotation of genes. It currently does this through an user submit-

ted form for the gene-related data. This data is then reviewed and, if approved,

added to an online database. Currently, the data is accessible for all researchers.

Additionally, the data can be browsed on VectorBase through Ensembl’s Genome

Browser [43].

Our plan builds upon VectorBase’s manual annotation of genes by adding the

55

ability to describe TEs. We utilize and adapt their core methods and extend them

to work with the unique structure of TEs. We utilize the Chado [26] database

schema to store the TEs and build a display based on the php GD library for

TE structural display, as Ensembl does not currently support the display of TEs.

Our primary goal is to complement the existing features of VectorBase with a

mechanism for the submission and display of TEs. This proposed system would fill

many gaps in the aforementioned existing systems and improve upon the quality

and spread of information across the scientific community.

4.2 Transposable Elements and the VectorBase Community Annotation Pipeline

Adding the capability to facilitate TEs through VectorBase’s community an-

notation pipeline (CAP) requires several changes. The following sections describe

how TEs would fit into the VectorBase CAP.

4.2.1 Similarities to the VectorBase Community Annotation Pipeline

Extending CAP to support TEs requires a number of steps. Fortunately,

many of the technologies developed for CAP can be utilized for TEs. From the

user standpoint, other than a TE-specific submission spreadsheet, the interface

and submission process would be nearly identical to that of CAP, making sub-

mission easier for the user. Submissions would be expert-regulated, and the TE

information would be available in the Ensembl browser via DAS. From the user

standpoint, the following steps (summarized in Figure 4.1) would be followed:

1. Download SubmissionForm.xls

2. Enter TE data

3. Upload SubmissionForm.xls

56

4. Wait for approval from curator

5. If approved by the curator, data goes live; otherwise, it must bemodified by user

The TE-specific submission spreadsheet for Class II TEs is summarized below

(some fields, such as ORF, can have multiple instances). A portion of the spread-

sheet is shown in Figure 4.2. Researchers are able to enter as many TEs instances

as they would like for a given submission.

Transposon Symbol Unique name for the TE.

Family Name Family to which the TE belongs.

Organism Organism in which the TE was found.

Transposon Description Description of the TE, as well as any unique proper-

ties.

DNA Transposon Complete sequence data for the TE.

Target Site Duplication Target site duplication for the TE.

5’ Start Genomic location where the transposon starts.

3’ End Genomic location where the transposon ends.

Strand Strand on which the TE was found.

5’ TIR TIR on the 5’ end.

5’ TIR Start Genomic location where the 5’ TIR starts.

3’ TIR TIR on the 3’ end.

57

Figure 4.1. Client-side TE Submission Process. Similar to theVectorBase CAP, researchers download, fill-out, and submit a

TE-specific submission form. Approved data goes online and iseventually incorporated into the Ensembl browser.

58

Figure 4.2. TE Submission Form. The figure above shows a snippet ofthe spreadsheet to submit TEs. Researchers can submit as many TEs

at a time as they desire.

3’ TIR Start Genomic location where the 3’ TIR starts.

ORF ORF sequence data.

ORF Start Genomic location where the open ORF starts.

There are advantages to using an offline form. First, researchers can keep their

information stored locally, making it easy to edit or add to their data. Second,

our form allows researchers to enter as much data as they have for as many TEs as

they have, eliminating the burden of manually going through an online submission

process multiple times.

Once the form has been submitted, phpExcelReader [108] is used to parse the

data. phpExcelReader is an open source php library that works with Excel files.

The library is used to read the file’s contents and display what it has parsed out. If

the researcher clicks approve, the data is inserted into the Chado database with a

status of “under review.” This is the same basic technique used by the VectorBase

CAP.

59

4.2.2 Differences from the VectorBase Community Annotation Pipeline

Extending the VectorBase CAP to support TEs also requires several changes.

While the submission process is largely the same from the client-side, thereby

facilitating submission in an user-friendly format, the aforementioned changes

(from the spreadsheet) are made when storing TEs in Chado. Additionally, the

spreadsheet allows for the submission of TE instances, the coordinates of which

are also provided. As a result, and unlike the alignment of genes, exonerate will

not be required to align the TE to the genome. The ability to submit consensus

TEs will also be supported. In this case, an alignment algorithm, such as BLAST,

will be used to generate instances of the consensus TE within the genome. This

data will be generated “on-the-fly” and could be displayed via a different DAS

track than the normally submitted and curated data.

4.2.3 Transposable Element Representation in Chado

We utilized Chado’s central module, the Chado sequence module, to store

information about TEs. The fundamental table within this module is the feature

table, which is used for describing biological sequence features. Chado defines

a feature to be a region of a biological polymer, which typically means a DNA,

RNA, or polypeptide molecule. A region can be the entire extent of the molecule

or a junction between two bases. Features can be classified according to ontology,

localized relative to other features, and form part, whole, and other relationships

with other features [54]. We store the model (in our case the consensus) of the

TE and all its functional parts in the feature table. This is accomplished by

identifying the relation of each functional part to the “main” consensus via the

feature relationship table, as well as the location of the smaller parts within the

60

FEATURE (feature_id, dbxref_id, organism_id, name, uniquename, residues, seqlen, md5checksum, type_id, is_analysis, timeaccessioned, timelastmodified) FEATURELOC (featureloc_id, feature_id, srcfeature_id, fmin, is_fmin_partial, fmax, is_fmax_partial, strand, phase, residue_info, locgroup, rank) FEATURE_SYNONYM (feature_synonym_id, synonym_id, feature_id, pub_id, is_current, is_internal) FEATURE_RELATIONSHIP (feature_relationship_id, subject_id, object_id, type_id, rank)

• Primary key • Foreign key which will be linked to a PK in the part of the structure which we are creating • Foreign key NOT linked to a PK in the part of the structure which we are creating

n 1

1 n

n 1 1 n n 1 n n n 1

• Boxes with solid borders represent relations which we will build • Boxes with dashed borders represent relations which we will reference in the existing VectorBase database

FEATURE

FEATURELOC

FEATURE_SYNONYM

FEATURE_RELATOINSHIP

ORGANISM

DBXREF

CVTERM

FEATURE_CVTERM

Figure 4.3. Entity-Relationship Diagram of Selected Chado Tables. Thefigure above shows selected changes to the Chado schema to account for

TEs.

main part via featureloc table. We also make a connection to existing data in

Chado. For example, in the feature table, we need to specify the organism where

the particular TE was found, so we utilize foreign keys connecting the TE, in this

case, to the organism, cv term, and dbxref table. The featureprop table is utilized

to set the status of the data (for example, “under review”). Figure 4.3 shows a

graphical representation of some of the tables we utilized.

61

Figure 4.4. TE Start and Submit Page. Here, users can select the TEthey wish to view information about or submit a new TE for review.

4.2.4 Proof-of-Concept

A sample interface independent of VectorBase has been developed. This in-

terface allows for the submission of TE instances into a clone of the VectorBase

Chado database. The basic interface allows researchers to view or submit TEs,

shown in Figure 4.4. Basic information about the TE can be displayed, as in

Figure 4.5, including a structural display. The structural display uses the php GD

Graphics Library [48] to dynamically create a visualization of the TE, a sample

of which is shown in Figure 4.6. This data would eventually be made available

via a link in the Ensembl browser. Figure 4.7 depicts the configuration.

62

Figure 4.5. TE Details Page. This figure shows the informationpertinent to each TE. It also has an option to display the TE structure,

shown in Figure 4.6.

63

Fig

ure

4.6.

TE

Str

uct

ure

Pag

e.H

ere,

we

show

the

dis

pla

yof

the

TE

,to

scal

e.T

he

larg

ece

nte

rre

gion

isth

eop

enre

adin

gfr

ame.

On

eith

eren

d,

one

can

see

the

term

inal

inve

rted

rep

eats

(TIR

s).

64

Figure 4.7. Proof-of-Concept Configuration. Here, we show theconfiguration of the data flow and display, independent of VectorBase.

4.3 Design and Implementation Plan

Work has been performed independent of VectorBase on a clone of the Chado

database utilizes the general design that has been described in the previous sec-

tions. To implement the ability for the VectorBase CAP to accept TEs, the

following steps should be initiated:

1. Add TE-specific Submission Interface to VectorBase. This can bedone through modifications (or edited duplications) of existing files. Namely,files such as UserTools.php, ManualModel.php, and SubmitAnnotation.php

could be used as templates to create files to allow for TE submission.

2. Import TEs into Chado. This is largely done through the CAPorg.vectorbase.www.cap.importer package. The .java files in this pack-age handle the parsing and insertion of the contents of the spreadsheet intoChado.

3. Export TEs to Ensembl Browser. The CAPorg.vectorbase.www.cap.exporter package allows for the display of CAPdata in the Ensembl browser. A link to the structural view of the TE wouldneed to be provided through the Ensembl code.

65

The logic to much of the CAP code would remain unchanged, such as the usage

of hibernate. However, portions of the existing code that utilize exonerate

would not be necessary, and the mechanisms by which TEs are inserted into

Chado must follow the schema previously described. The VectorBase CAP has

many underlying caveats; the implementation is relatively straightforward, yet is

difficult to initially follow because of many interdependencies. Allowing for the

acceptance of TEs into the VectorBase CAP first requires the CAP to be restored

to working order and then edited and extended for TEs. The capability for users

to submit consensus TE sequences is something that should be performed once

the CAP allows for the acceptance and display of TE instances. Once this is

implemented, consensus sequences could be used in blastn searches against the

genome and the results dynamically displayed via a DAS track in the Ensembl

browser. Additionally, TEs would be expert-regulated on an organism by organism

basis, much like the CAP.

4.4 Conclusion

This chapter has described common annotation strategies as well as the tech-

nologies used in the VectorBase CAP. We have described the VectorBase CAP

in detail, and offered solutions to extending it to allow for TEs. As a proof-of-

concept, we have cloned the VectorBase Chado database and successfully accepted

user submissions of TEs from the web, while also parsing and inserting them into

Chado. The Chado database has also been used to generate a structural display

of the TE.

Our approach extends the VectorBase CAP to allow for TEs while utilizing

the technologies currently in place. Such an annotation system for TEs has not

66

been implemented to date, as current systems serve mainly as TE repositories,

offering no structural display or community feedback. The community annotation

of TEs complements the VectorBase CAP for genes while also strengthening the

utility of VectorBase.

67

CHAPTER 5

SIMULATION AND MODELING BACKGROUND1

5.1 Introduction

Simulations of real-world phenomena have the potential to be extremely valu-

able to researchers, particularly in the public health realm. Rather than relying

on complex equations that are the basis for many scientific models, agent-based

models (ABMs) rely on more natural behavioral rules [60]. This leads to a more

direct translation from natural phenomena to a simulation model. It is logical to

integrate spatial data into some simulation environments; however, as Gilbert [50]

pointed out, utilizing geographical information system (GIS) data for dynamic

agents is a difficult challenge that has not yet been adequately solved. Although

GIS data has successfully been integrated into ABMs for several years, the ability

to run complex simulations with thousands of GIS aware agents is computation-

ally challenging. This chapter explores simulations, with an emphasis on ABM

and its utilization of GIS data.

5.2 Simulation and Modeling

A simulation is an imitation of a real-world process [10]. This imitation is

usually done with a computer through the use of a conceptual model. A concep-

1Portions of this chapter were previously reported in Kennedy [75].

68

tual model generally refers to the computer representation of the system that a

researcher has chosen to model. The common goal of simulations is to accurately

represent the behavior of a real-world system while providing feedback and in-

sight in a manner that would otherwise be infeasible. For example, an experiment

that would take months to complete in the laboratory may take only hours or

days to complete with a computer simulation. Also, simulations are especially

useful for models that are unethical in real life, such as infecting a population

with a pathogen. It is appropriate to think of simulations as parts of the scientific

method - we use them to help us check our assumptions or hypotheses as well as to

possibly predict future behavior. Sharing the same goal as the scientific method,

we utilize simulations to help us acquire new knowledge.

The literature presents a multitude of reasons why computer simulations are

valuable [10, 103, 128], a collection of which are listed below:

1. Simulations allow for the timely study of phenomena that would otherwisebe impractical. For example, the evolution of a species over a long periodof time can be simulated in far less time than the actual experiments wouldtake to perform.

2. Simulations can model theoretical behavior that cannot be replicated in thelaboratory. An example of this would be a simulation model that trackedthe historical migration patterns of icebergs or continental drift.

3. Simulation inputs can be modified to determine the outcome or effect on areal-world system without harming the real-world system. This would beapplicable if a researcher wanted to simulate the spread of a pathogen acrossa population without harming the population.

4. Experimentation with simulations can confirm understanding. For instance,a simulation model that mimics the population dynamics of a group of ani-mals could allow researchers to examine particular entities of the model andfollow them over time, thus furthering the understanding of the system.

5. Simulations can be used as prototypes for new experiments before real-worldimplementation. For example, a disaster recovery team could simulate any

69

sort of disaster as well as their response tactics, allowing them to choose thebest approach.

6. Modern systems are sometimes so complex that their internal workings canonly be studied through simulations. Banks et al. [10] refer to a complexfactory system in which the internal interactions are so complex that simu-lations offer the only solution.

As evidenced above, simulations are a powerful tool to researchers; however,

there are cases where a simulation would not be appropriate. Banks and Gibson

[8] list ten rules for when simulations should not be used. A sample of the more

meaningful ones for our purposes are summarized below:

1. Simulations should not be used when common sense can solve the problemor when the problem can be solved analytically in reasonable time.

2. Simulations should not be used when the cost of developing the simulationmodel exceeds the cost of experimentation.

3. Simulations are not useful when system behavior is too complex or unknown.

5.2.1 Advantages and Disadvantages

There are many advantages to using simulations for scientific study [10]. Aside

from the fact that simulations allow one to model the behavior of a real-world

system without harming or altering the real-world system, simulations typically

run and produce results faster than the real-world system being studied, if such

a system exists. Additionally, simulations are useful in testing the influence of

different variables both on the system as a whole and in regard to one another.

Furthermore, a simulation is helpful when performing hypothetical tests or when

testing situations that would be unethical or impractical in the real-world.

Simulation studies also have some inherent disadvantages. Banks et al. [10]

list four specific disadvantages. Namely, simulation models are difficult both to 1)

70

build and to 2) interpret. While true to an extent, experienced programmers will

find model-building manageable. In addition, 3) interpreting and analyzing the

results of a simulation may take some time, but, in many cases, this amount of

time will be less than if the scientist had done the actual real-world experiment.

Lastly, 4) simulations are sometimes incorrectly used when analytical solutions are

more practical. Although valid disadvantages to using a simulation exist, building

a simulation can be extremely useful to scientists as long as the simulation fulfills

the requirements previously mentioned.

5.2.2 Building a Simulation Model

We have already described simulations as being built upon a model. In most

cases, scientists start with a conceptual model, or a model with which they intend

to accurately represent the system they are studying. This conceptual model

typically goes through many phases and revisions as the simulation is being built.

Often, scientists will recognize a problem with their conceptual model or discover a

way to improve it and then implement the change. Once the scientist has sufficient

confidence in the conceptual model, it will transition into simply being called the

model, which will be used as the representation of the system the scientists are

studying. This representation is for the study of the system through simulation.

Accurately representing a model that exactly matches a real-world phenomenon

is extremely difficult, if not impossible.

Inherent randomness often appears in simulation models. While many factors

cause this randomness, the main cause is that real-world systems are far too com-

plex to accurately and comprehensively represent through a computer simulation

model. Randomness is included in simulation models to cover our limited under-

71

standing or uncertainty. In many of the systems we model, we have little idea

about the underlying mechanics. We build simulation models to try to help us to

understand these characteristics and to experiment with them. If done properly,

we will learn about real-world systems through our simulation models. Second,

randomness is included for decision-making. If we have a simulation that models

ants foraging for food, we have to give the ants the ability to make decisions.

If the simulation continuously prompted the entities to perform the same action

or encounter the same obstacles each time, nothing would be learned after the

first run; randomness is included for this purpose. In the ants example above,

the introduction of a random walk would add realism to the model. Lastly, mea-

surement error or quantum effects are accounted for by randomness. Simulations

cannot have the precision of real-world systems because of both the limitations

of computers and of our own knowledge. They also cannot represent entities as

accurately as a real-world system, so we include an inherent randomness. These

examples are not meant to be looked at as limitations of simulation models, but as

reasons why simulation models are created the way they are. In fact, this random-

ness is part of what makes simulations unique and powerful, while representing

how the world actually operates too.

5.2.3 Simulation Model Types

The literature [10, 90] has divided simulation models into the following three

overlapping subcategories:

Static vs. Dynamic Static simulation models are representative of a system at

a specific time. An example of a static system is one that solves complex

analytical problems that are infeasible with other methods. Dynamic simu-

72

lation models are representative of a system over time, such as population

dynamics.

Deterministic vs. Stochastic Deterministic simulation models produce results

determined by the provided inputs. In such simulation models, probability

does not play a role. An example would be a simulation that models a

student going to class at a specified time every day. Stochastic simulation

models involve random variables and produce different results with each

random seed. Our model with the student would be stochastic if we add a

certain probability as to when and whether the student will arrive to class.

Continuous vs. Discrete Continuous simulation models characterize systems

constantly over time. An example would be the population dynamics in

a predator-prey simulation model. Discrete simulation models characterize

systems at specific points in time. An example would be people paying tolls

at a toll booth.

For the purposes of this study, we further classify simulation models into the

following subcategories:

Agent-based vs. Equation-based Agent-based simulation models have indi-

vidual entities, called agents, that drive the simulation. They are good

at modeling systems with emergent properties. Equation-based simulation

models are adept at modeling mathematically based phenomena. We next

elaborate on these two subcategories.

73

5.2.4 Agent-based Modeling

Agent-based simulations, also known as individual-based simulations, have

recently gained popularity [121] and are proving to be very powerful. In an agent-

based simulation, an agent can be thought of as any acting component in the

system. Each agent is treated as an entity, having its own properties and behav-

iors. These can be influenced by a variety of factors, including the environment

and other agents. The interactions between agents and their environment over

time often lead to emergent properties within the system. Time is typically rep-

resented in the form of time steps; namely, each agent usually has a chance to

change its properties and interact with other agents and the environment once

every time step. A time step can represent any amount of time. An advantage

of agent-based simulations is that they are easily extensible; adding agents to the

model is a well-defined process. Additionally, agent-based simulations are rather

intuitive to code, as they are modeled in the same manner that we tend to think

about systems. Agent-based models have been applied to many areas, including

social network models and models of pathogen spread. Our model, named LiNK,

is an agent-based model.

5.2.5 Equation-based Modeling

Equation-based simulations are more mature than agent-based models and

are adept at modeling systems governed by underlying mathematical properties

or formulas. This is somewhat of a limitation, as more complex systems that

cannot be approximated by equations are difficult to build. Also, changing overall

properties of an equation-based simulation is often difficult, as it may require a

new mathematical model. However, modifying parameters in an equation-based

74

simulation is relatively simple. In this respect, equation-based simulations are

rather simple and straightforward. In general, equation-based simulations are

very good at modeling known systems with aggregate behaviors or systems simply

governed by mathematical rules.

5.3 Geographic Information Systems

A GIS is a system that is used to manipulate and store spatial data. For exam-

ple, a GIS could consist of a map of the counties of Michigan and the correlating

population data. Coupled with the proper software, users could query the data

for counties with a population greater than 10,000 or for counties with an area

larger than 500 square miles. Applications of GIS technology span many fields,

including environmental impact assessment, scientific investigations, urban plan-

ning, and resource management [52]. ArcGIS [4], GRASS GIS [56], and Quantum

GIS [114] are several popular GIS software tools. GIS data is usually stored in

either raster or vector format; next, we elaborate on each format. Figure 5.1

visually compares raster and vector data on a portion of Bali, Indonesia.

5.3.1 Raster Data

Raster data is characterized as a collection of pixels, or cells. Many cells make

up a single raster file. These cells are stored in a matrix-like manner, namely in

rows and columns. Each cell has its own attributes and associated data. Raster

files are generally less computationally expensive than vector files. However, they

require more storage space.

75

5.3.2 Vector Data

Vector data is coordinate-based and usually represents data as points, lines,

or polygons. Vector files can more realistically represent spatial data in smaller

storage space than raster files. GIS data is collected as coordinates, so there is

typically much more precision when compared to raster files. Querying complex

polygon-based vector files can be expensive, so the data is often approximated.

5.4 Integrating Geographic Information System Data into Agent-based Modeling

There have been several studies in which ABM has been combined with GIS

data [25, 29, 51, 66, 146]. Few of these models have focused on infectious dis-

eases, while even fewer have agents that intelligently move based on their current

environment. Castle et al. [25] mention numerous toolkits and applications for

coupling ABM and GIS yet fail to go beyond the incorporation of GIS data into a

model and into the realm of its effective use. Crooks [29] more deeply describes the

realm of space within ABM and offers example applications but does not specif-

ically address the underlying issue of how agents can most efficiently access GIS

data. Anwar et al. [3] describe a model built upon GIS data, but one that does

not directly query it. Some models imply space, such as NOSOSIM [135], but

few dynamically interact with GIS data. Gimblett [52], Keeling et al. [74], and

Brown et al. [18] describe aspects of the integration of ABMs and GIS data, but

do not go into detail regarding approaches to efficiently create GIS aware agents.

Moreover, standard means of linking agents with GIS data are computationally

expensive and are therefore not feasible for complex, large-scale simulation mod-

els. In many cases, only particular parts of a GIS are necessary for an ABM;

utilizing a feature-rich GIS toolkit, such as ArcGIS [4], at simulation run-time is

76

(a) Raster Data

(b) Vector Data

Figure 5.1. Panels (a) and (b) show the northwest corner of Bali asrepresented by a raster and a vector file. Here, the granularity of the

raster file is not as precise as the vector file.

77

not typically advisable. We aim to advance the field through efficient and fast

approaches to dynamically working with GIS data within an ABM.

5.5 Summary

This chapter has introduced simulation and modeling techniques, while focus-

ing on ABM. We have also discussed GIS and its popular data formats, offering

advantages and disadvantages. The difficulties in integrating GIS data with an

ABM have also been described. Chapter 6 describes our simulation that integrates

ABM and GIS data to model pathogen spread.

78

CHAPTER 6

A GIS AWARE AGENT-BASED MODEL OF PATHOGEN TRANSMISSION1

6.1 Introduction

In this chapter, we describe an epidemiological model that incorporates spa-

tial data as an influence to agent behavior and pathogen spread. In particular,

we create an epidemiological model to simulate pathogen spread amongst long-

tailed macaque monkeys, Macaca fascicularis, on the Indonesian island of Bali.

GIS data is incorporated into our simulation, and we offer insight on how to ef-

ficiently integrate GIS data into a model, depending on the model’s complexity

and needs. We note optimizations made along the way and compare our methods

to conventional approaches. We conclude with results for our model. This work

is performed with global public health goals in mind and could also be applied to

model infectious diseases carried by arthropod vectors.

6.2 LiNK Simulation Model

We have created a model, the implementation of which is named LiNK after

its creators (Lane, Niederweiser, and Kennedy) and further described in Lane

[88], to aid in the understanding of pathogen transmission patterns. This model

was designed to simulate the spread of infection amongst long-tailed macaques,

1Results from this chapter have appeared in Kennedy et al. [76, 79].

79

shown in Figure 6.1, on the Indonesian island of Bali. We have coupled detailed

GIS data with a detailed understanding of the macaque population to create

a rich simulation. LiNK is deployed on a computing cluster at the University

of Notre Dame [142]. Development of the model has been performed through

the interdisciplinary collaboration of biologists, anthropologists, and computer

scientists.

6.2.1 Model Background

Several zoonotic diseases have recently emerged on the Asian landscape; macaques

have been implicated as both hosts and reservoirs in these disease emergences in

humans. Anthropogenic landscape changes have increased the incidence of human

to non-human primate interaction, potentially leading to bi-directional pathogen

transmission events [32, 41, 88]. In our model, we evaluate how landscape changes

might influence pathogen transmission patterns, based on the behavior and dis-

persal patterns of long-tailed macaques across the island of Bali. Our long-term

aim is to answer the following research questions:

1. What are potential rates and routes of pathogen transmission in macaquesacross the island?

2. How do pathogen life history parameters impact this transmission?

3. Do the answers change with the inclusion of humans as a component of thelandscape?

Landscape plays a very important role in these questions, necessitating the use

of GIS data in our simulation.

A unique system of temples, one of which is shown in Figure 6.2, has existed

on Bali for centuries; these temples and their associated forests act as refugia for

80

Figure 6.1. Adult Female Macaque and Infant. Photo courtesy ofA. Fuentes.

81

the large populations of long-tailed macaques [47, 147]. The island itself is fairly

small at 130 km × 80 km, yet it is an ideal size for study. Each of its roughly 40

temple populations consist of between 30 and 400 macaques. Existing behavioral

and preliminary genetic evidence has documented the matrifocal society of the

macaques, resulting in strong female philopatry [42, 46, 47, 88]. Females remain

at their natal (birth) temples throughout their lives, and social status is inherited

maternally. Typically, subdominant and subadult males disperse from their natal

temple at around age seven, traveling to non-natal temple populations. Actual

dispersal distances and rates are unknown.

The ability of long-tailed macaques to coexist with humans has enabled a

number of macaque populations to thrive in areas where other primate species

have become extinct [47]. On Bali, human land-use patterns have resulted in a

mosaic of riparian forest, small forest patches, agricultural lands, and urban areas

across much of the island. The broad distribution of macaque populations on

Bali suggests that the macaques are utilizing the human modified landscape as

it currently exists. Due to the protection and resource availability at temples,

macaques are able to thrive in moderately high densities alongside high density

human populations. This co-existence, particularly surrounding the temples, has

created an ideal study environment for evaluating both how primate behavior and

anthropogenic landscape changes influence pathogen transmission [41].

6.2.2 Conceptual Model

The conceptual model was developed by K.E. Lane, with support from A. Fuentes

and H. Hollocher. This research group has closely studied macaques and an ar-

ray of pathogens for a number of years. The basic model consists of a display of

82

Figure 6.2. Uluwatu Temple Site. This image shows the southernBalinese temple at Uluwatu.

83

Bali with temple sites and macaques. Users can also view the contents of a given

temple and provide multiple model and pathogen parameter options. We next

introduce the core components of our model and discuss them in greater detail in

Section 6.2.3.

Agents Our agents are macaques, each with their own properties, such as loca-

tion, sex, age, natal temple, and infection status. Macaques move according

to their surrounding environment, and males have the ability to enter and

leave temples. Our model can support thousands of agents, easily support-

ing the roughly 10,000 macaques on Bali. We show a simplified transition

diagram for the life cycle of our macaques in Figure 6.3.

Behavior Macaques have the ability to move through their environment, inter-

act with other macaques, reproduce, and die. Movement is dictated by their

surrounding environment; macaques query their neighborhood and move

appropriately. Macaques within a temple move randomly, with no GIS in-

fluence. All macaques have the ability to carry pathogens and can transmit

pathogens when within a specified distance of one another. Reproduction is

handled by allowing female macaques to produce offspring, with inherited

traits, after they reach a specified age. As macaques age, they have a higher

probability of dying.

Interface Researchers interact with the model through a simple control panel,

shown in Figure 6.4, that allows them to modify simulation parameters.

Once the parameters are set, the user can begin running the simulation.

The simulation is displayed using OpenMap [107] and is shown in Figure 6.5.

Users can also see macaques within temples, as shown in Figure 6.6 .

84

Figure 6.3. Life Cycle Transition Diagram. Macaques are always bornin temple sites. Female macaques spend their entire lives within theirnatal temple. Mature male macaques disperse throughout the islandthrough varying landscape with the ability to join other, non-natal,

temples.

85

Figure 6.4. LiNK Control Panel. Here, we show the parameters a usercan modify when running a simulation. GIS layers can be enabled or

disabled, and pathogen parameters can be set.

86

Pathogens LiNK has the ability to simulate a wide array of pathogens through

the incorporation of several important pathogen parameters. The infectivity

parameter refers to how infectious the pathogen is, while infectiousness is

the proximity a macaque must be to another macaque to have the ability

to transmit the pathogen. Latency represents how long a macaque takes to

become symptomatic after becoming infected, and virulence represents the

deadliness of the pathogen. Acquired immunity refers to the amount of time a

macaque is immune to contracting a pathogen after having been previously

infected. Clearance time is the amount of time a macaque takes to be

cleared of a pathogen. Finally, natural resistance represents the proportion

of macaques that are immune to a given pathogen. Selected pathogen-

related variables and their temporal relationships are shown in Figure 6.7.

A transition diagram for these variables is shown in Figure 6.8. LiNK has

the ability to model one unique pathogen during a given simulation run.

Space The macaques move about on 2D grids that represent temples sites and

the island. The island grids are extrapolated from GIS data, at a customized

granularity. For our purposes, a grid cell has sides of roughly 100m, leading

to over one million possible locations. Each grid is called a layer; we have a

total of eight landscape layers: cities, forests, lakes, rice fields, rivers, roads,

temples, and the actual island (called coast). We have three additional

layers that serve as buffers that represent the impact of humans and water

on infectivity. These eleven layers are melded together and use the same

coordinate system. The coast and temple layers are required, while the

remaining layers can be turned on or off.

Time One time step in our simulation correlates to 12 real-world hours.

87

Figure 6.5. LiNK Display. The figure above shows the display of oursimulation. Here, we have the forests, lakes, and rivers layers enabled,as well as the actual island and the temple sites. Temples are shown as

squares on the map. Green temples have no pathogens, while redtemples have pathogens present within. Macaques are shown on theisland as circles; they are green if they are healthy, pink if they are

infected and not symptomatic, and red if infected and symptomatic.This screen capture has one infected temple and several infected

macaques.

88

Figure 6.6. Temple Site Display. Here, we show the interior of a templesite. Male macaques are shown as solid circles and females as hollow

circles. Macaques are green if healthy, pink if infected and notsymptomatic, and red if infected and symptomatic. The user can

choose which temple site to display at run-time.

Figure 6.7. Temporal Relationship of Pathogen Parameters and RelatedEvents. The diagram above shows the relationships of the pathogenparameters in our simulation. Depending on the parameters used,

macaques can become permanently immune to the modeled pathogen.

89

Figure 6.8. Pathogen Transition Diagram. Macaques generally begin assusceptible and then transition to other states after being infected.Macaques with a symptomatic infection can become reinfected and

macaques can reinfect themselves (autoinfection). An acquiredimmunity is gained after most infections, but may be lost after a given

amount of time.

90

6.2.3 ODD Protocol Description of LiNK

Grimm et al. proposed [58] and recently updated [59] a protocol to describe

agent-based models, the ODD protocol, that consists of 1) the model Overview, 2)

Design concepts, and 3) Details. The overview block consists of the purpose, state

variable and scales, and process overview and scheduling elements. The details

block is further divided into the initialization, input, and submodels elements.

This section describes LiNK according to the original ODD protocol.

6.2.3.1 Purpose

The purpose of the LiNK simulation model is to help understand the effect of

landscape on the spread of pathogens among macaque monkeys on Bali, Indonesia.

6.2.3.2 State Variables and Scales

The spatially explicit model consists of agents representing macaque monkeys

on the island of Bali, Indonesia. ESRI shapefiles serve as the backbone for the

GIS in the model. Layers representing landscape include cities, forests, lakes,

rice fields, rivers, and roads, as well as the island of Bali. We also utilize a

layer representing the geographic location of 42 temple sites on the island and 3

additional layers we created that serve as buffers, namely to represent the impact

of humans and water on infectivity. We abstract the shapefiles to a grid-based

system on which movement amongst the layers is probability-based and relies, by

default, on a Moore neighborhood, which is discussed further in Section 6.3.5.1.

At each time step, macaques evaluate potential new positions, noting their current

landscape and directional bias. Each new position is assigned a value, which is

then normalized. The macaque then has the opportunity to move. Each range of

91

TABLE 6.1

MOVEMENT VALUES FOR DISPERSING MACAQUES

to City Coast Forest Lake Rice Field River Road

from

City 10-30 15-20 40-70 0 10-30 15-45 0-20

Coast 5-20 20-30 40-70 0 10-30 15-45 5-20

Forest 5-20 15-20 10-30 0 10-30 15-45 0-20

Rice Field 5-20 15-20 40-70 0 15-40 15-45 0-20

River 5-20 15-20 40-70 0 10-30 20-55 0-20

Road 5-20 15-30 40-70 0 10-30 15-45 5-30

values in Table 6.1 represents a weighted probability that a macaque will move

from one landscape to another.

State variables for LiNK are described in Table 6.2. A time step of 12 hours

was chosen in conjunction with a grid cell size of 111 meters to obtain the appro-

priate level of precision based on our knowledge of macaque behavior. Movement

probabilities were also chosen in accordance to studied macaque behavior.

92

TABLE 6.2

STATE VARIABLES

Variable Value

Model

Dispersal Deaths per Day 7.14E-4 (2% every 2 weeks)

Autoinfection True

Initial Infected Temples 1

Natural Resistance 1% of population

Temples Temples populated with realistic numbers

Time step 12 hours

Grid cell size 111 meters

Macaque

Sex Temples: 75% female, 25% male

Dispersing: 100% male

Age 50% adult (8-18y male; 8-20y female),

50% juvenile (0-8 yrs)

Latitude Random within island bounds

Longitude Random within island bounds

Natal Temple Random

Directional Bias Random

Current Landscape Based on latitude and longitude

Infected True if infected

continued on next page

93

TABLE 6.2

(continued)

Variable Value

Sick Steps Number of time steps infected to date

Symptomatic True if symptomatic

Pathogen

Infectiousness 1 grid cell

Infectivity 10 (0-100 range)

Virulence 80 (0-100 range)

Clearance Time 28 time steps

Natural Resistance 1% of population

Latency 4 time steps

Acquired Immunity 120 time steps

6.2.3.3 Process Overview and Scheduling

The LiNK model is event-driven. At each time step, a specified number of

events are scheduled and executed, macaque by macaque. Macaques are handled

in two groups: those dispersing and those within temples.

Dispersing macaques are processed first. We begin by incrementing the macaque’s

age and then allowing the macaque to move according to the movement function.

Next, each infected macaque has the opportunity to transmit infection and to

die from infection. Death is also possible as a result of dispersal deaths per day,

94

virulence, or macaque age. Finally, each dispersing macaque has the opportunity

to enter a temple, depending on his proximity to it.

Within temples, the process is similar. We begin by aging the macaques

and next remove them if their age or sickness meets appropriate standards. If

a macaque’s previous coordinates exceed those of the temple bounds and if the

macaque is a male of appropriate age, the macaque leaves the temple according to

a given probability. Female macaques have a 25% chance to give birth annually

from 3-13 years of age. Finally, we simulate the pathogen and randomly move

macaques within the temples.

6.2.3.4 Design Concepts

Emergence Influenced by the landscape, patterns of disease spread across Bali

emerge over time.

Sensing Macaques know their current and surrounding landscape, which they

use to make movement decisions.

Interactions Macaques interact with other macaques only to transmit pathogens.

When a macaque is within the ring of infectiousness of an infected macaque,

it has the possibility to become infected.

Stochasticity Survival in the model is stochastic; pathogens and the dispersal

death rate directly affect survival rate. Movement is also probability-based.

Certain landscapes are more desirable than others, and macaques move with

a directional bias, both of which factor into movement decisions. Births are

stochastic such that females have an annual 25% chance to give birth each

year, between the ages of 3 and 13. The sex of the offspring has an equal

95

chance of being male or female. Finally, macaques located within temples

often attempt to move beyond the bounds of the temple. This is permissible

only a small percentage of the time and only for males of a specified age.

Observation Data is collected based on events. Namely, each infection, death,

birth, and transition between a temple and the landscape is recorded in

the output file. The model is observed through its GUI (graphical user

interface) and also through analysis of the output file. We have written a

separate program named LiNKStat (described in Section 6.5.1) that presents

and performs basic analysis of the output.

6.2.3.5 Initialization

Upon initialization, several things are constant. First, the number of macaques

within each temple site is always the same and is based on scientific data. Sec-

ond, the landscape layers available are always the same; however, the number

of landscape layers that are enabled varies. The initial geographic placement of

macaques, both inside the temples and dispersing, is random. The initial values of

the parameters were chosen based upon observation and prior studies. Pathogen

parameters are varied according to the characteristics of a given pathogen.

6.2.3.6 Input

The input to the model includes the GIS shapefiles representing the various

landscape features of Bali. These were collected as part of a dissertation [132].

96

6.2.3.7 Submodels

Pathogens When a macaque becomes infected, it traverses through a variety of

pathogen-related states. Upon infection, a macaque enters a latent state,

which refers to how long it takes the macaque to become symptomatic. A

latent macaque is also able to transmit the pathogen to other macaques.

After completing the symptomatic phase, a macaque will become free of

infection and clear of the pathogen, meaning it can no longer transmit the

pathogen. The macaque will also enter an acquired immunity phase during

which it will not be able to become infected. Transmission of the pathogen

between macaques depends on infectiousness and infectivity. Infectiousness

refers to the transmission ring which both macaques have to be within to

transfer infection; infectivity is the chance that the infection will take place.

Virulence reflects the deadliness of the pathogen. Figure 6.7 shows the

temporal relationships for selected pathogen-related states.

Movement The higher the virulence of an infected macaque, the smaller the

chance that macaque will move. While movement within temples is ran-

dom, movement amongst dispersing macaques is complex. In its simplest

form, macaques move about a Moore neighborhood, namely the eight imme-

diately surrounding grid locations. At each time step, dispersing macaques

consider their previous movement direction, their current landscape, and the

landscape in their Moore neighborhood to determine their next location. We

utilize the numbers in Table 6.1 to quantify the likelihood of a macaque leav-

ing one landscape for another. This is combined with the macaques current

direction of travel and the new location (if any) is determined stochastically.

The mechanism of movement is independent of the number of layers enabled

97

for any given simulation run.

6.2.4 Implementation

There are several tools and technologies utilized in LiNK. The model is coded

in Java [67] with the Repast simulation toolkit [118]. We utilize Repast and

OpenMap [107] to display the model and GeoTools [49] and JTS Topology Suite

[70] to interact with the spatial information. The choice of tools used in this study

was primarily driven by the necessity to process and visualize GIS data and to be

cross-platform and open-source.

6.2.5 Verification and Validation

Simulations are credible only once they have passed some form of verification

and validation analysis. Verification refers to solving the model right, meaning

that the simulation model matches the abstract model. Validation refers to solving

the problem right, meaning the correct abstract model was chosen. ABMs must

undergo and pass several subjective and quantitative verification and validation

techniques to be considered useful models [7, 9, 81, 149]. Figure 6.9 shows common

techniques for verifying and validating ABMs, adapted from Kennedy et al. [75].

The LiNK model was developed in conjunction with domain experts from multiple

fields and has undergone extensive face validation, both through its display and

evaluation of its output. We have also checked for internal validity and traced

entities of the model. Much of this work has been performed through the use

of LiNKStat, which we describe in Section 6.5.1. We are currently collecting

additional real-world data that we will use in conjunction with the existing data

to continue docking LiNK and to examine LiNK’s predictive power.

98

Figure 6.9. Verification and Validation Techniques for Agent-basedModels. Here, we show techniques we used and plan to use for the

verification and validation of LiNK.

99

6.3 GIS Data and Agent-Based Modeling

In this section, we describe common methods to utilize GIS data in an agent-

based simulation environment. We also describe our improvements to these tech-

niques. We conclude this section with details on our spatially aware agents.

6.3.1 Approximating GIS Data in Simulations

When an ABM environment is built upon GIS data, queries can be expensive,

particularly with complex data or movement. As a general rule, the more complex

the GIS data, the more difficult it is to efficiently utilize it within an ABM.

Additionally, the more GIS data that is available, such as multiple landscape

features, the more time-consuming it will be for agents to query. Put simply, at

each time step, an agent needs to query its unknown surroundings and make a

decision regarding its next move. The more GIS data there is, the longer this will

take. A common solution is to approximate GIS data to the level of granularity

required for a given model. As such, the amount of GIS data is decreased while

the integrity of the data required is maintained. We next describe several ways

to access GIS data from a simulation, offering advantages and disadvantages for

each.

6.3.2 Raster Queries

Raster-based (cell-based) spatial queries made through a spatial package can

be costly, as the mechanisms by which agents access this data are typically not

optimized for use in simulations. Additionally, storing and loading potentially

large raster data files is inefficient at simulation run-time, particularly when not

all of the data is necessary. Raster files are also not ideal for representing complex

100

GIS data where fine-scale granularity is required. An advantage of utilizing raster

data in an ABM is that it easily maps to traditional ABM grid spaces.

6.3.3 Spatial Queries

Spatial queries on vector-based (coordinate) GIS data are the most accu-

rate way an agent can interact with GIS data. Here, an agent simply performs

mathematical-based queries on the loaded GIS data to determine its surroundings.

While very accurate, the cost of performing a spatial query increases as the com-

plexity of the data increases. For example, it may be mathematically simple to

query a rectangle to see whether an agent is contained within it; however, it is very

mathematically expensive to do the same query on a large polygon. Repeatedly

performing such queries is especially expensive, and this problem is exaggerated

as the number of agents and the amount of spatial data increases. While indexing

spatial data alleviates some redundancy, queries are still expensive.

6.3.3.1 Simplified Spatial Queries

The performance of spatial queries can be improved if the vector data is ap-

proximated in a manner such that the number of vertices in a line or polygon is

decreased, while maintaining an appropriate level of data integrity. The Douglas-

Peucker algorithm [34] is commonly used to perform such simplifications. This

technique offers a performance gain over traditional spatial queries, but at a cost

of less accurate spatial data. However, repeatedly performing similar or identical

spatial queries is redundant and can be remedied. Figure 6.10 shows a near 100%

data simplification that maintains considerable data integrity for Bali’s outline.

101

(a) 10,000 Data Points

(b) 100 Data Points

Figure 6.10. Panels (a) and (b) represent Bali, Indonesia withapproximately 10,000 and 100 data points, respectively. Here, we

reduce the number of points by almost 100%, but still retainconsiderable data integrity.

102

6.3.4 Precalculated Query Matrix

Recognizing the drawbacks of earlier techniques, we developed and utilized a

technique involving precalculated query matrices to create spatially aware agents.

This technique relies on the advantages of raster data while utilizing the accuracy

of vector data. Here, vector files are used in conjunction with spatial queries

to build arrays of spatial data. Specifically, we iterate through the vector data,

at a specified granularity, and perform spatial queries at each point. The result

of the query is stored in the matrix for that specific layer as a Boolean value

which specifies whether a given landscape is present. This process is shown in

Algorithm 2 and is performed for all available spatial data. The run-time for

Algorithm 2 is O(xyl), where x and y are the number of latitude and longitude

values and l is the number of matrices. The number of matrices refers the number

of landscape layers in use. While time consuming, the expensive queries only need

to be performed once for a given granularity, prior to simulation run-time. We

utilize serialization to load the arrays into the simulation and agents can access

the data in constant time. The main disadvantage to this method is that arrays

of finer granularity will take longer to build, resulting in larger arrays and slightly

longer query times. The advantages include agents that can more quickly query

their environment and a simulation that scales well, both in terms of the amount

of GIS data available and in the number of agents. Researchers also have the

advantage of choosing a granularity to fit their needs. Currently, we use multiple

precalculated query matrices in LiNK.

103

Algorithm 2 BuildPrecalculatedQueryMatrix

Let X be the set of latitude valuesLet Y be the set of longitude valuesLet L be the set of GIS layersLet M be the Precalculated Query Matrix for a layerfor all x ∈ X do

for all y ∈ Y dofor all l ∈ L do

Ml(x, y)←SpatialQuery(l, x, y)end for

end forend for

6.3.5 GIS Aware Agents

In traditional ABMs, agents move about a grid-like structure. GIS aware

agents move about the same structure, but in a manner such that each move is

influenced by the surrounding environment, including nearby agents. A simple

example would be allowing agents to move preferentially into one landscape over

another. Previously, we listed ways by which agents can query their environment.

Our agents are able to adequately and efficiently survey their surroundings, mak-

ing use of that data to become spatially aware. We utilize precalculated query

matrices for movement decisions. To display this movement on the native vector

data, we use hash tables to “map” the native GIS latitude and longitude points to

our matrices, and vice versa. This mapping avoids repetitive calculations, while

allowing agents to find their real-world coordinates quickly. This also assists in

enabling agents to move with complex rules, which we next describe.

6.3.5.1 Movement

Adding movement to agents in a GIS-based environment is challenging. With

raster data, agents must perform tedious queries through the GIS system to deter-

104

mine the surrounding landscape. Spatial queries are inefficient too, as the queries

can be redundant and take considerable time. Utilizing precalculated query ma-

trices enables us to create many agents with complex and realistic movements in

rapid time.

In traditional ABM cellular automata spaces, agent behavior is based on a

von Neumann or Moore neighborhood. Specifically, von Neumann neighborhoods

describe the four cells immediately adjacent to the current cell in a traditional

square grid. A Moore neighborhood extends this to the surrounding eight ad-

jacent cells, including those diagonally adjacent. Performing spatial queries on

such spaces would be tedious and inefficient, particularly if the neighborhood was

extended beyond a Moore neighborhood.

In our model, spatial movement is based on a Moore neighborhood, with al-

lowance for larger neighborhoods. To move intelligently, agents must know the

landscape they are currently in as well as the surrounding landscape. To repre-

sent possible transitions from one cell to another, we use a matrix of probabilistic

movement values. This table consists of values representing the likelihood that an

agent would move from a given landscape to another, shown in Table 6.1. These

values were determined after discussions with domain experts. Calculations are

performed for each of the cells in the Moore neighborhood. A directional bias

is also added to the agents so they are more likely to continue in the same gen-

eral direction. Once the values for the surrounding cells have been calculated,

they are normalized. We then use probabilities to determine the next location

for the agent, if it moves at all. These calculations are performed quickly, as the

look-ups for the surrounding cells can be performed in constant time, allowing

for realistic movement among agents. Figure 6.11 shows a simplified version of

105

our movement on an example grid and Algorithm 3 describes dispersed move-

ment algorithmically (time-dependent on the number of possible new locations).

WeightedSelectOnAdjust refers selecting the new location based upon the nor-

malized probabilities.

Intelligent agents can be classified as simple reflex, model-based reflex, goal-

based reflex, utility-based, or as learning [123]. Based on the movement deci-

sions described previously, our agents could be classified as utility-based, but with

stochastic-based utility functions and decisions. This classification fits our agents

because they make decisions based upon utility - they are more content in certain

landscapes, and their contentment is determined by their previous location and

current landscape.

Algorithm 3 DispersedMovement

Let Lt+1 be the set of possible locations for the next time stepLet lt+1 be the new locationLet b1 be the directional biasLet b2 be the landscape biasfor each time step t do

for all l ∈ Lt+1 dol← b1 + b2

end forlt+1 ←WeightedSelectOnAdjust(l ∈ Lt+1)

end for

106

Figure 6.11. Macaque Movement. The graphic above shows how amacaque M determines where to move in a landscape consisting offorests (green) and a river (blue). There are movement probabilities

associated with landscape features. For example, a macaque would bemore likely to enter a forest than a river. Here, we base movement on

the immediate surrounding cells; however, it can be based on anarbitrary number of cells in an outward direction.

107

6.4 Results

LiNK has started to demonstrate the importance of landscape in the scope of

epidemiological modeling [88]. The model has been improved in terms of speed

and scalability through an abstraction of typical GIS data representation. We

have shown the ability to have many agents interact with complex spatial data in

a time frame adequate for a simulation while still addressing the research ques-

tions mentioned in Section 6.2.1 at a high-level. Additionally, we have started to

show the impact of landscape on pathogen transmission, as shown in Figures 6.12

and 6.13, which is thus far in accordance with real-world data from Roberts and

Janovy [120]. To date, it appears that virulence is the dominant factor in terms of

pathogen spread. Further sensitivity analysis and more verification and validation

is ongoing.

108

Total Infection by Landscape

0

200000000

400000000

600000000

800000000

1000000000

1200000000

Dis

par1

Dis

par2

7

His

to1

His

to27

Dis

par2

8

Dis

par3

8

His

to28

His

to38

Heterogenous Homogenous

Num

ber o

f Inf

ectio

ns

(a) Total Number of Infections by Landscape

Total Infection by Population Size

0

200000000

400000000

600000000

800000000

1000000000

1200000000

Dis

par1

Dis

par2

8

His

to1

His

to28

Dis

par2

7

Dis

par3

8

His

to27

His

to38

Large Population Small Population

Num

ber o

f Inf

ectio

ns

(b) Total Number of Infections by Population Size

Figure 6.12. Panels (a) and(b) show the total number of infections atfour temple sites. Temple sites 1 and 27 are in heterogeneous areas,

meaning there are many landscape types present. Temple sites 28 and38 are in homogeneous areas. Additionally, temple sites 1 and 28

consist of a small population, while temple sites 27 and 38 consist of alarge population. Dispar refers to Entamoeba dispar and is an avirulent

parasite, while Histo refers to Entamoeba histolytica and is highlyvirulent. Panel (a) shows that the diversity of landscape in which thepathogen is spread has little effect on the total number of infections.Panel (b) shows the same data grouped by population, showing that

population has little effect on the total number of infections. From thisdata, we conclude that virulence has the highest impact on the total

number of infections, while landscape has relatively little impact.

109

Fig

ure

6.13

.P

athog

enSpre

adto

Var

yin

gT

emple

Sit

es.

The

figu

reab

ove

show

sth

enum

ber

ofin

fect

edm

acaq

ues

that

reac

hte

mple

site

sth

rough

out

the

isla

nd

afte

rhav

ing

vaca

ted

the

tem

ple

den

oted

wit

hth

ere

dst

ar.

The

wes

tern

par

tof

the

isla

nd

ishig

hly

hom

ogen

eous,

allo

win

gfo

rth

epat

hog

ento

spre

adfu

rther

.T

he

pat

hog

enlike

lydoes

not

spre

adto

the

nor

thce

ntr

alpar

tof

the

isla

nd

due

toit

shet

erog

eneo

us

landsc

ape.

110

6.4.1 Performance

The model has utilized the aforementioned (Sections 6.3.2-6.3.4) techniques to

interact with GIS data. We started with hefty raster-based queries and refined our

method until we achieved the balance of specificity and speed we desired. Table 6.3

shows the initial GUI load time for the model for each technique, and Table 6.4

and Figure 6.14 show the number of time steps simulated per second for each query

mechanism. These tables and figure show averages over multiple simulation runs

with either the coast and lakes or the coast, lakes, and forests layers enabled, all

with the same number of initial agents. Spatial queries were predictably slowest, as

the raw vector files contain an enormous amount of realism, making calculations

expensive. Utilizing raster data offers a significant improvement but with the

drawback of the long initial startup time. Our simplified spatial query greatly

improves upon the traditional spatial query, but performance drops significantly as

more layers are added. Utilizing precalculated query matrices produces the fastest

simulation, with even greater gains when the display is disabled. Table 6.5 and

its corresponding Figure 6.15 show the scalability, in terms of number of agents,

for the raster and precalculated query matrix method. The precalculated query

matrix method scales very well as the amount of GIS data increases and adequately

as the number of agents increases. The precalculated query matrix method offers

the best, scalable results. All performance tests were run on a single core as

a single thread on a Core 2 Duo 2.0 GHz laptop, highlighting further potential

in scalability. Numbers listed in the figures are averages of 10 simulation runs.

Additionally, LiNK has been adapted to run on a high-performance computing

cluster, making it easy to automate, greatly increasing its utility.

111

TABLE 6.3

PERFORMANCE COMPARISON OF GUI LOAD TIME

GUI Load Time (s)

Coast, Lakes Coast, Lakes, Forests

Spatial Query 3.5 3.5

Raster Query 35 42

Simplified Spatial Query 1.8 2.5

Precalculated Query Matrix 1.6 2

TABLE 6.4

PERFORMANCE COMPARISON OF TIME STEPS/S

Time steps/s


Spatial Query 1.6 0.15

Raster Query 18.5 (11x faster) 19 (126x)

Simplified Spatial Query 39.5 (25x) 15.8 (105x)

Precalculated Query Matrix 126.2 (79x) 124.1 (827x)

Precalculated Query Matrix,

non-GUI 669.6 (419x) 650.2 (4335x)

112

0.1

1

10

100

1000

Spatial Raster Simplified Spatial PrecalculatedQuery Matrix

PrecalculatedQuery Matrix,

non-GUI

Tim

este

ps/s


Figure 6.14. Performance Comparison of Varying Query Methods. Thefigure shows that we obtained nearly an order of magnitude

performance increase in going from spatial to raster to simplified spatialqueries, and then almost another order of magnitude from raster tosimplified spatial queries. Finally, disabling the GUI offers nearlyanother order of magnitude improvement. It is also notable that

enabling more layers in non-GUI mode adds almost no performance hit.We show the figure above with a logarithmic scale.

113

TABLE 6.5

SCALABILITY COMPARISON OF TIME STEPS/S

Time steps/s

Number of Initial Dispersed Macaques 10 100 1000

Raster Query (3 Layers) 51.3 29 19.9

Raster Query (7 Layers) 33.6 27.6 11

Precalculated Query Matrix (3 Layers) 140.7 131.4 83.8

Precalculated Query Matrix (7 Layers) 137.5 129.5 82.9


non-GUI (3 Layers) 669.8 487.8 154


non-GUI (7 Layers) 680.4 529.6 158.2

114

10

100

1000

10 100 1000

Number of Initial Dispersed Macaques

Tim

este

ps/s

Raster Query (3 Layers) Raster Query (7 Layers)Precalculated Query Matrix (3 Layers) Precalculated Query Matrix (7 Layers)Precalculated Query Matrix, non-GUI (3 Layers) Precalculated Query Matrix, non-GUI (7 Layers)

Figure 6.15. Scalability with Respect to Initial Number of DispersedMacaques and Amount of GIS data. Here, we show simulations starting

with 10, 100, and 1000 dispersed macaques across different queryingmechanisms. The precalculated query matrix method performs best in

all cases, even better with 1000 agents than other methods with 10agents. The figure is shown on a logarithmic scale.

115

6.5 Analyzing Massive Amounts of Simulation Data

LiNK is a complex model; as such, it creates enormous amounts of output, up

to terabytes for a given experiment. To glean scientific insight and validation,

LiNK tracks of a wide array of events, including infections, births, deaths, and

when a macaque enters or leaves a temple. When simulations are run over a long

period of time, it is not uncommon to have tens of millions of events, or more. We

have created an interactive graphical tool, originally named LiNKStat, to analyze

output from LiNK.

6.5.1 LiNKStat

Written in Perl and Tcl, LiNKStat parses through output files and builds

graphs to gather statistics about the model. Much of the initial analysis and

graph building is done automatically following a simulation run. For example,

LiNKStat allows users to track the route of infection from a given macaque, ob-

taining statistics such as number of macaques directly or indirectly infected. Such

statistics help subject matter experts collect insight from LiNK. A screen capture

of LiNKStat is shown in Figure 6.16 and an example graph from LiNKStat is

shown in Figure 6.17. LiNKStat is efficient, with a run-time mainly dependent on

the number of infection events and their degree of proliferation. The techniques

used in LiNKStat have been generalized and published as P-SAM [6].

6.6 Conclusion

When designing an ABM with GIS aware agents, there are a number of factors

that should be considered. Scalability in terms of the number of agents is generally

most important. Other important issues include the complexity of the GIS data

116

Figure 6.16. LiNKStat. This screen capture shows one of the analysistabs of LiNKStat. The left column displays an interactive list of

macaques in the simulation that updates the middle right panel withspecific infection statistics. These statistics form graphs, an example ofwhich is shown in Figure 6.17. LiNKStat has been and will continue to

be very helpful in the verification and validation of LiNK.

117

Figure 6.17. LiNKStat Pathogen Transmission Graph. The graph aboveallows us to visually track pathogen transmission, helping with

validation and interpretation of output. Nodes refer to macaques, withthe naming convention being natal temple number concatenated with

an id concatenated with a sex identifier. For example, the topmost nodewould be parsed as a female macaque with temple 27 as its natal

temple and 2969 as its id. Transitions are infection events, listed withthe time step and location where the infection occurred. Starting at thetop, macaque 27.2969.0, infected macaque 27.2775.0 at time step 1, intemple 27. Macaque 27.2775.0 went on to infect four other macaques,

and was also reinfected by macaque 27.2870.1. Autoinfection is possibleas indicated by nodes 27.2863.0 and 27.2805.1.

118

and the amount of GIS data that the model will rely upon. An adept modeler

will utilize the GIS data at a granularity appropriate for the model at hand. In

terms of speed, raster data scales reasonably with increasing GIS complexity, but

not as well with an increase in the number of agents. Spatial queries scale poorly

with an increase in the amount of GIS data and complexity, as well as with an

increase in the number of agents. Regarding accuracy, utilizing vector data via

spatial queries offers the highest accuracy, but at the highest performance cost.

Raster data and our precalculated query matrix method offer varying levels of

accuracy, while offering faster speed. Table 6.6 summarizes general ratings for

each approach. Possible ratings are 1-5, from Poor to Excellent. Accuracy of GIS

data refers to the faithfulness to the original GIS data, while the amount of GIS

data refers to the ability of each technique to handle multiple layers of GIS data.

Our precalculated query matrix method scales best in terms of number of agents

and particularly in the amount of GIS data present.

We have presented a complex model of pathogen transmission that utilizes GIS

data. This model has started to demonstrate the importance of integrating spatial

data into models of pathogen transmission. We have created an efficient and ef-

fective mechanism to allow our agents to become GIS aware. Future extensions to

the model include adding the ability to model different pathogens simultaneously,

deploying a web-based front end to the model, and allowing for the use of cus-

tom GIS data. We would also like to explore running our simulation on graphics

processing units, as described in D’Souza et al. [37]. Finally, we plan to further

verify and validate the LiNK model through real-world data.

119

TABLE 6.6

ADVANTAGES AND DISADVANTAGES (1- POOR;

5- EXCELLENT)

Raster Spatial Simplified Precalculated

Query Query Spatial Query Query Matrix

Accuracy of GIS Data 3 5 4 4

Amount of GIS Data 3 1 2 5

Complexity of GIS Data 2 5 4 4

Load Time 1 4 4 5

Memory Requirement 2 4 4 5

Number of Agents 4 1 2 4

Time steps/s 4 1 2 5

120

CHAPTER 7

CONCLUSION

7.1 Overview

This dissertation has described the significance of TEs both in general and

with respect to their detection within newly sequenced genomes. We described an

automated homology-based approach for the identification of high quality TEs.

We next described a design and implementation plan for the annotation of TEs

on VectorBase. We later described a GIS aware agent-based model of pathogen

transmission. Together, we have created numerous approaches and models that

have important public health implications. We elaborate on our conclusions in

the following sections.

7.2 Automated Homology-based Approach for the Identification of Transposable

Elements

Chapter 2 introduced TEs and described strategies to detect them. Chapter 3

described our approach to the identification of TEs. The approach, implemented

as TESeeker, was tested on multiple families of TEs across a variety of organisms.

Overall, results were very good, with resulting consensus TEs as much as 98%

identical to previously annotated elements. TESeeker is available as a download-

able virtual machine. This work has been submitted to BMC Bioinformatics [80]

121

while results of this approach have appeared in print [5, 83].

7.2.1 Future Work

Due to the nature of TEs, there will likely never be an all-encompassing ap-

proach to discover them. Instead, existing approaches will be used in conjunction

with other approaches. For example, LTR TEs can be detected by both structure-

based and homology-based approaches. The utilization of multiple tools and ap-

proaches to detect TEs produces the most robust results. With TESeeker, several

improvements could be implemented. First, incorporating the capability to de-

tect LTRs in Class I and TIRs in Class II consensus elements would allow us to

more correctly trim our consensus sequences. Second, the ability for TESeeker

to automatically determine the size of flanking sequence could be implemented

on a family by family basis. Last, TESeeker could be extended to allow for the

detection of MITEs.

TESeeker also opens up numerous opportunities to further study TEs within

the organisms hosted on VectorBase. For example, a comparative study of the

mariner elements within the Anopheline mosquito complex could be performed.

Additionally, a comparative study of TEs within the Anopheles gambiae, Culex

quinquefasciatus, and Aedes aegypti could be performed. Initial work has been

performed on the comparative study of TEs within the Anopheles gambiae M &

S forms; TESeeker could help validate and complete this study.

7.3 Community Annotation of Transposable Elements on VectorBase

We introduced the VectorBase CAP in Chapter 2. Its technologies and imple-

mentation were also described in Chapter 2. A design and implementation plan for

122

the community annotation of TEs on VectorBase was presented in Chapter 4, in-

cluding a preliminary version which demonstrated the ability to store TEs within

the Chado database schema and to dynamically created a structural display of

the TE.

7.3.1 Future Work

Once the VectorBase CAP is restored to working order and TE instances are

able to be submitted by the community, work can begin to allow for the submission

and display of consensus TE sequences. The submission and display of consensus

sequences is more complex because consensus sequences need to be aligned against

the genome to determine the location of instances within the genome. This could

be done with a BLAST search and results could be displayed in the Ensembl

genome browser. Future additions could allow for the use of TESeeker to produce

an annotation of TEs that are submitted through the CAP and displayed in the

Ensembl genome browser.

7.4 GIS Aware Agent-based Model of Pathogen Transmission

Chapter 5 introduced simulations and described their applicability. We also in-

troduced GIS and discussed its utility within an agent-based model. We combined

GIS and agent-based modeling to create a simulation model for the transmission of

pathogens amongst long-tailed macaques on Bali, Indonesia, described in Chap-

ter 6. Macaques in our model are GIS aware and utilize their surroundings to

make movement decisions. Performance improvements were made through itera-

tive improvements in accessing GIS data. This work culminated in an invited and

refereed journal publication [76], in refereed conference proceedings [79], and was

123

also presented at peer-reviewed conferences [77, 78]. Applications of this work are

also expected to appear in Lane et al. [88].

7.4.1 Future Work

The utility of LiNK could be improved with the ability to utilize custom GIS

data. While this can currently be done manually, the ability to do so “on-the-

fly” would be very useful. This could be accompanied by the capability to add

additional agents with custom behavioral rules to the model.

The most significant improvement to the LiNK model would be the incorpo-

ration of a web-based interface to define and submit simulations, as well as to

view simulation results. The advantages of this include the ability to run simula-

tions on the CRC High-Performance Computing Cluster (HPCC) [142] at Notre

Dame and the ability to store simulation results in a database (rather than a text

file). Simulations run much more quickly on the HPCC, and the richer analysis of

the simulation data could be performed through software designed to work with

massive amounts of data.

7.5 Contributions

This dissertation has described the following contributions:

• Development and implementation of an automated approach to detect trans-posable elements.

• Design and implementation plan for the incorporation of TEs into the Vec-torBase community annotation pipeline, including a preliminary version im-plemented independent of VectorBase.

• Development and implementation of a GIS aware agent-based model ofpathogen transmission.

124

Results from this dissertation have appeared in the following refereed publica-

tions:

• Arensburger, P., Megy, K., Waterhouse, R.M., Abrudan, J., Amedeo, P.,Antelo, B., Bartholomay, L., Bidwell, S., Caler, E., Camara, F., Camp-bell, C.L., Campbell, K.S., Casola, C., Castro, M.T., Chandramouliswaran,I., Chapman, S.B., Christley, S., Costas, J., Eisenstadt, E., Feschotte, C.,Fraser-Liggett, C., Guigo, R., Haas, B., Hammond, M., Hansson, B.S.,Hemingway, J., Hill, S.R., Howarth, C., Ignell, R., Kennedy, R.C. et al.,“Sequencing of Culex quinquefasciatus Establishes a Platform for MosquitoComparative Genomics,” Science, 330(6000):86-88, October 2010.

• E. F. Kirkness, E.F., Haas, B.J., Sun, W., Braig, H.R., Perotti, M.A., Clark,J.M., Lee, S.H., Robertson, H.M., Kennedy, R.C. et al., “Genome Se-quences of the Human Body Louse and its Primary Endosymbiont: Insightsinto the permanent parasitic lifestyle,” Proceedings of the National Academyof Sciences, 107(27):12168-12173, July 2010.

• Kennedy, R.C., Lane, K.E., Arifin, S. M. Niaz, Fuentes, A., Hollocher,H., Madey, G.R., “A GIS Aware Agent-Based Model of Pathogen Trans-mission,” International Journal of Intelligent Control and Systems, 14(1):51-61, March 2009. (invited)

• Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Lof-tus, B., Xi, Z., Megy, K., Grabherr, M., Ren, Q., Zdobnov, E.M., Lobo,N.F., Campbell, K.S., Brown, S.E., Bonaldo, M.F., Zhu, J., Sinkins, S.P.,Hogenkamp, D.G., Amedeo, P., Arensburger, P., Atkinson, P.W., Bidwell,S., Biedler, J., Birney, E., Bruggner, R.V., Costas, J., Coy, M.R., Crabtree,J., Crawford, M., Debruyn, B., Decaprio, D., Eiglmeier, K., Eisenstadt, E.,El-Dorry, H., Gelbart, W.M., Gomes, S.L., Hammond, M., Hannick, L.I.,Hogan, J.R., Holmes, M.H., Jaffe, D., Johnston, J.S., Kennedy, R.C. etal., “Genome sequence of Aedes aegypti, a major arbovirus vector,” Science,316(5832):1718-23, June 2007.

• Lawson, D., Arensburger, P., Atkinson, P., Besansky, N.J., Bruggner, R.V.,Butler, R., Campbell, K.S., Christophides, G.K., Christley, S., Dialynas, E.,Emmert, D., Hammond, M., Hill, C.A., Kennedy, R.C. et al., “Vector-Base: a home for invertebrate vectors of human pathogens,” Nucleic AcidsResearch, 35(D503-505), January 2007.

• Kennedy, R.C., Lane, K.E., Fuentes, A., Hollocher, H., Madey, G., “Spa-tially Aware Agents: An effective and efficient use of GIS data with an

125

Agent-based Model,” In proceedings of Agent-Directed Simulation (ADS2009), Spring Simulation Multiconference 2009, San Diego, CA, March 2009.

The following manuscripts are under review or in preparation:

• Kennedy, R.C., Unger, M.F., Christley, S., Collins, F.H., Madey, G.R.,“An automated homology-based approach for identifying transposable ele-ments,” BMC Bioinformatics. (Under review)

• Lane, K.E., Kennedy, R.C., Miller, L.A., Madey, G., Hollocher, H., Fuentes,A., “Exploring the use of agent-based models in understanding patterns ofpathogen transmission.” (In preparation)

126

APPENDIX A

AUTOMATED APPROACH WALKTHROUGH

In this appendix, we utilize our approach described in Chapter 3 to identify

the mariner Class II TE from P. humanus humanus. We show iterative results

along the way.

A.1 Representative Amino Acid Coding Regions

We begin with 26 transposase sequences from various mariner elements:

>gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperla plorabunda]MEKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAKWVPRELTFDQKQQRVDDSERCLQLLTRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQNGIKKLEGRYNRCIALEGNYVE>gi|19570323|dbj|BAB86288.1| mariner transposase [Apis cerana]MQDQKEHFRHILLFYYRKGKNAVQARKKLCEIYGEGILTVRQCQNWFSKFRSDNFDIKDAPRSGRPVEADEDKIKALIEANRRITTREIATRLNLSNSTVHDHMKRLGFVSKLDIWVPHVLKEKDLLCRIDICDSLLKREENDPFLKRIVTGDEKWIVYDNIKRELNEPAQRTSKNNIHKKVMLSVWWDFKGVVFFELLPNNCTINSEVYCNQLDKLNNSIKQKRPELINRKGVVFHHDNAKAHMSLMTRQKLLQLGWEVLPHPPYSPDLAPSDYHLFRSLQNSLNDKTFTSNEDVKNYLDQFFANKDQKFYERGIMLLPKRWQYVLDHNGQYVIK>gi|5353885|gb|AAD42284.1| mariner transposase [Bombyx mori]WVPHELSEKNLNDRIIICTSLLAHNKIEPFLDRIITGYEKWITYENIIRKRAFYEPGKPAPSTSKPKLSLNKRMLCIWWNIRRPMHFELLKPNERLNSERHCQQFDKLKTALQEKRPAMFNRKDIILLHDNARPHAALGTRQKAAELG>gi|1698455|gb|AAC52011.1| mariner transposase [Homo sapiens]MNSAKIEARTNIKFMVKLGWKNGEITDALRKVYGDNAPKKSAVYKWITRFKKGRDDVEDEARSGRPSTSICEEKINLVRALIEEDRRLTAETIANTTDISIGSAYTILTEKLKLSKLSTRWVPKPLRPDQLQTRAELSME

127

ILNKWDQDPEAFLRRIVTGDETWLYQYDPEDKAQSKQWLPRGGSGPVKAKADWSRAKVMATVFWDAQGILLVDFLEGQRTITSAYYESVLRKLAKALAEKRPGKLHQRVLLHHDNAPAHSSHQTRAILREFRWEIIRHPPYSPDLAPSDFFLFPNLKKSLKGTHFSSVNNVKKTALTWLNSQDPQFFRDGLNGWYHRLQKCLELDGAYVEK>gi|1399036|gb|AAB17945.1| mariner transposase [Ceratitis capitata]MDNEKDHMLYEFRKGKTVGAATKDIREVYSDRAPALRTVKKWFAKFRSGDFNLEDRPRSGRPCELDNDVLRISVANNSRISTKEVASELNVNKPTAFRRLKKVGYTLKLDKWVPHQLSEKNKVDRMSTAISLLRRVKNEPFLDRLLTGDEKWILYNNVQRKRTWKQAHEGAEPMSKGGLHPMMVLLCIWWDIRGVIYFELLPAGETITANKYCQQLVELKKAIDEKRPIFANRKGVLFHYDNARPHVAKPTLAKLKEMNWEIMPHSPYSPDIAPSDYHLFRSLQNNLNGKKFKNVEDVKSHLDNFFNEKPRDFYESGIRKLVERWEWIAEHDGEYIID>gi|2564437|gb|AAC28162.1| mariner transposase [Glossina palpalis]NENQKNRRFEVSSSLLLRNNDDPFLNRIVTCDEKWILYDNRRRSAQWLDADEAPQHFPKPKLHQKKIMVTVWWSAVGLIHHSFLNPGETITAEKYCQQIDEMHQRLQQKQPALVNRKGPILLHDNARPHVSMITRQKLYELGYETLDHPP>gi|2564433|gb|AAC28160.1| mariner transposase [Pycnoscelus surinamensis]SDGLKCTRVEWCTEMLKRFNNGDSRRVSDIVTGEETWIYQFDLKTKCQSSVWVFPDEQPPTKVKRQRSVGKKMVATFFSKSGHLATVVLEDQRTVTVKWYTEVWLPQVFSKIQEKRPRTGLRGILLHHDNASSHTANATIAFLEKMPMKLMTHTA>gi|2564426|gb|AAC28158.1| mariner transposase [Plebeia frontalis]NAKNLHDRVTICTSLLARNKNDPFLDRIITGDEKWITYENIVRKRASCEPGQPAPSTFKPSLSLNKRMLCIWWEVQGPIHYVFLKPNEKLNSERYCQQMDDLNKELKKKRPAVFNRKHIILHHDNARPHTAFGTRQMIAELGWEILSHPP>gi|2564423|gb|AAC28157.1| mariner transposase [Stomoxys uruma]TFDQKQQRVDDSEWCLQLLTRNTPEFLHRYVTMDETFLHHYTPESNRQSAEWTAIGEPTPKRGKDQKSAGKVMASVFWDARGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMVKIHELGFELLPHPP>gi|2564419|gb|AAC28156.1| mariner transposase [Cryptolestes ferrugineus]TEKNMMDRISICEALTKRNKIDPFLKRMATGNEKWITYDNRVRKRSWSKSGEAPQTVVKPGLTARKVLLCIWWDWKGIIYYELLPYGQTLNSDLYCQQLYRLKIAIDHKRPELTNRRGVVFHQDNPRPHTSTVTRQKLRELGWEVLMHPP>gi|2564414|gb|AAC28155.1| mariner transposase [Delphinia picta]KEIHLTNRINACDMHLKRNEFDPFLKRIITGDEKWIVYNNVNRKRSWSKHGEPAQTTSKADIHQKKVMLSVWWDWKGVVYFELLPRNQTINSDVYCQQLDKLNAAIKEKRPELINRKGVIFHQDNARPHTSLMTRQKLGQLGWEVLMHPP>gi|2564412|gb|AAC28154.1| mariner transposase [Culex restuans]NDRQMENRKTVCEMLLQRFERKSFLHRIVTGDEKWIYYENPKRKKSWLSPGEAGPSTAKPNHFGRKTMLCVWWDQDGVVYHELLKPGETVNTARYRQQIINLNYALIEKRPEWARRHGKVILQHDNAPSHTAKPVKDALKTLNWEILSHPP>gi|2564407|gb|AAC28153.1| mariner transposase [Mantispa pulchella]TERQMENRKVTCEMLLQRYKRKSFLYRIVTGDEKWIYLENPKRKKSWVSPGEASTSTARPNRFGRKAMLCVWWDQTGVIYFELLKPGETVNAVRYQQQIKDLSRAIAENRPEYQERQKKVILLHDNAPSHKSKVVRDTLEKLQWEVLDHAA>gi|2564401|gb|AAC28151.1| mariner transposase [Bittacus strigosus]NDGQQENRKTTCEMLLARQKRKSFLHRIVTGDEKWIYFVNLKRKRSYVDPGQPAQLSPRPNRFGRKTMLCVFWDQRGVIWYELLKPGETVNGQRYQQQLANLNRALRQKRSEYETRHDKVIFLDDNAPSHRTKQTRELVE

128

SYSWQPLPHPP>gi|2564397|gb|AAC28150.1| mariner transposase [Tribolium madens]TLDEKKARVNWCKKMLTKFNNGQSNHVFDIVTGDETWIYRYEPETKRQSAQWVFPYEENPTKLKRPKSVGRKMIAAFFSRSGYIATIPLEDRKTVNANWYTSICLPQVFEKVREKRPRSEIILHHDNASSHTAGETLDFLNVSGIKIMTHPP>gi|2564394|gb|AAC28149.1| mariner transposase [Poecilia reticulata]SEANRQMRVDCCVTLLNRHNNEGILNRIITCDEKWILYDNRKRSSQWLNPGEPAKSCPKRKFTKKKLLVSVWWTSAGVVHYSFLKSGQTITADIYCQQLQTMMEKLAAKQPRLVNRSRPLLLQDNARPHTAQRTATKLEELQLECLRHPP>gi|2564369|gb|AAC28140.1| mariner transposase [Andrena erigeniae]SEENKRRRIDTAASLLSRFKRKSFLHKIIAGDEKWVLYDNPKRQKSWVSPGEPSTSMAKPSIHAKKVMLSIWWDFKGVIHYELLVPGKTITADYYQQQLMNLHDELERKRPFTGQGTRHVILQHDNARPHVAQGTRNTIYALGWEVMSHAA>gi|2564360|gb|AAC28136.1| mariner transposase [Atteva punctella]TERNLMNRVLICDSLLRRNETESFLKKLITGDETWITYDKNVRKRSWSKAGQASQTVAKPGLTRNKVMLCAWWDWKGIIHYELLPPGRTIDSELYCEQMMRLKQKAERKRPELINRRGVVFHHDNARPHTSIATQQKLREFGWGVLMHPP>gi|2564392|gb|AAC28148.1| mariner transposase [Nabis sp. HMR-1997a]TPQQSAKRLEICRNLLENPFDLRFCHRIVTCDEKWVYWRNPNTNKQWLDYGQTALPVAARGQFEKKSMLCVFWNFEGVIHHEFVPDGCSINSELYCEQLERLYSKISERYPALINRKGVLLQQDNARPHTSHRTKEKFTELHGFELLPHPP>gi|2564387|gb|AAC28147.1| mariner transposase [Epicauta funebris]SEKNLNDRVVICTSLLARNNVEPFLNRMITGDEKWITYENILRKRAYCESGKPSPSTSKPNLNLNKRMLCIWWDIRGPIHYELLKPNKKLNSEKYCQQLDNLTTAVQEKRPAMFNRRDIILHHDNARPHTALGTRQKIAELGWEILSHPP>gi|2564376|gb|AAC28142.1| mariner transposase [Buenoa sp. HMR-1997]TSDQKQQRIDDSEQCLKMFNRNKSEFLRRYVTMDETWLHHFTPESSRQSAEWTAYDEPNPKRAKTQQSAGKVMASVFLDAHGIIFIDYLEKGKTINSDYYIALLERLKDEIAEKRPHLKKKRVLFHQDNAPCHKSMKTMAKLNELGYELLPHPP>gi|1816499|gb|AAC47445.1| mariner transposase [Cymodusa distincta]TTRNLISRIEICDTLLKRNKMDPFLKRLITGDEKWIKYKNVKRKRSWLKPGEVPQTTTKPELTASKVMLSVWWDWKGIVYYEILEPGQTVDSGLYCQQLTRLQEAIQKKRPELVNRKSIEFHHDNARPHTSLMTRQKLTEFGWEILLHPP>gi|520556|gb|AAA20470.1| mariner transposase [Tetranychus urticae]PPGQMEHRVMACRFNLQMHRKTRELIQRTISIDETWVSLYMEPEKEQAKGWYYPDEQPEEVPRQNIHGNKRMLIMGMDYNGIAFFELLPEKTTVDGQTYKGFLERHVRHWLGTRASKHLWLLHDNARPHKHQVVREWLERHEITLWHHPP>gi|3093971|gb|AAC15448.1| mariner transposase [Heliothis subflexa]MLKLYENGTSNNINNIVSGDETWLYYFDVPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFTRRGILERIVLESQRTVTASWYINDCLPKVFQKLQEIRPNSRMDTWHFHHDNASAHRARDTVEFLNTSGVKVLEHPAYTPDL>gi|520553|gb|AAA20469.1| mariner transposase [Metaseiulus occidentalis]SERQKEVRLTVCRELLSRYKNKSFLYRIITSDEKWIYYDNPGRKRSWVSPGEPAEKSVRRNRFGKKTMLCVWWDQRGVIYHELLKPGETVDTARYQQQLIDLNRAVKEKRPNWDQVRNRVILLHDNAPCHTSKPTQETLSALNWEVLTHPA

129

>gi|3142710|gb|AAC16889.1| mariner transposase [Heliothis virescens]KWRKKMLHMYENGTSNNINNIVTGDETWLYYFDLPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFTRRGILERVLLESQRTVTASWYINECLPKVFQRLQEIRPNSRMDTWHFHHDNAPAHRARDTVEFLNSSGVRVLDHPAYPPDLPQ

130

A.2 Identify Coding Region

A.2.1 tblastn Search

We utilize the library of TEs from the previous section to perform a tblastn

search against the genome. Here, we present a subset of the tblastn hits.

TBLASTN 2.2.23+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs", Nucleic Acids Res. 25:3389-3402.

Database: phumanus.CONTIGS-USDA.PhumU1.fa8,555 sequences; 108,367,968 total letters

Query= gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperlaplorabunda]Length=348

Score ESequences producing significant alignments: (Bits) Value

AAZO01005188.1 Pediculus humanus USDA contig 1103172084976 136 7e-32AAZO01006015.1 Pediculus humanus USDA contig 1103172085458 132 1e-30AAZO01003584.1 Pediculus humanus USDA contig 1103172096529 110 1e-28AAZO01007101.1 Pediculus humanus USDA contig 1103172086085 122 1e-27AAZO01003080.1 Pediculus humanus USDA contig 1103172096313 120 9e-27AAZO01005603.1 Pediculus humanus USDA contig 1103172085202 100 3e-26AAZO01001198.1 Pediculus humanus USDA contig 1103172094763 116 1e-25AAZO01004437.1 Pediculus humanus USDA contig 1103172096885 114 4e-25AAZO01001978.1 Pediculus humanus USDA contig 1103172095787 113 8e-25AAZO01005816.1 Pediculus humanus USDA contig 1103172085330 112 1e-24AAZO01001816.1 Pediculus humanus USDA contig 1103172095715 94.1 4e-24AAZO01007534.1 Pediculus humanus USDA contig 1103172086338 111 4e-24AAZO01007070.1 Pediculus humanus USDA contig 1103172086064 111 4e-24AAZO01006787.1 Pediculus humanus USDA contig 1103172085910 111 4e-24AAZO01005899.1 Pediculus humanus USDA contig 1103172085374 111 4e-24AAZO01007995.1 Pediculus humanus USDA contig 1103172088564 110 9e-24AAZO01006175.1 Pediculus humanus USDA contig 1103172085571 110 9e-24AAZO01000215.1 Pediculus humanus USDA contig 1103172094998 110 1e-23

131

AAZO01003840.1 Pediculus humanus USDA contig 1103172096640 109 1e-23AAZO01006286.1 Pediculus humanus USDA contig 1103172085643 108 2e-23AAZO01005421.1 Pediculus humanus USDA contig 1103172085111 108 2e-23AAZO01000288.1 Pediculus humanus USDA contig 1103172095038 108 2e-23AAZO01007414.1 Pediculus humanus USDA contig 1103172086274 108 2e-23AAZO01007386.1 Pediculus humanus USDA contig 1103172086255 108 3e-23AAZO01006519.1 Pediculus humanus USDA contig 1103172090088 107 3e-23AAZO01007033.1 Pediculus humanus USDA contig 1103172094563 107 4e-23AAZO01007487.1 Pediculus humanus USDA contig 1103172086313 107 6e-23AAZO01004198.1 Pediculus humanus USDA contig 1103172096805 107 6e-23AAZO01003892.1 Pediculus humanus USDA contig 1103172096659 107 6e-23AAZO01001012.1 Pediculus humanus USDA contig 1103172095359 106 1e-22AAZO01001082.1 Pediculus humanus USDA contig 1103172095391 106 1e-22AAZO01004190.1 Pediculus humanus USDA contig 1103172096798 106 1e-22AAZO01003375.1 Pediculus humanus USDA contig 1103172096437 105 2e-22AAZO01006313.1 Pediculus humanus USDA contig 1103172085657 104 3e-22AAZO01003096.1 Pediculus humanus USDA contig 1103172096321 104 4e-22AAZO01007094.1 Pediculus humanus USDA contig 1103172086080 104 5e-22AAZO01008517.1 Pediculus humanus USDA contig 1103172093434 104 5e-22AAZO01006528.1 Pediculus humanus USDA contig 1103172085776 102 2e-21AAZO01005218.1 Pediculus humanus USDA contig 1103172084995 100 7e-21

> AAZO01005188.1 Pediculus humanus USDA contig 1103172084976Length=113412

Score = 136 bits (318), Expect = 7e-32, Method: Compositional matrix adjust.Identities = 106/329 (32%), Positives = 153/329 (46%), Gaps = 21/329 (6%)Frame = -1

Query 2 EKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERS 61+ F ++ + F KG N +A L D A +W+AKF+ G+ S ++ ERS

Sbjct 112767 QSEHFLHILLFYF*KGVNASQANKKLWVV*GDEALTERQCQNWFAKFRSGDFSLQNEERS 112588

Query 62 GRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAK-WV 120GR EV DE IK +I DR +I L +S V + + +KL A W

Sbjct 112587 GRQLEV-KDEQIKA---LIDYDRYSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLW- 112423

Query 121 PRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTA 178V+++ L+ T+ F R VT DE W + +F R+ + W

Sbjct 112422 ---------GTLVNEATWSLRYAS*TQCK*PFFERMVTGDEKWVVY--DDFLRKRS-WFR 112279

Query 179 TGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAK 238G + + + KV+ S WD GI+ + L + +TINS+ Y+ L L I K

Sbjct 112278 QGNRHQQLLRLTFTKKKVLLSFWWDYKGIVNFELLPRCQTINSEVYIRQLTNLSDTIQEK 112099

Query 239 RPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRM 297RP + K + FH+ NA +L T K+ ELG L HPPYSP LAP ++ F LK

Sbjct 112098 RPELANSKGIVFHHHNARPSLTLATGQKLLELGWNVLLHPPYSPKLAPNNYHFFRFLKNF 111919

132

Query 298 LAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326L G+KF + EV E +F K KE Y+

Sbjct 111918 LNGQKFQNDNEVKTALEQFFAPKTKELYE 111832


Score = 132 bits (308), Expect = 1e-30, Method: Compositional matrix adjust.Identities = 94/288 (32%), Positives = 138/288 (47%), Gaps = 21/288 (7%)Frame = +1

Query 43 DWYAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERV 102+W+AKF+ G+ S ++ ERSGR EV DE IK +I DR I L +S+ V

Sbjct 96649 NWFAKFRSGDFSLQNEERSGRQLEV-KDEQIKA---LIDYDRHSSTKYIIKKLDVSRTCV 96816

Query 103 GHIIHQYLDMRKLCAK-WVPRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDE 159+ + +KL A W V+++ L+ T F R VT DE

Sbjct 96817 KNCLRRLECQKKLDALLW----------GTLVNEATWSLRYAS*TECK*PFFERMVTEDE 96966

Query 160 TWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTI 219W + +F R+ + + G +P K K+++S WD GI+ + L + +TI

Sbjct 96967 KWVVY--DDFLRKKS*-SRQGKQAPTTSKVDIKQKKILSSFWWDYKGIVNFELLPRCQTI 97137

Query 220 NSDYYMALLERLKVEIAAKRPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPP 278NS+ Y+ L L I KRP + K + FH+ NA +L T K+ ELG L HP

Sbjct 97138 NSEVYIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPTLATGQKLLELGWNVLLHPS 97317

Query 279 YSPDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326YSP L P ++ F LK L G+KF + EV + +F K KE+Y+

Sbjct 97318 YSPKLPPNNYHFFRSLKNFLNGQKFQNDNEVKTALDQFFAPKTKEFYE 97461


Score = 110 bits (254), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 55/145 (37%), Positives = 78/145 (53%), Gaps = 1/145 (0%)Frame = -2

Query 183 SPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHM 242+P K K+++S WD GI+ + L + +TIN + Y+ L L I KR +

Sbjct 16786 APTTSKVDIKQKKILSSFWWDYKGIVNFELLPRNQTINLEVYIRQLTNLNDTIQEKRLEL 16607

Query 243 KKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRMLAGK 301+K + FH+DNA SL T K+ ELG + L HPPYSP LAP ++ F LK L G+

Sbjct 16606 ANRKGIVFHHDNARPSPSLATGQKLLELGWDVLLHPPYSPKLAPNNYHFFRSLKNFLNGQ 16427

Query 302 KFGCNEEVIAETEAYFEAKPKEYYQ 326KF + EV +F K KE+Y+

133

Sbjct 16426 KFQNDNEVKTALNQFFAPKTKEFYE 16352

Score = 31.8 bits (67), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 21/54 (38%), Positives = 29/54 (53%), Gaps = 4/54 (7%)Frame = -1

Query 45 YAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKIS 98+AKF G+ S + E SG EV D++ +K I NDR +IAE L +S

Sbjct 17150 FAKFYSGDFSLKNEECSGCLVEV--DDDQRK--AVIVNDRHSSTRDIAEKLDVS 17001

Score = 25.1 bits (51), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 21/68 (30%), Positives = 32/68 (47%), Gaps = 11/68 (16%)Frame = -3

Query 117 AKWVPRE---LTFDQKQQRVDDSERCLQLLTRNTPE-FLRRYVTMDETWLHHYTPEFNRQ 172W+P+E +T +Q C LL RN + FL+ VT DE W + +F R+

Sbjct 16974 VSWIPKEACCITLGSVRQL----GLCDMLLKRNANDPFLKEMVTGDEKWVVY--DDFLRK 16813

Query 173 SAEWTATG 180+ W+ G

Sbjct 16812 RS-WSRQG 16792

...

Matrix: BLOSUM90Gap Penalties: Existence: 10, Extension: 1Neighboring words threshold: 13Window for multiple hits: 40

134

A.2.2 Extract Sequences from the Genome

After the tblastn search, we combine hits within 50 bp that originate from

the same query sequence. When extracted from the genome, we include the in-

tervening sequence. We show a subset of the extracted sequences following these

steps below:

>AAZO01000215.1-0 40432 40911 fATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGA>AAZO01000215.1-1 40432 40923 fATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGG>AAZO01000215.1-14 40902 41231 fCTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-15 40929 41231 fGCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-16 40929 41441 f

135

GCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGAGAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCAAAAAATGTCAAAAGGTCACCAATAATAATGGACATAATATAATA>AAZO01000215.1-17 40935 41207 fAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGC>AAZO01000215.1-18 40938 41231 fGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-19 40938 41231 fGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA

136

A.2.3 CAP3 Assembly

The extracted sequences are fed into the CAP3 assembler, which produces con-

tigs and singletons from the sequences. The contigs file contains an accompanying

quality score file, denoting the quality of each base pair in the contig. We utilize

the quality scores to trim the contigs to encompass the TE, without irrelevant

adjacent sequence. In the following sections, we show the raw CAP3 contigs and

their accompanying quality file. In this case, the quality scores for each contig

never drops below the threshold for the required amount of time; therefore, the

contigs do not get trimmed.

A.2.3.1 CAP3 Contigs

>Contig1ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGTGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGGCTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGATCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACTAATTTAAATGATACCATCCAAGAAAAACGACCGGAGCTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCTAAAACTAAAGAGTTGTATAAAAAAAGGAAAATGATACCACCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATAGAT>Contig2ATGGGGAGCCAAAGCGAGCATTTCCTTCACATTTTGCTTTTTTATTTCTGAAAGGGTGTTAATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAGGAGTGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGAGTCACATACGTGCGTC

137

AAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGGGTCTTTGCGATATGCTTCTTAAATGCAATGCGAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTTTATGATGACTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACCTTAGAACTTCTAAGGCTGACATTCAGCAAAAAAGTTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGGAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATT>Contig3ATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTGCTTTTTTATTTCTGAAAGGGTGTTAATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACAAGGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGACGTCTTGGGTACCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCCTTTTTTGAAAGAATGGTCATCGAAGATGAAAAGCGGGTTGTTTATGATGACTTTTTGAGAAAAAGATCCTGGTCTAGGTAAGGAAACAGGCACCTTAGAACTTCTAAGGCTGACATTCACCAAAAAAGTTATTGTTATCATTTTGGTGGGCTTACAAAGGCATAGTCAACTTTGAGCTGCTGCTGCCACGAAGTCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCATAGCTAGCCAATAGCAAAGGAATTGTCTTTCACTATGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGAAGCTAGGCTGGAATGTTTTGCTGCATCCTCCTTAAAGTCCCGACCTAACTCGAAGTGAGAATCATTTTTCCCGATTCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAATTGTACTACATTGGACCAGTTTTTTGCT>Contig4ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGAGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCTTTAACAGAACGGCAGCGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGTACATTATAAAGAAGCTAGATGTGTCACGTACGTGCGTCAAAAACTGTCTGCGGCGTCTTGAGTGCCAAAAAAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGGAGTGCAAATGACCCATTTTTGAAAGAATGGTCACCGAAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGGAAAAAATCCTGATCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGTTGACATAAAGCAAAAAAAGATCTTGTCATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAACTGCTGCCACGATGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCAACTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTTCATATAGTCCCAAACTACCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATA

138

CTACCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATA>Contig5TTATTTTTGAAAGGGTGTTAATGCTTCACAAGTTCATAAAAAGTTGTGGGCTGTGTATGTACGGTGATAAAGCCTTAATAGAACGGCAGTGTCAAAACTGCTTTGAGAAATACAGTTCTGGAGATTTTCCTTTGAAAAATGAGAACCGCTCCAGGCATCCCGTGGAGGTGAATGTCAGTCATAAATAAAGGTTCTCATTGATTATGATCGGCATAATTCGACTATGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTACGGCGTCTTGGGTGCCAAAAAAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCAGCGGAGATGAAAAGTGCATTGTCTATGATGACTTTTTGAGAAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCATTACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCATTAATAATAATGGACATAATATAATAGAT>Contig6ATGGAGAGCCAAAGCGAGCATTTCCTCCACATTTTCGTTTTTTATTTTTGAAAGGGTGTTAATGCTTCACAAGCCAATACAAAGTTGTGGGCTGTGTAGGGTGATGAAGCTTTAATAGAACTGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGAAAGATAAGGAGCGCTCTGGGGGTCCAGTGGAGGTCGATGATGACCAAATAAAGGCCCTAATTGTTAATGATCGGCATAGTTCGACAAGGGACATTGCAAAGAAGCTAGATGTGTCACATAAGTGCGTCCAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGCATAACCTTGGGGAGTGTGAGTTAACGAGGCGACTTGATCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTGAGGAAAAGATCCTGGTCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGCTGACATTAACCAAAAAAAGGTACTGTTATCAGTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCTGCCACGAAATCAGACCATAAATTCAGAGGTTAATATTTGACAATTGAGAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGAAAAGGAAGAATTGTCTTTCAACACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCTTATAGTCCCGACCTAGCTCAAAGCGAGTATCATTATTTCCGATCACTAAAAAATTTTTTGAACGGACAAAAATTCCAA>Contig7CTAGATGTGTCACATACGTACGTCGAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCACAGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTAAGGAAAAGATCCTGGTCTAGGCAAGGGAAACAGGCACCAACAACTTTTAAGGCTGACATAAACCAAAAAAAAGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTTTACTTTGAGCTGTTCCTACAAAGTCAGACCATAAATTCAGAGGTTAACATTCAACAATTGACGAATTTAAATAATGCCATCCAAGAAAAACGACCGGAGCTAGCCAATAGAAAAGGAATTGTCTTTCATCACAATAATGCCAGGCTCCACACATCTTTAGACATCAGACAAAAACTACTGGAACTAGGCTGGGTTGTTTTGCCGCATCCTTCTTATAGTCCCAACTTAGCTCGAAGTGAGTTACATGTGTTTCGATCACTAAAAAAATTTTTGAACGGACAAAAAATCCAAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCT

139

AAAAATAAAGAGTTTTATAAAACATGGGATGATGATACTACCCGAAAAATGGCGAAAGATCATTGATAACAATGGACATAATATAATG

140

A.2.3.2 CAP3 Contigs Quality Scores

>Contig197 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 11 97 97 97 97 8297 97 97 97 97 97 97 97 82 97 97 82 97 82 82 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 82 97 97 97 97 97 97 9797 97 17 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 11 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 82 97 97 97 11 97 97 97 17 97 9797 17 97 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 9797 97 97 17 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 17 97 97 9797 97 97 97 97 97 97 97 97 17 97 97 82 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 497 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 9797 82 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9782 82 97 97 97 97 97 82 82 97 97 82 97 97 97 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 82 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9797 97 97 97 97 97 97 11 17 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 82 82 97 97 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 9797 97 97 97 97 82 97 11 82 97 97 97 97 97 97 97 97 97 82 8282 82 82 82 82 82 82 82 97 97 97 82 97 17 97 97 97 97 97 9797 97 97 82 17 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 82 82 82 97 97 11 97 97 97 97 97 97 97 17 97 82 97 9797 97 97 97 97 82 82 97 97 97 97 97 97 97 97 97 97 97 17 9782 97 97 11 97 97 97 82 97 97 97 97 97 97 97 82 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 17 97 9797 97 97 82 97 97 97 97 97 97 97 97 97 97 4 97 4 97 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97

141

97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 497 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 95 97 97 97>Contig297 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 30 97 97 4 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 70 97 97 97 97 97 97 50 97 97 97 70 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 7097 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 90 70 9797 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 90 97 9797 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 80 50 97 97 97 97 97 80 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 90 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 9797 97 97 42 97 97 97 97 97 97 97 42 97 42 42 97 97 97 97 9742 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 40 4297 40 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 71 97 97 97 97 97 97 97 97 15 97 97 5 42 97 97 597 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 84 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 74 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 77 77 97 77 97 97 77 97 97 97 97 97 9797 97 97 77 97 97 77 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97

142

97 97 97 97 97 97 48 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 77 97 97 97 97 97 97 77 77 77 7777 60 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 77 50 97 97 82 97 9797 97 97 97 97 97 97 97 97 97 97 82 97 62 97 97 97 97 97 9797 97 97 52 90 90 90 65 90 90 90 90 67 38 90 67 90 90 90 5267 90 90 67 90 90 67 90 67 90 67 90 90 90 90 90 90 90 90 9090 90 90 53 90 90 90 90 27 90 90 90 90 90 90 90 90 90 90 9090 90 90 90 90 67 20 90 90 90 90 65 90 90 90 90 90 90 90 8282 82 77 77 55 23 75 75 75 75 75 70 70 70 47 70 70 70 70 4770 70 70 70 70 70 70 70 35 35 70 70 35 70 70 70 70 70 70 7070 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 7070 70 70 70 70 70 47 70 65 65 65 40 55 55 55 55 55 55 55 5555 55 35>Contig330 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 5 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 15 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 15 30 30 30 30 30 30 3030 15 30 30 30 30 30 30 5 30 15 30 30 30 30 30 35 35 35 3537 37 37 37 37 37 15 37 37 37 37 37 37 37 37 42 42 42 42 2742 42 42 42 42 42 42 42 42 42 27 42 42 42 42 42 42 42 42 2042 42 42 42 42 42 42 42 42 42 42 42 20 42 42 42 42 42 42 4242 42 42 42 27 27 42 42 20 42 20 42 5 42 42 20 42 42 20 4242 42 42 42 42 42 42 5 42 42 42 42 42 42 42 20 20 42 42 4242 42 42 42 42 42 42 42 42 42 42 42 42 42 27 42 42 42 42 2727 42 42 42 42 42 20 42 42 42 42 42 42 20 42 20 42 5 42 542 42 42 42 17 42 42 42 5 42 42 42 42 42 42 42 5 42 42 4227 42 42 20 42 42 5 42 42 42 42 42 27 42 42 27 27 42 42 4242 42 42 42 42 20 27 42 42 42 42 20 42 20 5 27 20 42 42 4242 42 42 42 42 20 42 37 37 15 37 37 37 37 37 15 37 17 37 3737 17 37 42 11 42 11 42 20 11 42 42 42 42 42 47 47 47 25 4747 25 47 47 57 57 57 57 57 42 42 35 35 67 42 67 67 67 67 6767 67 67 67 67 52 67 27 67 67 67 67 67 67 67 67 67 0 67 6767 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 45 67 6767 67 67 67 67 67 67 65 65 65 60 60 60 60 60 60 60 60 60 1560 35 15 60 60 60 60 60 60 60 60 60 60 15 60 60 60 60 60 6060 45 45 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 6060 60 60 60 60 60 60 45 60 60 60 60 55 55 55 55 55 55 10 5555 55 55 55 55 30 55 55 55 55 55 55 55 55 55 55 55 55 55 5555 60 60 60 60 60 60 60 60 10 60 60 0 0 0 60 65 10 10 1567 45 57 57 57 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9595 95 92 95 95 95 95 95 95 95 50 95 95 95 95 92 92 92 92 9292 92 92 92 92 92 92 40 92 92 92 92 92 92 92 92 92 92 92 92

143

92 92 40 40 40 92 92 92 92 25 92 92 40 92 92 92 92 92 25 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 25 92 25 40 92 92 92 92 92 92 92 92 9292 92 92 40 25 40 25 25 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 40 4092 40 40 92 92 92 92 92 92 92 92 92 92 92 92 40 25 92 92 4092 92 40 40 40 92 62 62 62 62 62 62 62 62 62 62 62 62 62 6262 62 62 62 62 62 62 62 62 62 10 62 62 10 62 62 62 62 62 6262 62 10 62 62 62 62 62 62 62 62 62 62 62 62 62 10 62 62 6235 50 35 20 20 5 20 20 20 20 20 20 5 20 5 20 20 22 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15>Contig420 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 5 20 20 20 20 20 20 20 5 20 20 20 20 5 20 20 20 2020 20 20 20 20 20 5 20 20 20 20 20 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 10 25 2525 25 25 25 25 25 25 10 25 25 25 25 10 25 25 25 25 25 25 2525 25 25 10 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 2525 25 25 25 25 25 10 25 25 25 25 25 25 25 25 25 10 25 25 2525 25 10 25 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 105 5 5 5 45 40 45 45 45 45 45 45 45 45 45 45 45 45 45 4545 45 45 45 45 45 45 45 45 50 50 35 50 50 50 50 50 72 97 975 97 97 97 97 97 97 97 97 97 97 97 97 97 93 97 97 97 97 9797 97 82 97 93 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 82 93 97 97 82 97 97 97 93 93 97 97 9797 97 97 97 82 97 97 97 97 97 97 82 97 97 97 97 97 97 97 9793 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 95 95 95

144

97 85 97 97 97 97 97 97 97 30 97 30 97 97 97 97 30 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 9797 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 30 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5 905 50 50 25 50 50 50 50 50 50 25 50 25 25 50 50 4 50 50 504 50 50 50 50 25 50 25 25 50 50 50 50 50 50 50 50 50 25 5050 25 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 25 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 45 45 45 45 45 45 45 45 45 45 45 45 45 45 4545 45 45 45 45 45 20 45 45 45 45 45 45 20 45 45 45 45 45 4545 45 45 45 45 45 20 45 45 45 45 45 45 45 45 45 20 4 20 4040 40 35 35 10 35 35 35 35 35 35>Contig510 10 10 10 10 10 10 10 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 35 35 35 35 35 35 35 35 35 40 4040 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4040 40 15 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40 4040 15 85 85 85 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95

145

95 95 95 40 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9570 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9595 95 95 95 95 95 95 95 90 25 90 90 90 90 95 95 95 95 90 9090 90 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 45 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 36 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 42 97 97 97 97 97 97 97 36 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 57 97 57 97 97 97 36 97 97 97 9797 36 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 42 97 97 42 97 97 97 97 97 97 97 97 97 9797 97 97 97 42 97 97 97 42 42 97 36 97 97 97 97 97 42 97 9797 97 42 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 42 42 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 9797 97 97 97 82 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 5 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 82 97 97 5 82 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 17 97 72 97 97 97 97 97 97 97 97 97 97 9797 75 97 77 67 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 37 97 97 97 9797 57 97 97 97 97 97 97 97 97 90 60 85 60 70 85 85 85 85 8585 85 85 85 85 75 75 75 60 60 60 60 60 60 60 60 60 60 60 6060 60 60 60 60 60 60 60 60 60 15 60 60 35 60 60 35 60 60 6060 60 60 60 60 60 60 60 60 35 60 60 60 55 55 55 30 30 25>Contig615 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 20 20 20 20 20 5 2020 20 20 20 20 20 20 20 20 5 5 20 20 20 20 20 20 20 20 520 5 20 5 20 20 20 22 22 22 22 22 7 22 30 30 22 30 30 307 30 7 22 37 22 37 37 37 37 22 7 37 37 37 37 37 30 37 3737 37 37 37 37 37 22 37 37 37 37 37 37 37 3 37 22 7 37 3037 37 37 37 22 30 37 37 37 37 30 37 7 37 37 22 37 37 37 3737 37 37 37 37 37 30 37 37 37 37 37 22 37 37 22 37 30 37 1537 37 37 37 30 30 37 22 5 37 7 37 37 22 37 7 37 7 37 3737 37 37 37 15 15 7 7 0 37 37 37 37 3 37 37 22 37 7 1537 37 15 37 37 37 37 30 15 30 7 37 22 37 37 37 37 37 37 3737 37 15 15 37 37 37 37 37 37 37 22 5 22 22 22 5 22 22 22

146

5 22 22 22 22 22 22 5 7 7 37 37 30 37 7 3 22 37 37 3037 37 15 37 37 37 30 22 30 37 37 37 37 37 35 12 35 35 35 350 2 35 35 35 0 12 2 35 0 0 12 35 35 35 15 40 25 25 2510 32 25 32 32 32 25 2 25 32 40 10 2 45 45 45 45 45 45 455 45 45 45 45 45 45 32 45 45 45 45 45 45 45 45 45 32 45 4545 45 15 15 0 15 2 15 15 20 37 20 50 50 50 50 42 2 97 9427 97 97 97 97 97 64 94 97 94 94 97 80 97 97 97 97 94 97 9797 97 97 12 97 94 97 97 94 97 97 97 94 97 97 97 97 94 97 9794 97 94 97 97 94 94 94 97 94 97 97 94 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 63 97 97 97 82 64 97 97 97 42 97 9797 97 42 87 72 94 97 97 97 94 53 97 94 97 97 97 97 97 94 9797 97 97 97 97 97 97 97 97 97 97 97 94 97 97 97 97 97 13 9797 97 97 97 97 97 97 97 97 90 97 97 87 97 97 35 26 97 97 9797 97 80 97 97 97 97 97 97 97 97 97 92 97 97 32 97 97 97 9797 97 97 97 97 97 97 97 70 75 97 97 97 65 97 70 95 97 95 9597 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5545 97 55 97 97 97 97 97 97 97 77 77 77 25 77 77 77 15 77 2070 70 70 35 70 97 85 85 85 85 55 85 90 45 45 97 97 97 97 9797 97 97 97 20 97 97 20 97 97 97 97 97 50 97 97 97 97 97 9797 97 97 97 97 97 97 77 77 97 20 97 97 97 97 77 97 97 97 9797 97 97 97 97 97 97 97 97 97 77 77 77 97 97 77 97 77 97 7797 97 97 20 97 77 77 97 97 97 97 97 97 97 97 97 97 97 50 9720 50 20 20 50 20 97 97 97 97 97 97 97 97 97 97 97 97 97 7777 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 50 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 9797 97 80 80 97 97 97 97 27 27 27 27 27 27 27 5 27 5 27 2727 27 27 27 27 27 27 5 27 27 27 5 27 27 5 27 27 27 5 2727 27 5 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 10 10 10>Contig710 10 10 10 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 20 20 9292 92 92 92 70 92 92 70 92 92 70 92 37 92 92 92 92 92 92 1592 92 92 92 92 92 92 92 92 92 92 92 97 42 97 42 20 42 42 4242 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 75 97 9797 97 97 97 97 97 20 42 97 97 97 75 97 97 97 97 97 97 97 2075 42 97 97 97 97 97 97 97 95 40 80 40 95 95 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 35 35 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 35 90 35 9090 90 90 35 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90

147

85 85 85 85 85 85 85 85 85 85 85 85 80 80 80 80 80 80 80 8060 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 7575 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 5075 75 75 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 7575 75 75 75 75 75 75 75 75 75 50 50 50 75 50 75 75 75 75 7575 75 75 75 75 75 75 75 75 75 75 75 75 50 75 75 75 75 75 7570 70 70 45 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 7070 35 35 35 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 25 25 25

A.3 Encompass Complete TE

We now aim to extend the TE beyond the coding sequence and find instances.

Using the contigs from the previous step, we perform a blastn search, followed

by combining and extracting hits. At this point, when we perform extractions, we

add an extra 200 bp on both ends of the sequence. We now have instances in the

genome with flanks.

148

A.4 Generate Consensus

Once we have extracted the instances from the genome with flanks, we perform

a multiple sequence alignment using ClustalW2. Using the alignment, we generate

a consensus sequence, shown below:

>ContigTTTT-AGTA-GTTTAC-TTT-TG----A-A-A-A--AACTTTTG-T-AT-TTATTA-TTGGTTCCTTCTG-----A--TTC-GTACCAAAATTTCA-T--A--TCTTA---AAT-GT-CG-GAGCAAAAGA-A-T--AAAC------GGGAGCCAAAGCGAGC--------ATTTCCTCCACATTTT-CTTTT----TTATTTTTGAAAGGG-TGTTAATG-CTTC-CA-GCCAATAAAAAGTTGTGGG--G----TGTAGG------G-GATGAAGCCT-TAATAGAACGGCA-GTGTCAAAA--CTGGTTTGCGAAATTCCGTTCTGGAGATTTT-CTTTGAAAAATGAG---GAG-----GCTCCGGGCGTC-ATTGGAGGTTAATGATGAGCA------AATAAAGGCCCTCATTGATTATGATCGGCATAGTT-CGAC-AAGGACATTGTAAAGAAGCTAGATGTGTCACATACG-TGCG---TCAAAAACCGTCT--GCGGCGTCTTGGGT-----CCAAAAGAAGCTTGATGC-T-AC-TTGGGGA--GT---------AG-C-AC---TTGGTCTTTGCGATATGCTTC---TTAAACGCAA-------CGAATGACCCTTTTTTGAAA---GAATGGTCACCGG----AGATGAAAA---GTGGGTTGTCT-ATGAT--GACTTTTTGAGA-----AAAAG-ATCCT---GGTCTAGGCAA---G-AAACAGGCACCA-ACAACTTCTAAGGCTG-ACATTCACCAAAAAAA------GGTATTGTTA-TCATTTTGGTGGGATTACAAAGGCATAGTC-------ACTTT-GAGC----TGC-GCCACGAAGTCAGACCATAAATTCAGA--GGTTTACATTCGACAATTGACAAAT----TTAAATGATACCATCC---------------AAGAAAAACGACCG-------------------GAGCTAGCCAATAGCA---AAGGAATT------GTCTT-TCACCAC-ATAATG----CCAGGCCCT--------------------------------CCCC----ATCTTTAGCCACTGG--ACAAA-----AACTACTGGAGCTAGGCTGG-ATGTTTTGCTGCACCCTCC---TATAGTCCCAAACTAGCTCCAAATAATTATCA--------TTTTTTCCGAT--CTAAAAAATTTTTT-AA-GGACAAAAATTC----AAAA-GACAATGAGGTC------ACTGCATTGGA-CAGTTTTTT-GCTCCT--AAAACTAAA-GAGTTTTAT-------GAAAAAA-----G-A-AATGATACTACC--CGAAAAAT------------------A-AA--T-A-TAATA-ATA-------ACATAAT-T---ATA-ATAA----A-TAATTT---ATAATTAATAAAT------TTT-TT-TTTT-T-A-AAA-TTC--A-A-T--A----T-T---------AATA

149

A.5 Identify Complete TE

We now utilize the consensus sequence to perform a blastn search against the

genome to find its instances. The hits are again combined and extracted, with

50 bp flanks. Hits must be at least 90% of the query length before adding flanks

to be considered. We next assemble the hits iteratively in CAP3.

A.5.1 CAP3 Assembly

The results from the CAP3 contigs file are shown below.

>Contig1ATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTTTTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCCTTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGACATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGATAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATATCTTTTTGCCAACCCAATA

150

A.5.2 CAP3 Contigs Quality File

The quality file for the contigs in the previous subsection is shown below.

>Contig135 25 5 72 72 72 72 72 72 72 72 72 72 72 72 37 72 57 72 7272 72 72 72 72 72 72 72 72 72 57 72 72 57 72 72 72 72 65 7257 65 50 57 72 56 72 72 72 72 72 72 72 20 5 42 57 72 72 7272 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 4772 72 72 72 72 72 32 72 72 72 72 72 72 72 72 72 72 72 72 7220 72 72 72 65 72 72 72 72 72 72 72 72 72 72 72 72 72 72 5072 72 57 72 72 72 72 72 72 72 72 72 65 37 72 72 72 72 72 7262 72 72 72 72 72 72 72 72 72 35 72 72 20 72 72 47 72 72 7272 72 72 72 72 10 72 72 72 72 72 72 72 72 72 57 72 72 72 1072 21 65 35 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 072 72 72 72 72 72 72 72 72 57 72 72 72 72 72 50 72 72 72 7272 72 72 20 72 72 72 72 72 72 72 72 72 72 72 37 72 72 72 7272 56 47 72 72 37 72 65 72 72 72 65 72 72 72 72 20 72 72 2072 35 35 35 72 72 72 72 72 72 57 72 72 72 72 72 72 57 72 041 72 72 57 72 72 65 72 72 32 72 72 72 72 65 72 30 72 57 7265 65 65 15 65 72 72 72 47 72 65 72 72 72 57 72 72 72 72 7272 65 72 72 65 72 72 72 65 72 72 72 72 72 72 72 72 72 72 7272 72 72 72 57 72 72 72 72 72 72 72 72 31 57 72 72 72 57 2172 72 72 72 72 72 72 72 72 72 72 65 72 72 72 72 72 72 72 6572 72 37 65 37 72 72 72 72 72 72 72 72 72 72 72 72 25 72 7272 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 7272 72 72 72 72 72 65 72 72 72 72 72 72 72 20 72 72 65 56 7272 72 72 57 72 20 72 72 72 57 72 72 57 72 72 57 56 72 72 7272 72 72 65 72 72 72 72 72 72 72 57 72 72 72 72 57 57 72 7272 72 72 72 72 72 72 72 72 72 72 72 62 72 20 65 72 72 72 5672 47 57 72 72 72 72 72 72 72 47 65 72 57 72 47 72 72 72 7211 72 72 65 72 72 72 57 72 72 72 72 57 72 72 72 72 72 72 7272 57 72 72 72 65 72 65 72 65 72 72 65 65 72 72 72 65 65 6565 65 50 65 65 72 72 72 72 72 47 72 72 72 72 47 72 72 72 7272 72 72 72 72 50 72 72 72 72 72 72 72 57 72 57 72 72 72 072 72 72 72 37 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7265 72 65 72 57 72 65 65 72 47 40 72 57 72 72 72 72 72 72 7272 72 50 65 65 72 72 72 72 47 72 72 72 72 72 72 72 72 72 7272 72 72 72 72 72 72 47 72 72 72 72 57 72 72 65 72 57 57 5772 5 72 72 57 72 72 72 72 72 72 72 57 57 42 57 72 57 57 5772 47 57 72 57 72 72 72 72 72 72 72 57 57 72 27 27 72 72 7272 72 72 47 47 72 47 72 72 72 72 72 57 72 50 72 35 72 57 5772 72 72 47 72 72 57 72 57 72 72 72 72 65 72 72 57 72 57 72

151

72 72 72 57 72 72 72 65 72 72 72 72 72 65 72 25 72 72 72 7272 72 72 57 72 72 72 72 72 72 72 65 72 65 72 72 0 31 72 7272 72 72 37 65 72 72 72 72 72 72 72 72 72 37 72 72 57 72 7272 72 72 72 72 72 72 57 72 72 72 72 72 57 72 72 72 20 72 7272 72 72 72 72 20 72 72 65 72 72 72 72 72 72 72 72 72 72 5772 72 72 72 72 72 72 72 37 57 72 57 72 17 72 72 72 72 72 6572 72 72 72 72 72 65 72 72 72 57 72 72 72 72 72 72 72 72 7272 72 72 35 35 72 57 37 72 72 72 72 72 72 72 72 72 72 72 7257 72 72 72 5 72 20 72 10 72 72 72 65 72 72 72 72 72 72 7272 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7272 72 72 57 72 72 72 20 72 57 72 72 72 72 72 72 72 72 72 7272 72 72 72 72 47 62 72 72 72 72 72 72 72 72 72 65 72 72 7257 72 72 57 72 72 72 57 72 72 72 72 72 72 72 72 72 72 65 7272 72 72 72 72 72 72 65 57 72 72 10 72 72 72 72 72 72 72 7272 72 72 72 72 72 72 25 72 72 65 65 72 72 47 72 72 72 72 7272 72 72 72 72 72 72 72 72 72 65 72 37 72 72 57 72 72 72 7272 72 72 62 72 72 65 72 72 72 72 72 72 72 72 0 72 72 72 7272 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7272 72 57 65 72 20 72 72 72 57 72 72 72 65 72 65 72 72 72 7272 72 72 72 72 72 72 72 57 72 72 72 72 37 72 72 72 72 72 7257 57 72 72 72 72 72 72 72 72 72 72 40 30 57 72 72 72 72 7272 72 65 20 20 72 72 72 72 52 67 60 67 67 67 52 67 52 42 6752 67 67 67 67 67 67 67 67 67 67 67 5 5 60 67 52 67 67 6767 67 57 67 67 42 67 67 60 67 15 67 67 67 67 67 67 67 67 6767 67 67 67 67 67 67 67 67 67 67 67 67 60 67 67 67 5 67 6720 42 67 42 67 67 67 67 67 67 67 67 60 67 67 67 67 67 67

152

A.5.3 Trimmed CAP3 Contigs

Here we list the final contig, which represents the consensus TE. In this case,

there was nothing trimmed from the previous step, as the quality scores were all

above the threshold.

>Contig1-0 0 1280 fATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTTTTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCCTTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGACATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGATAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATATCTTTTTGCCAACCCAATA

153

APPENDIX B

TESeeker WEBSITE

The TESeeker website is located at http://www.nd.edu/~teseeker and in-

cludes the virtual appliance, representative TE library, and documentation. Fig-

ure B.1 shows a screen capture of the home page.

154

http://www.nd.edu/~teseeker

Figure B.1. TESeeker Website. The Website allows researchers todownload the TESeeker virtual appliance, as well as view the

documentation. We also provide the library of representative TEs fordownload.

155

APPENDIX C

TESeeker USER MANUAL

TESeeker is available as a VirtualBox[144] virtual appliance in the open vir-

tualization format (OVF). TESeeker requires at least 5 GB free hard disk space

and at least 1.5 GB of RAM on the host machine. TESeeker can dynamically

allocate up to 40 GB hard disk space for use in the virtual appliance. TESeeker

is licensed under GNU General Public License (GPL) v3 [55].

C.1 Installation

TESeeker can run on any operating system that supports the VirtualBox vir-

tualization software package, currently available for Windows, OS X, Linux, and

Solaris.

The following steps shall be followed to install TESeeker:

1. Download and install VirtualBox from http://www.virtualbox.org.

2. Download the TESeeker virtual appliance files (2) from http://www.nd.

edu/~teseeker.

3. Open VirtualBox.

4. Click File then Import Appliance... and complete the wizard, selecting theTESeeker .ovf file as the source. Be sure both downloaded TESeeker filesare in the same directory.

156

http://www.virtualbox.org



C.2 Usage

After installation, start TESeeker by opening VirtualBox, clicking teseeker in

the left frame, and then clicking Start. The virtual appliance hosting TESeeker

will then boot.1 As shown in Figures C.1-C.7, the booted appliance will contain

7 desktop items: the Genomes and TELibrary folders, shortcuts to bring up the

documentation and web interfaces, and the license. The TESeeker interface is

shown in Figure C.5. Hovering the mouse over the parameter name will provide

a more detailed description. All genomes and library files must be placed in the

folders on the desktop and must be in the FASTA file format with a .fa, .fas, or

.fasta file extension. We have included the Pediculus humanus humanus genome

and our representative TE library within the virtual appliance.

Clicking the TESeeker shortcut on the desktop will load the web interface.

Here, researchers can modify the default parameters, most notably the BLAST

Query Library, BLAST Database, and the Desktop Output Folder Name. Hovering

over the parameter name will provide a detailed tooltip description. Once the

parameters have been set, clicking submit will briefly show the selected parameters

and then start the search. The browser will display Job X is Running, where X

represents the job id number. The browser will continually refresh the page until

the job completes, at which point the page will notify the user. When finished,

researchers navigate to the specified output folder on the desktop to view results.

If the researcher elects to find only the coding region, results are organized as

follows within the specified output folder: the codingRegion files folder contains

intermediary output, the output folder contains all the singlets and contigs pro-

1Some Linux distributions automatically enable the KVM kernel extension If this is the case,disable it with the following command sudo modprobe -r kvm intel. To restore the KVM kernelextension, run sudo modprobe kvm intel.

157

Figure C.1. TESeeker Desktop. This figure shows the desktop.

158

Figure C.2. TESeeker Genomes Folder. Researchers can place FASTAgenome data in this folder.

159

Figure C.3. TESeeker TELibrary. This figure shows the folder for therepresentative TEs. Researchers can also place FASTA sequence data in

this folder.

160

Figure C.4. TESeeker Documentation. Here, we show a screen captureof the HTML TESeeker Documentation.

161

Figure C.5. TESeeker Web Interface. This figure shows the TESeeker

web interface. Researchers can alter the default parameters as desired.Library and genome files in the desktop folders are selectable through

drop-down menus.

162

Figure C.6. TESeeker BLAST Interface. Here, we show the BLASTinterface. The BLAST Database drop-down menu is populated via the

genomes available in the Genomes folder.

163

Figure C.7. TESeeker Extract Interface. This figure shows theTESeeker extract interface. Researchers can extract specified sequence

data from any genome in the Genomes folder.

164

duced, and the remaining files represent the contigs and singlets produced from

CAP3. For example, a file called cap2c out.fas contains the contig sequences from

the second iteration of CAP3, while cap1s out.fas contains the singlet sequences

produced from the first iteration of CAP3.

If a consensus sequence is desired, the results are organized as follows within the

specified output folder: the codingRegion files folder contains intermediary output

from the coding region search, the folder consen files contains intermediary files

from the consensus search, and the output folder contains the contig and singlet

sequences produced from each sequence that was fed into the consensus search.

Additionally, all contig and singlet sequences are available in single FASTA files

in the specified output folder.

C.3 Example Search

TESeeker is distributed with the Pediculus humanus humanus genome as well

as our library of representative TEs. We next describe how one could obtain

a high-quality consensus element for the Pediculus humanus humanus mariner

element, once the virtual appliance has been loaded.

1. Launch TESeeker. Double-click the TESeeker shortcut on the desktop.

2. Confirm Parameters. Ensure mariner ac.fa is selected for the BLASTQuery Library and that the phumanus.SUPERCONTIGS-USDA.PhumUA.fagenome is selected for the BLAST database. Also click Find Consensus? toenable a consensus search. The screen should now look as shown in Fig-ure C.8. The status for TESeeker will be continuously updated through theweb interface until the job completes.

3. Inspect Results. When the job is finished, click the link to the specifiedoutput folder, louseOut, and inspect the results. The web view of this folderis shown in Figure C.9. As mentioned in the previous section, the mainconsensus results will be in up to three FASTA files, consensus contigs.fas,

165

Figure C.8. TESeeker Default Parameters. This figure shows theTESeeker web interface with the default parameters set for a search for

the mariner transposon in P. humanus humanus.

consensus iter1 singlets.fas, and consensus singlets.fas. The best hits aregenerally in the consensus contigs.fas file, while the ones with the least like-lihood are generally in the consensus iter1 singlets.fas file. In this case, thefirst contig in consensus contigs.fas, Contig1-0 6 1309 f, contains a sequence99% identical to the manually annotated element, differing mainly in itsroughly 10 extra nucleotides on both ends. Figure C.10 shows the ends ofthe aligned sequences.

C.4 Additional Tools

There are also BLAST and Extract shortcuts on the desktop. These web in-

terfaces offer additional functionality by making it simpler to do a custom BLAST

search or sequence extraction using the files in the Genomes folder.

166

Figure C.9. Web Interface File Browser. The figure above shows thecontents of the main output folder, louseOut. FASTA sequences are in

the .fas files, shown here as consensus contigs.fas,consensus iter1 singlets.fas, and consensus singlets.fas.

167

(a) 5’ End

(b) 3’ End

Figure C.10. ClustalX Alignment with Annotated Element. Panels (a)and (b) show the 5’ and 3’ ends of the annotated mariner (mariner)and the top consensus sequence produced by TESeeker (Contig1-0)

when run with the default parameters. The sequences are 99%identical. The extra sequence on both ends of Contig1-0 can be reduced

with stricter parameters.

168

C.5 Technology

TESeeker utilizes a variety of technologies. The core bioinformatics tools,

BLAST, CAP3, ClustalW2, and BioPerl were mentioned previously in Section 3.2.3

and are united through bash scripts. Researchers interact with TESeeker through

a web-based form implemented in html/php and handled by the lighttpd web

server. The form interacts with the local scripts and utilizes a PostgreSQL

database and cgi/Perl to notify researchers when a job has completed. TESeeker

is installed on Ubuntu 10.04 LTS. The administrative password for user teseeker

is teseeker.

169

APPENDIX D

SELECTED AUTOMATED APPROACH SOURCE CODE

This chapter presents selected BioPerl scripts written for our automated ap-

proach.

D.1 Combine BLAST Hits

###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Authors: Ryan Kennedy and Scott Christley###### Arguments:### ARGV[0]: closeness in order for segments to be joined### ARGV[1]: minimum length a segment must be to be joined### ARGV[2]: closeness segments must be in the query to be joined### ARGV[3]: maximum length percent of query of combined sequence### if joined because of ARGV[2]### ARGV[4]: length combined sequence must be relative to the query### ARGV[5]: length of flanks to add to each side### ARGV[6]: BLAST file to process#######################################################################

use Bio::SeqIO;use Bio::SearchIO;use List::Util qw[min max];

$closeLen = $ARGV[0];$minSegLen = $ARGV[1];$querySepDist = $ARGV[2];$queryMaxPerc = $ARGV[3];$minLenPerc = $ARGV[4];

170

$flankLen = $ARGV[5];$blast = $ARGV[6];

$bres = new Bio::SearchIO(-format => ’blast’, -file => $blast );

%TEfor=();$forNum=1;%TErev=();$revNum=1;

while(my $result = $bres->next_result) {$minLen=$minLenPerc * $result->query_length;$queryMax=$queryMaxPerc * $result->query_length;while(my $hit = $result->next_hit) {

while(my $hsp = $hit->next_hsp) {$orientation = $hsp->strand(’hit’);if($orientation>0) { #forward strand$hitStart=$hsp->start(’hit’);$hitEnd=$hsp->end(’hit’);$QStart=$hsp->start(’query’);$QEnd=$hsp->end(’query’);$overlap=0;$QJoin=0;

if(abs($hitStart-$hitEnd)>=$minSegLen) {#only combine segments of a specified length

$TEfor{$hit->name}{$forNum}{"start"} = $hsp->start(’hit’);$TEfor{$hit->name}{$forNum}{"end"} = $hsp->end(’hit’);$TEfor{$hit->name}{$forNum}{"qStart"} = $hsp->start(’query’);$TEfor{$hit->name}{$forNum}{"qEnd"} = $hsp->end(’query’);$TEfor{$hit->name}{$forNum}{"minLen"} = $minLen;$TEfor{$hit->name}{$forNum}{"query"} =

$result->query_accession . $result->query_length;++$forNum;

} #end minSegLen check} else { #reverse strand$hitStart=$hsp->start(’hit’);$hitEnd=$hsp->end(’hit’);$QStart=$hsp->start(’query’);$QEnd=$hsp->end(’query’);$overlap=0;$QJoin=0;

if(abs($hitStart-$hitEnd)>=$minSegLen) {#only combine segments of a specified length

$TErev{$hit->name}{$revNum}{"start"} = $hsp->start(’hit’);$TErev{$hit->name}{$revNum}{"end"} = $hsp->end(’hit’);$TErev{$hit->name}{$revNum}{"qStart"} = $hsp->start(’query’);

171

$TErev{$hit->name}{$revNum}{"qEnd"} = $hsp->end(’query’);$TErev{$hit->name}{$revNum}{"minLen"} = $minLen;$TErev{$hit->name}{$revNum}{"query"} =$result->query_accession . $result->query_length;++$revNum;

} #end minSegLen check} #end orientation if

} #end hsp} #end hit

} #end result$bres->close();

#JOINfor $scaffold ( sort keys %TEfor ) {

$i=0;$#Forscaf= -1;$#teForStart=-1;$#teForEnd=-1;$#teForQStart=-1;$#teForQEnd=-1;$#teForMinLen=-1;

# go through each scaffold and collect into arraysfor $TE ( keys %{ $TEfor{$scaffold} }) {$Forscaf[$i]=$scaffold;$teForStart[$i]=$TEfor{$scaffold}{$TE}{"start"};$teForEnd[$i]=$TEfor{$scaffold}{$TE}{"end"};$teForQStart[$i]=$TEfor{$scaffold}{$TE}{"qStart"};$teForQEnd[$i]=$TEfor{$scaffold}{$TE}{"qEnd"};$teForMinLen[$i]=$TEfor{$scaffold}{$TE}{"minLen"};$teForQuery[$i]=$TEfor{$scaffold}{$TE}{"query"};$i++;

}

# sort and get indexes@list_order = sort { $teForStart[$a] cmp $teForStart[$b] } 0 .. $#teForStart;

# push sorted stuff onto are final arrayfor my $i ( 0 .. $#list_order ) {

push @joinScaffold, $Forscaf[$list_order[$i]];push @joinStart, $teForStart[$list_order[$i]];push @joinEnd, $teForEnd[$list_order[$i]];push @joinQStart, $teForQStart[$list_order[$i]];push @joinQEnd, $teForQEnd[$list_order[$i]];push @joinMinLen, $teForMinLen[$list_order[$i]];push @joinQuery, $teForQuery[$list_order[$i]];

}}

@Forscaf=@joinScaffold;

172

@teForStart=@joinStart;@teForEnd=@joinEnd;@teForMinLen=@joinMinLen;@teForQuery=@joinQuery;

$joinDist=$closeLen;for($i=0;$i<=(@Forscaf+0);$i++) {if($Forscaf[$i+1]) {

if(abs($teForStart[$i+1]-$teForEnd[$i])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])

{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;$i-=1; #unless $i==0;

}elsif(abs($teForStart[$i]-$teForEnd[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])


}elsif(abs($teForStart[$i]-$teForStart[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])


}elsif(abs($teForEnd[$i]-$teForEnd[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])

{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;

173

$i-=1; #unless $i==0;}

}}

$#joinScaffold=-1;$#joinStart=-1;$#joinEnd=-1;$#joinMinLen=-1;$#joinQuery=-1;$#list_order=-1;

#REVERSE JOINfor $scaffold ( sort keys %TErev ) {

$i=0;$#Revscaf= -1;$#teRevStart=-1;$#teRevEnd=-1;$#teRevQStart=-1;$#teRevQEnd=-1;$#teRevMinLen=-1;

# go through each scaffold and collect into arraysfor $TE ( keys %{ $TErev{$scaffold} }) {$Revscaf[$i]=$scaffold;$teRevStart[$i]=$TErev{$scaffold}{$TE}{"start"};$teRevEnd[$i]=$TErev{$scaffold}{$TE}{"end"};$teRevQStart[$i]=$TErev{$scaffold}{$TE}{"qStart"};$teRevQEnd[$i]=$TErev{$scaffold}{$TE}{"qEnd"};$teRevMinLen[$i]=$TErev{$scaffold}{$TE}{"minLen"};$teRevQuery[$i]=$TErev{$scaffold}{$TE}{"query"};$i++;

}

# sort and get indexes@list_order = sort { $teRevStart[$a] cmp $teRevStart[$b] } 0 .. $#teRevStart;

# push sorted stuff onto are final arrayfor my $i ( 0 .. $#list_order ) {

push @joinScaffold, $Revscaf[$list_order[$i]];push @joinStart, $teRevStart[$list_order[$i]];push @joinEnd, $teRevEnd[$list_order[$i]];push @joinQStart, $teRevQStart[$list_order[$i]];push @joinQEnd, $teRevQEnd[$list_order[$i]];push @joinMinLen, $teRevMinLen[$list_order[$i]];push @joinQuery, $teRevQuery[$list_order[$i]];

}}

@Revscaf=@joinScaffold;

174

@teRevStart=@joinStart;@teRevEnd=@joinEnd;@teRevMinLen=@joinMinLen;@teRevQuery=@joinQuery;

$joinDist=$closeLen;for($i=0;$i<=(@Revscaf+0);$i++) {if($Revscaf[$i+1]) {

if(abs($teRevStart[$i+1]-$teRevEnd[$i])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])

{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;$i-=1; #unless $i==0;

}elsif(abs($teRevStart[$i]-$teRevEnd[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])


}elsif(abs($teRevStart[$i]-$teRevStart[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])


}elsif(abs($teRevEnd[$i]-$teRevEnd[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])

{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;

175

$i-=1; #unless $i==0;}

}}

#print$ct=0;for($i=0;$i<=(@Forscaf+0);$i++) {if(($Forscaf[$i] ne "") && (abs($teForStart[$i]-$teForEnd[$i]) >=

$teForMinLen[$i])) {print "perl get_fasta2.pl " . $Forscaf[$i] . " $ct ";print $teForStart[$i] . " " . $teForEnd[$i] . " " . $flankLen .

" f $teForStart[$i] $teForEnd[$i]\n";}$ct++;

}for($i=0;$i<=(@Revscaf+0);$i++) {if(($Revscaf[$i] ne "") && (abs($teRevStart[$i]-$teRevEnd[$i]) >=

$teRevMinLen[$i])) {print "perl get_fasta2.pl " . $Revscaf[$i] . " $ct ";print $teRevStart[$i] . " " . $teRevEnd[$i] . " " . $flankLen .

" r $teRevStart[$i] $teRevEnd[$i]\n";}$ct++;

}

176

D.2 Extract Sequences

###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: BLAST database (genome)### ARGV[1]: input file that contains get_fasta queries#######################################################################

use Bio::SeqIO;use Bio::SearchIO;

$GENOMEDB = $ARGV[0];$in = Bio::SeqIO->new(-file => $GENOMEDB, -format => ’Fasta’);$ii=0;$eeend=0;%cpipiens=();while($seq=$in->next_seq()) {$cpipiens{$seq->id()}=$seq;

#print "scaff: " . $seq->id() . "\n";$ii++;

}

# search for a specific scaffold id$filename=$ARGV[1];$strand="";$seq="";open(INFILE,$filename);while(<INFILE>) {($p,$g,$scaffold,$scaffoldNum,$pos,$end,$flank,$strand)=split();$seq=$cpipiens{$scaffold};

if ($strand eq "r") {$revseq = $seq->revcom();$olen = $end - $pos+2*$flank;$start = $seq->length() - $pos - $olen + $flank;

if($start<0) { $start=0; }if($olen>$seq->length()) { $olen=$seq->length(); }

$seqstr = substr($revseq->seq(), $start, $olen+1);$start=$pos-$flank;$eeend=$start+$olen;print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n";

print $seqstr . "\n";} elsif ($strand eq "f") {

$olen = $end - $pos + 2*$flank;

177

$start = $pos-$flank-1;if($start<0){ $start=0; }

if($olen>$seq->length()) { $olen=$seq->length(); }$seqstr = substr($seq->seq(), $start, $olen+1);# seq is getting the raw sequence form the $seqm which has# additional info about theother param

$eeend=$olen+$start+1;$start=$pos-$flank;print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n";

print $seqstr . "\n";}}

178

D.3 Trim CAP3 Contigs

###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: quality score file### ARGV[1]: minimum distance to perform joins### ARGV[2]: sliding window size### ARGV[3]: CAP3 quality baseline### ARGV[4]: CAP3 threshold quality multiplier### ARGV[5]: minimum length of processed CAP3 sequence#######################################################################


$qualfile=$ARGV[0];$joinDist=$ARGV[1];$windowSize=$ARGV[2];$quality=$ARGV[3];$val=$ARGV[4];$minLen=0;$minLen=$ARGV[5];

$in = Bio::SeqIO->new(-file => $qualfile, -format => ’qual’);

my $sum=0;my $outside=1;my $teNum=0;

for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes$window[$i]=0;

}

while($seq=$in->next_seq()) {for($i=0;$i<$seq->length();$i++) {

unshift(@window,$seq->qual()->[$i]); #add new quality score$sum+=$seq->qual()->[$i];$sum-=pop(@window); #remove new quality score and update sumif($sum>=($quality * $windowSize * $val) && $outside==1) {$outside=0;$scaf[$teNum]=$seq->id();$te[$teNum]=$teNum;

$len[$teNum]=$seq->length();

179

if($i-$windowSize<0) { $start[$teNum] = 0; }else { $start[$teNum]=$i-$windowSize; }

}if($outside==0 && ($sum<($quality*$val*$windowSize)) ) {$end[$teNum]=$i;$teNum++;$outside=1;

}} #end scaffoldif($outside==0) { $end[$teNum]=$seq->length(); $teNum++; }$outside=1;$sum=0;

for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes$window[$i]=0;

}}

#perform joinsfor($i=0;$i<=(@scaf+0);$i++) {if($start[$i+1]) {if(abs($end[$i]-$start[$i+1])<$joinDist && $scaf[$i] eq $scaf[$i+1]) {$end[$i]=$end[$i+1];splice @start, $i+1, 1;splice @end, $i+1, 1;splice @scaf, $i+1, 1;splice @len, $i+1, 1;splice @te, $i+1, 1;$i-=1 #unless $i==0;

}}

}

#output for get_fasta2for($i=0;$i<(@scaf+0);$i++) {if(abs($start[$i]-$end[$i]) > $len[$i]*$minLen) {print "perl get_fasta2.pl " . $scaf[$i] . " " . $te[$i] . " " .

$start[$i] . " " . $end[$i] . " 0 f \n";}

}

180

D.4 Generate Consensus

###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: alignment file to create consensus from### ARGV[1]: percent of nt that must be common for consenus#######################################################################


$alnFile=$ARGV[0]; #file to make consensus from$percThresh=$ARGV[1]; #percent of nt that must be common to go to consensusmy @A=0; my @T=0; my @G=0; my @C=0;my @consensus="";my $numSeqs=0;my $length=0;

$in = Bio::SeqIO->new(-file => $alnFile, -format => ’fasta’);

while($seq=$in->next_seq()) {$numSeqs++;my @seq_array=$seq->seq() =~ /./sg;$length=$seq->length();for($i=0;$i<$seq->length();$i++) {

if(uc $seq_array[$i] eq "A") {$A[$i]++;

} elsif (uc $seq_array[$i] eq "T") {$T[$i]++;

} elsif (uc $seq_array[$i] eq "G") {$G[$i]++;

} elsif (uc $seq_array[$i] eq "C") {$C[$i]++;

} elsif (uc $seq_array[$i] eq "N" || $seq_array[$i] eq "-") {$A[$i]++;$T[$i]++;$G[$i]++;$C[$i]++;

}}

}

for($i=0;$i<$length;$i++) { #calculate percentages

181

$Ap[$i]=$A[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Tp[$i]=$T[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Gp[$i]=$G[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Cp[$i]=$C[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);

}

for($i=0;$i<$length;$i++) { #find commonalitiesif($Ap[$i] > $percThresh && $Ap[$i] > $Tp[$i] &&

$Ap[$i] > $Gp[$i] && $Ap[$i] > $Cp[$i]) {push(@consensus, "A");

} elsif ($Tp[$i] > $percThresh && $Tp[$i] > $Ap[$i] &&$Tp[$i] > $Gp[$i] && $Tp[$i] > $Cp[$i]) {

push(@consensus, "T");} elsif ($Gp[$i] > $percThresh && $Gp[$i] > $Ap[$i] &&

$Gp[$i] > $Tp[$i] && $Gp[$i] > $Cp[$i]) {push(@consensus, "G");

} elsif ($Cp[$i] > $percThresh && $Cp[$i] > $Ap[$i] &&$Cp[$i] > $Tp[$i] && $Cp[$i] > $Gp[$i]) {

push(@consensus, "C");}else {push(@consensus, "-");

}}

$i=0;until($consensus[$i] ne "-" && $consensus[$i+1] ne "-" &&

$consensus[$i+2] ne "-" && $consensus[$i+3] ne "-") {shift(@consensus);

}

$i=(@consensus+0);#print $i;until($consensus[$i] ne "-" && $consensus[$i-1] ne "-" &&

$consensus[$i-2] ne "-" && $consensus[$i-3] ne "-") {pop(@consensus);$i=(@consensus+0)-1;

}

print ">Contig\n";print @consensus;print "\n";

182

APPENDIX E

TRANSPOSABLE ELEMENTS IDENTIFIED

This chapter presents all full-length consensus TEs we identified in the P.

humanus humanus genome and selected sequences from the C. quinquefasciatus

genome. We offer a detailed annotation of the mariner element from P. humanus

humanus and an annotated version of the putative mariner from D. melanogaster.

E.1 P. humanus humanus

E.1.1 Non-LTRs

E.1.1.1 Hope-like SART

LOCUS hope-like 4655 bpDEFINITION hope-like, 4655 bases, 1B9 checksum.ORIGIN

1 GGCCGATGGG GGTCACGGCC TCGTGTCCCG AGAGGCGAGT CGACCTGCGC51 GTGGAAGGTC TTTGCGAGTC RGCGACGTCG TCGATGGTCG CAGCGGAATT101 AGCTAGGGTA GGGGGGTGCT CGTCATCGGA CATCTCTGTG CGCTTCCCGA151 CCGGCGGAGG CCCACATYGT WACTCGTSYG CGTACGCATC CGTCCCGGCG201 GCCGTCGCGT CCGTYCTCTT RAAGGCRGGC GCGGTTGTCR TCGGGTGGRT251 GCSTGCGAGG GTGGTCCTGC TGCCGAAACR CAGGACCAWC TGCGTCAGRT301 GCCTACAYCY GGGGCATGTA GGCTCGCGTT GTCCCAATGT CAGGGAAAGG351 GGAGATACGG CGCTGGACCG CTGCTTCAGG TGCGGGGAGA AGGGGCACCG401 GGCGCGAGAA TACGCAAACA AAATCCGCTG CGCTCCCTGT TCGGAGGCGG451 GCCGGCAGGC TCAACACCGG CTCGGTGGGT CTTCGTGCGG GGCGCCGGCC501 GTCAGGGGTA GGCCCATTCG GTGGGAATCC GATACCAGGA ACACGGGGGG551 AAACGTTAGG GGGGCAGCCT CCGCATGTCA GCCGGGAACG AACGGCTCCG601 CCGGTAGTTC TCTGTGAGCA TGAGGAGGCT TCTCCAAATA AATCTGAATA651 GGTGCAAAGG GGCACAGGAT CTCTTGCTTA ACTTGGTAGG GACGTGGGGC701 GTCGGGATCG CGATGATCTC GGAACCCCAC GTTGTTCCGT CGGGGGAAGG

183

751 GAGTTGTTCC GCTGGCACTT GGTTCGTATC CGAGGACGGC GGGGCCGCGA801 TCTTCTTCTC GAGCTTGGCG CCGGGTCCGA AACCTCGGGA ATTGTCCCGT851 GGGGTCGACT TCGTTATTGC AGGCTGGGGA AACTCGGTCC TGGTATCGGT901 ATACGCGTCG CCAAATCGTC CATTRMRAGT ATTTGAAGAT CAGCTGCGRG951 CGATTTCGAG GGAGCTGGAC AGGTTAGGGG ACTCGCGTCC GATTCTGATG1001 TCCGGGGACT TCAACGCGAA ACAYRCATCG TGGTCAGGTT ACGCGACTAA1051 CGCGCGCGGC CGATTGTTAA GACAGTGGTT AGACGAGCGT CATCTCATMG1101 TGTGGAACGC AAAAGGAKTC AAGACGTGCG TGCGGTCCAC CGGCGGGTYC1151 AYGATCGATC TCACGATTTC ATCCGGTTCK ACGGCGRCGC GGGTCTCCGA1201 CTGGGAGGTC CTTTCCGACT TCGAGTCCCT CAGCGATCAC GCCTACGTGG1251 CGTTCTSGTT CGGGGATYCT YGGAGCTCGG GGGGGGGCAG CCTCCTCGCC1301 GGACARRCCT CGCGTCGTCY CRCCCCGRTG GAGTCTGCGG TCGTTSGATG1351 ACGAACGGCT CGCCGAATGT GTCGGCTCGC GGCTGAGGTC TGGAGARATG1401 ACGTYGCTCC CGGCCCGAMA AGACGAAACG CACACGATAT GGCGCGGGGA1451 GTTCGGGACG ACTTAMGACG GATCTCGGAC TCCTGCATGA AGAGAATCGG1501 CTCGCGCTCG ACGACTCGGA AACGGSCGAA GTACTGGTRG ACGGACGAGA1551 TAGGAGAATT GTGGCAGGAG TCCTGTCTCG CACGTCGCCR GTTCGTCGGC1601 AAAAAAAGAC AAYTRATCAA ACGGGGAGGA CKTTACGAGG ACATCGCGAA1651 CGACGACGAA CTCGCGCKTT TAAAGGCGGA GTGGAGRGCT TCCCGGGYCC1701 GGGTCCGACG AGCGATATGG AGTTCTAAGG RGACGTGYTG RAAAGAACTC1751 TTGAGCGAAG CGGATTCGGA TCCGTGGGGT CTCCCRTTCC GTCTYGTGAC1801 CAACAAACTR AGRGGTTCGT YGGSACCGCT GACGGCCGGC ATGACGGAGR1851 AGTTTCTCGC CGAAGTGATC GCAGAGCTCT TCCCTCGCGT CGARCCGTTC1901 GCGTTTCCCC CGCCGGATYT CTCGCACGAC GCGAACCGCG ACCACGASAY1951 CGAAGTGACG GAGGGGGAGG TCCGCCTTCT CCTCGACGCG GCCTCGAAGA2001 GACGCTCGGC TCCGGGCMCG AACGGCGTGC ATTACCGCAT TATCGGCAAC2051 TCTGCCGGGG TGCTGTGTGC GCGTCTCTCG GCGCTCTACA CCGCGTGTTT2101 CCGAGAGGCG ACGTTTCCGA CGCCGTGGAA AGAAGCAAAC TTGGTACTGC2151 TGGACAAACC GGGTAGGGAT CCGACGACGC CGAGCGCGTA CCGGCCGATT2201 TGTCTTCTCG ATGTGGARGG CAAGTTGTTC GAGCGCGTGA TAGCGTCYCG2251 GATMGACGAA CACTTACGAT CAGGGGGAGG GAGGAACGAC CTCTCTCCAA2301 ATCAATACGG CTTCCAGACG GGACGTTCGA CGACCGACGC ACTCGACCGC2351 GTGTGCGCCG GAATCCGGKA CACGCTTCAC CGGGGCGGCG TCGCGATCGC2401 CGTCTCCATC GACATCAAAA ACGCGTTCAA CACGGTACCT TGGTCCGCGA2451 TAAGGGACGG GCTTAMGTCG AAGTCCGTGM CAGACTACCT CGTGTCGGTG2501 ATAGRGTCCT TCCTCTCGGA GAGGAARATC GCGTACGAGA AACCGGACGG2551 AWCGACGGGA AGGGCGGACG TGTTCTRCGG CGTGCCTCAG GGATCGGTTC2601 TAGGRCCACT CCTGTGGAAC ATCGCCTACG ACCGYGTCCT CACCCGGACC2651 GTTCTTCCCG AAGGCGTWTC GTTGACTTGC TACGCCGACG ACACTCTGCT2701 CCTCGCGACC GGTCGCGGGT GGGCGGAGGT TCGCGAACGY GCCGAAACRG2751 GACTTAACGC RACGGTGGAG SCCATCCGGG ACACCGGTCT CCGGGTGTCT2801 CTGCCGAAAA CGGAAGCTTG CGGGTTTCAC CGTCCKCGGA ATCCGCGTCC2851 TCGCGACTTR TCGATAATCG TGGACAACGT CRGAATCGGA GTGGGTAGTT2901 CACTTAAATA CTTGGGTTTG GTTCTCGAYT CCGGTCTTCA CTTCGGGGAA2951 CATCTGGCRC GCYTCGGACC GARRATCCGA GCCGTCAGYG CGACGTTAGG3001 RCGTCTGATK CCGAACCTGC GYGGMYCGCA GGTCAAGGTC AGRCGATTAT3051 RCRCGACGGT CGTTCACTCC GTGGYCCTGT AYGGAGCTCC GATCTGGGCC3101 GAGTCGSTTT CGGCAGCTCG GCCTCTTCGC GAGAAGGCAA TCCAACTCCA3151 RAAGGCGTCT CTGAACAAGG TGGCGATGGC GTACAGGGAC GTCGYGGCGG3201 AGGTGTCTTG TCTTCTGTYG GGCACTCCCC CGCTYGATCT CCTCGCGATG

184

3251 GARAGACTCG TYCTCTATCG CGAACGGGAT CGCGGCGGGC CTAGGACAGC3301 TCGTCACCGC ARGGAACTTC GCTCTCAAAC GATCAACACG TGGCAGGCGC3351 GATTCACCGA CGGGCGTAAG CGTTACGGGG GGGAAATCAT CAACGTTCTC3401 GGCCCCCGAG TGGGGGAATG GGTRGGGCGR GGTCACGGAA ACTTGACTTA3451 TCGCCTCACT CAGRTCCTCA CRGGTCACRG CGTGTTCGGC TCGTACTTGG3501 CGCGCATCGG CAGAGAGGAG ACGGCGRAGT GTTGGTTCTG CGGTGCGCYC3551 GAGGACGACG TCGARCACAC GGTCGCGATA TGTYCCAYGY GRGASGTCCA3601 YAGRCAGCGG CTAGTCGAGG TCATYGGACC CGACTTGTYM ATACGCGGTT3651 TAGTGAACGG GTTGCTCCGC GGACCTCGGG AATGGTCCGT GATCTCGCGA3701 TTTTCGGAAA CGGTCCTACG GACGAAAGAG GACCGCGAAC GCGAAAGAGA3751 AAAATCCGGA GTTCGGCGTC GCGCGCTCCA GGAAAAAAGA AAAAGGAAGA3801 AAAACAACGG CGGCAGCGAC GGCGAAAAGG AAACAGCCAA CGCCTGATCA3851 TCCGTGGATG ATGGACGATC GCATCCTCAG GAAACAAGCG AGTTTCCCAT3901 CCCCTTCCCC CAGCGAAGGA CCGGAGGACA CTCCCGAAAG GGACTTGCCG3951 AACGTCCTGG GAAATGCGTT CTCGTCTTCG AGGTGGCAGC AGAACGAAGA4001 CTAGGGCGAA TCTCCTACCA AGTAATCGTG TAGAACGGTA ACGTTCTGGC4051 TTCCCGGAGA GGGGACTGAC TGAAATGTAG GTACAACTCC CACGAACGTC4101 CCCCCCTCTT CATATAAGCC ACGCTATTGC GACATGCTCA GGAGTTTTAG4151 TGGGTATGGT TCCCGGTTAG GCCTTCTTTT CGAATCCCAT ACCGAGCCCC4201 ACACTCCCTC TTCGGAGGGG TTGTGCGTAA CTTGCATTTC TCCTGAGTTA4251 ACAAAAAAAA AAAAAAAAAA TTTTTTTTTT TTAAAAAAGG TTGGCTTGCT4301 CAGAAAGGGT CAGCGACCCG TTCTGAGTCT CTGGCTGGGG AGTCCTCTCC4351 GGGTTCGAGA CCTTGAGCGA TCACGTTGGC GGTGAATGAA TGTATGGATT4401 TCTTTTTTTT TTGGAATGCA CATCACCATC ATTCTTATAA GGTTCTTATA4451 AGGTACACTC CAAGATGAAT ACTGCATTAA ATTATATATT TATATGTATA4501 TGGGTAATTA TGGAAGGCAG AACCGAAAGT AAAACCTCGG GAACGTCTCC4551 CAGATTGTCA TTGGGAGAAA AGAATTCCAA TTAGGTGGAG AGGATCTTTG4601 CACTCCACTG TTCTATCCAC TATTTACATT TATTTCAAGC GGGAATCCAT4651 TGTTA

//

185

E.1.1.2 Dong-like R4

LOCUS dong-like 5266 bpDEFINITION dong-like, 5266 bases, 28B checksum.ORIGIN

1 GAAAARGAGG GCGCTGGTGT TCCATGGTTG ACAAGCCTTT TAGTCATATC51 GTTTTCTTTA ATTAATTTTA ATAGTTTTTT TTTAGGGTTT ATTTATAAGT

101 TATATATCAA AAGTTTTGTA ATTTAGTCTA GTTTACAGTG ATATAATTAG151 TTTTTTAAAT TATTTAGTAA TAACTATATT ATCTCAAGTT TAAATTTACA201 AATTGTTCAA CGATTCCATG TGACAAAAGG AAGCGTGGGC AATAAAAAAC251 CTTAGAAGGC ACCTTAGTAC ATATAAATTA AACGGATAAG TACCTCATTT301 TAAGACTTAA ATTTCAATTG TAAAACGTTA GGCTAACAAA GAAATTGAAC351 ACATAACCAA AAAATTTTTT TTATTTTTAT TTTTTTTttt tATATTTCTT401 TGATTAATTA TAAGTGATTT GACTATTTAT ATCTTTATCT TATTTAACAA451 AATAAGAATT AAAATATTTT TACAATAATA TAGGTTGATA CATGATTTTT501 TCAGGGTGTA ATTTCTATAC CCTTATTGTT AAGTGTTAAC TTGTGATTGT551 ATGTTATTTT AGCCTTGTCT TAAGCTAACT TAATTCATTG TAGTAAAGTT601 TATTAATAAT AACATAGTGT AACTAATACC TTTCTTAAAT CTGAATCAAA651 TTAGGAAAGT ATAGTAGTAG TCTTTATATT AAGTGATATT AGTGTGAAAG701 GGTACAACAG ATTTCTCTTT AACCAGACTT CATTTAATAA ATTTTGCATC751 TTTTGAATTA CTTATAAGAG TTAAATATCA TTGTCATACT AAATAAGAAT801 AGAGGTAGTT CAATATAAAT ACTCATTATT TATTGGTTTA ATGTAATGTA851 TAGCTGGATA TAACATCTCT GTACTCCTGT GTTAATCTAC TGATTGTTTT901 AATACAGAAT ATTTTGCCAT ATACTTTATG CAAATCTCCA TTTTGATCTA951 CATTAACACT AGCAGAGATC AATTTCTTTT ATACTACTTA ATAATAGTAC1001 AGAATTGTAA TTTGTATGTT GTGCATACAG TTAAGTGTAG TCTATGTAAT1051 CTCTTTTATC CTTAAATTAG TTTGTTTTGT TACTGCTTAG TTTATAATCT1101 TGATAGTGTA ATATTGCCTG TAAATTTGAT TCAAGATCAA TTAATTAAGT1151 TATAAATAAG TAGCCTTTTA GTATTTGTTT CTAATGTAAA AATTGTGAAA1201 AAGCATTATA CATTTGAGGG ACCACAACTT AAACAAACCT TATACAACTT1251 AAGTTGAAGT CTAATTGCAT ATTATAACTA AATAGTCTAG CCAAATATCT1301 CTACCATACT CATTGAAAAT AGGGATAGCT TAACAAAATA GGTTAATAAG1351 ATAGTTAGTA ATATCTGATT TAATATATCA ATTGGGTAAA AAATTGGTTC1401 TGTCTAATCC AGTTGTCTGA TAACTAGTGT TCTAATTGTA TAATATTTAA1451 CCTTACATTA AATTGTTTCT GCCTTATTGC AGCATATTTT ACATGATTTA1501 CAGCAAATTG ACTTTTATAA GTATAAGAGA CATAAGTGAT AAGTTTAATT1551 AAAAATTTTA TAATAGTGAA CTCTAAATAC TGAATTAGAA AATGTTTTGT1601 TATAAAGGTG AATGAATTTA TAAATTTATA TATAGGTAAA AAGAGCAAAA1651 TGAGAACAAG ACAAAGTAAG AACCAAAACA AATCTTGTTC TACTGTCGAC1701 CTGAGGCAGC TCGACGAGAA TGTGAATTTC ACTGCTTCTG ACCCTGGCCA1751 CTCCAACGAT GTTAGTCACC GGAGCCCAGT TCAGGAGACA ACTCGTTCCC1801 ATCGAATAAG ATGGACTCAG GAAGACCTTC AAGAATTGAT GTGGTGCTAT1851 TTTTACTCTC AAAAATTTGG TTCAGGATCG GAAAGTGACA CCTTTAAAAT1901 CTGGAGAGGG AGAAACCCAA ACAGCAGGAA GGACATGACA TCGAAGAAGT1951 TGGCTGCCCA AAGAAGATAT ATAATAAAAA CAAAAAAAAT TGAAAATGAT2001 AAATTAGAAG AAATTAAGAA AAATGTAGAC TCTTCGTGCA GTAATGTTAT2051 ACGAGATAAT GCACAAATAA TAACAGAATT AAATGAGTCT AACCACAAGA2101 GTAAAACCGA TACTGACGAG CAATTGTTGA CTGATGCAGA AATGAAGAGC2151 ATCGAAGAGA GACTAATTGA AGAGATAAAA AAAGTAAAGA TGTGCCCATT

186

2201 AATAAACAGG GAGCCATTAA GGAAAATTTA CAAAAATAAA AAGGCAACTG2251 AAGTTTTACA TCTTATAGAT AACACCCTCA TAAATGTATT AGAAAAAGTG2301 GTAGACATTA ATTTAACCAC AATTAATGAA ATTATATATG CTGCGGGGGT2351 AGTTGCCACA GATATTATAC TTGGGCCAAG AAAAGAAGCT CGTCATAAAG2401 GAATGGAAAC AAAAAAGTCA ACATCACCAA TATGGATACA AAGAATAGAG2451 GGAAAAATAG AAAGAATAAG ATTACATATC TCCCTGGTAT CCGAAATGAA2501 GAAGAATAAT AATTTAAAGA AGAGGACAAT AAAGAAACTG GATCACCTTA2551 AAAGGATCTA TAAATTGAAG ACCATGGAAG ATATAGAACT TACCATGGAA2601 ACTTTAAAAC AAAAGGTATT GCTCTATTCT CAGAGAATAA GAAGATATAA2651 GAAGAGAGAG CAGTTTTGGC GACAAAATAA ATTGTTTGAA TCGGACCCAA2701 AGAAATTTTA CAGAACAATT AGAGAGCAGA ACATACAAAA TGGATTCAGC2751 ACCTTAAATG TAGAAAAAAT GGCAGACTTT TGGTCAAATA TTTGGGAAAA2801 ATCTCATCCA TTAAATAAGA ACTCTACATG GATGAATAAA GAAAAAGAAG2851 CACATGCCTG GATTGCTTCT TCAACAATGT CGGATGTAAG AATGGCGGAT2901 TTAGAAACAT GCCTAAAAAA CACGGCCAAT TGGAAAAGTC CCGGATTAGA2951 TAGGGTACAA AACTTCTGGA TTAAGAATTT TACAAGTACC CATAAGTATT3001 TACTGGTATC CATAAATAAA TTAATAATGG GACGGCAAGA AATGCCAGAA3051 TGGATAACCA CAGGTAAAAC GTATTTGCTG CCAAAAAAAT CAGGAGCTAC3101 GGAACCGAAA GATTTTCGGC CGATAACCTG TTTGCCCACA ATGTATAAAA3151 TAATAACAGC TATTATTGCT GAAAAGATTT ATGGGCATTT AAGAAAAAAT3201 AACATTTTTC CTCCTGAACA ATATGGATGC AGAAAAGGGT CTTACGGCTG3251 CAAGGAGGTT TTATTAATAA ATAAATTGAT CATGGCCAGT GCAAAACAAA3301 AGAGGAAGAA TTTAAGCATG GCATGGATTG ACTATCAAAA GGCCTTTGAT3351 AGTGTGCCTC ACGAATGGAT TATTGAGGCA TTGAAAATAT ATAAAGTAGA3401 CCCTAATATT ACAGCGTTCT GCGAGAAGAG TATGAAAAAT TGGTGCACCC3451 AGCTGGAAGT GCAAAAATAC TCTTCTAGAA AAATATTTAT AAAAAGAGGA3501 ATTTTTCAGG GAGATTCATT GTCGCCACTT TTATTTTGCA TGTCTTTAAT3551 TCCTCTATCC AGACAGCTTA ATATCAAGGA TCAAGGATAT GAGTTGGTAC3601 CGGGAGGCAG GAAAATTACC CATATGCTAT ATATGGATGA CTTAAAAATT3651 TATGCCAAAA ATGAAGAGGA GTTAAATAAA ATGTTACGGA CGGTTCAAAC3701 CTTTTCCTCT GACATCAACA TGAAATTTGG GTTAGAGAAA TGTGCCAGAA3751 TAAATATTGT CAGAGGAAAG TTAAAACAAA AGCAAAATAT AGAAGACTCC3801 GAAGAAGAAC TTATTAAAGA ATTGGACCCT GGATCATCAT ACAAATATCT3851 GGGGATTGAA GAAAATTTTG GGATAGCCAA CAAGGAAATT AAACCTCGAT3901 TGAAAAGAGA ATATTTTAAA AGATTGAGGC TTATATTACA GTCGGAACTG3951 AATGGAAGAA ATAAGATAAC CGCCGTCGGC ACATTAGCAG TTCCTGTAAT4001 AGAATATAGT TTTGGCCTCG TAGACTGGAC GAAAGAAGAA ATCACGCACC4051 TAGATAGAAG GACAAGAAAA ATATTAACCA TGAATGGTGC GTTACATCCA4101 AARGCTGATG TGGATAGATT GTACGTCAGC AGGAAAGATG GAGGAAGAGG4151 ATTACGACAA ATAGAAGCGG CATACCAGAA TGCCATWATT GGAATGGGAA4201 AATACATAGA ATCCCATCRA GAGGACCCYA TCTTAGCCCA AGTTATACAT4251 GCAGAAGAAA AAACTACAAA AAAAGGAGTT CTGAAAAGGG CAAAACAAAT4301 CGTCCAAGAA AATAAAGAAA ACGAGATAAT GGAAGARGGG CAACTTGCAA4351 CTTATAATAG CAAAGCCCAA TCTCAGAAGA AATTAATAGG CAAATGGGAA4401 CAGAAAAAAT TACATGGGCA ATACCTAAAA AGAATAAATG CCGAAGATAT4451 TAATAAGAAG AGCACGCACA ATTGGCTACG ACGTGGAAAA CTTAAAATTG4501 AAACAGAAGC GTTTATTACA GCGGCTCAAG ATCAGGCATT GCGGACCCAT4551 AACTATGAAA AAGTAATTCT CAAAGTCCGC CAAGATGACA AGTGCCGAAT4601 TTGCCAATCW CAATCGGAAA CCATCGATCA TTTAATTTCC GGTTGTCCGA4651 TACTGGCAAA ACATGAATAC TTAGAAAGGC ACAATAAAAT ATGTCAATAT

187

4701 CTCCATTGGA ATATATGCCG AGAATATGGA ATGGATGGAT TACCCAAGGA4751 GTGGTACAAC CACATTCCAA GCCCGGTTAC GACAGTAGGT CCATGCACAG4801 TTCTATATGA TCAACAAATC CACACTGATA GAACTGTGCC AGCTAACAAA4851 CCGGATATCA TCCTCAGGCA TAATGGGGAA AAATGGTGTA AGTTAATTGA4901 GGTATCCGTG CCGGCAGAAA AAAACACCAC AGCCAAAGAA GCAGACAAAA4951 GGCTGAAATA CAGAAATTTG GAAATTGAAA TAACCAGGAT GTGGGGAACA5001 AAAACTGAAA CGATTCCGGT CATCGTGGGA GCATTGGGAG CCATGCCAMA5051 TTCAATAAAA GGAAATTTAA AGAAGATTAT GAAGAACCTA AAAGAAGAAA5101 CCATCCAGGA AATCGCACTA TGTGGGACGG CCCACATTCT TCGGAAAATA5151 TTATAAATAG CACCACCGAT ATTCGAGTAT TTGTCCCTAA GGAATCGGGT5201 AGAGACCTGG GATAAATGCT TAAAAACCCC GACAAATAAG AGCCTTGTCG5251 TGTTTAAGAG TGACCA

//

188

E.1.2 LTRs

E.1.2.1 Mdg1 ty3/gypsy

LOCUS mdg1 5395 bpDEFINITION mdg1, 5395 bases, 9DF checksum.ORIGIN

1 TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT TATAAACAAA51 AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA AGCTTCTAAA

101 GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC CCATTGTTGC151 TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT AGGCAAGCTA201 CACAGCTTGG CGTnGGTTCA TTTGATTTTT AGTCTAGCTG TAATAAGTTC251 TTGTTTATTA ATATTAGTTT TGTtagtGAT AGTTTTAAGT GCTTGTTCAT301 TACTTACAAA AATAAAGAAC GAACCGAATA TAACATATTA AAATTTTGGC351 GCAGTCGACG AAGGATCATT CTGAACGACA AGAAACCGGA GAATTCTTAA401 GTAATTCGTT TCATGTTGGT GGGTTTTAAC TCTTTATCAG AATTTCTTTA451 TACTGTAAGT AGTAGTAATA GGTAACTCAT TTttTGAACA ATGCCTAAAC501 TTGATCAACC TGTTTTCAAC CCATCCCAAA TGTCCCTTAG TTTCTTGTGT551 AGTCTAGTTC CTAATTCCTT TTCCGGAGAA AGGAACCAAC TTAAtGCATT601 TATctcTGAT TGTGATAGTG TTATCGAAAT GTCCTCTGAA GAgAATAAGT651 ATCCACTCTT TAGATTCATT TATTCCAGAA TCACTGGTAA agCCAgGgaT701 CGAATTTCTa TATACCATTT TGATAATTGG GATGACGTAA AAaCGAAACT751 CATAGAATTA TATcAAGACA AGAAACCCCA TAGTCAACTA ATGGAGGAGT801 TGACTAGTTG CAGGCAAAAA TCAAATGAAT CGGTAACGGA ATTTTATGAA851 AGATTGGAGA ATTTGTCCCG TGAAATTGTA TCCAACTTAA AGGTGGATGT901 CAAGGATAAA AGAGCTCACT CAATAAAAAT AGATTACATT AACGAGATTG951 CTTTGGGTCG TTTTATTTAC CATTCAAATT CCGGAATTTC CCAAGCGCTG1001 AGATGGAGGA ATTTCGACAG TATCAACTCA GCATATTCAG CAGCCATTGC1051 TGAAGAAAAA TTTCTAGAAA TGAGAAGGCG TGATAGGTGC TGTAATTGTG1101 ATTCTAGAAA TTCTAATGCT TCAAAACTAT TGTCTCCAGC TCAAATTCCA1151 AAATTCTGCA AGTACTGTAA AAAGTCAGGA CATCTTTTAG ATGAATGTTT1201 CCGAAGAAAG AAAAtTAATG AGCTCCGAAA AAAACAAAGT AAAGCAAGTG1251 TTAATTTAAA CTTACATCAG TCCCCAGTGG TAGACACCGC ACtGGAGGAG1301 TCAaTAGGtC aACTGAAGGT GTCAGAAATA TAAAATTcTT tGTTGTTAAt1351 GATAATtTAA ATTTCATTAA AGTATCGTCT AATAACTCCA AAAATCCgGG1401 TAAGTCgCTG AAATTCATCT TAGATACCGG TTCCAGtGTA AACATtATAA1451 AaGCGTGTAA GTTAACtCCT GAAACTAAAT TTAACTCCaA gGAAAAAaTC1501 AAaCTTCAgG GAATCAGTCA tagTCAgCAA gCcTTGGAAA CAGTCGGTtC1551 tTGcgTAATa CCacTAagAA TcGAGGtaaA AtaatataCC aCAAAATTTC1601 AtATTCTtaA TCAaaCAACT AATATcCCat AtGATGGTcT GCTAGGAAAA1651 GAATTTTTTa TAaAAcAcTC tGCATGCATC GAATATGAAA CCAACACcGT1701 AAAATTGAAA GAAATTTCTA AACCACTAGT TGTGTATTCC GAGGAGACTT1751 CATCTTCAAA CAACCTTCAC CTTAAGGCTA GAACAGAAAC CaTAGTTAAA1801 ATAAACATCC TCAACCCtGA AATAAAGGAA GGAATAGTGC CAaACACAAA1851 AATTATGGAA GGCGTCTATT TGTCaCGTGC CATCGTTAAA GTAAACAAtA1901 ACAACGAGGC TTACGCCACC ATTTTGAATA CAAGAACGAC TGATcAAACA1951 ATTGAACCAA TCACaGTTcG CCTTGAGAAA GCTTCAAAAC AGTGTTTcCA2001 AATAAAAAGT CTAGAACCAC GACATAAAAG GAAATCAATA ATAAGTAACC

189

2051 ATTTGAGGTT AGAGCATCTT AATGAAGAAG AAAAAACATC AATAGTTAAA2101 ATTTGTGAAT CTTATTCcGA TATTTTCCAC TTACCCGGAG ATTATtTAAG2151 TTCGACAGAG GCAATTGAGC ATAAAATCAA TACTATTAAT GAAAATCCTA2201 TTTATACTAA AACTTACAGA TATCCnGAAA TACACAAAAG AGAGGTTAAC2251 AAACAGGTGA CCGATATGCT GAAACAAAAC ATTATAAGAC CTTCCAATTC2301 TCCTTGGTCG TCTCCATTAT GGGTTGTTCC AAAAAAATCT GATGCTTCTG2351 GAGAGAAGAA ATGGAGAGTA GTAATTGATT ATCGAAAATT GAATGAAGTT2401 ACAGTAGATG ACAAATATCC TATTCCTAAT ATTGAAGAAA TTTTAGACCA2451 GTTAGGGCGT TCAAAATATT TTACTACTCT AGACTTGGCT TCCGGTTTTC2501 ACCAAATTCC ATTGAATGAC GATGACAGTA AAAAGACAGC TTTTACAACT2551 CCTTTTGGAC ACTACGAGTA CACGCGTATG CCTTTTGGAC TGAAAAATGC2601 CCCAGCTACT TTTCAAAGAC TAATGAATAC CGTTTTATCC GGTTTACAGG2651 GACTACAATG TTTTGTTTAC CTTGACGACA TCGTGATCCA TGCTGCTAGT2701 ATTCAGGAAC ACGAAATTAA GTTAAGGAAA ATATTCGACA GACTCCGACT2751 AAATAATCTC AAATTACAGC CGGACAAGTG TGAGTTCCTA AGAAGAGAAG2801 TCGTTTACTT GGGACATACT ATCAGTGACG TAGGTGTCAG ACCAAACCCA2851 GACAAAGTAC AAGGAATTAA CTCATTCCCT ATCCCGAAAA GTACTAAGGA2901 CATAAAGTCT TTCTTAGGTC TGGTAGGTTA TTATCGTAGG TTTATAAAAG2951 GATTTGCCAA AATTGCAAAG CCCTTGACAA TTCTTCTTAA GAAAAACCAG3001 GACTTCAAAT GGACAAACAA AGAGCAAGAA GCATTTGAAA AATTCAAGGA3051 AATCCTTTCC ACACAACCCA TAnTACAGTA CCCCGATTTT AATAGCGAGT3101 TTGTCTTGAC AACGGATGCA TCGAATTTCG CAATAGGAGC CGTCCTCAGT3151 CAnGGGGAAA TAGGAAAAGA CTTnCCAATT GCCTATGCAT CTAGAACTCT3201 GAATGAGTCC GAATTAAACT ATAGTGTAAT nGAnAAnGAA CTCCTAGCnA3251 TCGTATGGAG CGTCAAACAT TTTCGACCCT ACCTCTTCGG AAGGAAATTC3301 ACTATTGTCA CAGATCATAG ACCTCTAACC TGGTTGTTTA ATTGCAGGGA3351 GCCCAATAGT AGGTTGGTAC GGTGGAGAAT TAAATTAGAA GAATATGATT3401 ATAAGATTAT TTATAAGAAG GGAAGCCTAA ATGCAAATGC CGATGCACTC3451 TCCAGAAATG TCTATATAAA CGTACCATCG TCCGACAAAT CCGAAAGAAT3501 CCATCCCAAA AGTAAGGAAG AGATTAAGAA AATTTTGGTA GAAAACCATG3551 ATTCTAAACT GGCTGGACAT TGTGGATTTT GTAAAACCTA TCAAAGAATC3601 AAACAACGTT ATTACTGGAA AACAATGAAG AATGATATAA AAAACTATAT3651 AAGAAGTTGC AAGTCCTGCC AAGTTAACAA AATTAATTTT AAACCTATTA3701 AGGTGCCCAT GGAAATAACA ACTACCTCTA AGCAAGCTTT TGAAAAATTG3751 GCCATAGATG TCATGGGACC ATTGCCAACC ACTAGTGAGG GAAATAAGTT3801 CATTCTGACC ATGCAAGATG ACTTAACCAA GTATTCATAC GCGGAGCCCA3851 TACCGAACCA TGAGGCGAGA ACAATAGCTT CGAAGATCTC CAAATTCGTG3901 ACGTTATTCG GAATCCCTAA ATTTATTCTG ACAGATCAAG GCACTGATTT3951 TACATCGAAC ATAATTAAAA ACTTAATGAA ATTATTTAAA ACAAATCACA4001 TAAAATCTAC TCCATATCAT CCGCAAACTA ATGGAGCATT GGAGAGATCT4051 CATCTGACCC TGAAGGAATA TTTAAAACAT TATATAAACG AGAGACAAGA4101 CGATTGGGAC GAATTTATTC CTTTTGCCAT GTATTCCTTC AATTCGCACA4151 CGCATACTTC CACAGGTTAC ACTCCATATG AGCTTTTATT TGGCAAGAAA4201 CCGTTCATAC CGAATTCATT GATAAGAAAA TCTACACTCA GAAAATTCAT4251 TGATAAACCA TTTAATAAAA ATTACGATGA TTATATTAGT GATTTAAAGG4301 AGAAGATTCA AATCAGTCAA AAATTAGCCA GAGAGAATCT AATAAAACAT4351 AAGGTGAAGT CAAAACAATA TTACGACAAT AAGATCAATA TTCACGATTA4401 TAAGATTGGA GATTTGGTTT ATATAAAAAA CAATTTAACT AAAATAGGAA4451 TCAATAAAAA ACTTAGTCCA AAATTTAAAG GCCCATACGA GATCAAGAAA4501 ATATCTGGGA ATAACGTTTA TTTAAAAATC CGTAACAAAT TAGTTACTTA

190

4551 TCATGTTAAC AACACAAAAC CGTGTTCTGG GTAGTTGATC TAAAGAGAGA4601 AATCCAAAAT AATAATAATT TAATCCTTTT ATTTTCCCAT ATAGCTAATC4651 ATCATTTATT TTATTATTAA TCCATTTATT CTTTTTCTAA TCATTATTTA4701 TGTTTCCTAT ATACAATTCT TGTATTTTTC AGAACTTATG TCAATTTTGT4751 TTCGAATTTC ATGTCTTTTG ATTTTATTCT TTTATCTTTA CTCTTTATGT4801 AACATTGTTC ACTAACGAGT ATTGCCTAGG AAAGGTTCTG CTACTTGCAG4851 CTATAAAGCT TAAAATTATG TTCCTGATAA GAAGCCAATG TGAAGAGATA4901 CAACAGTCTA CCATACAGAA GACATAGCAC CTACTACTAC TCAAGGAGTC4951 TCAACCTGGA CGGAACCATT CAGAACCCGA CATAGGAAGC TGTCAAATCC5001 GAACAAGAGG AGAAAACAAA TTCTTTTTCA AATAATTGTT TTCCGTCTTA5051 AGGGGGGAGG TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT5101 TATAAACAAA AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA5151 AGCTTCTAAA GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC5201 CCATTGTTGC TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT5251 AGGCAAGCTA CACAGCTTGG CGTCGGTTCA TTTGATTTTT AGTCTAGCTG5301 TAATAAGTTC TTGTTTATTA ATATTAGTTT TGTTAGTGAT AGTTTTAAGT5351 GCTTGTTCAT TACTTACAAA AATAAAGAAC GAACCGAATA TAACA

//

191

E.1.3 Transposons

E.1.3.1 mariner

DEFINITION Pediculus humanus humanus mariner consensus sequenceFEATURES Location/Qualifiers

source 1..1276/organism="Pediculus humanus humanus"/mol_type="genomic DNA"/transposon="mariner transposon"

misc_feature <1..2/note="target site duplication"

repeat_region 3..33/note="left terminal inverted repeat"

ORF1 285..584translation=GDEALIERQCQNWFAKFRSGDFSLQNEECSGRQLEVKDEQIKALIDYDRHSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLWGTLVNEATWSLRYAS/product="mariner transposase"

ORF2 683..1213translation=ARKQAPTTSKTDIHQKKVLLSFWWDYKGIVNFELLPRCQTINSEVYIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPSLATGQKLLELGWNVLLHPPYSPKLAPNNYHFFRFLKNFLNGQKFQNDNEVKTALEQFFAPKTKEFYEKRKMILPEKCQKVTNNNKHNIIDKNNLT/product="mariner transposase"

repeat_region 1244..1274/note="right terminal inverted repeat"

misc_feature 1274..1276/note="target site duplication"

ORIGIN1 TATTGGGTTG GCAAATAAGT AACTGCGGAT TTTACCAACA GATAGTTTGT51 TATTTTTTTT GAGTACGTTT ACGTTTTTGT ACAGACATGA ACTTTTGATA

101 TGTTATTACT TGGTTCCTTC TGTAACATTC GGTACCAAAA TTTCATTGAA151 CTCTTAAATA GTACGCGAGC AAAAGATAAT TAAACATGGG GAGCCAGAGC201 GAGCATTTCC TCCACATTTT ACTTTTTTAT TTTTGAAAGA GTGTTAATGC251 TTCCCAGGCC AATAAAAAGT TGTGGGTCGT GTAGGGGGAT GAAGCCTTAA301 TAGAACGGCA GTGTCAAAAC TGGTTTGCGA AATTCCGTTC TGGAGATTTT351 TCTTTGCAAA ATGAGGAGTG CTCCGGGCGC CAATTGGAGG TTAAAGATGA401 GCAAATAAAG GCCCTCATTG ATTATGATCG GCATAGTTCG ACTAAGGACA451 TTGTAAAGAA GCTAGATGTG TCACATACGT GCGTCAAAAA CCGTCTGCGG501 CGTCTTGGGT GCCAAAAGAA GCTTGATGCG TTACTTTGGG GAACGTTAGT551 TAACGAGGCG ACTTGGTCTT TGCGATATGC TTCTTAAACG CAATGCAAAT601 GACCCTTTTT TGAAAGAATG GTCACCGGAG ATGAAAAGTG GGTTGTCTAT651 GATGACTTTT TGAGAAAAAG ATCCTGGTTT AGGCAAGGAA ACAGGCACCA701 ACAACTTCTA AGACTGACAT TCACCAAAAA AAGGTATTGT TATCATTTTG751 GTGGGATTAC AAAGGCATAG TCAACTTTGA GCTGCTGCCA CGATGTCAGA801 CCATAAATTC AGAGGTTTAC ATTCGACAAT TGACAAATTT AAATGATACC851 ATCCAAGAAA AACGACCGGA ACTAGCCAAT AGCAAAGGAA TTGTCTTTCA

192

901 CCACCATAAT GCCAGGCCCT CCCCATCTTT AGCCACTGGA CAAAAACTAC951 TGGAGCTAGG CTGGAATGTT TTGCTGCACC CTCCATATAG TCCCAAACTA1001 GCTCCAAATA ATTATCATTT TTTCCGATTC CTAAAAAATT TTTTAAACGG1051 ACAAAAATTC CAAAACGACA ATGAGGTCAA AACTGCATTG GAGCAGTTTT1101 TTGCTCCTAA AACTAAAGAG TTTTATGAAA AAAGGAAAAT GATACTACCC1151 GAAAAATGTC AAAAGGTCAC TAATAATAAT AAACATAATA TAATAGATAA1201 AAATAATTTG ACATAATTAA TAAATCGTTT TTTGTTTTCT TAAAAAATTC1251 GTAAATATCT TTTTGCCAAC CCAATA

//

193

E.1.3.2 MITE1

LOCUS MITE1 623 bpDEFINITION MITE1, 623 bases, A64 checksum.ORIGIN

1 CTCCGACGTC GGAACCCCGC GACTCGATGG GAGCCGCGAT CTCGCGGTTT51 CGGGTGGAGG TGGGGGAGGG CGCGAAAAAT TTTTACTTTT TTTTTTCGGA

101 ATTTTGACCG CGGTCGGAGA CTCTCCGACG TCGGAaCSCS GaCSMCKCGA151 YGGGAGCCGC GATCTCGCGG TTTCGGRTGG AGGTGGGGGA GGGCGCGAAA201 AATTTTTACT TTTTTTTTTT TGAATTTTGA CCGCGGTCCC GGACTCTCCG251 ACGTCGGAaC SCSGaCSMCK CGAYGGGAGC CGCGATCTCG CGGTTTCGGG301 TGGAGGTGGG GGAGGGCGCG AAAAATTTTT mTTTTTTTTT TYGGAATTTT351 GACCGCGGTC GGAGACTCTC CGACGTCGGA ACCCCGCGAC TCGATGGGAG401 CCGCGATCTC GCGGTTTCGG GTGGAGGTGG GGGAGGGCGC RAAAAAWWWT451 TacTTTTTTT TTTCGGAATT TTGACCGCGG TCGGAGACTC TCCGACGTCG501 GAaCSCSGaC SMCKCGAYGG GAGCCGCGMT CTCGCGGTTT CGGRTGTAGR551 TGGGGGAGGG CGCGAAAAAT TTTTTtTTTT TTTTtCGGAA TTYGACCGCG601 GTCCGAGACT CTCCGACGTC GGA

//

194

E.1.3.3 MITE2

LOCUS MITE2 169 bpDEFINITION MITE2, 169 bases, 2038 checksum.ORIGIN

1 TAGGTCGACC TTGAATMCAA GGCCATTGGT TTTACATTTC RATTTTGTGA51 AATTTTTCAT GGTCATGATT TTTCAACCAA GGTCGACCTT GAATACAAGG

101 CCACCAGTTT TAAAATTTTA TTTTGTGAAA ATTTTTCATG GTCACGTTTT151 TGCATTCAAG GTCGACCTT

//

195

E.2 C. quinquefasciatus

Because there were more than 100 non-LTR TEs identified in C. quinquefas-

ciatus, we list only one element per family.

E.2.1 Non-LTRs

E.2.1.1 CR1

LOCUS Cp_CR1_Ele11 3116 bpDEFINITION Cp_CR1_Ele11, 3116 bases, CE7 checksum.ORIGIN

1 CCACGAATCT GCTGCTTTAY TAYCAGAACG TTGGAGGCAT TAATACCACT51 ATCGCCAACT ACGCCCTYGC AATCTCTTCY GCCTCMTACG ACCTGTACGC

101 AWTMWCTGAR ACGTGGTTGA CYTCWGCTAC TCTRTCTGGT CAAATYTTYG151 GTCCCGAATA YGAAGTATTC CGTGGAGATC GGACMGYCTC GAACAGYWKT201 AAAGRRTCAG GCGGRGGAGT YCTGCTTGCC GTCCGCTCSA AMCTAAAGCC251 ACGCCAAYTR TTCCCWCCAR ATTGTACCGT TCCRGAGCAA GTMTGGGTYG301 CAGTTCCACT CGCTGCATCY ACGATGTTYG TGTGTGTTAT CTACATTCCT351 CCYAAATTTG ACAACGATAA GCCGCTGTTC GATCAGCACA GACATTCTTT401 GACGTGGATA GTCTCCAAAA TGAAAGTGAA CGATAGTGTT ATGGTCCTCG451 GTGACTTCAA CTTCCCAGCC ATTCGCTGGA CGCGSACMCC GACGAACAAA501 CTGMTTCCAA ACYTAGCCCT YACTCCGACC AAYGMGTTAA AGCACAAMCT551 CCTGGATGAS TATTCYACCG CAAACCTTAG CCAACTGAAY GACATGYGCA601 ACAACTCAAR CAACGTTCTY GACCTRTGCT TTGCCAGCTC WGRKACACCG651 ATCAACTWTA CYCTTCTMCC AGCWCCTYTR CCKTTGGTKA AAGACGTGCG701 SCACCACYTK CCRTTTCTYG TWTCSATWTC YTGCACGRYG CTCSMTTTTC751 GTGAWGTYGC TGGYAAYWCK TTYATGGACT AYCGWAAGGG AAACTAYGAT801 GRCATGAACA ACTTCCTGAC CAACATTAAY TGGMACCAAC TWYTGGCCAA851 CCTTGACGCC GACACAGCYG CTGMTACTTG GACAGGTGTT CTGACGGATG901 CCATCAACAC CTTCGTTCCA AGGAAACWGC GCCAGCCTCC AAGAYATCCA951 CCGTGGTCAA CACMTCGAYT GCAGATTYTG AAGTCCAGGA AACGMGCTGC1001 CCTCAAGAAA TWCGCCAAAC AYCCGACAGA TCGATGGAGA AACCATTATA1051 GGTCAAGAAA CCGGAAGTAC AGTATCCTGA ACAAACAACT TTTTCATCGC1101 CACCAACACC GAATCCAAAG CCGATTGAAA CGAGACCCCA AGAAGTTCTG1151 GAAYCAYGTA AACGAGCAGC GGAAAGAGAC AGGTCTRCCA ACTGCGATGA1201 TACTCGACGG TGARGAGGCY ACTTCCACCG AGAGTATAAG CGATCTKTTT1251 CGTCGCCAGT TCAGCAGCGT ATTCACCAAC GAAGCAGTAG RGGAAACGCA1301 TATTGCTAAG GCTGCTAGCA ACGTTYCACT GCGACCTCCC ATYGGACCTC1351 ACCCGGTGGT CACTTCCGAG TCCATCCGTC GTGCCTGCGC CTCTCTCAAA1401 GGTTCTACCA GCTGCGGYCC AGACGGCATC CCMGCGTTTG TGCTMAAAAA1451 GTGTTGYGAT GCACTCGCGG AACCAYTGGC TCAACTYTTC AAYACCTCGC1501 TTGCTACTGG AGTTTTCCCG TGTTGCTGGA AGAAGTCYTW TGTKTTCCCA1551 GTYCACAAGA AGGGCCCMAA ACGTGATGTC CGGAACTATC GYGGAATTGC1601 TGCCCTCTGC GCAGTYAGCA ARCTGTTCGA AGTWATCGTG CTGGATTTYA

196

1651 TYAAGTTCAA CTGCTGTGAC YATGTCGCCC WGGAACARCA CGGCTTCATG1701 GCGAAACGTT CCACYAACTC YAACTTGGTC TCYTACTCGT CCTTCATTCT1751 WCGAACCATG CAGCAACGGA AGCAGATCGA TGCCATCTAT ACGGACCTAT1801 CAGCGGCYTT CGACAAGCTG AACCACCGYA TYGCYGTTGC KAAACTGGAA1851 CGMCTAGGYT TCGGCGGGCC CATGCTYGAT TGGCTWCGCT CCTATCTCAC1901 TGGMCGTGAA ATGAGCGTYA AAATCGGTGA CGTGATTTCC GCTGCTTTYT1951 CTGTTTTTTC AGGCRTTCCR CAAGGAAGCC ATCTGGGCCC TCTGATCTTC2001 CTCCTCTACA TGAACGACGT GCATCATCTg YTTAGGCTGT CACAAACTGT2051 CGTATGCGGA TGAYATCAAR CTGTTCRYCG TTGTCGAGAA TGATACCGAC2101 TGCCAGWTTC TTCAGGAGCA GCTCRACCKG TTCGCCAACT GGTGCTCCGA2151 MAACAGGATG GTTCTGAACG CTTCCAAGTG CTCGGTYATC TCTTTCACAC2201 GCAAGCGCAA YACMATKTCY TTYSACTACA CACTTTCAAA CACCACCATA2251 CCYAGGACCT CYTGTGTGAA AGATTTAGGT GTGATGCTGG ATAGCAAAAT2301 GACGTTTRCT GACCACATYA CGTATACAGT CTCCAAGGCT TCCAAAACTC2351 TTGGCTTCAT CTTYAGRATA GCTAAAAACT TCCGGGATTT AGGCTGTCTC2401 AAAGCTCTTT ATTGTTCGTT GGTYCGCTCT ACTTTRGAGT AYTGTTGTAY2451 TGTTTGGGCT CCCTTCTACC AGAACGCGAT TCAACGCGTG GAGTCGGTSC2501 ARCGGAAGTT CGTTAAGTAC GCGCAACGTC ACATTATCTG GCCTGATCCC2551 GCCAATCCRC CGAGTTACSC AGAGCGCTGT AAAATGCTTA ATCTCGAACT2601 TCTTACAGTA AGRCGTGACG TTKCCAAGGC GACYTTCGTT GCAGATCTCC2651 TTCGWTCGTC CATCGATTGT CCTGCCGTTT TGCAAMTGGT MAACATAAAC2701 ACTCGCCCTC GCGTACTCCG CAATCACTCR TTCTTGACTG TCCRCAGGGC2751 TCTCACAAAC TAYGGGCAGA ACGAACCGGT TTCWAGTATG TGTCGTGTTT2801 TTAACTTGTG CTCAGATCTG TTTGACTTTG ACATCTCCCG TGACACAATC2851 AAAAAACGAT TCCTTAATCA CCTGAAATCC CATCCCTAAc CtgAcgATAC2901 ACACGTAGAT TTTAGAACTG TGATATTTWT GTTAATTTAT TGAGTTAGTT2951 TTAAGAGTGA ACCCGTCTTG TATCATTTGA GTTTTGTGTA CTTGTTGATG3001 CGATAAGATG AGGTGGTTTT GTGCCTTTTT GAGAAAGTGT CTTKAAYRAT3051 ACCAGACACA GCTCAAGGGG GCTTTTGTCC ACCTCCAATA AAGAAAAAAT3101 AACaAAAATA AATAAA

//

197

E.2.1.2 I

LOCUS Cp_I_Ele1 3837 bpDEFINITION Cp_I_Ele1, 3837 bases, 16BF checksum.ORIGIN

1 TTTTTTTTTT GTATTTATTT AGGACCTTTT AACTATGAGT CTTTCGGGTC51 CTGTTWATGA ATCTTTACAC TTTTCCATAC ATGTCAGTAG CCTTTAAAAA

101 TTTCAACACA TTTTTCGCCA TATCTCGATC ATCAGCCAAG GCTTCCCTGA151 CGTTGGATGG TACTCCAGCT CGGCGTCTCT GCTCTTCAAA CTCAGGGCAT201 TCCGCTAAGA TGTGTTTCAC CGTCAGCGCC AAATTGCACC TTGTGCAACG251 CGGCGCACTG TCCTTCTCCA GCAGATACTG ATGGGTTAGC AAGGTGTGTC301 CTATCCTGAG TCTTGTCAGA ATGACGTCTT CCTTCCTCGA TCCAACAAAT351 ACATCTCGAT AAGGTAGTAC CGAGTTCTTC ACCTCTCGTA GTTTGTTGCC401 CACCTTTCTG CTCCATTCCG CATTCCAGCG CCAAACAGTC CTCTTCTTGA451 TCACCGTCCT GAACTCCTTG AATTCTACGC TTCTGTCCCA GATGTTTCTG501 TCGTTCAATG ACTGCTTCGC TTCTTCATCT GCCTTCTCGT TGCCTGCTAT551 TCCTACATGA CTTTTCACCC ACATGTAGAT TATCTCCGTG CCATTACATT601 CTGCCTGATT GTGTAGGATG TTGATCTCRT CCTTCCATCT GCATTTGGTT651 TTTCGTTTAC CCAAGGCCGT GATGGCACTC AAGGAATCTG TACAGACGAG701 GTAGGTCCCC ACGCGATTCT GACCAATTAT CCACCTGAGC GCTTCAGTGA751 TCGCACCACA CTCCGCAGCA AAGATACTGC TCAGATCACT GATTCTTCTT801 CGAACAACCA AGTCCCCTCT AACCATCGCG TAACCAACCC TGCCATCTTT851 CTTTGATCCA TCGGTGAAAA TTGTTTCACA CAAGCGATAT TCCGTGTTCC901 TTCGGGACAC GAAGAGCTCC TTCAGTTGTG TTGAAGTTGC TCCGGCTCGT951 TCAGCTTCGA GTAATGTTTT GTCTATCCGG ATCCGTCTGC GTTCCCAGGG1001 GGGACACAAT GGTAGTGTGA ATATCTTCAG CTCGGGTAAC GGTAGTTCCA1051 GCTCTTCAAG TATCGCCTTT CCTCTTGTCT CAGCAGTTTC CACTCCACGT1101 AGCGGACCTC GATGCCTAGT CGAATTCCAC TCCTCTCCTG AACTGTTGCT1151 ATCGTAGCTG CTCTCGGTAC TGCTTCCTGC TGATTGTGAG TCGTCTTCTG1201 CTGGCTGGGT TMTGCTCTGT GCTGATATGG CTGCTTTTCT GGCGGCGTAG1251 ATCGCTGTCC GCTGTTCAAA GAACACTCTC AGGCTCGGGA TTCCTGTTTC1301 AGCTTGGAGA CTATCTACAG GGCTGGTACG GAACGCACCA CAGATAGCTC1351 TCAAGCCTGT GTTGTGTGTT GGCTCGAGGA TCTTCAGGAC GTTGTCACTG1401 ACGGCAGCTG TTATAGGTGC TGCGTACAGR ATTTTCTCCA ACACTGTGGC1451 TCTGTACAGC TTGATTAGAG TCTTCCTATC GCCACCCCAT GACCTACATG1501 CTACACAACG GATWAGCTGG ACTCGTTGTC GGCAAGCCGC TTTCACTTCC1551 TCGCAGTGTG TCTTGAACAG TAAATGTTGA TCCAGGATAA CGCCCAAACA1601 TCGGTGCTGC TTTTTGGTTG GGATCAGGGT CCCATTCAGC TCCAGTGCAG1651 CTCTGCTCGG TGGTTTCCTC GTTCCGTAGC TTCTAAAGAT GACMGTAGCG1701 CTTTTCTCCG CAGAAATTTT AAATCCCGTT GAGCTCTGCC AGCATTCCAC1751 CGCCTTCAGT GCAGCTTGCA AGTCGTTCTC AACTTCTTCA ACATYTCGTC1801 CACTGGCCAG CAGTACAACA TCATCTGCGT ACAAGAGTGT TGTTATGCTG1851 GGCGGCATCC GTGCTACCAA KGTGTTGATA GCCACCAAGA ACAAGGTGAC1901 GCTCAGTACT GATCCTTGGC AGAGGCCTGT TTCCATGATC TTGCTYTGCG1951 ACAGCTGGCC GTTCACAAAA ACTCTAAATG AGCGATTTTC CAGCATTCGG2001 TCCAAGAATT TCAACATCAG ACCTTCTATG TTCCAGTCAC GCAGTTGGTT2051 CAGGACCAGT CTTCTCCAGG TGGTATCGTA CGCCTTGGTM ACATCCAGAA2101 AAATCCCCTG AACGTACTCC TTCTTGTTCC AGGCTGCTCG TACCACTTTC2151 TCGAGCTCAG CCAAGTGGTC AACGGTAGTT TTTCCTTTCC GGAAGGCAAA

198

2201 CTGGTGAGGA TGCAGTAATC TTCTTGTCTC GATGATGTGA ACTAGACGAT2251 TGTTGACCAT TCGCTCAAAC ACCTTTCCCA GGCAACTGTT YAAGAAGATT2301 GGCCTATAGT TACGAGGATT TGCTTTTTCA CCACAATTTT TGAAGATAGG2351 GATCACCAGC GATTCCGTCC ATTCGGGTGG GTATACTGAA CCGAGCCACA2401 GCCGGTTGTA CGCGTCAAGA AGCTGTCGTT TGCAATCCAA CGGTAACTTT2451 ACGATCATGG AATAGTGGAC AGTGTCAGGG CCTGGTGATG AGCCACGCAA2501 ACCTGCCGTT GCCTCTTCAA ATTCAGTGAA TAAAAAGGGC TTGTTGTACT2551 CAGCACTCGT GTCAGCTGGR ATTGACAGAG GAGTTTGCTC AACCTGATCC2601 TTGAATGCGC GAAACTCGGG GCTGTAAGCA TCGTTACTAG AGATGGAAGC2651 GAAAGARCTA GCCAGCGCCT CAGCAATATC TTCATCTTTA GACACAGTAC2701 CTCCTTCGGC TGTTATTGCA CTGATACGGT TCGTTTTGCR CTTGCCTTGG2751 ATCCGGCGGA AGTTCTCCCA CACTTCTTTA ACAGGTGTTT GCACTGTGAA2801 GTCGTTGACG AAGTTGATCC ACGAGGTCCG TTGTGCTTCC TGGATGACTT2851 TGCGTGCATG MGAGCGGGCT GCTCTGAATT CTGCTGCAAG AGCCTCTTGA2901 TTCTCCAAGT TGTTCCGTTT GGCTGATTGC ARTGCACGCA AGGCTTTCTT2951 CCTGCTCTTG ACAGCGGAGG CCACCTCTTG GTTCCACCAA GGCACCGCCC3001 TCTTGTTCAC CATCCCCGTC GTTTTCGGTA TAGTTTGTTC CGCTGCTTCA3051 AGTATTTTCT GGGTAATGCT GGAGATTTGT TGTGTAGCTG TAGGCATKAG3101 GGGGAACTGC ACTGTTGTCT GGAATGTCTC CCAGTCTGCT TCCTCGATCT3151 TCCATTTGGG TCGCGTTCGG ATGGCATGGG CCGTGTCAGG GAGAGTTAGC3201 AATARTGGTA GATGGTCACT ACCGTAGGMG TCTTGTAGTA CAGAAAAATC3251 TAGTTGGTCT ATTATTTCTG TTGAGCAGCA TGCCACGTCA ATACTCGTGA3301 GAGTTCCGGT CGCAACACTA ATGTGGGTTG GATCTGTTTT ATTCAGGACC3351 ACCAGGTTGC ATTCGTTCAA TACTTCCTCG AACATTAGTC CTCGGGCGCT3401 GAACGTTTCC GATCCCCAYA GTGGGTGGTG GGCGTTTATA TCTCCCACGA3451 GTAGGCGTGG CTGGGGGACT TGGTTTATCA GGTGTACAAT WTCCGAGGCT3501 TGTATGGCTT GCCCCGGTGG CAGATATRTA TTGATAATGG TCAGRTTAAA3551 TGGGGGACCT ACCTTAACAG CGATGGCTTC CAGGTTGGTG TCCAGGTCGA3601 TTTCTTCACT GTCCAGTTCA GGTTTCACKC CAACCAAAAC TCCTCCGGCT3651 GCACGGTCGC CACCARCTCT CGGTCGGTAG TATATGCTGT AGCCGTTGAT3701 GTTCGCCTGG TCCTTAGATG AGCACATGGT TTCTTGGAGG CATAGTACAG3751 ATGGGTTGAT TTTGCTGCAT AGTATTTTCA AATTTGGTTT ACTGGTTCTA3801 AGTCCTTGGG TGTTCCATGA AATAATGTTC GAGTAGT

//

199

E.2.1.3 Jockey

LOCUS Cp_Jockey_Ele4 4487 bpDEFINITION Cp_Jockey_Ele4, 4487 bases, 6C9 checksum.ORIGIN

1 TTTTTTTTTT TAATTTATAT TTATTCAAAT TTCTTTTCCA TGTACATTCA51 TTCAGTTAAA ATATTATTGA GTGTCCAATC ACAAACGATG ACTTTTCACC

101 TCAATTTTAA ATACTAGCAA CTTTCATTTA TTCATGAAAT ATTGTAGCTT151 TCGCTATTCA GTGATTTCAA ATGTAGGAGG TCCTACATGT ACAAAAGGGA201 AAAGGGATAC CTTAAAACTA ACTTATAAAC TATATAAAGA GCGGATCAAT251 GCAGCTGAAG ACTGCAATGA TTTTTGTCGA AATGCATCAA TTATCTTATT301 GGACATAACA TCCAAAGTGT CAACTTCGGC TAATTGATGA AGTTCACTGG351 TGCTGAACCA GGGAGGAAGT TTCAGAATCA TTTTCAGAAT TTTGTTCTGA401 ATCCTCTGAA GTTTTTTCTT CCTGGTTAAG CAACAGCTTG TCCAGATCGG451 CACAGCATAA AGCATGGCAG GTCTGAAAAT TTGTTTATAA ATTAACAGTT501 TATTCTTGAG ACAAAGTCTA GAATTCCTGT TTATAAGTGG ATACAAACAT551 TTAATATATT TGTTACATTT AACCTGKATA CTTTCAATGT GATCCTTGTA601 AGTAAGGTTT TTGTMWtAAA CCAAGTCCAA GATATTTCAC TTGatCCtCC651 CACTTTAAaT TTACMTCATT CATCTTTATA ATGTGATGAC TTTTTGGTTT701 AAGAAAATCA GCCCTTGGTT TGTGAGGGAA AATAATAAGT TGAGTTTTTG751 CAGCATTTGG AGTAATTTTC CATTCTTTCA AATAAGAATT GAAAATATCC801 AAGCTTTTTT GTAATCTTCT TGTGATGACA CGAAGGCTTC TGCCTTTGGC851 GGAGATGCTT GTGTCATCAG CAAAAAGTGA TTTCTGACAT CCTGGGGGCA901 AATCAGGCAA GTCAGAAGTA AAAATATTGT ATAAAATTGG ACCCAAAATG951 CTTCCTTGAG GGAYGCCAGC ACGTACAGGT AGTTGATCAG ATTTGCTATT1001 CTGATAACAT ACCTGCAGAG TACGATCCGT CAAATAATTT TGAATAATTT1051 TCACGATATA AATCGGAAAA TTAAACCTTT TCAATTTCGC AATCAAACCT1101 TTATGCCAAA CACTGTCAAA TGCTTTTTCT ATGTCTAGAA GAGCAGCGCC1151 AGTAGAATAG CCCTCAGATT TGTTGCTTCG AATTAAATTT GAAACTCTCA1201 ACAACTGATG AGTAGTTGAA TGCCCAAGGC GAAATCCAAA CTGCTCATCA1251 GCGAAAATTG AATTTTCATT AATGTGCGTC ATCATTCTAT TAAGAATTAT1301 TCTTTCGAAT AATTTACTAA TAGATGAAAG CAAACTAATG GGCCGATAGC1351 TTGAGGCTTC AGCAGGATTT TTATCCGGTT TCAAAATCGG AATTACTTTG1401 GCATTTTTCC AACTACTGGG AAAATATGCC AAATCAAAAC ATTTGTTGAA1451 AATTTTGACC AAGCWACTTA AAGTTGCTTC TGGTAATTTT TTAATTAAAA1501 TGTAAAAAAT GCCATCCTCA CCAGGGGCTT TCATATTTTT AAATTTTTTG1551 ATAATAGATT TTATTTCATT CAGATCCGTA TTAAAAACTT CATCTGATGA1601 AAATTCTTGT TCAACAATAT TCTGAAATTC TATTGAAATT TGATTTTCAA1651 TAGGACTCAA AACATTCAAG TTGAAATTAT GAGCACTCTC AAACTGCTGA1701 GCAAGTTTTT GAGCTTTTTC CCCATTAGTT AATAGAATAT TATCACCATC1751 TTTTAAAGAA GGGATGGGTT TTTGAGGTTT CTTAAGAACC TTTGAAAGTT1801 TCCAAAAAGG TTTGGAATAA GGTTTAATTT GTTCGACATC TCTTGCGAAC1851 TTTTCATTTC GCAGGAGAGT GAATCTGTGG TCAATAACCT TTTGCAAATC1901 TTTTTGAATT CGCTTCAGTG CAGGATCACG AGAACGTTGA TACTGTCTTC1951 GGCGAACATT TTTCAGACGA ATCAGAAGCT GAAGATCGTC ATCAATAATG2001 GGAGAATCAA ATTTGACTTG GACTTTAGGA ATAGCAATAT TCCTAGCATC2051 CAAAATTGCA TTAGTTAAAG ATTCCAAGGC TGAATCAATA TCAGCTTTGG2101 TTTCTAAAAC AAAATCATGA TTTAAATTAT TCTCAATATG ATGCTGATAC2151 CTGTCCCAAT TAGCTTTGTG GTAATTAAAC ACAGAACTAT TGGGTCTGGT

200

2201 AACTGCTTCA TGAGAAAGTG AAAAAGTTAC TGGAAGATGA TCAGAATCAA2251 AATCAGCATG AGTCACTAAA GGACCACAAT ACTGACTTTG ATTTGTCAAM2301 ACCAAATCAA TTGTTGATGG ATTTCTAACA GAAGAAAAGC AAGTTGGCCC2351 ATTCGGGTAT AAAACCGAAT AAAGACCAGA AGTGCAATCT CTGAACAGAA2401 TTTTACCATT GGAATTTACT TTTGAATTAT TCCAAGATTG GTGTTTGGCA2451 TTAAAATCAC CGATGATCAA AAATCGAGAT CTATGCCGAG TAAGTTTATT2501 CAAATCCCCT TTGAAATAAT TTTTATTTTC CCCAGTGCAT TGGAAWGGCA2551 AATATGCAGC TGCAATCATA ATTTTCCCAA AAGAAGTTTC AAGTTCAATG2601 CCCAAACTTT CAATAACTTT TAACTTAAAG TCACGTAACG TGCTATAAGT2651 CATACTACGG TGGATAACTA TTGCAACTCC ACCGCCATTT CGATTCATTC2701 GGTTATTAGT TATAACTTTA TAATCTGGAT CACTTTTCAA ATAAGTGCCA2751 GTTTTTAAAA ATGTTTCGGT TATAACAGCA ACATGCACGT TATGAACTCG2801 TAAAAAGTTG AAAAATTCAT TTTCTTTCGC TTTTAAAGAG CGAGCATTAA2851 AATTCATAAT ATTGATGGAA TTACTTAGAT CCATGATTAA ACTTCAGGGT2901 AAGAACAACA TCATTCGCAA ATTTTAATCC AATCTGGATT GCTTCCATCA2951 TGGATGTAGC ATTACTCATT GTTTGAATCA AACCAAACAG TGAGTTTTGC3001 AAAAAAGTCA TTTTTTCAAA CGTAACATCG CCGAGATCAG AAGATCCCAA3051 AGCGTTGCCA GCAGAAAAAT TTTCAAATGA GATTTGAGGT ACCTGCCCAA3101 TTTCAGAAAG ATTGGTAGAG GATTTAAAAT TCGTGGATGA ACCCGAAACG3151 ACGTTAGCAT AAGAAATGCC ATTGTTGTTA CCTAACTTTT CCACGGTAGG3201 GGTATTTCTA GAATTGTTCG AGTGAGACAG CACGAACGTT TGATTTAAAG3251 ATGCAGGTAC AACCTGACTT TGAGAAAATT TCGGTTTGGA TTTCGGCTGA3301 TGCTTAGCAC GAGAATCCAA AACCTTTTTT CTGATGGGGC AATCCCAGAA3351 ATTTGATTTG TGATTTCCAC CACAATTTGC ACATTTAAAT TGGGTGACTT3401 CTTTCACGGG ACAATTGTCC TTGTCGTGAG AAGAATCCCC GCAAACCATG3451 CATTTTGGAA CCATGGCGCA ATGATCAGTA CCGTGACCGA ATGCCTGGCA3501 ACGCCGGCAC TGGGTCAGAT TCTGGCCATT ACCGCCATGT TTCTTAAAAT3551 GCTCCCACTT TACCCGTACA TGGAACAAAA ACTGAACTTT GTCCAAAAGT3601 TTCAAATTGT TGATTTCATT TCTGTTGAAA TGAATCAGAT AAAATTGTGA3651 AGTCAAACCA AAGCGAGAAA TATTCCCGTT TGATTTTTTC TTCATTGGTA3701 TTACTTGGGA TGGGGCAAAG CCAAGCAACA CCTTAAGTTC GTTTTTGATC3751 TCATCCACCG ACAAGTCGTT GGAGAGACCT TTCAAGACCG CCTTGAATGG3801 CCGAGCATTC TTGGTCTCAT ACGTGTAGAA ATTGTGTTTG TGGTTTTTCA3851 AATAACCAAC AAAAGTTTGG TGATCTTGTA AAGATTCCGT CAACAAGCGA3901 CATTCTCCTC TTCGACCAAG CTGGAACGAA ACYTTCAAAT TGCAAGTTTC3951 CTTGCAATTC TTCAGTTGCG TTCGAAAGCT GGCCAAATCG GAGACGGAAG4001 TCACWACAAT TGGCGGAGCC TTTACTCGTT TCTCGACGGC AGAAGGCTCA4051 GTACGAGGAG AAGGTTCCTT GTCATCAGTT TCGGATAAAA CACCGAAACT4101 GTTTGTCAAT GGAATTGGAG GATTGACCTC ACATTCAGAA TCAGAATCAG4151 ACCTCAGAAG AGGCTGTTTT CTTTTTCTGT TTCCGTTAGC CGGTTTAGCG4201 TTCAAACGCT TCACCGACGT AACGACCAAA TCTTCAGAAG ATTTGCGTTT4251 ACCTTTGTTT TGACGCATTT TGCAAGCAAA GTTCTCTTAA AAAGATGGCT4301 TCGTTTGTAA AATACAACAA AATTTCAGGT GGGGTTAGTC TTGAAAAGAC4351 TGTTTAGAAT TTTGGAAAAT AACTCAGGTA GTCTTTAAAA AGACTGTGAT4401 TTTGTTTTGA AATAACTCTG AGCTTAGGTA GTAAAAAATA CCGCAGCTCT4451 AGTGTCCGTT CACCACGAAG GTTCGCAAGA CACTGAT

//

201

E.2.1.4 L1

LOCUS Cp_L1_Ele39 3228 bpDEFINITION Cp_L1_Ele39, 3228 bases, D32 checksum.ORIGIN

1 TACAACGTAG CCACCATCAA CACTAACGCA ATATCCAACG AAAACAAGTT51 AAACGCACTA CGGACACTTG TCCGACTACT CGRCCTTGAC GTAGTGTTGT

101 TGCAAGAGGT CGAGAGCAAC CAATTTTCGA TCCCTGGCTT CAACACCTAC151 ACAAATGTAA ACGAAACTAA AAGAGGAACA GCAATTGCCG TGAAACAACA201 CATGCTGGTG AGCAATGTTC AGCGTAGCCT GGACAGTAGA ATACTCACAC251 TCAAGGTCAA CAACTGCGTT ACGATCTGCA ATGTCTATGC GCCTTCTGGA301 GTCCAAAGCT ATCAGTCACG AGAGAGTATG TTCAACCAAT CTTTGCCTTT351 CTATCTCCAA AACGCGGGGG AGTATGTACT TGTTGGGGGT GACTTCAACT401 GTGTCGTGTC AGCCAGGGAT GCCACGGGTA CAAACAGTCA AAGCATCGCG451 CTGAGAGTCC TTGTACAGAA CATGAACCTG AAGGACACTT GGCAGATTAT501 GAATGGAACT CGGACGGAGT TCAGCTTCAT TCGAGCAAAC TCAATGTCTC551 GTTTGGACAG AATTTACGTG TCGTCAAACA TCTGCTCACA GGTGCGCACA601 ACGTCCTTCC ATGTGAACTC CTTCTCGGAC CACAAAGTCT ACAAAACAAG651 AGTCTGTTTA CCAGATCTTG GAAGGGCAGC TGGCAATGGC TACTGGTCTA701 TGCGAACGCA CACACTCACT GACGAGAACA TAATCGAGTT TGAGCATAAG751 TGGAACTGGT GGACAAGACA AAGACGGGAC TATAACAGCT GGATGAGCTG801 GTGGCTTGAG TACGCAAAAC CGCGCATCAA GACTTTCTTC AAGAGAAAAA851 CCAACGAAGC ATTCCGTGCA TTCAACGCGG AAAATGAGTA CCTGTACGCT901 CAACTGAGGG AGGCATATGA CTGTTTGTAC CTGAACCCGA ATGCTCTTGC951 CGATGTAAAC CACATTAAAG GGAGGATGCT GCGACTGCAA CGTGACTTCT1001 CTTCGAGCTA CCAGCGTCTT AATGATCCTG TCGTTGCCGG GGAGCACATC1051 TCGTCCTTTC AACTTGGAGC CAGGATGAAA AGGAAGAAAA ATTCGTTCAT1101 CTCTAAAATC ACCGACGGAG TCCAGACGCA GCCTTTGGAT GCAGCAGAAA1151 TAGAAGCACA CATTCACCAG TATTTCCAGT CCCTGTACTC TGCTGGAGAC1201 GTAGCTGATC CTGACGGTGC AACAACCAAC CGGGCTATTC CATCTGACTC1251 GGTGCCGAAC GCGCAGGTAA TGGAGGAAAT AACAACTGAG CAGCTGTATA1301 ACATCATCAA AACGAGTGCA TCGCGCAAAG CTCCGGGGAA CGATGGAATA1351 CCCAAAGAGT TTTACGTGCG GACCTTCCAC GTAATACACA GACAACTAAA1401 TCTGGTGATC AATGAAGCGC TGAACGGGAA CATCCCCCAG AAGCTGGTTG1451 AAGGTGTTGT AGTATTGTGC CACAAAAAAG GTGGCAACAA TACTATCAAA1501 TCCTACCGAC CTCTCACGAT GCTCAATTTT GACTACAAAA TCCTAAGCCG1551 AATACTCAAA ACCCGAATTG AGGAGATCAT GGTCCGGCAC GACATCCTCA1601 CACCCTCTCA AAAGTGCTCA AACGGCAAAA GGAATATATT TGAAGCTCTT1651 CTTGCCGTCA AAGATCGAAT TGCCCAGATC AAGCACACAC ACATACAAGG1701 AACGCTCGTA TCATTTGATC TTGATCACGC ATTCGATCGA GTTGAACACT1751 CTTATCTGTT TCGGGTTATG GACGATATGG GCTTCAACAG GGCACTTATA1801 CAGCTGCTGC GCACTATCAT GGACCACTCA CGCTCTCGTG TGCTAGTAAA1851 CGGGCATTTG TCTCCAGAGT TCGAGATACG GCGCTCGGTT CGGCAAGGGG1901 ATCCGATGAG CATGCATCTC TTCGTTCTCC ATCTGCACCC GCTGCTGGAG1951 AAAATACGCA CACTCTGCAA CGACCAGCTA GACCTCTCCA CCGCATATGC2001 CGACGATATA TCTGTTATCG TGGTTGATAA CACGAAGTTA CCAACACTCA2051 AACAACTCTT CTTCGACTTT GGACGGTATT CTGGGGCCGT CCTCAACCTC2101 GAGAAAACAG TTGCAATGAA CATAGGAAGA AGCAGCGAAA ACCTACCCTG2151 GCCGTCGATG GAAACGCGTG TGAAGATCTT GGGAATCAAT TTCTTCAATG

202

2201 ACCACAAGCA GATGATACAG TTCAACTGGG ACGAAGTGAT CCGAAAAACT2251 ACGCAGCTAA TGTGGATGTA TAAAGCGCGA AACCTTACGT TGATCCAAAA2301 GGTTACCGTG CTGAACATGT TTGTGACCTC AAAACTGTGG TTCGTGGCAT2351 CTGTGTTGAG CATACGCAAT CAAGACATAG CAAGAATCAC AAGACAACTT2401 GGGTTCTTCC TATGGGGTCG CCAGCTGAGA GTTCCAATGG AGCAAATTTG2451 TCAACCTATT GCAAAGGGAG GGCTGAATCT GCATCTTCCC ATGCACAAAT2501 GCAGAGCACT ACTGGTCAAC CGATTCCTGT GTACGATTGC CGAAACTCCC2551 TTCGCCGAGC ACCTGTACGG CCTGGTTAAC AATGGAGGAT CCCTACCAGC2601 AACATACCCT TGTCTACGGC CGACGTGGAC CACTATTCGA GAACTTCCCC2651 AGCAGCTACG AGACAACCCG TGTTCGAGCA GCATCGAAAG TCATCTTCTG2701 CAAGCTTTGC CAACCCCGAA GGTAGTGGTG AACAACCCCA GAGCATCGTG2751 GAGAAGCGTG TGGCGAAACG TACGAGCAAG GAGTCTCACG TCGTTGGAAA2801 AATCCACGTT CTATCTGCTT GTGAATGGCA AGCTGCCTCA CGCGGCCCTG2851 TTGTTCCGGC AACATCGGAT CAGTAGTGCT TTTTGCATTC ATTGTCCGAA2901 CGAAACAGAA GATCTAGAAC ATAAGCTAAG CAAGTGCCGT AAAATAAGCC2951 ATTTGTGGAA CCACCTTAAA CCAAAATTAG AATCCATTTT GGACCGAAGA3001 GTAGAGTTCA AAAACTTTCA AATCCCTGAA TTCAGGGCAA TAAGAATGGC3051 AAATGTAGAG AGATGTCTAA AATTGTTTAT CAACTACGTA AACTTTATTT3101 TAGATACAAA AAATGATTTT ACGACTCAAG CACTTGATTT TTTACTAAAT3151 TGTAACTGCC CATAATATGT ATCTCTGCAA ACTGTAACAA AACTAAATAA3201 ACGTGTTAAA AAAAAAAAAA AAAAAAAA

//

203

E.2.1.5 L2

LOCUS Cp_L2_Ele4 2824 bpDEFINITION Cp_L2_Ele4, 2824 bases, 358 checksum.ORIGIN

1 TCTAAAAACT AACTATTATT AAAATGCTCT AAACATGCTT TCTTAAATCC51 TATACTCGAA TTTAACAATT GTAGATTAGG CGGCAATTGA TTCCATTCTG

101 CAATACCTCT AACAAAAAAA GAATTGGCAT AGTGAGATGA TCTAAATCTT151 GGCAAGATGA ATTTTCTTCC TCTTCGGCTT CTCATAGGAG TTATTTTTGA201 GGTGATGTAT GGAGGGGACT GATTGACAAT AAATTTATGT AAGAGTAGAA251 CTGATCTAAG CACACCAAAA CGAGAAAACG GACAACCAAT TAGGATATTC301 TGTCTGTGAG TCACGCTTGC AAAACGATTT AAACCAAACA CAAAACGTAC351 ACAACAGTTT AAAGCAACTT TTAATTTATT ATAAGCAAAA GTGGACGATT401 GTTTCAACAC AAAGTCACAG GTTATAAAGT GCGGTAAAAT TAAAGCTTTA451 AATAATTTTA GTTTAATCTT AAAGTCGAGA TGATTTGCTT TGAGACGAAG501 GGATCTTAAA GCAGCATAAA CTTTTCCACA TTGCATCAAA ACAAATTCGT551 CCCACTCGAA TTTACTGGTA ATTTTCACTC CTAAGCTGAT AACTGTATCG601 TTATAATGAA CTAATTGATT ATTTAAAAAT ATGTCGGGTT TAAACACAGG651 TCTTTTTGAT CTTGAAATAC AAATAGCTTG CGTTTTTGAT GCATTAAGTT701 TTAGATTGTT TGAAATAGCC CAATTGGCAA CAATGGTCAT ATTTGTATTT751 ATCATGTTAG ATATTTCAAG TTGTGGTTTA TTGGTACAGT CAAAATATAT801 CTGTACGTCA TCTGCGAACA AGTGAACACC ACATCCTTGT AGGTGTGACG851 GTAGATCATT AATGTATAAT GAGAAAAGTA ATGGACCAAG AACGGATCCT901 TGTGGCACTC CTGAAAGAAC CGGAAGCAAC TGAGAAAATA TTCCGTCGTT951 AAAAACAGTT TGAAATCTAT TTGACAAATA AGATCGTACT AAACGTACTG1001 CATCAGGACA AAAACCAAAA ATCATTTTTA ATTTATCACA CAAAGTTTTA1051 TGACAAATAG TATCRAATGC TTTAGAATAG TCTAAAAGAA TAAGCACAAC1101 GTATCCACCT GAGTCTATCA CTAAACCAAT ATCATCACAT ATCTTTAGCA1151 TTGCTGTTTT AGTACTATGT CTTGGGCGAT ATCCCGACTG GCAAGGTGTT1201 AATAAATCAA AACGAGAAAC AAAATCCGTA ATTTGTTTTT TAATAACTCT1251 TTCAAAAGCT TTAGAAAGCG CAGATAAAAT GCTAATTGGA CGTAAATTGT1301 CCAAACTAAT TTTAGGACCT TTCTTTTTAA TAGGTACAAC TTTCGATATT1351 TTCCAAACTT TGGGGTAAAT ACAAGTTTTA ATTATTTTAT TAAAAATATG1401 TCGAATGGGT TCAACTATGA GGGGAATTAT AGCTTTGCAA AATTTAATTG1451 GAATTTCGTC CAGACCAACT GCATTAGATT TTATTTCATA TATTGCATTA1501 ACAACATGAT ATGATTCAAT AGGTTTAAAA GCAAAAGCCT CAGGTACAGG1551 TGCGATAGCA TTCAATGTTG TGGATGAACT GATCCCAGTT GAAAAATTAC1601 TTGCAAAGAA ATGATTGATT TGATCTGCAT TTAGGTTGTT TTTTTCGTTT1651 TTATTTTTTT TCTTAAATCC AATAGCATTG AGTTTATTCC ATAATTCTTT1701 CGAGTTCAAA TTTTGAGCAA ACATTGTGCT GAAATAATTT TTCTTGGCAT1751 TGTTGATCAT ATTTGTTGCC TTATTTCTAA GCCTTTTATA AAGCGTGAGG1801 TCATTATCAT GTTTAAATTG TTTCCATTTA GAAAAAGCTA AATCTCTATC1851 AACAATTACT TTAGACAAGT TTGCATTAAA CCAAGGTTTA TTATGGGGTT1901 TTTGGATAAA ATTCCGTAAT GGAACGCAGT TATCATGCAG ATTTTTAATG1951 TTTGTATTGA AAAAATCTAC TAAGATATCT GGATTGTCAA GGCTAAAAAA2001 ATAATTCCAA TCGATCTGAT CAAACATTGT TAAAAGCAGA GACGAATTCA2051 TGTGACCATA ATCACGAAAA CAAATATTTT CATTAACATT CGATACATCA2101 ACATCAACAG ACGCAAAAAT TAAATCATGA TTGGACATAA ACGGTACAGA2151 CACTTGGTTA AAACGCAAAA TAAATTCTTT TTTGTTAGTT AATAGAAGAT

204

2201 CTATCAAAGA TGCTCCTGTT CTGTGAAAAT GTGTAGGAAT TTCTCCAACA2251 GGTTTCAGAG ACAAACTTTG TAAACACTTT AAAAAACGTC TTGTGTTAGA2301 AGAGTTCAGG TCAAGAAGAT TGGTATTGAA ATCGCCCATG ATAATCATGT2351 GTTCATGCTG TAAATTACTA CGGGAAAGCA AGTCGATCAT CAACTGAGAG2401 CAATCGGAAT TTGGAGTATT GTAGATAAAT CCCATGAGTA AACTTTCATG2451 ATTTACAGAA ACTTCGACGA AAATAAATTC AGTTGGTGCA TTGTCAATAT2501 CATGTGATGA TATGATGACT TTACAATTTA AAAAGTTTTT AACATACAAC2551 AAGATACCAC CTCCAATTCT TCCTTTACGA TCACATCTTA TAATTTTATA2601 ACCATCAATT TCAAGTACGT CGTTTGGTAT TTCCATTTTC AACCATGACT2651 CACAAACACA CATGACGTCG ACTTTGGAAA TGTATGCAAT ATTGCGAAGT2701 TCATCCAATT TAGTTAATTT TCGAGCACAT ATACTTTGGC TATTCATACA2751 GCATACAGAA AGTTTGTTGT GAATGAGAGC AGATTTCATC ACAATACCTG2801 GAATACTCAA ATTGTTAGAG TTAT

//

205

E.2.1.6 LOA

LOCUS Cp_LOA_Ele7 3751 bpDEFINITION Cp_LOA_Ele7, 3751 bases, 9C checksum.ORIGIN

1 TTTTTTTTTT TTTTTTTTTT GTTGAGCCTT ATGACCACTG CGACCAATTA51 GAGGAATATT GTGGTGTGAC TCTTTTTTTT TGTAGCAAAT CAGGTTAACC

101 CAACACCATG GAACGATGAG AGGCAGCTCC TGTTTACTGC ATGTTGTCCC151 AGTTAGGGTC GACTTGTTTT ATAAAGTCTA TCACCCTACT AGGACGGAAA201 TGGTTGATAT TGGAGGGATG TAAGATCCCT TTCCCAAAAA ATAATATTCT251 AGTTTGAATC AGGGCACCAC AATTACACAA AAGGTGCTCT GATGTTTCAT301 TCTCAATATT ACAAAATCTA CAATTTGCGT TTGGAATTTT CCTCATATTA351 AATAAGTGGT ATTTGCATGG GCAGTGACCA GTTAGTAAAC CAGTTATCAC401 CCTTAAATTT TTCTTATCTA GTTTTAATAG TTCTCGAGAT AGTTTCCCAC451 CTGGAAAGAT AAATCTTTTT GCTTGTCTGG AAGTATCTGT AGAATTCCAG501 ATTGAATTGA TTGTTAATTG TTCCCAACGT TTTAGCTCCA TTTTCAGAGA551 ACAATCTGGA ACACCAAAGA AAGGTTCTGG GCCAATGAAG TCAGTTTCAG601 ATCCTTTCCT TGCCAGACTG TCCGCGTTTT CGTTTCCTAA AATTCCACAA651 TGACCAGGAA CCCAGTAAAG ATTGACCGAG TTCTTTTGGG ACAATTGCTT701 TAGTAATTTT ATTCCTTCCC ATACTAGTCT TGACTGGCAG GTATACCCTT751 TCAGAGCATT TAGTGCTGCC AAGCTGTCGG AAAAGATACA AATATTAGCT801 AATCTATAAT TTCTTTTGAG GCAAATGTGC AAACATTCAA TTATTGCATA851 AACTTCTGCT TGAAAAACTG TTGGAAATTG TCCCATTGGA ACAGATGTCT901 TAATTCCTGG TCCAAACACT CCAGCTCCCG TTTTATTATT CATTCTAGAA951 CCATCCGTGT AGAACAGAAT AGAGCCTGGA CGAAGATTAG GCCCACCATT1001 ATCCCACTCG TAGCGACTTT TAGCGTCCAC TGTGAACGGA ATGTCATAGT1051 TAGCCTTTGT CGGCATCCAG TCTACTATTG AGTTTAACGC CGGATTCAAG1101 GTAAAGTTTT TAAGGACTTC TAGGTGCCCT GTCAAATCTT CATCCTGAAT1151 TCTTGTGGCT CGATTCATTA ATAAAGCACT CTTAAGGGCT TCTAATTCTA1201 TGAATTTATG AAGAGGAAGT AAATTTAACA AGGAGTTTAG TGCAGCAGAT1251 GAAGTGCTGT GCTTAGCTCC AGTTATTGAA ATGGTGGCTA ATCTTTGCAG1301 CTTAGCTAAT TTAATTTGGG CAGTTGATTC TTTAACTTTC GGCCACCACA1351 CAAGTGACGC ATACGTAATT CTTGTTCTAA CTATGCATGT ATATATCCAA1401 TAAATCATTT TAGGGCGGAG ACCCCACTTT TTTCCAAAWG TTTTTTTACT1451 CAGCCATAGA GAATTAATTC CTTTGTTTAA TACATTATTC AGGTGAGCAT1501 TCCAATTAAG TTTCCTATCT AAAGTAATAC CAAGATATTT CACTTCGTTA1551 GAATACTCAA TAGTTGTCCC ACGAATAGCA AGATTAGGTA GTACTAGTGA1601 CCTTCGTCTT GTGAACGGTA TTATAACTGT TTTAGAGGGA TTTATACCAA1651 GACCCTCATT TTTRCACCAT TTTGAGATAA AATTCAAGGC CGATTGCATT1701 CGGTTACAAA TTTCGCTACC TATTTTTCCC CGAACTATTA TACAGATGTC1751 ATCCGCATAG CTGATCAGTT CAAAGCCGAG ACTCGCCAAC TTTTTCAGAA1801 GATCGTCGAC TACTAATGAC CATAGCAGAG GTGATAGAAC TCCACCTTGT1851 GGACATCCCC TCCGTGGTCT GATTACCAAT CGTTCTCCTC CAAGGTCGGC1901 AGATATCTCC CTTTCAATAA GCATTGTTTC AATCCATTTG ATAATCCATG1951 AACCAAAGCC CCTTGATTCC ATGGCGGAAC GCATAGAGCT GTGAGATGCA2001 TTATCAAAAG CTCCTTCAAT GTCTAGGAAA GCAGCCAGAG CTACTTCCTT2051 GGCTTTGAGG GACTTCTCAA TTTTAGAAAC TAAACAGTTT AAAGCAGTTA2101 TAGTAGATTT ACCACTTTGG TATGCAAATT GGCTATTACT WAGAGGAAAT2151 TTGATCAAGA ACGtTTTTTT TtACAAAATC ATCAAGAATT TTTTCCATAG

206

2201 TTTTCAATAA AATAGAAGAG AGACTAATTG GACGGAAGGA TTTGGGACTT2251 GTTTTGTCCC TTTTCCCAAC TTTTGGAATA AATATGACCC GTACCTCTCT2301 CCAAGCTTTT GGGATATAGC CAAGAGCGAT ACTGGCTCGA AAAATGTCTG2351 TTAATTTAGC ATTGATAATA TCTTTACCCT GCTGGAGAAG AGCAGGAAAG2401 ATGCCATCCA TCCCTGCGGA TTTAAATGGT TTGAAAGAGC TCACCGCCCA2451 GTCAACCTTG GAACGAGTAA ATATTTCACA GGCTTGATTA TGGGCAAGTA2501 CAGACCAATC TTTTGTCAGT CTTTCGCCCA CGGCCTGATC TATACAACTT2551 ATTACCTTGG TCGAGCCAGG AAAGTGAGAT TCCATCATTA AATCCAGAGT2601 TTCAGATGCA TTTTTTGTAA AACACCCATC TGCTTTTTTA AGATTTCCTA2651 GACCGTTTGA ATGATCTTTG GAAAGAGCTT TTTGTAATCT TGCGACTATT2701 GGAGTTTCCT CGATTTTTTC ACAAGAACGT CTCCAGTCGT TGCGTTTAGA2751 ACGTTTTATT TCTTTATTGT ATTCAGTCAG GGCTTTTTTA TATGAGTCCC2801 AATCTCCAGA TTTTTTTGCT CTGTTGAACA ACCTTCTAGA TTTTTTCCTT2851 AACGCAGCTA ATTTTTCATT CCACCAAGAT ACCTCACGAT TTGAAGATCT2901 TTGCTGGACT GGGCAACTAG CGGAGTAAGA ATTGGTWATA GCAATAATTA2951 ATTCAGACGA AGCCTTCTCC AACTGATCAA CAGTACGGAT TTCGGCTTCG3001 GAATTGAAAT CGTAGTCATT TAGAAGAAGT TTATAGGAGT CCCARTTGGT3051 TTTCTTGGGA TTYCTATAAC TTCCAACTAT CATTTCACCT GCATTGTATT3101 CGAATTCAAT TTGTTTATGG TCTGAGAGTG ATATTTCATC AGAAACCTTC3151 CAATTTATAA CTCTTTCAGA AATAACTTCG CTACAAAACG TGAGGTCAAG3201 AACCTCCTGT CTTGTTGCTG TAACAAAAGT GGGATTATCA CCATTGTTAC3251 AAATGTCAAT ATTGTTAGTT GTTATGTAAT TGAGCAAATT CTCACCTCTA3301 TTGTTAATGT TAGTGCTACC CCATACTGTG TGATGAGCAT TRGCGTCGCA3351 CCCAACAATA AATTGTTTAT TTATTTTTTT GCAGTAAGAG ATAAACATTG3401 TTACATCAGM TGGAGGGGCC TCCTCGGTGT CTCCGGGGAA GTACGCAGAT3451 GCAATACACA CCCAGGTACT TCCTCTGACC GTAGGGACCT CCATTTGAAT3501 TGCAACAATG TCTCTTGAAA TAAATTCGGT TATTGGAATA AAATTTGTAT3551 CATTTTTTAC WATCATAGCT GCTCTAGGGT TGGGAGAACT ATTATCRTAT3601 AATAGTTTAC CCATTTGAGC TGAAATTCCT TTAATTTTAC CCTTATTTAT3651 CCAGGGTTCT TGAATGAGCC CTACGTCAAT ATTATTTTTA CTAAACCTTC3701 TGCTAAGAAC GGCTGACGCT CCTTTGGCAT GATGAAGATT AATTTGAATA3751 A

//

207

E.2.1.7 Loner

LOCUS Cp_Loner_Ele1 5618 bpDEFINITION Cp_Loner_Ele1, 5618 bases, 78D checksum.ORIGIN

1 CaGTCGGTAT CTGATCGAGG TACAAGCAAG ACGTGTTTGT TTTCGCTAGA51 GCACCGAAGC TTCTAATCGT TCAACGGATC AAGGGATCCA CCGCCGACTT

101 CTCGAAGGTT GACCTTACGA GTGCGTACGT GACACGCGTG ACAGTTCGTG151 AACGTGAATA GTGCGTGTAA GCCAATTATC GCCGGTCAAG TTCACCATCA201 GCAGCCGAAC CACAAAGgCA aCcctccAaa cCGCGAAGCG TGTTTTGGAG251 TTTTCGAAAA CGTTTACACA AGGCATTGTC TAACCGCGAG AGCTAGACAA301 AGGAGCACCA CCGAATACAA TAGCAAACGG ACGGTTTGTT TCATCCTCAT351 CAACTCGGCG TGGCAAAATC TAGAGTTAAG TAGAAAATCA TCTCCCCCTC401 CCCCCATGTC TGGGGGCGAA GGCGGCATGG AAAGCGATGG CGAAGATGAC451 GAGGCTTCCA ATGTGCGCAC AAAGATCTAC CCTAATGGCT CAACAGGGCC501 GTTTATTGTC TTCTTTCGGC CCAAATTGAA ACCCCTAAAC CTGATCAGCA551 TCACCCGAGA TCTAACGAGA AAGTTTTCTG GCGTATCCGA AATAAAACGT601 GTCCATGCTA ACAAGATCCG CGTAGTTGTG AATAACATCT CTCACGCCAA651 TGAAATTGTC ACCTGCGAGC TTTTCACGCT CGAATATCGA GTTTACATCC701 CTTCACGCAT CGTGGAGTGT GATGGTGTTG TCACAGAAGA GGGCTTAACC751 CTAGATGAGC TGTATGAGTG CCGTGGTTAC TTCAGAAATC CTGCCGTGAA801 CCCCGTGAAG ATCATTGAGG TTAAACAACT GTTCTCCTCC TCCACACAGG851 ATGGCAAAAC GGTTTACTCC CCTTCGAACT CTTTCCGAGT GACCTTTGAA901 GGATCCGCTC TGCCAAAATA CATTGAGATA GACAAGGCTC GTCTACCTGT951 TCGACTTTTT GTCCCCAAGG TAATGAACTG CCAAAAATGC AAACAACTCG1001 GCCATACCAC AGCTTACTGT TGCAACAAGG CTAGATGCAT CAAATGCGGT1051 GAAGAGCACG ATGATAGTAG CTGTACGCAA GCTGCTACCA AGTGCCTCTA1101 CTGCGACGAG GACGCCCTTC ACAAACTCTC GGATTGTCCG ACGTATAAGC1151 AGCGTCAGGA GAAACTTAAG CTTTCTCTGA AGCAACGATC GAATCGTACT1201 TTTGCGGAAA TGCTCAAACA AGCCACCGAA CCACTCAATT CTGGAAACAT1251 CTACAATATA CTGCCTTCCG ACGAGACGGT CGCCGACTCG ATCAATGCGG1301 GCGCGTCAAC GTCCGGGACG GGTAACTCGA GGAAAAGGAA CAATGGATCA1351 CCAAGCATCC GCCGAAAAGA AATAAAGCTA TCCCCACAAC AAGACAGGAT1401 CCCTAATTTT CAGCCAACTC CCCCTGGAAT CAACCCCCCC GGTTTCCCCC1451 CATTGCCAAG GCCCCCACCT CTGACCCCCA AACCAAATCC TAACAAACCT1501 AAGCAAGGAT TAATCGGTTT CACAGTGTTG ATTAACCAAA TTCTCGATGC1551 GCTCCAGATT TCCACCGGTG TTCGAACCGT GGTAATCACT CTGATCCCTT1601 TCGTTCGGAC ATTTTTGATC AAATTATCTG AACAATGGCC CCTTATTTCA1651 ACAATCATAT CCTTCGATGG ATAATTCAAC GTCGAAGATG AATGAAGAAA1701 TTTCTGTTCT CCAGTGGAAC TGCAGGAGCA TTGTTCCAAA ATTAGATTCT1751 CTTAAAATAT TAGCTCACGA AACTAAATGT GAAGTATTTG CTCTCTGTGA1801 GACATGGCTT CCACCCAACG ATGATGGTCT GAATTTTCCC AATTTTAATA1851 TCATTACCAA AAATAGAGAC GACTCCTACG GAGGGGTTTT GTTAGGCATA1901 AGACACGGTT TAACATTCCA AAGATTGAAT CTTCCTTCTC AGCCTGGAAT1951 TGAAGTAGTT GCGATTCAGG TTCAAATTAA GAATAAATGT TTTTCAATAG2001 CTTCTGTATA TATCCCGCCC AAARCAAGTG TTAATCGTCA ACAGTTAAAA2051 AACATCGTTG AAATGATGCC TGAGCCAAGA CTTATTCTCG GCGACTTCAA2101 TTCTCATGGG ACAGGATGGG GTGAATTGTA CGACGACAAT CGAGCAAATC2151 TTATATATGA CTTATGYGAT GAATTTAATC TAACTATTAA GAACAGTGGT

208

2201 GAAATAACTC GAATTGCTAG ACCTCCTGCA AGGGAAAGTA GATTGGAYTT2251 GTCAATTTGC TCAAGAACAC TCTCAATAGA TTGCACCTGG AACGTAATTC2301 AAGATCCCCA TGGTAGCGAT CACCTTCCTA TTTTGATTTC AATTGCGACA2351 GGAAATCAAC CTKTAGAACC AGTYAGCTAT ACATACGATC TTACGAAAAA2401 TATAGATTGG AAAAGATATG CTCTCATTAT CACCGAGGCG ATTGAATCAA2451 TAGATCCTCT TACCCCCCAA GAAGAATACA CCTTCCTTGC AAATCTCATC2501 CACAGTAGCG CGATCCAAGC TCAAACAAAA CCAATACCAT CWGCTTCTTC2551 CCGAATGCGA CCTCCATCTT TATGGTGGGA CAAGGAGTGC TCGGAAGTGT2601 ACTCTGAGAA ATCAAATTCT TTCAAAATTT ACAGACGAAC GGGTCAAATT2651 GAGTCTTACG AACAGTACCT CCTTTTGGAG ATTAAGTTCC AAAaTTTAGT2701 AAAATGTAAA AAACGAAAYT ATTgGCGAAC GTTTGTTGAT GGGCTTTCAC2751 GCGAAACCTC CATGCGTACT CTTTGGACTA CAGCAAGAAG AATGAGAAAC2801 CGAGCTCCCA AAAACGCTAG TGAAGAGTAT TCTGATCGGT GGTTGCATAA2851 TTTTGCCAGA AAAGTGTGCC CCGACTCCAC GATTCCCAAA CAGAAAAGGT2901 ATTCGAATGA TCTTGTATTC CCGGAACTAT CATCCGCGTT CTCGATGATA2951 GAATTCTCGG TCGCTCTCCT TTCATGCAAT AACACTGCCT CTGGAATGGA3001 TGGAATTAAA TTTAATCTCC TGAAAAATTT GCCTTCCGWT GCAAAATGTC3051 GACTATTAAA CTTATTCAAT ATTTTCCTTG AACAAAACAT CGTCCCAGAA3101 GTCTGGAGAC AAGTCAGAGT TATAGCTATT CAAAAACCGG GTAAGCCGGC3151 CACCGATCAC CATTCATATA GGCCCATTTG TATGCTATCG TGCGTGCGAA3201 AGTTATTGGA AAAAATGATA CTTTTCAGAT TGGATAAATG GATGGAATCA3251 AACGGATTAT TATCAGATAC TCAGTTTGGA TTTCGTAGGG GCAAGGGAAC3301 GCARGATTGT TTAGCGCTGC TTTCAACCGA AATTCAACTA GCTTTCGCTA3351 AAAAAGMACA AATGGCTTCA ATTTTCTTAG ATGTAAAGGG AGCATTTGAT3401 TCAGTGTGCA TCGAGGTGCT AGCAGATAAA CTCCACAAAA GTGGACTCCC3451 ACCTTTATTG AACAATTTTT TGTATAACTT ACTCTCGGAA AAACACATGA3501 ATTTCATTCA TGGTAACGTG ACAATCACAA GATCTAGCTT TATGGGCCTT3551 CCTCAAGGAT CATGTYTAAG CCCTCTCTTG TACAATTTCT ATGTAAATGC3601 AATTGACTCT TGCCTCGATA ACGGGTGCAC AATAAGACAA TTGGCAGATG3651 ATTGCGTTGT ATCAGTTACT GGTCAGTCGG CCAACCATCT TTCTGAACCT3701 CTGCAGAACA CTTTAAACAA TTTATCTCGC TGGGCTATGG AATTAGGAAT3751 CGAGTTCTCA ACTGAGAAAA CGGAAATGGT CGTCTTCTCC AGAAAGCACA3801 ACCCCCCCTC ACTGAAGCTG TACCTACTGG GAAAACTTAT AATACAGTCC3851 CTGGTTTTCA AATATCTCGG TATTTGGTTT GACTCGAAAG GTACTTGGGC3901 TTGTCAAATA AGATACCTGA AACAGAAATG CCAACAGAGA ATAAACTTCC3951 TCCGAACAAT CACGGGTACG TGGTGGGGCG CACATCCCAC GGACCTCATT4001 AGGCTATACC AAACGACGAT ACGTTCAGTA TTGGAATATG GATGTTTTTG4051 CTTTCAATCC GCCGCGAAAA TCCACATGAT CAAACTTGAA AGAATACAGT4101 ATCGTTGTCT GCGCATTGCC TTAGGATGCA TGCACTCAAC TCATACGCTG4151 AGCCTAGAGG TACTTGCAGG CGTTCTTCCG CTGAAAACCA GATTGTATCA4201 GCTCGCTCAC AGAACGTTGA TTCGTTGTGA GATTAGGAAT CCATTAGTGA4251 TCCAGAACTT CGATCTTCTT CTCGACAAAA ATCCTCAGAC TAGGTTTATG4301 ACTATCTATC ACAACCACAT AACCAAGGAA ATCTCACCTT CAAACTTTAC4351 TCCCAACCGC AGCACAATAA GCAGCACGCA TAACCCATCA GTTTTATTTG4401 ATTTATCTAT GCAACAAGAA ATCAAGATGA TACCAGCAAG TCAACGTTCG4451 CAATTAGTAC CGCATATTTT TTTGTCTAAA TATAACCATA TTAAGGCGGA4501 AAACATGTTC TACACAGACG GATCGCTAAT CGAAGGGTCC ACAGGCTTCG4551 GGGTATTTAA TACGAAAGTA AGTGCCTTCC ACAAACTCCA AAATCCTGCT4601 ACAGTATACG TAGCAGAACA AGCTGCAATT CATTATGCAC TAGGGATCAT4651 TAACCTGCAG CCACAAGATC ACTACTACAT ATTTTCTGAC AGCCTTAGTA

209

4701 CAATTGAGGC TCTCCGGTCG TTGAAATCAC CCAATTCCTC GTCGTTCTTT4751 TTTCATAAAA TTAAAGAAAT CATGAGTTTA CTGGTAGAGA AAAAATACAA4801 AATTACTCTT GTTTGGATCC CTTCTCATTG TTCTGTATTA GGAAATGAGA4851 AAGCGGACTC GTTGGCAAAG CAAGGTGCCT TGGAAGGATC CACTTACGAT4901 CGTATTATCA CTTATGACGA ATATTTTACA ATCCCTCGTC AAGAATCTCT4951 TGTAAGCTGG CAAACCAAAT GGGACAAAAG CGAAATGGGT CGATGGCTTT5001 ACTCTATCAG GCCAAAAGTT TCTACAACTT CGTGGTTCAA ACACATGAAT5051 GTTGAAAGGG ATTTCATACG CGTAATATCA AGATTAATGT CAAACCACTA5101 CCTACTCAAC GCTCACTTAT ATCGGATTAA CTTAAAAGAT GACAATCTCT5151 GCGGTTGTGG AGAGGGTTAT CACGATATCG AACATATTGT TTGGAACTGT5201 CCAGAGAACC TTCACGCTAG ATCTCAACTC TTAGACTCCC TTAGGGCCCA5251 AGGAAGACAA TCAGACTTCC CTGTTCGTGA CATTTTGGCA AGTCAAGATG5301 TGCCATATCT TCTCTGCTTG TACCGCTTTC TAAAGTCAAT TAAAGTGCAC5351 CTGTAACAGC ATCAATCTCG CAAGCATCGC CACCCTGCAA CCTAGCAATA5401 GTAACATCTG ATAAAAACTA GAACCTTAGC CCGCACAGAA GCAAAAGTCC5451 GTCCTTAAAC ATAATGTATT ATTAACCTCG AAACAGCCGC GAGTATTCGG5501 CTTTCCCCCT TTACTAACCC TAGCTTTAAG TAATTATGTA AAAATGATAT5551 CCGGCTCCGT AAAACTTTGG TAGATGAGCC TAAATAAATA AAGACAGTTA5601 TAAAAAAAAA AAAAtAAT

//

210

E.2.1.8 Outcast

LOCUS Cp_Outcast_Ele3 2336 bpDEFINITION Cp_Outcast_Ele3, 2336 bases, 2063 checksum.ORIGIN

1 AACTACATCC AGCTACAAAA AGCCAGAGCA CTGTTTAAAA AAGCTGTCAG51 GAAGGCGAAA CGGGAACACG TAGCTGAGCT GACGGGAAAG ATTGACGAGT

101 CGACACCTCC MAAACAGCTA TGGAACATCG TCAAGGGAAT AGAYACGGCG151 TTGGCTGRGG GCAGTAAAAA GAGGGCGATC CTGGAGCGCT CCAAAGGAGA201 GGAGTTTATG GAGCACTACT TCAGTGGAAG ATGTGGTACA GTGCAGTTGC251 CGAACTACGA GACAGCCCGG GACTTGGAAG GTTTCGAAAT GGCGCTYAAG301 GACGGCGAAG TGCTTAAMGC ACTGAAAAGA ACAAAAAATC ACTCGGCGCC351 GGGRGAAAAT CAAGTCTCGT ACGACATTRT TAARCAATTG CCGCTAGGTC401 TGCAGCTCAA GTTCGCAGAA ATGCTGAGCA GAGTATTCGC GACTGAAAAC451 ATTCCTGAAA GGTGGCGCAT CACTGAAGTA CGACCGATTC CAAAGAAAGG501 AGYGAACCCC AACCTACCRA ACTCGTGGAG ACCCATTGCG CTCATGAATA551 TCGAGATAAA GCTGATCAAC AGTGTAGTGA AAGACCGACT GGCGGCGATC601 GCGGAGCTAA ATGGTYTGAT CCCGGATYTG TCTTTTGGTT TCCGGAAGAA651 TGTRTCATCG GTAACCTGCG TGAACTATGT CGTGAATGCT GTACGAGAGG701 CGAAGGAGTA CAACAAYGAA GTCATCGTAG CATTTCTCGA CGTGAAGATG751 GCGTATGACA CCGTGAACAC GACTAAGCTG CTTCAGATCT TGGCARGGCT801 GGGTATCCCG GAAAAACTGA CATCGTGGCT CTACGAGTAT CTCAGATGTC851 GCGTGTTACG ACTACAAACG GAGGACGGAG TCGTAGAACA AGTRATCTCY901 GAGGGCCTAT CACAAGGTTG TCCGGCAGCA CCGACACTTT ACAACTTCTA951 CACGGCTGGG TTACACGATC TCTCAAACGA AACGTGCAAG TTRGTGCAGT1001 TTGCTGATGA TTTCGCCGTY ATCGCAACAG GTGCCTCCCT CGAGCTGGCG1051 GAACAACGGT TGAACGGTTT CCTCGATGTT TTGGCAGGSC GGCTRAAAGA1101 GCTGGACATG GAAGTAAGCC CATCCAAGTG CGCTGCGATC GCCTTCACCG1151 GAAAAAGGAT CGACCATCTC AGAGTCAAGA TGCAGGGGCA AGYGGTTCAG1201 ATCGTCAACA CCCACAAGTA TCTGGGRTAC ACCTTGGACC GGRCCTTGAA1251 ACACAGAAAA CACATCGAAA CCGTGACCGC CAAAGCCGGA GAGAAACTTG1301 GWCTGCTCAA GYTACTATCG AGGAAAACAA GTGGTGCGAA TCCGGCAACC1351 TTGGTCAAGG TGGGAAATGC GATTGTTCGG AGCCGGATGG AGTACGGAGC1401 CACGATCTAC GGGAATGCCG CCAAATCAAA TCTGGGAAAG CTGCAGGTGT1451 TACAAAATTC GTACATCAGA ATCGCCATGG GATATGTACG AAGCACACCC1501 ATCCACGTGA TGTTGGCCGA AGCTGRCCAA ATCCCGACAA GTCTTAGAAY1551 AGAGGCTCTT ACCAAGAGRG AACTGATCCG AAGTACGTAC TTCAGGACGC1601 CGTTGCTRCG CTTTATAAGC GACACGCTAT CGAGGGAGAT TCCAAACGGA1651 TCGTACCTGA CGGAAATGGC GGACAAGCAT GCGGATATCC TGTACCAACT1701 GCACCCYTCA GACAAGGATG TCGCACAGGA GGCAAGAATG AGCTACTTCA1751 GCAACTTTGA CCTGGAAGAC TAYGTACAGC ACACRCTAGG AAARGAAACA1801 CKGAAAAAGG AGAACAYAAA CGAAGCAGTT TGGAGGCAGA ATTTTCATGA1851 AGTGGCCAAT GGAAAGTACA AAGACCACAA GCAGATATWC ACAGATGCTT1901 CGAAGACGCC CGGAGGGACA GCGCTGGCGG TCTACGACTC GAGCGAGGAG1951 GCGACCTACA CGGAGAGCAT TAACGACAAC TACTCAATCA TGAACGCAGA2001 GCTGCGGGCT ATTTGCATCG CAGTTGAGCA TGTGAAGCAA AAACAGTACG2051 AAAAGGCGGT CATCTACACG GATTCCAAGG CGGCTTGTCA GAGCYTGCTA2101 AACCWAAATG CACTGCGAGA GAACTTTATC GTTTGGAACA TTTACAAGGA2151 GATTCAAARC ATGCGGAGAG GCTCGCTGAG AATCCAATGG ATCCCCAGYC

211

2201 ACGTCGGAAT ACGAGGAAAT GAAATTGCGG ACCAAGCAGC GAAAGCGAAG2251 TCATACGAGA AGCAGACGGA GTTCATTGGA ATTACACTTG GAGATGCTAG2301 AGTACTTTGC CATGAAGAAA TCTGGTACAA TTGGaG

//

212

E.2.1.9 R1

LOCUS Cp_R1_Ele1 5425 bpDEFINITION Cp_R1_Ele1, 5425 bases, 116E checksum.ORIGIN

1 CGTGGGGACA GGTGAGGGTC CCTGCGGAGC TTAGCTGCCA GCTACCGGGC51 GGGTTGCAGT AGGCGGATAG CTGTCGGCGA TTGCATACAT TCATTGCATC

101 GCCCCCGGAC CAGCAGCGGG AGGATGTCTA GGACGTGGCG GAATTGAACA151 AGGGCTCTGT TTAATTTCTT CGAAWAAAAA AACCACATGG GTCCGTAAAC201 ACCTCTGTCA AGCGATCGAA CGCCGCTATA AGTGTTTTAG CCCAAAACCT251 CACCAAACCC CGAATCCAAG ATGTGAYGCG ACCCGTGTCG AGGGATGCAT301 GGCTGGGGGG GTTCAACAAA TTCCCAGTCG ATAACGGAGC CTGTGGGGCA351 CAGGGGCGAA CCCCACACGT AATTTGCCCT TACTGCGTCA CGGCAGGGCA401 CTGGCGCAGC GGACCGTATT TCCCTAGCGA CTCGTGGGAT TCAAAATGAA451 GACAAGTGAA ACAAACCAAC AAAAGGGTCC CGATGCGCCC CAAGCATCGG501 AAAACACGGA GGTAGAAGAG GACGCAAACG TCGAAATGGC AGGAGGCGAG551 TCAAACGGCG GCGACACAGC AGGCGGAGTG GCGAGCGCGT TCCGTGGAAG601 CGGGAAGGTG TTGAGATCCC CAGTGTTGAA CCAGGCGGCT GCTTCGAGTC651 AGCAGATAGG AGTAATTGGA GAGGAGACTC CCAAGTCATC CTTGTTGAAC701 TTCGCCGGCA GTACCCCTCA GGACGGAGTC CTGCTCGGAa GGACCGcGTT751 GCAGGAGGTC AGAAGGAGGG TCAACGAACT CTTTGATTTC ATCAAGGACA801 AAAACAACGT CCACACCAGA ATCAAGCAGA TGGTGAATGG AGTCAAGGCA851 GCCATGAATG CCGCAGAGCG CGAAAACAGC TCGCTGGTGG TGACGCGGAA901 TTCACTGAAG CTCAGAGCTG AAAGAGCCGA AGAAACGCTG AAGGCAAAAC951 TGGAGGAGGA AGCGCTACGG GAGAAAGAAC CGAAAACGCC GCCCGGCCCA1001 AGCTCTAAAA GGGACAGGGA AACGCCTGGA GAGGAGGAGG ACGCAAAGAA1051 GCAGAAGCAG GGGAATGGAG ACAGTCCGGA CCCAGCGAAG GAGCCAGAAC1101 CAGACCCAGG GAAGGAGAAG GAATGGGAGA AGGTCAAGAA AAAGAAGCGG1151 AAGAAAAAAG GGAAGCAGAA CGAGGACACC CAAAAACCCA AGTTTCGCAG1201 GGAGCGTAAC AAAGGCGAGG CTTTGGTGGT CGAGGTGAAG GAAGGTGTTT1251 CGTACGCAGA CCTCCTCCGG AAAGTACGAA CCGATCCGGA ACTCAAGGAG1301 CTTGGCGAGA ACGTGGTTAA AACCAGGCGC ACTCAAACCG GAGCGATGCT1351 TTTTGAGCTG AAGAATGATC CCGCGGTCAA GAGCTCAGCT TTTAAGTCCC1401 TCGTCGAGAA AGCCGTAGGC TACGAGTCGA AGGTAAGGGC GCTATCGCCG1451 GAGACAACGA TTGAGTGCAG GAACCTGGAC GAGATCACGA CGGAGGAAGA1501 GCTAGAAGAT GCGCTGATCG TTCTTCTGGA TGACCGTACG ACACCGATGG1551 CAATCCGGTT GAGGAAAGCC TACGGCGGCA CGCAAATTGC GTCGATCCGA1601 CTATCGACGC CTTCGGCGTC TAAGCTGCTG GAAGCCGGCA AGGTCAAAGT1651 AGGGTGGTCG GTGTGCCCAC TGAGGCCTGT TCCTCGAGTG ACCCAGCAGA1701 TGACGAGGTG TTTCCGCTGT ATGGGTTTCG GCCACCAGGC GAGAAATTGC1751 GACGGTCCCG ATCGAACCAA CAGTTGCAGA AGGTGTGGTA GAGAAGGCCA1801 CATGGCAAGA GACTGCAAAA ATAAGCCGAA GTGCGTGCTC TGTAAAGAAG1851 GCGATGGCAA TAGCCATGCG ACGGGTGGCT TTAATTGCCC GGTGTACACA1901 GAAGCTGGCC TCGGGCAAAA AGTAATGGAG GTGTCCCAGG TGAACCTCAA1951 TCACTGCGAC ACTGCACAGC AACTGCTGTG GCAGTCGACC GCGGAGACGG2001 GGTGTGACGT GGCAATTATT GCAGAACCGT ACCGAGTTCC ACACGACAAC2051 GGAAACTGGG CCGCGGATAC AGCAAGAATG GCGGCGATAC ACGTGATGGG2101 GCGGTACCCC ATACAGGAAG TGGTCTCGAG GGCGTTTGAA GGATTCGTGA2151 TCGCCAAAGT AAACGGAACC TTCTTCTGTA GCTGCTATGC TCCCCCAAGA

213

2201 TGGACCTTGG AGCAGTTTCA GCAGATGCTG GATAGTCTGA CCGACGAACT2251 GATCGGACGA AGCCCGATCG TTATCGGAGG TGACTTCAAC GCGTGGGCGG2301 TCGAGTGGGG TAGCAGATGC ACCAATGCTA GGGGGCATAG CCTAATGGAA2351 GCTCTGGCAA AGCTAGACGT TAGGCTGGCG AATCGCGGAA CCAGCAGTAC2401 CTTCCGCAAA GACGGTCGTG AGTCCATTAT CGACGTTACG TTCTGTAGCC2451 CGCGACTGGC GGCCGACATG AACTGGAGGG TGAGTGAGGA CTATACCCAT2501 AGCGATCACC AAGCGATCCG GTACAGCATC GGGAGACGAG CCCCTGTACC2551 AGATAGGAGC AGCCGGTCCT ACGGAAGGAA ATGGAAGCTG CAGTACTTCG2601 ACGAGGGTCT CTTCGTGGAA GCGCTCCATT GGTGTGATGG TCCCCAAGAC2651 TTGAGTGCCG ACGTGCTAAC AGCACAACTG GTGACAGCAT GCGACACAAC2701 CATGCCGCGG AGACTGGAGC CAAGGAACTG TCGTCGTCCA GCCTACTGGT2751 GGAATGAAGA ACTCGGTACC CTTCGGGCAA GTTGCCTCAG CGCCAGAAGA2801 CGAGTCCAGA GAGCAAGATC CGAAGCAACT AGAGAGGAGT GCAGAGAGGA2851 GTACCGGTCT GCAAAGGCCG CGCTCAAGAA AGCGATCAAA TGCAGCAAGA2901 CAAACTGCTT CAAGGAGTTA TGCCAAGACG CTGATGCAAA CCCTTGGGGG2951 AGCGCATATC GTGTCGCGAT GGCGAAGATC AGAGGCCCAT CGATGGTGGC3001 TGAAACGTGT CCCGACAAGC TGAAGGTCAT TGTGGAAGGG CTCTTCCCAA3051 GACATGACCC AACGACATGG CCTCCTACAC CGTACAACGA CGAAGGGGGT3101 AGCAACGCCG AAGGTCATCT GATCACCAAC GAAGAACTTG TGGCAGTAGC3151 GAAGAGATTG AAGGTGAAGA AAGCTCCCGG CCCGGATGGA ATCCCGAATT3201 TCGCCCTGAA ATCGGCGGTT CAAGCATTCC CGGACAGGTT TCGAACAGTC3251 CTGCAGAAAT GCCTGGACGA AGGACACTTC CCCGACCCGT GGAAGGTTCA3301 AAAGCTCGTG TTGCTGCCGA AGCCAGGCAA ACCACCGGGG GACCCATCAT3351 CGTATAGGCC TATATGTTTG CTGGACACCC TCGGAAAGCT TCTGGAACGG3401 ATCATCCTTA ACCGGCTGAC CAAGTACACG GAGAGCGAGC ATGGCTTAGC3451 AGCGAGGCAG TTCGGCTTCC GTAAAGGGAG ATCCACGGTG GACGCCATCC3501 GGAAAGTGGT CGAGAAAGCC GACGAAGCGC GGAGGAAAAA ACGCAGGGGG3551 AACCGTTGCT GCGCAATAGT CACGATTGAC GTCAAGAACG CGTTCAACAG3601 TGCGAGCTGG GCGGCCATAG CAGCAGCGCT GCACAAAATG AAGGTGCCTG3651 ACTATTTGTG CATGATCTTG AAGAGCTACT TCGAGAACCG CGTGCTGGTC3701 TACGACACTG CCGATGGACA AAAAACCGTT GTTGTTACCG CGGGAGTTCC3751 ACAGGGATCC ATTCTGGGTT CAGCACTGTG GAACGGAATG TATGACGGAG3801 TGTTGACACT GGGGCTACCC AACGGCGTAG AGATTGTTGG CTTTGCAGAC3851 GACATAGTGC TGACGGTAAC CGGCGAAAAT GTCGAGGAGG TCGAAATGCT3901 GGCTATGGAG GCAATCGCAA TGATCGAGAA CTGGATGCTC GAGGTGAAGC3951 TGCGGATCGC TCACCACAAG ACGGAGATGG TACTGGTTAG TAACCACAAA4001 AAGGTGCAGC AGGCCCAGAT ACACGTTGGG GAACACGTAG TGCACTCGAA4051 GAGAGCGCTC AAGTACCTCG GGGTGATGGT GGATGACCGG CTGAACTTCA4101 ACAGCCACGT CGATTACGCC TGCGAGAAGG CGGCTAAGGC GATCATGGCA4151 CTGTCGAGGA TGATGCCGAA CAACGCTGGA CCCAGGAGCA GTAGGCGCCG4201 CCTCTTGGCA AGTGTCGCGA CGTCCATACT TAGGTACGGC GGACCGGTAT4251 GGTGGACGGC GCTGGGGACG AAGCGAAATC GAGCGCTGCT CGACAGAACG4301 CAGAGACTGA TGGCCATGCG GGTTGCAAGC GCGTACAGGA CCATYTCGTC4351 GGAAGCAGTT GGCGTCATAG CCGGAATGAT CCCCATCGGC ATCACACTGG4401 AGGAGGACAC CGTGCGCTAC ACCCGRAGAG GCACGAGAGG TATCCGGGAA4451 GCTGCGAGAG CCGAATCGCT GGCAAGGTGG CAACGTGAGT GGGACACCAC4501 GGAGAAAGGC AGATGGACGC ATCGGCTTAT CCCGTCCGTA TCCACGTGGG4551 TGAGCAGAAG GCAYGGAGAG GTCACCTTCC ACCTCACACA GTTCCTGTCG4601 GGCCATGGCT GCTTCAGGAA GTACCTGCAC AGGTTYGGAC ATGCAGAGTC4651 TCCTCTCTGT CCGGACTGCG TCGATTGCGA GGAAACACCG GAGCACGTGG

214

4701 TGTTCGCCTG CCCTCGCTTC GAGGCAGCGC GAAGCGAAAT GCTGGCCATT4751 ATCGGAGCRG ACACCAGCCC GGATAATGTG GTGCGAAGAA TGTGCAGCGA4801 CATYGCCAAG TGGAATGCGG TCGTCGGAGC GGTGACGCAG ATCACTTCGG4851 CTCTCCAGCG GAAATGGAGA GACGATCAGA GGAGGAACGA CTAGGAGCCT4901 AGTCGAAAAC CCACGAGTGT GGCTGTGAAG GAGAGCACGT TATGATGGTC4951 GGCTCTACCA AATCGGTACA CGTCTCGATG GTCACAGGAG TCGAGAACCC5001 ACGAGTGTGG CTGTGAAGGA GAGCACGTTA TGACGGTTGG CTCTACCAAA5051 TCGGTACACG TCTCGATGGT CCAAGGAGAG GGCTGCATAT GACTAGCCGA5101 TCAAAAGCAA CGCGATTCTT GGGCGCGGTT AAACCCTCGC ATGGACTCAT5151 ATGTATGTRG AACAGGAAAT GGTTCTAGYA CCCGGCATGG ATCCTGTAAG5201 TAGACTAGTG CAGAAAATGC AACGCCTCCC CCCGAAGTTA TACCGAAAGG5251 TGGTCCCGGG GGGAMAAGGG CACGGCGTTC AAGGACTGGT TTAGTGGGTC5301 GGGAAAACTC TTTTTGTTTT CCCAACCCCA CACTACCTGA GAAATGAATT5351 CTCAGGTGTC TGGTAGCAGA TTCCGACCTT GTAAAAAAAA AAACACACAC5401 ACACACACAC ACACACACAC ACACA

//

215

E.2.1.10 RTE

LOCUS /tmp/readseq.in.19664 4069 bpDEFINITION /tmp/readseq.in.19664 [Unknown form], 4069 bases, 1CD7 checksum.ORIGIN

1 Cp_RTE_Ele MCGTGCACAT CGGTATTGTT TTTCGTACCG TTCCCGCGAA51 AGTGCAAAAT TTACCACGAA AATACGCGTT AGTGGGGTGT GAAATTCATT

101 GCATCCGTGT TCTGAACCCG TCTCCTGGCC CGGGAAAGTG ATAGTCCGGT151 GTTTTAACGA CCGCGAGTGA GACAAATCAA TCACCAGATT TGACCGAATA201 TCGCGGTTCC GCCACGGGCA ATTGAAAAAA AGTACAAAAG AACTGCTAAG251 GCATCGTCGT CGTCGTCGCG ATCGATGGTC GTTGTATTGT GTCGYGCGGC301 GTGTGGAAGT GTACCGCGWA GACCATACTG ASGYCGTGTC GTCGTCGTCC351 TCGTCAGCYG TGDTCTGTTG CTASGAGAGW MGTGYAWAGA AGAGAAKCCG401 TWGGTGCAAC GGWGGGGTGA AKTGCTAAGG ARAGCKACAG AAAAAAAAGT451 GCTGCAAAAG TTTTTATTGG TRAAAWTAAA AAGAAGGAAW VAAKCAARCD501 GAAGMGTCMA CCCTGCTRCR TACACAGCCC CCCCCCCCCC CCTCTCTAGT551 TCCGTTTYTG GGCGTCTGCA CCCCAKGTTA GGGGCGGCCC AAACGGACTA601 GGTKGTCGAT TCCATTCTCA CCCGTGAGCG GCTGTTCCCC ACGTTAGGGG651 CGGCTCATGA AAGGAAGTGA GCCGACACCC CCCCCCCCCT CCCCCCACCC701 TTCTTGAGCG TCTGTTCCCC AGGTTAGGGG CGGCTCRAAG AAACCGGTGT751 CCTGCCTCYA TCGTCGAGGT AAGCGTCTGT TCTCCAKGTT AGGGGCGGCT801 TACAGCAGGA TAGAGTTCGG ASCCCCCACC CCCTCCCCCC CGAGCGTCTG851 TTCCCCAGGT TAGGGGCGGC TCGAAACAGC GTCTGTACCC CAGGTTAGGG901 GCGGCTGAGT AAAAGTCCYT GTGTCGGCGT GGGACTKTAA ACAGTACCGG951 CACGATGGTC CTCCGGCGAG ACAGGGGGTT GGTGCAGGCC ACACGAACCC1001 GCCGTAAAAC ACCAGTGCAG GAAGCACACG ATGCGAGCCG GACCAATCGG1051 CACGGAACTG GACATCWTAT GAGGTCCCAC GATTGGAAGC TCGGGACGTG1101 GAATTGCAGG TCTCTCAAAT TTGACGGGAG TATCCGCATA CTTTCCGACA1151 TATTGAGGGT CCGCAAGTTC AGCATCGTAG CGCTGCAGGA GGTTKGCTGG1201 ATAGGCGCGG AAGAGGTACA AGCGTACCCA AGGATTGGGC TGTACAATCT1251 ACCAGAGCCG CGGCGAAAAC AAGAGGCTGG GGACAGCCTT TATAGTGCTG1301 GGCGAAATGC GCGATCGCGT GATTGGGTGG ACCCCGCTCA CCGACCGAAT1351 GTGCGTGCTG AGGATTAAAG GCCGTTTCTT CAACATTAGC ATCATAAACG1401 TGCACAGCCC GCACTCAGGA AGCGAAGATG ACGACAAGGA CGCATTTTAC1451 GAGCAGCTGA ACTGGACGTA CAACAGCTGC CCAAAACATG ACGTCAAAAT1501 CGTCATCGGA GATTTTAACG CTCAGGTTGG CCAGGAGGAG GAATTCAGAC1551 CGGTGATAGG AAAGTTCAGC GCCCACGTAC GCACGAACGA AAACGGCCTG1601 CGACTGATCG ACTTCGCCAC CTCCAAAAAC ATGGCCGTAC GAAGTACCTG1651 CTTCCAGCAC AACCTCCGAG ACAAGTACAC CTGGAGATCA CCGCAAGGAA1701 CGGAATCACA AATCGACCAC GTCGTAATCG ACGGTAGACA CTTTTCCGAC1751 ATCATCGACG TCAGGACCTA TCGCGGCGCC AACGTCGACT CGGACCACTA1801 TCTGGTGATG GTGAAAATGC GCCAACGACT TTCCCTGGCG AAAAGCGTTC1851 GGTACCGCCG CCCTCCGCGG TTGGATCTGG AGCGGCTTAA GTTACCGGAA1901 GTCGCATCCC GGTACGCGCA TTCGCTGGAG GCTGCGTTGC CAGGGGAGGG1951 TGAGCTGTTG GAAGCTCCCC TCGAGGACTG CTGGAGGAGC GTCAAGGCAG2001 CCATCACCAA CGCAGCGGAA AGCACCATCG GATTTGTGGA ACGAGGACGA2051 CGGAACGATT GGTTCGACGA GGAGTGTCGA GCGATTTTGG AGGAGAAGAA2101 TGCAGCACGG AGGGCAATGC TGCAGTACAA TCTCCGTGAT TACGAGGAGG2151 CGTATGGACA GAAGCGAAGG CAGCAGCACC AGCTCTTCCG AGCAAAAGTG

216

2201 CGCCACCAGG AAGAGTTGGA GTTTGAGGAC ATGGAGCAGC TGCATCGCTC2251 AAACGAAACG CGCAAGTTCT ACAAGAAGCT CAACGGATCC CGMAACGGCT2301 TCACGCCGCG AGTCGAAATG TGCCGGGATA AAAATGGAGC TATCTTGACG2351 AACGAGCGTG AGGTGATTGA CAGGTGGAAG CAGCACTTCG ATGAACACCT2401 GAATRGCGCA GAAGCAGAGG CAGGGGTCCA AGGCGGCAGG AGAGAGGACT2451 TCATCGGTAC AGCGGGAGAA GGAGAGGAGC CAGTTCCCAC GATGAGGGAA2501 GTTAAGGATG CCATCAAGAA GCTGAAGAAC AACAAAGCAG CGGGTAAGGA2551 TGGTATCGGT GCTGAACTCA TCAAGATGGG CCCGGAGAAG CTGGCGTCCT2601 GTCTGCACCG ACTGATAGTC AGGGTCTGGG AGTCAGAACA GCTACCGGAG2651 GAGTGGAAAG AGGGAGTAAT ATGCCCGATC TACAAGAAGG GGGACAAGTT2701 AGATTGTGAG AACTACCGTG CCATCACAAT CCTCAACGCG GCCTACAAAG2751 TGTTCTCCCA GATCCTCTTC AGCCGCCTAT CGCCAATAGC GGAAGGTTTT2801 GTTGGAAGTT ATCAAGCCGG ATTCGTCATG GGGAGATCAA CAACCGACCA2851 AATCTTCACT GTGCGACAAA TCCTCCAAAA GTGTCGCGAG TACCAAGTCC2901 CCACGCACCA CCTTTTCATC GACTTCAAAG CCGCGTACGA CTCAGTCGAT2951 CGCGAAGAGC TATGGAAAAT TATGGACGAG AACGGTTTTC CCGGGAAGCT3001 GATCAGACTG ATCAAGATGA CGATGGATGG GGCTAGGTGT TGTGTGAAGA3051 TATCGGGTGC GGAATCGGAC TCGTTTACTT CACTTGGGGG GCTTCGGCAA3101 GGCGATGGGA TCTCTTGYCT YTGTTTCAAT GTCGTGCTAG AAGGTGTTAT3151 GAGACGAGCG GGCTTCAATA TGCGGGGCAC GATCTTCAGC AAGTCCAACC3201 ARTTCATCTG CTWCGCCGAC GACATGGACA TTGTTGGCAG AACGTTCAAG3251 GCGGTTGCKG ATGCGTACAC CGRCTTGAAG CGGGAAGCAG AGAAGGTTGG3301 GCTAAGGGTG AATGTGGCGA AGACAAAGTA CCTGCTGGCA GGAGGAACCG3351 AGTCCCTTAG GGCTCGCATT GGACCRAGCG TTACRATCGA CGGGGACGAA3401 TTCGAGGTRG TGGAGGAGTT TGTATACCTC GGATCGTTGG TAACGTCGGA3451 CAACAGCTGC AGCAGGGAAA TTCGGAGGCG CATCATCGCT GGAAGTCGTG3501 CCTATTTCGG TCTYCACAAG AGCCTAAGGT CCCGGAAATT CTCCCTACAT3551 ACGAAGTGTT CCATCTACAA GTCGCTGATA AGACCGGTCG TCCTCTACGG3601 GCACGAGACG TGGACAATGC TCGARGAGGA CYTACGAGCG CTAARCGTYT3651 TCGAACGTCG AGTGCTAAGG ACCATCTTTG GCGGCGTATA TGAGAACGAC3701 GGATGGCGGC GGAGAATGAA CCACGARCTT GCRCAACTCT ACAACGAACC3751 AAGCATCCGG AARGTCGCGA AGGCTGGACG GTTGCAGTGG GCGGGTCATG3801 TTGCAAGGAT GCCGGAACGA GCCGASCAMT TGAGCCAACG GAACCAGAAG3851 ATCAATCCTG CGAAGTTGGT GTTTGTGTCG GAGCCGGTAG GAACAAGACG3901 TAGGGGGGTG CAACGTGCGA GGTGGGTGGA CCAAGTGGAG ARCGATYTRG3951 AAAGTGTGGG TGCGCCGCGA AATTGGAGAM AWGCAGCCAT GGACCGAGCT4001 TGTTGGCGGA GAATCGTGCA GCAGGYCAAG CTAATGGTGT AGCGCCAAYA4051 AAAGTAAAGT AAAGTAAGT

//

217

E.2.1.11 Unclassified LINE

LOCUS L1_Contig_59 405 bpDEFINITION L1_Contig_59, 405 bases, 1F4B checksum.ORIGIN

1 ACAATAGCCT AAGTATTCGC AGAGAAGTTT TAGTATTTGA AGTCTTTGTC51 TTCTTCCGGC TCCCTCTAGG GCAGGTCTCA GCAGGTTCTC GAATGAGATA

101 GAGCGCCTTY CTCGCAAGAT CACAGTGACC TTGTTTTGAA ACAKCAGCCA151 CAYGTCTGCC ACSCGTGAAC ATTCACTAAA TTTATGCTGC AGCGTTTCCA201 CAGCAGCTCC ACAGTGAAGG CAACTCTMAC TGTTCACCCT CTGTATCGTG251 TGAAGCAACT TACGATGTTC TGTTTTTTCG TTCACGAACA TGTACAGTTG301 ATTCTGTTGT GCAGAAGTGA GCCCTCTCAA AGCGATGTTT TTTTTCCAAA351 TTCTCCGCCA GTTGACCGCT GGGTTAGCTT GCTGAACCTT CGGCTGTTCG401 GTTTG

//

218

E.3 D. melanogaster

E.3.1 Transposons

E.3.1.1 mariner

DEFINITION Putative Drosophila melanogaster mariner sequenceSOURCE Drosophila melanogaster

flybase.org, dmel-all-chromosome-r5.29.fastaFEATURES Location/Qualifiers

source 1..1043/organism="Drosophila melanogaster"/mol_type="genomic DNA"/transposon="putative mariner transposon"

repeat_region 1..26/note="left terminal inverted repeat"

ORF1 66..780translation=LCNCILSSVSYQLLLLAECQNWFRKFRSGDFSLKHEPRSGRLYEVDDDLIKALIELDRHVNKQEIGEKFNIPKSTVYYHIKRLVKKFDIWVPHVLKEIHLTHRINACDMQLKCNEFDPFLKRITSGKEKWIVYNNVSRKRSWSKHGEPAQTTSKADIHQKKVMLSVWWDWKGVVYFELLPRNQTINSDVYCHQLNKLNTRRSDQNWSIVKVSYSTRITLDCTHLWSLSKNCVSLGRNF/product="putative mariner transposase"

repeat_region 1018..1043/note="right terminal inverted repeat"

ORIGIN1 TGCCCAAAAA GTAATTGCGG ATTTTTCATA TAGTCGGCGT TGACAAATTT51 TTTCAACGGC TTGTGACTTT GTAATTGCAT TCTTTCATCT GTCAGTTATC

101 AGCTGTTACT ATTAGCTGAG TGTCAAAATT GGTTTCGCAA ATTCCGTTCT151 GGAGATTTTT CACTTAAACA TGAGCCCCGT TCAGGTCGGC TATATGAAGT201 TGATGATGAC CTAATCAAAG CATTAATCGA ATTGGATCGT CATGTAAATA251 AGCAGGAGAT AGGAGAGAAG TTTAATATAC CAAAATCAAC CGTTTACTAT301 CACATAAAAA GACTAGTGAA AAAGTTTGAT ATTTGGGTAC CACATGTATT351 GAAAGAAATT CATTTAACAC ACCGAATAAA TGCTTGTGAT ATGCAACTTA401 AATGCAATGA ATTCGATCCG TTTTTAAAAC GAATCACATC TGGAAAGGAA451 AAATGGATTG TTTACAACAA CGTTAGTCGA AAACGATCAT GGTCCAAGCA501 TGGTGAACCA GCTCAAACCA CTTCAAAGGC TGATATCCAC CAAAAGAAGG551 TTATGCTGTC TGTTTGGTGG GATTGGAAGG GTGTCGTATA TTTTGAACTG601 CTTCCAAGGA ACCAAACGAT TAATTCGGAT GTTTACTGTC ACCAATTGAA651 CAAATTGAAT ACAAGGAGAA GCGACCAGAA TTGGTCAATC GTAAAGGTGT701 CATATTCCAC CAGGATAACG CTAGACTGCA CACATCTTTG GTCACTATCC751 AAAAACTGTG TGAGCTTAGG TAGGAACTTT TGATGCATCC ACCGTATAGC801 CCTGACCTGG AACCATCAGA CTACCATTTA TTTCGATCTT TGCAGAACTC851 CTTAAATGGT AAAACTTTCG GGAATGATGA GGCTATAAAA TCGCACTTGG901 TTCAGTTTTT TGCAGATAAA GGCCAGAAGT TCTATTGACC GTGGAATAGA951 AAAAAGGTTA TCGAAAAAAA TGGCAATTCA TTCTAAGTAT TATTAAAAAT1001 GCATTTACTT TCTTTTAAAA AATCGGAAAT TATTTTTTGG GCA

//

219

APPENDIX F

SCRIPT USED TO IDENTIFY MITES

This chapter presents the BioPerl script we utilized to identify MITEs in

Pediculus humanus humanus.

#Ryan Kennedy and Scott Christley

use lib ’/opt/bioperl’;


$len=0;$minlen=$ARGV[0];$file=$ARGV[1];$allowMismatch = $ARGV[2];$maxDistance=$ARGV[3];$name=$ARGV[4];

$minDistance = $ARGV[5];

$len=$minlen;$counter=0;open(OUTDAT, ">$name.fa");flock(OUTDAT,$LOCK);print OUTDAT "==================================================\n";print OUTDAT "PARAMETERS:\n";print OUTDAT "\tMinimum Distance:\t $minDistance\n";print OUTDAT "\tMaximum Distance:\t $maxDistance\n";print OUTDAT "\tMismatches Allowed:\t $allowMismatch\n";print OUTDAT "\tMinimum Length:\t $minlen\n";print OUTDAT "==================================================\n";flock(OUTDAT,$UNLOCK);close(OUTDAT);$in = Bio::SeqIO->new(-file => $file, -format => ’Fasta’);

220

if(length($str1)>length($str2)) { $max=length($str1); }else { $max=length($str2); }

$num=1;while($seq=$in->next_seq()) {if ($seq->length() < 80) {

#starting sequence too shortnext;

}

$seq1=$seq->seq();$rvseq=$seq->revcom();$rseq=$rvseq->seq();$seq_full2=$seq1;$rseq_full=$rseq;

# search through scaffold, forward strand$lastMatch = "ZEWQSDV";for ($k = 0; $k < length($seq1); $k++) {

#print "k: " . $k . "\n";

# extract substrings$distance = length($seq1) - $k;if ($distance > $maxDistance) { $distance = $maxDistance; }

$seqSub = substr($seq1, $k, $distance);$rseqSub =

substr($rseq_full, length($rseq_full) - $distance - $k, $distance);

# forward strand$i = 0;#for ($i = 0; $i < $distance; $i++) {

# reverse strandfor ($j = 0; $j < $distance; $j++) {

# compare strings and allow for mismatches$len = 0;$numMismatch = 0;$match = "";while ((substr($seqSub,$i + $len, 1) eq

substr($rseqSub,$j + $len, 1))|| ($numMismatch < $allowMismatch)) {

if (substr($seqSub,$i + $len, 1) nesubstr($rseqSub,$j + $len, 1)) {

++$numMismatch;}$len++;$match=substr($seqSub,$i,$len);

221

if( (($i + $len) > $distance) ||(($j + $len) > $distance) ) { last; }

}

# if the distance between the two repeats is too small then skip$back = $distance - $j + $k - $len;$front = $i + $k;if ($front > $back) { next; }if (abs($front - $back) < $minDistance) { next; }

if (($match ne "") && (length($match) >= $minlen)) {#print $match . " $i\n";

# trim out simple repeatsmy $cntA = 0;my $cntT = 0;my $cntG = 0;my $cntC = 0;for ($l = 0; $l < length($match); ++$l) {

$seqstr = substr($match, $l, 1);if ($seqstr eq "A") { ++$cntA; }if ($seqstr eq "T") { ++$cntT; }if ($seqstr eq "G") { ++$cntG; }if ($seqstr eq "C") { ++$cntC; }

}if (($cntA == 0) || ($cntT == 0) ||

($cntG == 0) || ($cntC == 0)) {# simple repeat

} else {# check the other sequence$cntA = 0;$cntT = 0;$cntG = 0;$cntC = 0;$match=substr($rseqSub,$j,$len);#print $match . " $j\n";for ($l = 0; $l < length($match); ++$l) {

$seqstr = substr($match, $l, 1);if ($seqstr eq "A") { ++$cntA; }if ($seqstr eq "T") { ++$cntT; }if ($seqstr eq "G") { ++$cntG; }if ($seqstr eq "C") { ++$cntC; }

}if (($cntA == 0) || ($cntT == 0) ||

($cntG == 0) || ($cntC == 0)) {#print "Simple Repeat " . $seq->id() . "\n";# simple repeat

} else {#print $lastMatch . "\n";($seqNew,$passed) =

222

loop(substr($seqSub,$i,$len), $lastMatch);if($passed==1) {

$lastMatch = $seqNew;open(OUTDAT, ">>$name.fa");flock(OUTDAT,$LOCK);print OUTDAT ">" . $seq->id() .

" $counter $front $back $len\tRPT1:" .$seqNew . "\tRPT2:" .substr($seq1,$back,$len) . "\n";

flock(OUTDAT,$UNLOCK);close(OUTDAT);$counter++;

}}

}}$len=$minlen;$match="";

}}$num++;

}

sub loop() {$check=0;$pass=0;

$keeper=$sequence;my($sequence, $last_sequence) = @_;#print $sequence . " " . $last_sequence . "\n";while($sequence =~ /(A{4}|T{4}|G{4}|C{4})$/i ||

$sequence =~ /^(A{4}|T{4}|G{4}|C{4})/i ||$sequence =~ /(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)$/i ||$sequence =~ /^(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)/i ) {

#Removed$check=1;$sequence = $‘ . $’;

} #end whileif(length($sequence) >= $minlen) {if($check>0) { $sequence=$keeper;

}if($last_sequence =~ $sequence) {

$pass=0;} else {$pass=1;}#$pass=1;

} else { $pass=0;#Sequence full of Repeats or is a substring

}return $sequence, $pass;

} #end loop

223

REFERENCES

1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller,and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Research, 25(17):3389–3402, July 1997.

2. O. Andrieu, A. S. Fiston, D. Anxolabehere, and H. Quesneville. Detectionof transposable elements by their compositional bias. BMC Bioinformatics,5(94), July 2004.

3. S. M. Anwar, M. Musiani, G. McDermid, and D. Marceau. How Do HumanActivities Shape Wolves’ Behavior In The Central Rocky Mountain Region,Alberta, Canada? In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.

4. ArcGIS. http://www.esri.com/software/arcgis.

5. P. Arensburger, K. Megy, R. M. Waterhouse, J. Abrudan, P. Amedeo, B. An-telo, L. Bartholomay, S. Bidwell, E. Caler, F. Camara, C. L. Campbell, K. S.Campbell, C. Casola, M. T. Castro, I. Chandramouliswaran, S. B. Chap-man, S. Christley, J. Costas, E. Eisenstadt, C. Feschotte, C. Fraser-Liggett,R. Guigo, B. Haas, M. Hammond, B. S. Hansson, J. Hemingway, S. R. Hill,C. Howarth, R. Ignell, R. C. Kennedy, C. D. Kodira, N. F. Lobo, C. Mao,G. Mayhew, K. Michel, A. Mori, N. Liu, H. Naveira, V. Nene, N. Nguyen,M. D. Pearson, E. J. Pritham, D. Puiu, Y. Qi, H. Ranson, J. M. C. Ribeiro,H. M. Roberston, D. W. Severson, M. Shumway, M. Stanke, R. L. Strausberg,C. Sun, G. Sutton, Z. J. Tu, J. M. C. Tubio, M. F. Unger, D. L. Vanlanding-ham, A. J. Vilella, O. White, J. R. White, C. S. Wondji, J. Wortman, E. M.Zdobnov, B. Birren, B. M. Christensen, F. H. Collins, A. Cornel, G. Di-mopoulos, L. I. Hannick, S. Higgs, G. C. Lanzaro, D. Lawson, N. H. Lee,M. A. T. Muskavitch, A. S. Raikhel, and P. W. Atkinson. Sequence of Culexquinquefasciatus Establishes a Platform for vector Mosquito ComparativeGenomics. Science, 330(6000):86–88, October 2010.

224

6. S. M. N. Arifin, R. C. Kennedy, K. E. Lane, G. R. Madey, A. Fuentes,and H. Hollocher. P-SAM: A Post-Simulation Analysis Module for Agent-Based Models. In Proceedings of the International Simulation Multiconfer-ence (ISMc2010): Summer Computer Simulation Conference (SCSC2010),2010.

7. O. Balci. Handbook of Simulation: Principles, Methodology, Advances, Ap-plications, and Practice, chapter Verification, Validation, and Testing. JohnWiley & Sons, New York, NY, 1998.

8. J. Banks and R. R. Gibson. Don’t simulate when... 10 rules for determiningwhen simulation is not appropriate. IEE Solutions, September 1997.

9. J. Banks and J. S. C. II. Introduction to discrete-event simulation. InProceedings of the 1986 Winter Simulation Conference, pages 17–23, 1986.

10. J. Banks, J. S. C. II, B. L. Nelson, and D. M. Nicol. Discrete-Event SystemSimulation. Pearson Education, Inc., Upper Saddle River, NJ, fourth edition,2005.

11. Z. Bao and S. Eddy. Automated de novo identification of repeat sequencefamilies in sequenced genomes. Genome Research, 12(8):1269–1276, August2002.

12. E. A. Bennett, L. E. Coleman, C. Tsui, W. S. Pittard, and S. E. Devine. Nat-ural genetic variation caused by transposable elements in humans. Genetics,168:933–951, October 2004.

13. C. M. Bergman and H. Quesneville. Discovering and detecting transposableelements in genome sequences. Briefings in Bioinformatics, 8(6):382–392,November 2007.

14. J. Biedler and Z. Tu. Non-LTR Retrotransposons in the African MalariaMosquito, Anopheles gambiae: Unprecedented Diversity and Evidence ofRecent Activity. Molecular Biology and Evolution, 20(11):1811–1825, 2003.

15. E. Birney, M. Clamp, and R. Durbin. GeneWise and Genomewise. GenomeResearch, 14:988–995, 2004.

16. E. Birney and R. Durbin. Using GeneWise in the Drosophila AnnotationExperiment. Genome Research, 10:547–548, 2004.

17. BLAST. http://www.ncbi.nlm.nih.gov/blast.

225

18. D. Brown, R. Riolo, D. Robinson, M. North, and W. Rand. Spatial Pro-cess and Data Models: Toward Integration of Agent-based Models and GIS.Journal of Geographic Systems, Special Issue on Space-Time InformationSystems, 7(1):25–47, 2005.

19. R. V. Bruggner. A system for integration and management of communityannotation for vectorbase.org. Master’s Thesis, University of Notre Dame,2007.

20. C. Burge and S. Karlin. Prediction of complete gene structures in humangenomice DNA. Journal of Molecular Biology, 268:78–94, 1997.

21. R. E. Butler. The Design and Development of VectorBase: A BioinformaticResource Center for Invertebrate Vectors of Human Pathogens. Master’sThesis, University of Notre Dame, 2010.

22. P. Capy, C. Bazin, D. Higuet, and T. Langin. Dynamics and Evolution ofTransposable Elements. Landes Bioscience, Austin, Texas, 1998.

23. L. Cary, M. Goebel, B. Corsaro, H. Wang, E. Rosen, and M. Fraser. Trans-poson mutagenesis of baculoviruses: analysis of Trichoplusia ni transposonifp2 insertions within the fp-locus of nuclear polyhedrosis viruses. Virology,172(1):156–169, September 1989.

24. A. Caspi and L. Pachter. Identification of transposable elements using multi-ple alignments of related genomes. Genome Research, 16:260–270, February2006.

25. C. Castle, A. Crooks, P. Longley, and M. Batty. Agent-based modellingand simulation using repast: A gallery of gis applications from CASA. InG. Priestnall and P. Alpin, editors, Proceedings of the 14th GeographicalInformation Systems Research UK Conference, pages 237–239, 2006.

26. Chado. http://www.gmod.org/wiki/index.php/chado.

27. Chado Best Practices.http://gmod.org/wiki/index.php/Chado Best Practices#Transposons.

28. N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, editors. MobileDNA II. ASM Press, Washington, DC, 2002.

29. A. T. Crooks. UCL working paper series: The repast sim-ulation/modelling system for geospatial simulation. available athttp://www.casa.ucl.ac.uk/working papers/paper123.pdf, September 2007.

226

30. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle,and M. Clamp. The ensembl automatic gene annotation system. GenomeResearch, 14:942–950, 2004.

31. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle,and M. Clamp. The Ensembl Automatic Gene Annotation System. GenomeResearch, 14(5):942–950, 2004.

32. P. Daszak, A. Cunningham, and A. Hyatt. Anthropogenic environmentalchange and the emergence of infectious disease in wildlife. Acta Tropica,78:103–116, 2001.

33. DNASTAR SeqMan.http://www.dnastar.com/products/seqmanpro.php.

34. Douglas-Peucker Algorithm.http://geometryalgorithms.com/Archive/algorithm 0205/#Douglas-Peucker%20algorithm.

35. R. D. Dowell, R. M. Jokerst, A. Day, S. R. Eddy, and L. Stein. The Dis-tributed Annotation System. BMC Bioinformatics, 2(7), 2001.

36. R. Drysdale and the FlyBase Consortium. FlyBase: a database for theDrosophila Research Community. Methods in Molecular Biology, 420:45–49,2008.

37. R. M. D’Souza, M. Lysenko, S. Marino, and D. Kirschner. Data-ParallelAlgorithms for Agent-Based Model Simulation of Tuberculosis On Graph-ics Processing Units. In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.

38. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analy-sis: Probabilistic models of proteins and nucleic acids. Cambridge UniversityPress, Cambridge, United Kingdom, 2003.

39. S. R. Eddy. Profile hidden markov models. Bioinformatics Review,14(9):755–763, 1998.

40. R. C. Edgar and E. W. Myers. PILER: identification and classification ofgenomic repeats. Bioinformatics, 21 Suppl. 1:i152–i158, March 2005.

41. L. J. Engel, G. A. Engel, M. A. Schillaci, A. Rompis, A. Putra, K. G.Suaryana, A. Fuentes, B. Beer, S. Hicks, R. White, B. Wilson, and J. S.Allan. Primate-to-human retroviral transmission in asia. Emerging Infec-tious Diseases, 11(7), July 2005.

227

42. J. E. Fa and D. G. Lindburg, editors. Evolution and Ecology of MacaqueSocities. Cambridge University Press, 2005.

43. P. Flicek, B. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke,G. Coates, F. Cunningham, T. Cutts, T. Down, S. Dyer, T. Eyre, S. Fitzger-ald, J. Fernandez-Banet, S. Graf, S. Haider, M. Hammond, R. Holland,K. Howe, K. Howe, N. Johnson, A. Jenkinson, A. Kahari, D. Keefe,F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl,B. Overduin, A. Parker, B. Pritchard, A. Prlic, S. Rice, D. Rios, M. Schus-ter, I. Sealy, G. Slater, D. Smedley, G. Spudich, S. Trevanion, A. Vilella,J. Vogel, S. White, M. Wood, E. Birney, T. Cox, V. Curwen, R. Durbin,X. Fernandez-Suarez, J. Herrero, T. Hubbard, A. Kasprzyk, G. Proctor,J. Smith, A. Ureta-Vidal, and S. Searle. Ensembl 2008. Nucleic Acids Re-search, 36:d707–d714, January 2008.

44. P. Flicek, B. L. Aken, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen,P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gor-don, S. Graf, S. Haider, M. Hammond, K. Howe, A. Jenkinson, N. Johnson,A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, G. Koscielny,E. Kulesha, D. Lawson, I. Longden, T. Massingham, W. McLaren, K. Megy,B. Overduin, B. Pritchard, D. Rios, M. Ruffier, M. Schuster, G. Slater,D. Smedley, G. Spudich, Y. A. Tang, S. Trevanion, A. Vilella, J. Vogel,S. White, S. P. Wilder, A. Zadissa, E. Birney, F. Cunningham, I. Dunham,R. Durbin, X. M. Fernandez-Suarez, J. Herrero, T. J. P. Hubbard, A. Parker,G. Proctor, J. Smith, and S. M. J. Searle. Ensembl’s 10th year. Nucleic AcidsResearch, 38:D557–562, 2010.

45. P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent, Y. Chen, P. Clapham,G. Coates, S. Fairley, S. Fitzgerald, L. Gordon, M. Hendrix, T. Hourlier,N. Johnson, A. Khri, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski,E. Kulesha, P. Larsson, I. Longden, W. McLaren, B. Overduin, B. Pritchard,H. S. Riat, D. Rios, G. R. S. Ritchie, M. Ruffier, M. Schuster, D. Sobral,G. Spudich, Y. A. Tang, S. Trevanion, J. Vandrovcova, A. J. Vilella, S. White,S. P. Wilder, A. Zadissa, J. Zamora, B. L. Aken, E. Birney, F. Cunningham,I. Dunham, R. Durbin, X. M. Fernndez-Suarez, J. Herrero, T. J. P. Hubbard,A. Parker, G. Proctor, J. Vogel, and S. M. J. Searle. Ensembl 2011. NucleicAcids Research. Epub ahead of print.

46. J. Fooden. Systematic review of southeast asian longtail macaques, Macacafascicularis. Fieldiana Zoology, 81, 1995.

47. A. Fuentes, M. Southern, and K. G. Suaryana. Monkey forests and humanlandscapes: Is extensive sympatry sustainable for Homo sapiens and Macaca

228

fascicularis on Bali? In J. D. Patterson and J. Wallis, editors, Commen-salism and Conflict: The Primate-Human Interface. American Society ofPrimatology Publications, 2005.

48. GD Graphics Library. http://www.libgd.org/.

49. GeoTools. http://geotools.codehaus.org.

50. N. Gilbert. Agent-based Models. SAGE Publications, Thousand Oaks, CA,2008.

51. H. R. Gimblett, editor. Integrating Geographic Information Systems andAgent-based Modeling Techniques for Simulating Social and Ecological Pro-cesses. Oxford University Press, 2002.

52. H. R. Gimblett. Integrating geographic information systems and agent-basedtechnologies for modeling and simulating social and ecological phenomena.In H. R. Gimblett, editor, Integrating Geographic Information Systems andAgent-based Modeling Techniques for Simulating Social and Ecological Pro-cesses. Oxford University Press, 2002.

53. GMOD. http://www.gmod.org/.

54. GMOD Names of Features.http://www.gmod.org/wiki/index.php/Chado Sequence Module#Names of Features.

55. GNU General Public License (GPL) v3.http://www.gnu.org/licenses/gpl.html.

56. GRASS: Geographic Resources Analysis Support System.http://grass.osgeo.org.

57. P. Green. http://www.phrap.org/phredphrapconsed.html.

58. V. Grimm, U. Berger, F. Bastiansen, S. Eliassen, V. Ginot, J. Giske,J. Goss-Custard, T. Grand, S. K. Heinz, G. Huse, A. Huth, J. U. Jepsen,C. Jørgensen, W. M. Mooij, B. Muller, G. Pe’er, C. Piou, S. F. Rails-back, A. M. Robbins, M. M. Robbins, E. Rossmanith, N. Ruger, E. Strand,S. Souissi, R. A. Stillman, R. Vabø, U. Visser, and D. L. DeAngelis. Astandard protocol for describing individual-based and agent-based models.Ecological Modelling, 198(1):115–126, 2006.

59. V. Grimm, U. Berger, D. L. DeAngelis, J. G. Polhill, J. Giske, and S. F.Railsback. The ODD protocol: A review and first update. Ecological Mod-elling, 221:2760–2768, 2010.

229

60. V. Grimm and S. F. Railsback. Individual-based Modeling and Ecology.Princeton University Press, Princeton, NJ, 2005.

61. U. Hellsten and et al. The genome of the Western clawed frog Xenopustropicalis. Science, 328(5978):633–636, April 2010.

62. Hibernate. http://www.hibernate.org.

63. R. A. Holt, G. M. Subramanian, A. Halpern, G. G. Sutton, R. Charlab,D. R. Nusskern, P. Wincker, A. G. Clark, J. M. C. Ribeiro, R. Wides,S. L. Salzberg, B. Loftus, M. Yandell, W. H. Majoros, D. B. Rusch, Z. Lai,C. L. Kraft, J. F. Abril, V. Anthouard, P. Arensburger, P. W. Atkinson,H. Baden, V. de Berardinis, D. Baldwin, V. Benes, J. Biedler, C. Blass,R. Bolanos, D. Boscus, M. Barnstead, S. Cai, A. Center, K. Chatuverdi,G. K. Christophides, M. A. Chrystal, M. Clamp, A. Cravchik, V. Curwen,A. Dana, A. Delcher, I. Dew, C. A. Evans, M. Flanigan, A. Grundschober-Freimoser, L. Friedli, Z. Gu, P. Guan, R. Guigo, M. E. Hillenmeyer, S. L.Hladun, J. R. Hogan, Y. S. Hong, J. Hoover, O. Jaillon, Z. Ke, C. Kodira,E. Kokoza, A. Koutsos, I. Letunic, A. Levitsky, Y. Liang, J.-J. Lin, N. F.Lobo, J. R. Lopez, J. A. Malek, T. C. McIntosh, S. Meister, J. Miller,C. Mobarry, E. Mongin, S. D. Murphy, D. A. O’Brochta, C. Pfannkoch,R. Qi, M. A. Regier, K. Remington, H. Shao, M. V. Sharakhova, C. D. Sit-ter, J. Shetty, T. J. Smith, R. Strong, J. Sun, D. Thomasova, L. Q. Ton,P. Topalis, Z. Tu, M. F. Unger, B. Walenz, A. Wang, J. Wang, M. Wang,X. Wang, K. J. Woodford, J. R. Wortman, M. Wu, A. Yao, E. M. Zdobnov,H. Zhang, Q. Zhao, S. Zhao, S. C. Zhu, I. Zhimulev, M. Coluzzi, A. dellaTorre, C. W. Roth, C. Louis, F. Kalush, R. J. Mural, E. W. Myers, M. D.Adams, H. O. Smith, S. Broder, M. J. Gardner, C. M. Fraser, E. Birney,P. Bork, P. T. Brey, J. C. Venter, J. Weissenbach, F. C. Kafatos, F. H.Collins, and S. L. Hoffman. The genome sequence of the malaria mosquitoAnopheles gambiae. Science, 298(5591), October 2002.

64. X. Huang and A. Madan. CAP3: A DNA Sequence Assembly Program.Genome Research, 9:868–877, 1999.

65. International Human Genome Sequencing Consortium. Initial sequencingand analysis of the human genome. Nature, 409(6822):860–921, 2001.

66. R. M. Itami. Mobile agents with spatial intelligence. In H. R. Gimblett, ed-itor, Integrating Geographic Information Systems and Agent-based ModelingTechniques for Simulating Social and Ecological Processes. Oxford UniversityPress, 2002.

67. Java. http://java.sun.com.

230

68. A. Jenkinson, M. Albrecht, E. Birney, H. Blankenburg, T. Down, R. Finn,H. Hermjakob, T. Hubbard, R. Jimenez, P. Jones, A. Kahari, E. Kulesha,J. Macias, G. Reeves, and A. Prlic. Integrating biological data - the Dis-tributed Annotation System. BMC Bioinformatics, 9(Suppl 8):S3, 2008.

69. N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algo-rithms. The MIT Press, Cambridge, MA, 2004.

70. JTS Topology Suite. http://www.vividsolutions.com/jts/jtshome.htm.

71. J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, andJ. Walichiewicz. Repbase update, a database of eukaryotic repetitive ele-ments. Cytogenetic and Genome Research, 110:462–467, 2005.

72. J. Jurka, P. Klonowski, V. Dagman, and P. Pelton. Censor–a program foridentification and elimination of repetitive elements from dna sequences.Computers and Chemistry, 20(1):119–121, 1996.

73. K. Kaiser, J. W. Sentry, and D. J. Finnegan. Eukaryotic transposable ele-ments as tools to study gene structure and function. In D. J. Sherratt, editor,Mobile Genetic Elements. Oxford University Press, 1995.

74. M. Keeling, M. Woolhouse, R. May, G. Davies, and B. Grenfell. Modellingvaccination strategies against foot-and-mouth disease. Nature, 421:136–142,January 2003.

75. R. C. Kennedy. Verification and Validation of Agent-based and Equation-based Simulations and Bioinformatics Computing: Identifying Transpos-able Elements in the Aedes aegypti Genome. Master’s Thesis, Universityof Notre Dame, April 2006.

76. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, A. Fuentes, H. Hollocher, andG. R. Madey. A GIS Aware Agent-Based Model of Pathogen Transmission.International Journal of Intelligent Control and Systems, 14(1):51–61, March2009. Invited.

77. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, andG. R. Madey. Effectively integrating gis data into an agent-based epidemi-ological model. In NICO Complexity Conference, September 2009. PosterAward Winner.

78. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, andG. R. Madey. Simulation and Analysis of Pathogen Transmission in anAgent- and GIS-based Model. In North American Association for Com-putational Social and Organization Science 2009 Conference, Tempe, AZ,October 2009.

231

79. R. C. Kennedy, K. E. Lane, A. Fuentes, H. Hollocher, and G. Madey. Spa-tially Aware Agents: An effective and efficient use of GIS data within anAgent-based Model. In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.

80. R. C. Kennedy, M. F. Unger, S. Christley, F. H. Collins, and G. R. Madey. Anautomated homology-based approach for identifying transposable elements.BMC Bioinformatics, 2010. Under review.

81. R. C. Kennedy, X. Xiang, T. F. Cosimano, L. A. Arthurs, P. A. Maurice,and S. E. Cabaniss. Verification and validation of agent-based and equation-based simulations: A comparison. In L. Yilmaz, editor, Proceedings of the2006 Agent-Directed Simulation Symposium. The Society for Modeling andSimulation International, April 2006.

82. M. G. Kidwell and D. Lisch. Transposable elements as sources of variation inanimals and plants. Proceedings of the National Academy of Sciences USA,94:7704–7711, July 1997.

83. E. F. Kirkness, B. J. Haas, W. Sun, H. R. Braig, M. A. Perotti, J. M.Clark, S. H. Lee, H. M. Robertson, R. C. Kennedy, E. Elhaik, D. Ger-lach, E. V. Kriventseva, C. G. Elsik, D. Graur, C. A. Hill, J. A. Veen-stra, B. Walenz, J. M. C. Tubo, J. M. C. Ribeiro, J. Rozas, J. S. Johnston,J. T. Reese, A. Popadic, M. Tojo, D. Raoult, D. L. Reed, Y. Tomoyasu,E. Krause, O. Mittapalli, V. M. Margam, H.-M. Li, J. M. Meyer, R. M.Johnson, J. Romero-Severson, J. P. VanZee, D. Alvarez-Ponce, F. G. Vieira,M. Aguad, S. Guirao-Rico, J. M. Anzola, K. S. Yoon, J. P. Strycharz, M. F.Unger, S. Christley, N. F. Lobo, M. J. Seufferheld, N. Wang, G. A. Dasch,C. J. Struchiner, G. Madey, L. I. Hannick, S. Bidwell, V. Joardar, E. Caler,R. Shao, S. C. Barker, S. Cameron, R. V. Bruggner, A. Regier, J. Johnson,L. Viswanathan, T. R. Utterback, G. G. Sutton, D. Lawson, R. M. Wa-terhouse, J. C. Venter, R. L. Strausberg, M. R. Berenbaum, F. H. Collins,E. M. Zdobnov, and B. R. Pittendrigh. Genome sequences of the humanbody louse and its primary endosymbiont provide insights into the perma-nent parasitic lifestyle. Proceedings of the National Academy of Sciences,107(27):12168–12173, July 2010.

84. O. Kohany, A. J. Gentles, L. Hankus, and J. Jurka. Annotation, submis-sion and screening of repetitive elements in Repbase: RepbaseSubmitter andCensor. BMC Bioinformatics, 7(474), 2006.

85. K. K. Kojima and H. Fujiwara. Evolution of Target Specificity in R1 CladeNon-LTR Retrotransposons. Molecular Biology and Evolution, 20(3):351–361, 2003.

232

86. K. K. Kojima and H. Fujiwara. Cross-Genome Screening of Novel Sequence-Specific Non-LTR Retrotransposons: Various Multicopy RNA Genes andMicrosatellites Are Selected as Targets. Molecular Biology and Evolution,21(2):201–217, 2004.

87. I. Korf. Gene finding in novel genomes. BMC Bioinformatics, 5(59), 2004.

88. K. E. Lane, R. C. Kennedy, L. A. Miller, G. Madey, H. Hollocher, andA. Fuentes. Exploring the use of agent-based models in understanding pat-terns of pathogen transmission. In preparation.

89. M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan,H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thompson,T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinfor-matics, 23(21):2947–2948, November 2007.

90. A. M. Law and W. D. Kelton. Simulation Modeling and Analysis. McGraw-Hill, Boston, MA, third edition, 2000.

91. D. Lawson, P. Arensburger, P. Atikinson, N. J. Besansky, R. V. Bruggner,R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas,D. Emmert, M. Hammond, C. A. Hill, R. C. Kennedy, N. F. Lobo, R. M.MacCallum, G. Madey, K. Megy, S. Redmond, S. Russo, D. W. Severson,E. O. Stinson, P. Topalis, E. M. Zdobnov, E. Birney, W. M. Gelbart, F. C.Kafatos, C. Louis, and F. H. Collins. VectorBase: a home for invertebratevectors of human pathogens. Nucleic Acids Research, 35:D503–D505, 2007.

92. D. Lawson, P. Arensburger, P. Atkinson, N. J. Besansky, R. V. Bruggner,R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas,M. Hammond, C. A. Hill, N. Konopinski, N. F. Lobo, R. M. MacCallum,G. Madey, K. Megy, J. Meyer, S. Redmond, D. W. Severson, E. O. Stin-son, P. Topalis, E. Birney, W. M. Gelbart, F. C. Kafatos, C. Louis, andF. H. Collins. VectorBase: a data resource for invertebrate vector genomics.Nucleic Acids Research, 37:D583–587, 2009.

93. E. Lerat. Identifying repeats and transposable elements in sequencedgenomes: how to find your way through the dense forest of programs. Hered-ity, 104:520–533, 2010.

94. S. Lewis, S. Searle, N. Harris, M. Gibson, V. Iyer, J. Richter, C. Wiel,L. Bayraktaroglu, E. Birney, M. Crosby, J. Kaminker, B. Matthews,S. Prochnik, C. Smith, J. Tupy, G. Rubin, S. Misra, C. Mungall, andM. Clamp. Apollo: a sequence annotation editor. Genome Biology, 3(12),2002.

233

95. N. Lobo, A. Hua-Van, X. Li, B. Nolen, and J. M.J. Fraser. Germ linetransformation of the yellow fever mosquito, Aedes aegypti, mediated bytranspositional insertion of a piggyBac vector. Insect Molecular Biology,11(2):133–139, April 2002.

96. E. R. Mardis. Next-Generation DNA Sequencing Methods. Annual Reviewof Genomics and Human Genetics, 9:387–402, September 2008.

97. E. M. McCarthy and J. F. McDonald. LTR STRUC: a novel search andidentification program for LTR retrotransposons. Bioinformatics, 19(3):362–367, February 2003.

98. B. McClintock. The discovery and characterization of transposable elements:The collected papers of Barbara McClintock. Garland Publishing, Inc., NewYork, NY, 1987.

99. MediaWiki. http://www.mediawiki.org/wiki/MediaWiki.

100. P. Medstrand, L. N. van de Lagemaat, C. A. Dunn, J. R. Landry, D. Sven-back, and D. L. Mager. Impact of transposable elements on the evolutionof mammalian gene regulation. Cytogenetic and Genome Research, 110:342–352, 2005.

101. J. R. Miller, S. Koren, and G. Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315 – 327, 2010.

102. D. M. Mount. Bioinformatics: Sequence and Genome Analysis. Cold SpringHarbor Laboratory Press, Cold Spring Harbor, NY, second edition, 2004.

103. T. Naylor, J. Balintfy, D. Burdick, and K. Chu. Computer Simulation Tech-niques. John Wiley, New York, NY, 1966.

104. NCBI: National Center for Biotechnology Information.http://www.ncbi.nih.gov.

105. V. Nene, J. R. Wortman, D. Lawson, B. Haas, C. Kodira, Z. J. Tu,B. Loftus, Z. Xi, K. Megy, M. Grabherr, Q. Ren, E. M. Zdobnov, N. F.Lobo, K. S. Campbell, S. E. Brown, M. F. Bonaldo, J. Zhu, S. P. Sinkins,D. G. Hogenkamp, P. Amedeo, P. Arensburger, P. W. Atkinson, S. Bidwell,J. Biedler, E. Birney, R. V. Bruggner, J. Costas, M. R. Coy, J. Crabtree,M. Crawford, B. deBruyn, D. DeCaprio, K. Eiglmeier, E. Eisenstadt, H. El-Dorry, W. M. Gelbart, S. L. Gomes, M. Hammond, L. I. Hannick, J. R.Hogan, M. H. Holmes, D. Jaffe, J. S. Johnston, R. C. Kennedy, H. Koo,S. Kravitz, E. V. Kriventseva, D. Kulp, K. LaButti, E. Lee, S. Li, D. D.Lovin, C. Mao, E. Mauceli, C. F. M. Menck, J. R. Miller, P. Montgomery,

234

A. Mori, A. L. Nascimento, H. F. Naveira, C. Nusbaum, S. O’Leary, J. Orvis,M. Pertea, H. Quesneville, K. R. Reidenbach, Y.-H. Rogers, C. W. Roth,J. R. Schneider, M. Schatz, M. Shumway, M. Stanke, E. O. Stinson, J. M. C.Tubio, J. P. VanZee, S. Verjovski-Almeida, D. Werner, O. White, S. Wyder,Q. Zeng, Q. Zhao, Y. Zhao, C. A. Hill, A. S. Raikhel, M. B. Soares, D. L.Knudson, N. H. Lee, J. Galagan, S. L. Salzberg, I. T. Paulsen, G. Di-mopoulos, F. H. Collins, B. Birren, C. M. Fraser-Liggett, and D. W. Sever-son. Genome sequence of Aedes aegypti, a major arbovirus vector. Science,316(5832):1718–1723, June 2007.

106. M. Oliveira de Carvalho, J. Silva, and E. Loreto. Analyses of P -like trans-posable element sequences from the genome of Anopheles gambiae. InsectMolecular Biology, 13(1):55–63, 2006.

107. OpenMap. http://openmap.bbn.com.

108. phpExcelReader. http://sourceforge.net/projects/phpexcelreader.

109. B. R. Pittendrigh, J. M. Clark, J. S. Johnston, S. H. Lee, J. Romero-Severson,and G. A. Dasch. Sequencing of a new target genome: the Pediculus humanushumanus (Phthiraptera: Pediculidae) genome project. Journal of MedicalEntomology, 43(6):1101–1111, November 2006.

110. R. H. Plasterk, Z. Izsvak, and Z. Ivics. Resident aliens: the tc1/marinersuperfamily of transposable elements. Trends in Genetics, 15(8), August1999.

111. M. Pop, S. L. Salzberg, and M. Shumway. Genome sequence assem-bly:algorithms and issues. Computer, 35:47–54, 2002.

112. PostgreSQL. http://www.postgresql.org/.

113. K. D. Pruitt, T. Tatusova, W. Klimke, and D. R. Maglott. NCBI Refer-ence Sequences: current status, policy and new initiatives. Nucleic AcidsResearch, 37:D32–36, 2009.

114. QGIS: Quantum GIS. http://www.qgis.org.

115. H. Quesneville, C. M. Bergman, O. Andrieu, D. Autard, D. Nouaud, M. Ash-burner, and D. Anxolabehere. Combined evidence annotation of transposableelements in genome sequences. PLoS Computational Biology, 1, 2005.

116. H. Quesneville, D. Nouaud, and D. Anxolabehere. Detection of new trans-posable element families in Drosophila melanogaster and Anopheles gambiaegenomes. Journal of Molecular Evolution, 57, 2003.

235

117. H. Quesneville, D. Nouaud, and D. Anxolabehere. P elements and mite rela-tives in the whole genome sequence of Anopheles gambiae. BMC Genomics,7(214), 2006.

118. Repast. http://sourceforge.repast.net.

119. Repbase. http://www.girinst.org/repbase/index.html.

120. L. Roberts and John Janovy, Jr. Gerald D. Schmidt & Larry S. Roberts’Foundations of Parasitology. McGraw-Hill, eighth edition, 2009.

121. L. M. Rocha. From artificial life to semiotic agentmodels: Review and research directions. available athttp://informatics.indiana.edu/rocha/ps/agent review.pdf, Los AlamosNational Laboratory Complex Systems Modeling Team, 1999.

122. G. Rubin and A. Spradling. Genetic transformation of drosophila with trans-posable element vectors. Science, 218(4570):348–353, October 1982.

123. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pear-son Education, Inc., Upper Saddle River, NJ, 2003.

124. S. Saha, S. Bridges, and Z. V. Magbanua. Computational approaches andtools used in identification of dispersed repetitive dna sequences. TropicalPlant Biology, 1:85–96, 2008.

125. F. Sanger, S. Nicklen, and A. Coulson. Dna sequencing with chain-terminating inhibitors. In Proceedings of the National Academy of Sciences ofthe United States of America, volume 74, pages 5463–5467, December 1977.

126. A. Sarkar, R. Sengupta, J. Krzywinski, X. Wang, C. Roth, and F. Collins.P elements are found in the genomes of nematoceran insects of the genusAnopheles. Insect Biochemistry and Molecular Biology, 33(4):381–387, April2003.

127. A. Sarkar, C. Sim, Y. Hong, J. Hogan, M. Fraser, H. Robertson, andF. Collins. Molecular evolutionary analysis of the widespread piggyBac trans-poson family and related “domesticated” sequences. Molecular Genetics andGenomics, 270(2):173–180, 2003.

128. R. E. Shannon. Introduction to the art and science of simulation. In Pro-ceedings of the 1998 Winter Simulation Conference, pages 7–14, 1998.

129. J. A. Shapiro. The discovery and significance of mobile genetic elements. InD. J. Sherratt, editor, Mobile Genetic Elements. Oxford University Press,1995.

236

130. R. K. Slotkin and R. Martienssen. Transposable elements and the epigeneticregulation of the genome. Nature Reviews Genetics, 8(4):272–285, April 2007.

131. SOAP. http://www.w3.org/TR/soap/.

132. M. W. Southern. An Assessment of Potential Habitat Corridors and Land-scape Ecology for Long-Tailed Macaques (Macaca fascicularis) on Bali, In-donesia. Master’s Thesis, Central Washington University, June 2002.

133. J. E. Stajich, D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdi-gian, G. Fuellen, J. G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla,C. J. Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D.Stein, E. Stupka, M. D. Wilkinson, and E. Birney. The Bioperl Toolkit: PerlModules for the Life Sciences. 12(10):1611–1618, October 2002.

134. TEfam. http://tefam.biochem.vt.edu/tefam/.

135. L. Temime, Y. Pannet, L. Kardas, L. Opatowski, D. Guillemot, and P. Y.Boelle. NOSOSIM: an agent-based model of pathogen circulation in a hos-pital ward. In L. Yilmaz, editor, Proceedings of the 2009 Agent-DirectedSimulation Symposium. The Society for Modeling and Simulation Interna-tional, March 2009.

136. S. Tempel, M. Jurka, and J. Jurka. VisualRepbase: an interface for thestudy of occurrences of transposable element families. BMC Bioinformatics,8(345), 2008.

137. TESeeker. http://www.nd.edu/˜teseeker.

138. Z. Tu and C. Coates. Mosquito transposable elements. Insect Biochemistryand Molecular Biology, 34:631–644, 2004.

139. Z. Tu and S. Li. Mobile genetic elements of malaria vectors and othermosquitoes. In P. J. Brindley, editor, Mobile Genetic Elements in Meta-zoan Parasites. Landes Bioscience, September 2008.

140. J. M. C. Tubıo, H. Naveira, and J. Costas. Structural and EvolutionaryAnalyses of the Ty3/gypsy Group of LTR Retrotransposons in the Genomeof Anopheles gambiae. Molecular Biology and Evolution, 22(1):29–39, 2005.

141. UniProt Consortium. The Universal Protein Resource (UniProt) in 2010.Nucleic Acids Research, 38:D142–148, 2010.

142. University of Notre Dame Center for Research Computing.http://crc.nd.edu.

237

143. VectorBase: A Bioinformatics Resource Center for Invertebrate Vectors ofHuman Pathogens. http://www.vectorbase.org.

144. VirtualBox. http://www.virtualbox.org.

145. J. L. Weber and E. W. Myers. Human whole-genome shotgun sequencing.Genome Research, 7:401–409, 1997.

146. J. D. Westervelt. Geographic information systems and agent-based model-ing. In H. R. Gimblett, editor, Integrating Geographic Information Systemsand Agent-based Modeling Techniques for Simulating Social and EcologicalProcesses. Oxford University Press, 2002.

147. B. P. Wheatley. The Sacred Monkeys of Bali. Waveland Press, 1999.

148. WikiPoson. http://www.bioinformatics.org/wikiposon/doku.php.

149. X. Xiang, R. Kennedy, G. Madey, and S. Cabaniss. Verification and val-idation of agent-based scientific simulation models. In L. Yilmaz, editor,Proceedings of the 2005 Agent-Directed Simulation Symposium, volume 37,pages 47–55. The Society for Modeling and Simulation International, April2005.

238

IDENTIFICATION AND ANNOTATION OF …macaque/resources/RCKennedy.pdfIDENTIFICATION AND ANNOTATION OF...

Documents

Transcript of IDENTIFICATION AND ANNOTATION OF …macaque/resources/RCKennedy.pdfIDENTIFICATION AND ANNOTATION OF...