IDENTIFICATION AND ANNOTATION OF …macaque/resources/RCKennedy.pdfIDENTIFICATION AND ANNOTATION OF...
Transcript of IDENTIFICATION AND ANNOTATION OF …macaque/resources/RCKennedy.pdfIDENTIFICATION AND ANNOTATION OF...
IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS
AND
AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION
A Dissertation
Submitted to the Graduate School
of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
by
Ryan C. Kennedy,
Gregory R. Madey, Co-Director
Frank H. Collins, Co-Director
Graduate Program in Computer Science and Engineering
Notre Dame, Indiana
January 2011
IDENTIFICATION AND ANNOTATION OF TRANSPOSABLE ELEMENTS
AND
AGENT- AND GIS-BASED MODELING OF PATHOGEN TRANSMISSION
Abstract
by
Ryan C. Kennedy
The work presented here has two primary components: 1) the identification
and annotation of transposable elements (TEs) and 2) a spatially-aware agent-
based model of pathogen transmission.
Recent advances in sequencing technology have resulted in an explosion of
genomic data. The identification of TEs is an important part of every genome
project. This dissertation presents an automated homology-based approach to
identify TEs, implemented as TESeeker, that produces consensus TEs up to 98%
identical to manually annotated sequences. It also offers a design and implementa-
tion plan to allow for the inclusion of TEs on VectorBase’s community annotation
pipeline.
Agent-based modeling is very adept at modeling natural phenomena. Coupling
geographical information system (GIS) data with agent-based modeling further
increases the utility of such simulations. This dissertation presents a GIS aware
agent-based model of pathogen transmission as well as methods and recommenda-
tions for incorporating GIS data into a simulation. The model, named LiNK, was
specifically developed to study the impact of landscape on pathogen transmission.
DEDICATION
To my family and friends
ii
CONTENTS
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Identification and Annotation of Transposable Elements . . . . . . 11.3 Agent- and GIS-based Modeling of Pathogen Transmission . . . . 31.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CHAPTER 2: TRANSPOSABLE ELEMENT AND BIOINFORMATICSBACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Molecular Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 VectorBase . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Transposable Elements . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Transposable Element Identification . . . . . . . . . . . . . . . . . 16
2.5.1 De novo Discovery . . . . . . . . . . . . . . . . . . . . . . 172.5.2 Structure-based Discovery . . . . . . . . . . . . . . . . . . 172.5.3 Comparative Genomic Methods . . . . . . . . . . . . . . . 182.5.4 Homology-based Discovery . . . . . . . . . . . . . . . . . . 18
2.6 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.6.1 DAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Ensembl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2.1 Ensembl Genebuild . . . . . . . . . . . . . . . . . . 212.6.3 Chado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
2.6.4 Hibernate . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6.5 VectorBase Community Annotation Pipeline . . . . . . . . 25
2.6.5.1 Planned Updates to the VectorBase Community An-notation Pipeline . . . . . . . . . . . . . . . . . . . 28
2.7 Transposable Element Annotation . . . . . . . . . . . . . . . . . . 282.7.1 VisualRepbase . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 3: AUTOMATED HOMOLOGY-BASED APPROACH FORTHE IDENTIFICATION OF TRANSPOSABLE ELEMENTS . . . . . 323.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Approach for Identification of Transposable Elements . . . . . . . 33
3.2.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.1.1 Library of Representative Sequences . . . . . . . . . 333.2.1.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.3 DNASTAR SeqMan II . . . . . . . . . . . . . . . . 343.2.1.4 CAP3 . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.5 ClustalW2 . . . . . . . . . . . . . . . . . . . . . . . 343.2.1.6 BioPerl . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 General Description of Approach . . . . . . . . . . . . . . 353.2.2.1 Identify Coding Region . . . . . . . . . . . . . . . . 373.2.2.2 Encompass Complete Transposable Element . . . . 393.2.2.3 Generate Consensus . . . . . . . . . . . . . . . . . . 413.2.2.4 Identify Complete Transposable Element . . . . . . 41
3.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 423.2.4 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Pediculus humanus humanus . . . . . . . . . . . . . . . . . 45
3.3.1.1 Class I Elements . . . . . . . . . . . . . . . . . . . 473.3.1.2 Class II Elements . . . . . . . . . . . . . . . . . . . 48
3.3.2 Culex quinquefasciatus . . . . . . . . . . . . . . . . . . . . 493.3.3 Anopheles gambiae PEST Genome . . . . . . . . . . . . . 493.3.4 Other Organisms . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
CHAPTER 4: DESIGN AND PROOF-OF-CONCEPT PLAN FOR COM-MUNITY ANNOTATION OF TRANSPOSABLE ELEMENTS ON VEC-TORBASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 Transposable Elements and the VectorBase Community Annotation
Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
iv
4.2.1 Similarities to the VectorBase Community Annotation Pipeline 564.2.2 Differences from the VectorBase Community Annotation
Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.3 Transposable Element Representation in Chado . . . . . . 604.2.4 Proof-of-Concept . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Design and Implementation Plan . . . . . . . . . . . . . . . . . . 654.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
CHAPTER 5: SIMULATION AND MODELING BACKGROUND . . . . 685.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Simulation and Modeling . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 Advantages and Disadvantages . . . . . . . . . . . . . . . 705.2.2 Building a Simulation Model . . . . . . . . . . . . . . . . . 715.2.3 Simulation Model Types . . . . . . . . . . . . . . . . . . . 725.2.4 Agent-based Modeling . . . . . . . . . . . . . . . . . . . . 745.2.5 Equation-based Modeling . . . . . . . . . . . . . . . . . . 74
5.3 Geographic Information Systems . . . . . . . . . . . . . . . . . . . 755.3.1 Raster Data . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.2 Vector Data . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Integrating Geographic Information System Data into Agent-basedModeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
CHAPTER 6: A GIS AWARE AGENT-BASED MODEL OF PATHOGENTRANSMISSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 LiNK Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.1 Model Background . . . . . . . . . . . . . . . . . . . . . . 806.2.2 Conceptual Model . . . . . . . . . . . . . . . . . . . . . . 826.2.3 ODD Protocol Description of LiNK . . . . . . . . . . . . . 91
6.2.3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . 916.2.3.2 State Variables and Scales . . . . . . . . . . . . . . 916.2.3.3 Process Overview and Scheduling . . . . . . . . . . 946.2.3.4 Design Concepts . . . . . . . . . . . . . . . . . . . 956.2.3.5 Initialization . . . . . . . . . . . . . . . . . . . . . . 966.2.3.6 Input . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2.3.7 Submodels . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 986.2.5 Verification and Validation . . . . . . . . . . . . . . . . . . 98
6.3 Geographic Information System Data and Agent-Based Modeling 1006.3.1 Approximating Geographic Information System Data in Sim-
ulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
v
6.3.2 Raster Queries . . . . . . . . . . . . . . . . . . . . . . . . 1006.3.3 Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3.1 Simplified Spatial Queries . . . . . . . . . . . . . . 1016.3.4 Precalculated Query Matrix . . . . . . . . . . . . . . . . . 1036.3.5 GIS Aware Agents . . . . . . . . . . . . . . . . . . . . . . 104
6.3.5.1 Movement . . . . . . . . . . . . . . . . . . . . . . . 1046.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.5 Analyzing Massive Amounts of Simulation Data . . . . . . . . . . 116
6.5.1 LiNKStat . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
CHAPTER 7: CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 1217.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2 Automated Homology-based Approach for the Identification of Trans-
posable Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 Community Annotation of Transposable Elements on VectorBase 1227.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4 GIS Aware Agent-based Model of Pathogen Transmission . . . . . 1237.4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
APPENDIX A: AUTOMATED APPROACH WALKTHROUGH . . . . . 127A.1 Representative Amino Acid Coding Regions . . . . . . . . . . . . 127A.2 Identify Coding Region . . . . . . . . . . . . . . . . . . . . . . . . 131
A.2.1 tblastn Search . . . . . . . . . . . . . . . . . . . . . . . . 131A.2.2 Extract Sequences from the Genome . . . . . . . . . . . . 135A.2.3 CAP3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . 137
A.2.3.1 CAP3 Contigs . . . . . . . . . . . . . . . . . . . . . 137A.2.3.2 CAP3 Contigs Quality Scores . . . . . . . . . . . . . 141
A.3 Encompass Complete Transposable Element . . . . . . . . . . . . 148A.4 Generate Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . 149A.5 Identify Complete Transposable Element . . . . . . . . . . . . . . 150
A.5.1 CAP3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . 150A.5.2 CAP3 Contigs Quality File . . . . . . . . . . . . . . . . . . 151A.5.3 Trimmed CAP3 Contigs . . . . . . . . . . . . . . . . . . . . 153
APPENDIX B: TESeeker WEBSITE . . . . . . . . . . . . . . . . . . . . . 154
vi
APPENDIX C: TESeeker USER MANUAL . . . . . . . . . . . . . . . . . 156C.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156C.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157C.3 Example Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165C.4 Additional Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 166C.5 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
APPENDIX D: SELECTED AUTOMATED APPROACH SOURCE CODE 170D.1 Combine BLAST Hits . . . . . . . . . . . . . . . . . . . . . . . . 170D.2 Extract Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 177D.3 Trim CAP3 Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . 179D.4 Generate Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . 181
APPENDIX E: TRANSPOSABLE ELEMENTS IDENTIFIED . . . . . . 183E.1 P. humanus humanus . . . . . . . . . . . . . . . . . . . . . . . . . 183
E.1.1 Non-LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . 183E.1.1.1 Hope-like SART . . . . . . . . . . . . . . . . . . . . 183E.1.1.2 Dong-like R4 . . . . . . . . . . . . . . . . . . . . . 186
E.1.2 LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189E.1.2.1 Mdg1 ty3/gypsy . . . . . . . . . . . . . . . . . . . . 189
E.1.3 Transposons . . . . . . . . . . . . . . . . . . . . . . . . . . 192E.1.3.1 mariner . . . . . . . . . . . . . . . . . . . . . . . . 192E.1.3.2 MITE1 . . . . . . . . . . . . . . . . . . . . . . . . 194E.1.3.3 MITE2 . . . . . . . . . . . . . . . . . . . . . . . . 195
E.2 C. quinquefasciatus . . . . . . . . . . . . . . . . . . . . . . . . . . 196E.2.1 Non-LTRs . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
E.2.1.1 CR1 . . . . . . . . . . . . . . . . . . . . . . . . . . 196E.2.1.2 I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198E.2.1.3 Jockey . . . . . . . . . . . . . . . . . . . . . . . . . 200E.2.1.4 L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 202E.2.1.5 L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 204E.2.1.6 LOA . . . . . . . . . . . . . . . . . . . . . . . . . . 206E.2.1.7 Loner . . . . . . . . . . . . . . . . . . . . . . . . . 208E.2.1.8 Outcast . . . . . . . . . . . . . . . . . . . . . . . . 211E.2.1.9 R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 213E.2.1.10 RTE . . . . . . . . . . . . . . . . . . . . . . . . . . 216E.2.1.11 Unclassified LINE . . . . . . . . . . . . . . . . . . 218
E.3 D. melanogaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219E.3.1 Transposons . . . . . . . . . . . . . . . . . . . . . . . . . . 219
E.3.1.1 mariner . . . . . . . . . . . . . . . . . . . . . . . . 219
vii
APPENDIX F: SCRIPT USED TO IDENTIFY MITES . . . . . . . . . . 220
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
viii
FIGURES
1.1 Dissertation Components . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Central Dogma of Molecular Biology . . . . . . . . . . . . . . . . 8
2.2 Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Global and Local Sequence Alignment . . . . . . . . . . . . . . . 10
2.4 Typical mariner Class II Transposon Structure . . . . . . . . . . 13
2.5 Transposable Element (TE) Classification Scheme and Structures 14
2.6 Ensembl Location-based View on VectorBase . . . . . . . . . . . . 22
2.7 Ensembl Gene-based View on VectorBase . . . . . . . . . . . . . . 23
2.8 Ensembl Transcript-based View on VectorBase . . . . . . . . . . . 24
2.9 VectorBase Gene Submission Form . . . . . . . . . . . . . . . . . 26
2.10 VectorBase Community Annotation Pipeline Data Flow . . . . . . 27
2.11 VisualRepbase Interface . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Approach Schematic . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Methods of Combination . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 P. humanus humanus mariner element . . . . . . . . . . . . . . . 45
3.4 C. quinquefasciatus Jockey element . . . . . . . . . . . . . . . . . 45
4.1 Client-side TE Submission Process . . . . . . . . . . . . . . . . . 58
4.2 TE Submission Form . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Entity-Relationship Diagram of Selected Chado Tables . . . . . . 61
4.4 TE Start and Submit Page . . . . . . . . . . . . . . . . . . . . . . 62
4.5 TE Details Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 TE Structure Page . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Proof-of-Concept Configuration . . . . . . . . . . . . . . . . . . . 65
5.1 Raster vs. Vector Data . . . . . . . . . . . . . . . . . . . . . . . . 77
ix
6.1 Female Macaque and Infant . . . . . . . . . . . . . . . . . . . . . 81
6.2 Uluwatu Temple Site . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3 Life Cycle Transition Diagram . . . . . . . . . . . . . . . . . . . . 85
6.4 LiNK Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 LiNK Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Temple Site Display . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.7 Temporal Relationship of Pathogen Parameters and Related Events 89
6.8 Pathogen Transition Diagram . . . . . . . . . . . . . . . . . . . . 90
6.9 Verification and Validation Techniques for Agent-based Models . . 99
6.10 Spatial Data Approximation . . . . . . . . . . . . . . . . . . . . . 102
6.11 Macaque Movement . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.12 Comparison of Total Number of Infections grouped by Landscapeand Population Size at Varying Sites . . . . . . . . . . . . . . . . 109
6.13 Pathogen Spread to Varying Temple Sites . . . . . . . . . . . . . 110
6.14 Performance Comparison of Varying Query Methods . . . . . . . 113
6.15 Scalability with Respect to Initial Number of Dispersed Macaquesand Amount of GIS data . . . . . . . . . . . . . . . . . . . . . . . 115
6.16 LiNKStat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.17 LiNKStat Pathogen Transmission Graph . . . . . . . . . . . . . . 118
B.1 TESeeker Website . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
C.1 TESeeker Desktop . . . . . . . . . . . . . . . . . . . . . . . . . . 158
C.2 TESeeker Genomes Folder . . . . . . . . . . . . . . . . . . . . . . 159
C.3 TESeeker TELibrary . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.4 TESeeker Documentation . . . . . . . . . . . . . . . . . . . . . . 161
C.5 TESeeker Web Interface . . . . . . . . . . . . . . . . . . . . . . . 162
C.6 TESeeker BLAST Interface . . . . . . . . . . . . . . . . . . . . . 163
C.7 TESeeker Extract Interface . . . . . . . . . . . . . . . . . . . . . 164
C.8 TESeeker Default Parameters . . . . . . . . . . . . . . . . . . . . 166
C.9 Web Interface File Browser . . . . . . . . . . . . . . . . . . . . . . 167
C.10 ClustalX Alignment with Annotated Element . . . . . . . . . . . 168
x
TABLES
3.1 Pediculus humanus humanus NON-LONG TERMINAL REPEAT(NON-LTR) RESULTS . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Culex quinquefasciatus RESULTS . . . . . . . . . . . . . . . . . . 50
6.1 MOVEMENT VALUES FOR DISPERSING MACAQUES . . . . 92
6.2 STATE VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . 936.3 PERFORMANCE COMPARISON OF GUI LOAD TIME . . . . 112
6.4 PERFORMANCE COMPARISON OF TIME STEPS/S . . . . . 112
6.5 SCALABILITY COMPARISON OF TIME STEPS/S . . . . . . . 114
6.6 ADVANTAGES AND DISADVANTAGES (1- POOR; 5- EXCEL-LENT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xi
ACKNOWLEDGMENTS
I would like to thank my advisors, Dr. Frank Collins and Dr. Greg Madey
for their direction, patience, and encouragement. I would especially like to thank
Dr. Greg Madey, with whom I have collaborated since I was an undergraduate,
for his constant support and for the providing me the opportunity to pursue this
degree.
Thank you also to my committee members: Dr. Scott Emrich for his bioinfor-
matics guidance, Dr. Agustın Fuentes for his direction on the LiNK project, and
Dr. Tijana Milenkovic for her valuable contributions on this dissertation.
Additionally, I am grateful to Dr. Nora Besansky for serving as my outside chair
and to Dr. Hope Hollocher for her direction on the LiNK simulation model and
for serving as the outside chair for my proposal. Thank you to our administrative
assistant, Ms. Joyce Yeats, for her invaluable assistance over the years.
Special thanks to Dr. Scott Christley for his unwavering support, insight, and
contributions. Thank you also to Maria Unger and Jenica Abrudan for sharing
their biological expertise while also contributing to much of this work. Thank you
to Kelly Lane for her collaboration on the LiNK simulation model.
Finally, I would like to thank my family and friends, particularly my parents
Bill and Martha, and Carrianne Scheib. This work would not be possible without
them.
This research was supported in part by NIAID/NIH contracts HHSN272200-
xii
900039C and HHSN266200400039C for “VectorBase: A Bioinformatics Resource
Center for Invertebrate Vectors of Human Pathogens” [143] and NSF grants
BCS#0639787 and BCS#0629787. Selected bioinformatics simulations were per-
formed on the Notre Dame Biocomplexity Cluster supported in part by NSF MRI
Grant No. DBI-0420980. Additional computational resources provided in part by
the Notre Dame Center for Research Computing [142].
xiii
CHAPTER 1
INTRODUCTION
1.1 Overview
The work presented in this dissertation consists of two related parts. The first
part, Chapters 2, 3, and 4, concerns the discovery and annotation of transposable
elements (TEs). The second part, Chapters 5 and 6, involves the development
of a simulation model that utilizes agent-based modeling (ABM) and geographic
information system (GIS) data to model pathogen spread. Each of these parts
could be categorized under the computational biology realm, shown visually in
Figure 1.1. The following sections provide a brief introduction to each chapter,
including our motivations.
1.2 Identification and Annotation of Transposable Elements
Transposable elements (TEs) are a type of repetitive sequence that have been
found in nearly all eukaryotic genomes. They have the ability to move about
and replicate within a genome and are believed to pay a major role in genome
evolution [82, 100, 130].
Largely because of their diversity and mobility, TEs are difficult to identify.
We present an automated homology-based approach for the identification of TEs.
This approach utilizes a comprehensive library of representative sequences as the
1
ComputationalBiology
BioinformaticsAgent-based
Modeling
TEIdentification
TEAnnotation
Model ofPathogen
Transmission
Global Health
Figure 1.1. Dissertation Components. The first part of this dissertationlies within the Bioinformatics realm, while the second is categorized as
Agent-based modeling. Each part of this dissertation shares globalhealth implications.
basis for our search. We make heavy use of common bioinformatics technologies,
namely BLAST, ClustalW2, CAP3, and BioPerl. Our approach, implemented as
TESeeker, is designed to be easier to use than existing approaches and to produce
high-quality consensus TEs, allowing for quicker genome annotation.
We also present a design and implementation plan for the inclusion of TEs in
VectorBase’s community annotation pipeline for genes. Existing TE annotation
websites, such as TEfam [134] and Repbase [119], lack the detailed data that
is available for genes. Extending VectorBase’s community annotation pipeline
to include TEs would fill this gap and allow researchers resources and detailed
information not available elsewhere.
2
1.3 Agent- and GIS-based Modeling of Pathogen Transmission
There are numerous advantages to using a simulation for scientific study [10],
including the ability to model and predict behavior of a real-world system with-
out altering the actual system. If a simulation can utilize real-world data, such
as geographic data, it has even more potential to be valuable. We present an
ABM that utilizes GIS data to simulate pathogen transmission. In particular, we
are interested in the effect landscape has on pathogen transmission. Our simu-
lation, named LiNK, has been developed to model pathogen transmission among
macaque monkeys on the island of Bali, Indonesia. GIS data has been incorpo-
rated into LiNK to allow for spatially aware macaques. We are unaware of any
other epidemiological studies to couple ABM with GIS data, or of any work study-
ing efficient ways for mobile agents to interact with their environment. As such,
we explore different means to include GIS data in an agent-based simulation and
offer suggestions for a variety of applications.
1.4 Goals
This dissertation makes the following contributions:
• Development and implementation of an automated approach to detect trans-posable elements.
• Design and implementation plan for the incorporation of TEs into the Vec-torBase community annotation pipeline.
• Development and implementation of a GIS aware agent-based model ofpathogen transmission.
3
1.5 Organization
The remainder of this dissertation is organized as follows. Chapter 2 intro-
duces biological concepts necessary to the understanding of the first part of this
dissertation. Chapter 3 describes our approach for the automatic identification
of TEs, as well as the implementation of our approach, TESeeker. This chap-
ter also presents the results of our approach, applied to a number of genomes,
including comparisons to published data. Chapter 4 describes the community an-
notation pipeline for genes on VectorBase and how to extend it to allow for TEs.
We present a design and implementation plan and describe a preliminary imple-
mentation. An introduction to agent-based simulations and GIS is described in
Chapter 5. Chapter 6 thoroughly describes our model of pathogen transmission,
LiNK. Chapter 7 summarizes the contributions of this dissertation and proposes
future work. We conclude this document with supplementary information: Ap-
pendix A presents a walkthrough of the inner-workings of TESeeker, Appendix B
presents the TESeeker website, Appendix C presents the TESeeker user manual,
Appendix D presents selected source code for TESeeker, Appendix E presents
selected transposable elements, and Appendix F presents our script to detect
MITEs.
1.6 Contributions
Applications of our approach to identify TEs have been included in the Pedicu-
lus humanus humanus and Culex quinquefasciatus genome papers [5, 83]. In par-
ticular, we authored all TE-related sections of the P. humanus humanus paper
and contributed to the non-LTR section of the TE analysis in the C. quinquefas-
ciatus paper. A paper describing our approach and its implementation is under
4
review [80]. LiNK has been described in detail in an invited journal manuscript
[76] and in conference proceedings [79]. A manuscript detailing initial biological
implications of the model is in preparation [88].
5
CHAPTER 2
TRANSPOSABLE ELEMENT AND BIOINFORMATICS BACKGROUND1
2.1 Introduction
Much of this dissertation relies on a basic understanding of molecular biology in
addition to a familiarity with transposable elements and the field of bioinformatics.
This chapter introduces these biological and bioinformatics concepts.
2.2 Molecular Biology
Cells form the basic components of life and serve varying purposes, but most
have the ability to replicate. Within each cell, there is enough genetic information
and mechanisms for the cell to make a complete copy of itself, a process called
replication [69]. Jones and Pevzner liken this to a car factory that gathers the raw
materials, prepares the materials, and assembles a copy of itself, all while making
cars at the same time [69]. Because cells are the basic reaction vesicles in the body,
understanding their inner-workings would lead to a greater overall understanding
as to how the body functions, which would be very valuable to scientists.
Although cells come in a myriad of shapes and sizes, each has three common
components: DNA, RNA, and protein molecules. DNA, or deoxyribonucleic acid,
1Portions of this chapter previously reported in Kennedy [75].
6
is often described as the building block of life, as it contains the genetic mate-
rial governing how a cell operates. DNA, composed of the nucleotides adenine,
guanine, cytosine, and thymine, is a double-stranded, helical molecule. RNA, on
the other hand, or ribonucleic acid, is composed of only a single strand of the nu-
cleotides adenine, guanine, cytosine, and uracil. RNA is used to transfer pieces of
the DNA strand to other locations in the cell. Proteins are molecules made up of
amino acids, which we describe later, that produce enzymes that can be thought
of as the laborers of the cell. This is because they perform functions varying from
assembling strands of nucleotides to signaling other cells.
The double-stranded helical structure of DNA lends itself to replication. To
replicate, a chromosome in the DNA “unzips” from its matching strand, and
then an enzyme called DNA polymerase, which is prevalent throughout the cell,
attaches itself to one of the strands. It then moves along the strand of DNA,
attracting complementary nucleotides. These nucleotides hydrogen bond to one
another and the process continues until the chromosome is copied. The same
process happens concurrently on the original strand, completing replication.
It is also important to understand the Central Dogma of Molecular Biology,
which outlines the general process by which proteins are generated from DNA,
shown in Figure 2.1. First, DNA “unzips” and then an enzyme called RNA
polymerase binds complementary nucleotides to one strand of DNA, starting at the
promoter region. This “unzipping” continues until the RNA polymerase reaches
the terminator region, at which point the RNA strand breaks off and the DNA
strands “zip” back together, completing transcription. Next, ribosomes allow the
RNA to translate and produce a polypeptide chain. The RNA strand produced by
transcription is more specifically called messenger RNA, or mRNA. Another type
7
translationproteinDNA
transcriptionRNA
Figure 2.1. Central Dogma of Molecular Biology. This is the process bywhich DNA undergoes transcription to produce RNA, which in turn
undergoes translation to produce protein.
of RNA, transfer RNA, or tRNA, continually floats around within the nucleus.
Ribosomes move along the mRNA strand, reading groups of three nucleotides,
called codons, at a time. Each codon encodes for an amino acid. There are
sixty-four possible combinations of nucleotides for a codon, yet there are only
twenty different amino acids, as multiple combinations can code for the same
amino acid. Partially for this reason, amino acids are the preferred components
of our transposable element library, described in detail in Chapter 3. Figure 2.2
shows which codons refer to which amino acids. As the ribosomes move along the
mRNA, they decipher the codons and the tRNA molecules bring the correlating
anticodons. These amino acids are assembled into a chain, called a polypeptide
chain, which is folded to make up a protein. This process proceeds from the start
codon and continues until the stop codon is encountered.
2.3 Bioinformatics
Bioinformatics is the application of computer science techniques to solve bi-
ological problems. Bioinformatics aims to develop a better understanding of the
function of genes through the use of advanced, yet easy-to-use and often web-
based, interfaces. Computer science plays a central role in bioinformatics because
the study and analysis of large amounts of genetic data, which is otherwise time-
consuming and prone to error, readily lends itself to the computer science dis-
8
U C A G
Phenylalanine Tyrosine Cysteine U
UUU UUC UUA UUG Leucine
UCU UCC UCA UCG
Serine
UAU UAC UAA UAG Stop
UGU UGC UGA UGG
Stop Tryptophan
Histidine C
CUU CUC CUA CUG
Leucine
CCU CCC CCA CCG
Proline
CAU CAC CAA CAG Glutamine
CGU CGC CGA CGG
Arginine
Asparagine Serine A
AUU AUC AUA AUG
Isoleucine Methionine
ACU ACC ACA ACG
Threonine
AAU AAC AAA AAG Lysine
AGU AGC AGA AGG Arginine
Aspartic acid G
GUU GUC GUA GUG
Valine
GCU GCC GCA GCG
Alanine
GAU GAC GAA GAG Glutamic acid
GGU GGC GGA GGG
Glycine
Figure 2.2. Genetic Code. A codon is represented by a set of threenucleotides. Each codon codes for an amino acid.
cipline. We next briefly describe several relevant bioinformatics research areas,
most of which are utilized in Chapters 3 and 4.
Genome Annotation Genome annotation refers to locating genes in a sequence
and then giving biological meaning to those regions. Genome annotation is
often classified into functional and structural annotation. Functional anno-
tation refers to deciphering the function of a gene, such as how a gene is
expressed. Structural annotation identifies characteristics of a gene, such as
where a coding region is located.
Sequence Alignment Sequence alignment is the comparison of two or more
sequences to one another. The goal of sequencing is to find similarities be-
tween or among the sequences. There are three types of sequence alignment:
global, local, and semiglobal alignment. Global alignment involves finding
9
Global Alignment
CAATCAGATCTCAT
|||| | ||||
CAATGA----TCAT
Local Alignment
Input Sequences
CAATCAGATC
|||| ||||
CAAT--GATC
CAATCAGATCTCAT
CAATGATCAT
Figure 2.3. Global and Local Sequence Alignment. Here, we showexamples of global and local alignment of the same two sequences.
Global alignment aims to match best over the entire sequence, whichmay result in the insertion of multiple gaps, as shown here.
Additionally, both sequences are found in their entirety. While globalalignment produces the best alignment that utilizes all nucleotides in
each sequence, local alignment does not necessarily utilize allnucleotides and uses shorter segments. Local alignment produces a
different alignment because it preferentially aligns shorter fragments.
the best alignment of two sequences using every nucleotide or amino acid in
each sequence, while local alignment concentrates on aligning shorter frag-
ments in highly conserved regions. The Needleman-Wunsch algorithm is
typically used for global alignment, and the Smith-Waterman algorithm is
used for local alignment. Figure 2.3 shows example global and local align-
ment for two sequences. Semiglobal alignment aims to align two sequences
such that only one sequence needs to be used in its entirety while only part
of the other sequence is aligned. Aligning more than two sequences together
is called multiple sequence alignment.
Sequencing Sequencing is the process used to find the order of nucleotides in
10
DNA. A common method to perform this is known as Sanger sequencing
[125]. Another common technique is to perform shotgun sequencing on the
Sanger sequencing results [102, 125]. In shotgun sequencing, a sequence is
divided into many small, random fragments and then sequenced using the
chain termination method. This method was used in the sequencing of the
human genome [145]. High-throughput techniques, such as 454, Illumina,
and SOLiD sequencing, are now the most commonly used DNA sequencing
methods [96].
Genome Assembly Genome assembly refers to the process of assembling many
short DNA sequences together to form the original chromosome(s) they once
composed. The short sequences are often generated by shotgun sequencing.
Gene Expression Translating information from nucleotide DNA sequence into
protein or RNA is referred to as gene expression. This technique helps
elucidate the function of such a sequence and is reflected phenotypically,
meaning the effect is observable in the organism.
2.3.1 VectorBase
VectorBase [91, 92] is a bioinformatics resource that serves as a web-based
facilitator to a wealth of information and tools pertaining to invertebrate vectors
of human pathogens. At VectorBase, researchers can, among other things, browse
genomic data, contribute to community annotation of a genome, run bioinformat-
ics tools such as BLAST, ClustalW, or HMMER, and obtain relevant information
about the vector organisms. VectorBase is an NIAID Bioinformatics Resource
Center (BRC) and serves as a facilitator and motivator for much of the work in
this dissertation.
11
2.4 Transposable Elements
Transposable elements (TEs) are a type of repetitive sequence that have been
found in nearly all eukaryotic (nucleus containing) genomes. First discovered and
analyzed by McClintock in the 1950s [98], TEs have the ability to move about
and replicate within a genome. Due to their mobile and replicative nature, TEs
often occupy large portions of genomes. TEs are estimated to represent 47% of
the yellow-fever mosquito genome, Aedes aegypti [105], 35% of the frog genome,
Xenopus tropicalis [61], and 45% of the human genome, Homo sapiens [65]. This
prevalence of TEs poses a major difficulty in sequence assembly, as repeat regions
are prone to misassembly [101, 111]. TEs can impact host genomes in a number of
ways. They are believed to play a major role in genome evolution [82, 100, 130], as
they can insert themselves into, mutate, and move genes, thereby influencing gene
expression. In turn TEs can cause gene variation and transfer genetic material
[12, 28, 129, 138].
The process by which TEs move about a genome is called transposition. TEs
are classified according to their transposition mechanism into Class I and Class II
elements. Class I TEs, or retrotransposons, are mediated by an RNA intermediate,
typically produced by a TE encoded reverse transcriptase. Class I TEs transcribe
themselves to RNA and are reverse transcribed back into DNA by the reverse
transcriptase enzyme, the so-called “copy-and-paste” mechanism. The presence
or absence of long terminal repeats (LTRs) further classifies Class I TEs into non-
LTR and LTR elements. Class II TEs, or transposons, are DNA-mediated and
transpose through the use of a transposase enzyme. Class II TEs are typically
bounded on each end by terminal inverted repeats (TIRs), which flank and serve
as the recognition sequence for the transposase. The transposase adheres to a
12
GTACAGC...AATTACG GAT...GAT TTAC...GCGTACGC...GTAA TAC...CAT
TIR TSD
TA
TSD
TA
TIR UTR UTRTRANSPOSASE
2 bp 20-30 bp 2 bp20-30 bp ~100 bp ~100 bp~900 bp
Figure 2.4. Typical mariner Class II Transposon Structure. Marinertransposons are characterized by a single transposase flanked byterminal inverted repeats (TIRs) and a preferential target site
duplication (TSD). There are generally 20-30 base pairs (bp) ofuntranslated region (UTR) flanking the transposase as well.
“cut-and-paste” mechanism, as it cuts out the TE from the host DNA and allows
it to insert at a new site in the host DNA. Many TEs have preferential insertion
sites and the method by which TEs move about genomes often produces artifacts
flanking the TEs, called target site duplications (TSDs). Both Class I and II
TEs are further divided into families, each with distinguishing characteristics. We
follow the classification scheme described by Tu [139], summarized in Figure 2.5.
Class I RNA-mediated retrotransposons are divided into several main families,
which we next describe.
LTR Retrotransposons– Members of the LTR retrotransposon group are
typically 5-9 kb (kilo base pairs) long and have a 200 to 500 bp (base
pairs) LTR on both ends. LTRs encode polymerase (pol) and group-
associated antigen (gag) proteins [22, 139]. Example families of LTRs
include Ty1/copia, Ty3/gypsy, and BEL retrotransposons.
Non-LTR Retrotransposons– The non-LTR retrotransposons include long
interspersed nuclear elements (LINEs). LINEs are generally 5-8 kb
long with open reading frames (ORFs) that contain coding necessary
13
LTR LTR
gag PR IN RT RH
TSD TSD
TSD TSD
TIR TIR
TIR TIR
non-coding region (GAA)n
(GAA)n
ORF
ORF I ORF II(gag) (pol)
APE RHRT
tRNA-related region
promoters
ORF
A T
Helicase
LTR Retrotransposon, 5-9 kb
non-LTR Retrotransposon (LINE), 5-8 kb
SINE, < 500 bp
"cut-and-paste" DNA Transposon, typically under 5 kb
MITE, typically under 500 bp
Helitron, around 7 - 9 kb
Class I TEs
Class II TEs
Figure 2.5. TE Classification Scheme and Structures. The figure above,adapted from [139], shows the typical division of TE classes, as well as
major families within each. PR, IN, RT, RH, and APE refer toenzymes. TSD refers to target site duplications and ORF refers to openreading frame. ORFs are coding sequences without a stop codon. This
figure is not drawn to scale.
14
for transposition [22, 129, 139]. Non-LTRs also have a pol region. L1,
L2, R1, Jockey, and Penelope are several example non-LTR Retrotrans-
posons.
SINEs– SINEs, or short interspersed nuclear elements, are similar to LINEs,
but much shorter, with a typical length of under 500 bp [139]. SINEs
do not code for proteins and instead use transposition mechanisms from
other retrotransposons.
Class II Transposons are mediated by DNA transposition. Transposons cut and
paste themselves within genomes without the use of an RNA intermediate.
Descriptions of several families of transposons follow.
mariner– The mariner transposons are characterized by a DDD motif and
generally contain one exon transposase. A motif is a sequence of amino
acids, which, in this case, are all Aspartic acids and are each rep-
resented by a ‘D.’ Mariners are widespread in invertebrates and are
well characterized. We show an annotated mariner transposon in Ap-
pendix E.1.3.1.
P– The P element transposons were first discovered in the fruit fly Drosophila
melanogaster genome and were the first transposons to be used as a
gene vector [73, 122]. Since the initial discovery of P elements, they
have been found in several other species, most notably the malaria
mosquito Anopheles gambiae [126].
piggyBac– The piggyBac family of transposons was first discovered in the
cabbage looper moth Trichoplusia ni [23]. This transposon has since
been found in a number of organisms and has proven effective as a gene
15
vector for a variety of organisms, including Aedes aegypti [95].
Tc1 – Like all transposons, Tc1 transposons are flanked by inverted repeat
sequences. They are valuable gene vectors, extensively used in the
analysis of Caenorhabditis elegans [110]. Tc1 transposons are typically
recognized by their DDE motif.
pogo– These transposable elements are members of the Tc1 superfamily.
They are very similar to Tc1 transposons, but lack the DDE catalytic
motif.
MITEs– MITEs, or miniature inverted-repeat transposable elements, are
small elements often under 500 bp. Their mechanism of transposition
is not entirely known and they do not encode a protein.
Helitrons– Helitrons replicate using a rolling-circle mechanism and encode
helicase-like proteins.
2.5 Transposable Element Identification
There are several difficulties with TE identification. TEs do not adhere to a
universal structure; instead, some families of TEs follow specific structures. An
example would be the TIR-transposase-TIR general structure of a Class II trans-
poson, such as in the mariner element. Additionally, the structure of TEs can
mutate over time. For example, TEs may preferentially insert themselves in sim-
ilar regions of the genome, or even within one another, leading to many nested
and fragmented copies. While autonomous, or active, TEs possess intact reading
frames which serve as mechanisms for transposition, the majority of TEs are non-
autonomous, or not active. Non-autonomous TEs can often still be transposed,
using the transcription machinery of other elements in their class, but they can-
16
not transpose themselves. For these reasons, a general approach cannot be used
to identify all TEs. Instead, several approaches are used with varying levels of
effectiveness.
The automatic computational identification of TEs is not as robust or mature
as analogous methods currently used for genes [115]. Bergman and Quesneville [13]
describe many TE discovery methods and classify existing TE discovery techniques
into de novo, structure-based, comparative genomic, and homology-based. Saha
et al. and Lerat more recently reviewed approaches to identify TEs [93, 124] and
classify identification techniques into similar groups: ab initio, signature-based,
and library-based techniques. We next describe the approaches according to the
Bergman and Quesneville classification.
2.5.1 De novo Discovery
De novo TE discovery approaches look for similar sequences found at multiple
positions within a genome. Once identified, the sequences are typically clustered,
filtered, and characterized. While computationally expensive, this approach can
identify novel TEs and is most effective in discovering TEs with high prevalence
within a genome. De novo techniques are not as effective in identifying degraded
TEs, meaning TEs with mutated or incomplete structure. Example de novo tools
include PILER [40] and RECON [11].
2.5.2 Structure-based Discovery
Structure-based approaches, such as LTR STRUC [97], typically work well
to identify complete TEs that comply to a defined and conserved structure. In
this case, LTR STRUC is effective at finding retrotransposons with LTRs at each
17
end of the element. Structure-based methods are less useful when searching for
degraded TEs or for TEs without a conserved structural characteristic.
2.5.3 Comparative Genomic Methods
A comparative genomic discovery method described by Caspi and Pachter
[24] uses multiple sequence alignments of closely related genomes to detect large
changes between the genomes. The idea is that differences in the genomes, called
insertion regions, could be TEs or caused by TEs. Such differences are analyzed
and classified. This approach is useful when related genomes are readily available
and can identify new families of TEs. Common ancestral TEs will likely not be
identified by this approach.
2.5.4 Homology-based Discovery
Homology-based approaches utilize known TEs as a means to discover similar
TEs in genomes. This is typically done by manually using alignment programs,
such as BLAST [1], to align known TEs to the genome in question and then care-
fully analyzing the results. Biedler and Tu [14] reference a suite of TE-related
programs to identify and characterize TEs that are homology-based, and Ques-
neville et al. offer the BLASTER suite of tools [116] to detect TEs. Although
there are few homology-based tools and despite the fact that they struggle in
identifying TEs unrelated to known elements, they are normally most accurate in
identifying known TEs, as well as detecting degraded TEs. Existing homology-
based approaches also sometimes utilize hidden Markov models (HMMs) [2]. Such
approaches are effective for closely related genomes, but struggle with distantly
related species, as the models tend to capture more irrelevant data when searching
18
for diverse sequences. Additionally, homology-based approaches currently avail-
able are the fewest in number and least automated. Moreover, many are not
geared to output high-quality consensus sequences. For these reasons, our auto-
mated approach to identify TEs, described in Chapter 3, is homology-based.
2.6 Annotation
The annotation of genomic data refers to giving meaning to such data, namely
describing the functions served by specific regions of DNA. Annotation is most
often applied to finding the structure and function of genes, a process that can be
very time-consuming in the lab. Several computational tools are commonly used in
the automatic annotation of genomes. These tools can be categorized as structure-
or homology-based and are often used in conjunction with other tools. Genescan
[20] is a gene identification tool that utilizes the structure of introns (non-coding
regions) and exons (coding regions) in its computation. GeneWise [15] is another
tool that works based on protein sequence similarity, shown to be very effective
in locating known genes [16]. These tools often utilize Hidden Markov Models
(HMMs) [38, 39, 102]. Ensembl’s automatic gene annotation system is one of the
better-known annotation systems; however, it is far from the “gold standard” of
annotation [30]. Such a standard is labor-intensive and takes years to complete.
Genes vary in many ways, making their automatic annotation difficult. Kohany
et al. [84] note that while automatic annotation has the potential for very high
throughput and is not susceptible to user error or bias, it is difficult to reconstruct
sequences, particularly TEs (largely due to their fragmentation), using automated
techniques.
The VectorBase community annotation pipeline eases many of these concerns.
19
Gene experts have the ability to annotate genes through the community pipeline,
as well as link their scientific publications to their efforts. Their work is visible to
the world through VectorBase, and these experts are publicly recognized. Vector-
Base’s current system utilizes the DAS, Ensembl, Chado, Hibernate technologies,
each of which are used in our design and implementation plan to annotate TEs,
described in Chapter 4. We elaborate upon each technology below.
2.6.1 DAS
DAS, or distributed annotation system [35, 68], is the protocol used to dy-
namically display data in the VectorBase Ensembl genome browser. DAS utilizes
a client-server setup through which a client interacts with multiple servers. Its
main advantage is that the annotation information can be located in multiple
databases and on multiple servers, but can be gathered and used by a single client
(VectorBase).
2.6.2 Ensembl
Ensembl exists with the goal to produce and maintain automatic annotations
on selected eukaryotic genomes [44]. Ensembl regularly increases supported con-
tent; as of release 59, Ensembl supports 56 species [45]. For supported organisms,
Ensembl offers detailed information. Researchers can visually browse to locations
within a genome and then view detailed information, such as for a gene. The anno-
tation information is displayed through Ensembl’s rich genome browser’s various
views. The location-based view shows details for regions of the genome. The
gene-based view shows detailed gene information. Transcript information is also
available in the transcript-view. This rich genomic annotation data is produced
20
through DAS tracks. Figures 2.6, 2.7, and 2.8 show the VectorBase implementa-
tion of each of these views.
2.6.2.1 Ensembl Genebuild
Ensembl uses an automated pipeline to annotate genes in newly sequenced
genomes [31]. Their pipeline is largely homology-based and heavily relies on pro-
tein alignments with GeneWise [15]. A major advantage of GeneWise is that it
allows for frameshifts in the coding regions. Frameshifts are regions where the
reading frame has changed due to an insertion or deletion. Other alignment tools
such as exonerate are also utilized. Protein sources are obtained from UniProt
[141] and RefSeq [113]. A detailed description of the Ensembl pipeline is available
in Curwen et al. [31].
21
Fig
ure
2.6.
Ense
mbl
Loca
tion
-bas
edV
iew
onV
ecto
rBas
e.H
ere,
we
show
the
Vec
torB
ase
imple
men
tati
onof
Ense
mbl’s
loca
tion
-bas
edvie
won
the
2Lch
rom
osom
eof
the
An
ophe
les
gam
biae
genom
e.
22
Fig
ure
2.7.
Ense
mbl
Gen
e-bas
edV
iew
onV
ecto
rBas
e.T
he
figu
reab
ove
show
sth
eV
ecto
rBas
eim
ple
men
tati
onof
Ense
mbl’s
gene-
bas
edvie
w,
show
ing
gene
AG
AP
0033
95in
An
ophe
les
gam
biae
.
23
Fig
ure
2.8.
Ense
mbl
Tra
nsc
ript-
bas
edV
iew
onV
ecto
rBas
e.T
he
Vec
torB
ase
imple
men
tati
onof
Ense
mbl’s
tran
scri
pt-
bas
edvie
wfo
rA
nop
hele
sga
mbi
aege
ne
AG
AP
0033
95is
show
nab
ove.
24
2.6.3 Chado
Chado is a part of the GMOD (generic model organism database) [53] project.
GMOD is a collection of open source software tools for creating and managing
genome-scale biological databases. Chado is a modular schema that is designed to
allow the addition of new modules for new data types. VectorBase uses Chado to
host its complex biological data, currently stored in more than 250 tables. TEs can
have a more complicated structure than “regular” genes and the existing Chado
schema is complex and abstract enough to able to accommodate them. There
is limited built-in TE-specific support for TEs within Chado; in fact, GMOD
encourages adapting TEs to Chado [27], which prompted us to adapt Chado to
our specific needs, which is outlined in Section 4.2.3.
2.6.4 Hibernate
Hibernate [62] is a library for Java that provides a framework for mapping
java classes to a database using XML. This allows for the automatic updating,
retrieval, creation, and deletion of object data. The use of Hibernate eliminates
the need to write SQL for such operations.
2.6.5 VectorBase Community Annotation Pipeline
VectorBase has implemented a community annotation pipeline (CAP) for genes,
which has the ability to accept four types of annotation information:
1. Gene Models Users can submit gene sequence to be incorporated into genebuilds.
2. Publications Sequence data can be linked to the literature.
3. Controlled Vocabulary Terms Controlled vocabulary terms can be as-sociated with genes.
25
Figure 2.9. VectorBase Gene Submission Form. The figure above showsa portion of the spreadsheet researchers populate with gene data to
submit to VectorBase.
4. Comments General comments are available for publication.
This data is submitted by full-time annotators and community researchers.
Data is collected through spreadsheets, a portion of one which can be found in
Figure 2.9. The spreadsheets are uploaded through the VectorBase website and
the genome-specific curator is notified of the submission. If approved, the data is
aligned to the genome with exonerate and incorporated into VectorBase, making
it available to the community. Once submitted, links are available for researchers
to associate publications or controlled vocabulary terms with the data, as well as
to add comments. Additionally, community submitted genomic data is displayed
through Ensembl’s browser on VectorBase via DAS.
Community annotation data on VectorBase flows according to the schematic
shown in Figure 2.10, adapted from Bruggner [19], and also briefly described in
Butler [21]. Many technologies are utilized, most of which have been described
previously. VectorBase uses PostgreSQL databases [112], which communicate with
DAS through Hibernate. SOAP [131], an XML protocol for exchanging XML
messages, interfaces with Hibernate and the VectorBase web pages.
26
Manual
Annotation
Database
Java Hibernate Chado API
FeatureSubmission
Functions
FeatureSubmission Web
Service Interface
FeatureSubmissionQuery
Functions
FeatureSubmissionQuery
Web Service Interface
PostgreSQL Apache Tomcat & Axis
Community
Genomic
Annotation
SubmissionForm.xls
DAS Server
Apache HTTP Server & PHP
Community AnnotationSubmission & Review
Interface
Genome Browser
Contig View
Genome Browser
Gene View Gene DAS
Report
SOAP
Figure 2.10. VectorBase Community Annotation Pipeline Data Flow.We show the flow of information on VectorBase for community gene
annotation, adapted from Bruggner [19].
27
2.6.5.1 Planned Updates to the VectorBase Community Annotation Pipeline
VectorBase has been heavily updated since the initial implementation of CAP.
At the time of this work, the original VectorBase CAP has not been restored
to working order. In the long term, and particularly with VectorBase 2.0, CAP
is likely to evolve further. Potential new features to CAP include an improved
interface for the submission and presentation of data, additional means to submit
data, as well as the ability to search for community submitted data. Additional
plans for the new CAP are currently under development and should eventually
include TEs.
2.7 Transposable Element Annotation
There has been much work done on the annotation of genes [16, 20, 30, 84, 87],
but little on the annotation of TEs. TEfam [134], Repbase [119], and WikiPoson
[148] currently serve as the best online TE resources. TEfam currently caters
to TEs from the Anopheles gambiae, Aedes aegypti, and Culex quinquefasciatus
mosquito species, as well as the Ixodes scapularis tick. TEs from the body louse,
Pediculus humanus humanus, are available internally, with plans to make them
public in early 2011. There, users can submit and view representative sequences.
There is no structural display provided, but the TEs are well-annotated. The
submission process is regulated through user accounts. Meanwhile, Repbase has a
large database of consensus TEs from many taxons. Registered users may export
TEs in the EMBL, FASTA, or IG file format and can submit properly formatted
TEs by emailing the editor-in-chief at the site. WikiPoson is a newer site that has
various information about TEs, including descriptions of how TEs are classified,
descriptions of many elements, as well as TE-related news. Stand-alone tools,
28
such as Apollo [94], also exist, but they largely rely on biological expertise and
their data are often not available online.
2.7.1 VisualRepbase
VisualRepbase [136] offers an interface for the study of TEs available on Rep-
base. VisualRepbase is available as a downloadable Java archive and has several
important features. First, researchers can search for TEs by family and genome
and download sequences in the FASTA format. Second, VisualRepbase can dis-
play the location and orientation of its TEs on the selected genome, as well as
any available annotation data. Third, VisualRepbase can also display the location
of any properly formatted annotation data available for a given genome. Lastly,
VisualRepbase lists the number of occurrences by chromosome for selected TEs.
This mapping was performed with Censor [72] and is stored in a table format that
includes coordinates within the genome. Figure 2.11 shows the distribution of the
MARINERN3 AG transposon on the 2R Chromosome of Anopheles gambiae.
While useful, there are drawbacks to VisualRepbase. The visual display of
TEs lacks structural information. While the sequence orientation is shown, struc-
tural TE features such as TIRs are not shown. VisualRepbase is most useful for
showing the distribution of TEs within a genome, as well as their proximity to
any previously annotated data. Rich genome browsing features, such as those
available in Ensembl’s Genome Browser, would increase the utility of VisualRep-
base. Additionally, while VisualRepbase allows for the entry of new sequences,
it appears that users must also submit the associated annotation information,
including genomic coordinates, which is not always straightforward.
29
Figure 2.11. VisualRepbase Interface. The figure above shows thedistribution of the MARINERN3 AG transposon on the 2R
chromosome of Anopheles gambiae. As shown in the figure, there are109 occurrences on the chromosome with at least 50% identity to the
consensus.
30
2.8 Summary
This chapter has presented background information necessary for an under-
standing of Chapters 3 and 4. In particular, we have discussed basic molecular bi-
ology and introduced several bioinformatics research areas, including annotation.
We have described TEs and techniques utilized to detect them within genomes.
Lastly, we have described TE annotation techniques. Chapter 3 uses the infor-
mation this chapter has presented to describe an approach to identify TEs and
Chapter 4 describes a plan for the annotation of TEs on VectorBase.
31
CHAPTER 3
AUTOMATED HOMOLOGY-BASED APPROACH FOR THE
IDENTIFICATION OF TRANSPOSABLE ELEMENTS1
3.1 Introduction
The identification of TEs is an important part of every genome project. Unfor-
tunately, the automatic identification of TEs in novel genomes is far from mature.
In particular, there is a lack of automated homology-based approaches that pro-
duce high-quality consensus TEs [13]. As the number of sequenced genomes has
rapidly risen, the identification of TEs has received greater attention from the
scientific community. The ability to identify TEs automatically and effectively
in a manner similar to the methods used for genes is also of increasing impor-
tance. There exist many difficulties in identifying TEs, including their tendency
to degrade over time and that many do not adhere to a conserved structure. In
this chapter, we describe an easy-to-use, automated homology-based approach to
discover high-quality putative TEs. We apply this approach to recently sequenced
arthropod genomes and identify consensus TEs up to 98% identical to manually
annotated TEs. The implementation of our approach, TESeeker, is available for
download as a virtual appliance.
1Results of this chapter have appeared in the TE sections of Arensburger et al. [5], andKirkness et al. [83]. Our approach, with selected results, is under review in Kennedy et al. [80].
32
3.2 Approach for Identification of TEs
Our approach targets the identification of TEs with homology (similarity) to
known TEs. While our approach has many potential applications, we focus on
characterizing TEs in novel genomes. The approach utilizes a diverse library of
representative TEs as a basis for BLAST searches against the genome in question.
The hits then undergo multiple iterations of processing before we produce a high-
quality consensus TE. This section introduces and describes our approach.
3.2.1 Dependencies
This approach relies upon several notable bioinformatics tools, described be-
low.
3.2.1.1 Library of Representative Sequences
Our modular homology-based approach relies on a thorough and high-quality
library of representative TEs, organized by family. When strong information is
available, amino acid coding regions, reverse transcriptases for Class I TEs and
transposases for Class II TEs, are the preferred components of the library. Nu-
cleotide sequences can also be used, but such sequences do not allow for as much
nucleotide variance during the search. Sequences for our library were chosen man-
ually from TEfam [134], NCBI [104], Repbase [71], and the literature. LTR reverse
transcriptases within the representative library were chosen with the assistance of
Jose Manuel C. Tubıo [140]. Sequences with complete amino acid coding regions
were preferentially chosen, and a wide variety of related sequences was assembled
for each family. Currently, the library consists of 475 representative coding re-
gions from a variety of organisms and covering the major TE families. Further
33
details on the provided library are available within the FASTA files and online
[137]. Because the library consists of sequences in the FASTA format, researchers
can easily modify the library or create their own library for use in the approach.
3.2.1.2 BLAST
We utilize BLAST [17, 102], or basic local alignment search tool, to perform
sequence similarity searches. BLAST is one of the most widely used algorithms
in bioinformatics and works well for our purposes.
3.2.1.3 DNASTAR SeqMan II
While not used in the automated approach, DNASTAR SeqMan II [33] played
a central role in the development of this approach. DNASTAR SeqMan II works
similar to an assembler in that it allows one to set various parameters, such as
match size, minimum match percentage, or minimum sequence length, to produce
contigs, which are consensus sequences generated from multiple sequences. We
utilized DNASTAR SeqMan II extensively in early versions of this approach.
3.2.1.4 CAP3
CAP3 [64] is a popular and mature sequence assembly tool. In most cases, CAP3
produces better quality contigs than the phrap assembler [57]. The ability to clip
low-quality regions of the input sequences is an added plus.
3.2.1.5 ClustalW2
ClustalW2 [89] is the newest version of the most widely used multiple sequence
alignment program. ClustalW2 offers a balance of speed and accuracy, while also
34
supporting the ability to produce phylogenetic trees.
3.2.1.6 BioPerl
We use Perl, or practical extraction report language, and an extension of Perl
called BioPerl [133] to perform the majority of our sequence analysis. Perl is
an interpreted scripting language that lends itself well to the bioinformatics field
largely because of its parsing capabilities. BioPerl has many of the common bioin-
formatics applications built into modules, which makes it very powerful.
3.2.2 General Description of Approach
Our approach varies slightly depending on whether the representative TEs
are amino acid or nucleotide sequences, the main difference being that amino
acid searches require only a translated nucleotide genome search, tblastn, while
nucleotide sequences require translation of both themselves and the host genome,
tblastx. We next describe the approach that starts with an amino acid library
of TEs, shown graphically in Figure 3.1. A walkthrough of our approach, for the
mariner TE in P. humanus humanus is described in Appendix A.
The approach begins with BLAST searches against the genome using repre-
sentative TEs for the chosen family. Resulting BLAST hits are combined if they
overlap or are very close together, and are then extracted from the genome. We
next assemble the hits with CAP3 in an attempt to get a viable representation of
the coding sequence. We use the CAP3 results to do another BLAST search against
the genome and process the hits in the same manner. However, when extracting
the sequences from the genome, we add flanking regions. The length of the flank-
ing region is dependent on the type of TE and is utilized to enable us to capture
35
BLAST CAP3
ClustalW2
trim
combine
generateconsensus
ConsensusTE
TE Library
Genome
Genome
Figure 3.1. Approach Schematic. The approach is composed ofmultiple, iterative steps. The general flow is as follows. A TE family isused in a BLAST search against the genome. Hits are then combined,
extracted from the genome, and assembled with CAP3. Next, thesequences are trimmed and again used in a BLAST search against the
genome. The results are then used to produce a multiple sequencealignment in ClustalW2. We generate a consensus from the alignment,and then perform a final BLAST search against the genome. We againcombine, extract, and assemble with CAP3. Finally, the consensus TE is
generated from CAP3.
36
the entire TE. These extracted results are then aligned and a consensus is gener-
ated. We use the consensus to perform a final BLAST search, again combining,
extracting, and assembling the sequences. CAP3 then produces the high-quality,
full-length consensus TE. We next describe the approach in more detail.
3.2.2.1 Identify Coding Region
The coding region is generally most conserved across TEs within a genome,
as it must be complete to produce a functional protein. We begin with local
sequence alignments using BLAST. Nucleotide-based blastn searches are not as
effective in identifying TEs and are not used; the nucleotide sequence for a given
TE may vary considerably, while the translated amino acid sequence is more likely
to be conserved. Instead, tblastn searches are used to identify the coding region.
BLAST produces a set of hits for each TE query against the genome and we
consider hits with an expectation value (e-value) less than 1E-20 for our approach.
Lower e-values correlate to more significant hits. This cutoff was determined
from our empirical data to limit the hits to the most probable TEs while also
eliminating most false positives and can be manually adjusted. Due to slight
sequence variations, BLAST results are often rich in short, nearly-adjacent hits.
We process BLAST results such that all hits are combined if they are within a
specified distance of one another, 50 bp by default, and originate from the same
query sequence. Hits with overlapping coordinates are combined as well. These
combinations increase the quality of our hits and the potential to capture more
complete sequences. In the case where there is a gap between sequences, we also
include the intervening sequence data in our hit. Figure 3.2 shows combination
scenarios. Once all possible combinations are performed, hits are extracted from
37
AB
C
A B
C
AB
C
AB
C
A
A
B
B
C
C
Figure 3.2. Methods of Combination. This figure shows the five generalcombination scenarios used in our scripts. In each case, hit sequences A
and B are consolidated into a single sequence C, which represents asection of nucleotides from the genome. We have shown combinations of
overlaps, nested sequences, and sequences separated by a short,prespecified distance.
the genome.
At this point, we have a set of possible coding sequences, both complete and
partial, many of which are copies or partial copies of one another. To consolidate
and improve our results, we assemble the sequences with the CAP3 assembly pro-
gram [64]. CAP3 produces contigs and singleton sequences. Singleton sequences
are sequences that did not assemble with other sequences. CAP3 also generates
accompanying quality scores for the contigs, based upon the underlying sequences
that produced the consensus. We use the quality scores to trim the sequences
such that the highest quality sequence remains. To do this, we iterate through a
38
contig, keeping track of the cumulative sum of quality scores for a given number
of consecutive nucleotides, called the sliding window, which is 20 bp by default.
When the average value of a nucleotide in this sliding window exceeds a thresh-
old, typically 18, we consider the corresponding sequence to be high quality. If the
average value drops below the threshold, the sequence is ignored. Once we have
read the entire sequence, there will likely be gaps in the sequence where there is
little commonality. In these cases, we only keep the low-quality regions if they are
of short length and have adjacent high-quality sequences. These results are then
reassembled in CAP3, trimmed, and considered the best potential complete coding
region. In the case that CAP3 produces only singletons, we perform the aforemen-
tioned analysis with them. We then aim to extend the sequence to encompass the
entire TE. Pseudocode for the steps described in this section of our approach is
shown in Algorithm 1.
3.2.2.2 Encompass Complete Transposable Element
Once the putative coding region has been identified, we create a consensus for
the complete TE. We perform a blastn search with each contig and singleton
from the previous (CAP3) step to find the instances of the TE within the genome.
We again process these hits as before and extract them from the genome, but
this time we also extract flanking regions on both sides of the viable hits in an
attempt to capture the entire TE. This extracted set of instances can then be used
to generate a consensus sequence.
39
Algorithm 1 P=IdentifyPutativeSequences (Q, S, evalue, distance)
Let Q be the set of representative TEsLet S be the genomeLet P be the set of putative hitsLet evalue be the maximum e-value of a potential hitLet distance be the maximum distance between potential hits// Search genome and sort hits according to locationfor all q ∈ Q do
Hq ←BLAST(q, S)Hq ←sort(Hq, position)
end for// Combine overlapping hitsfor all q ∈ Q do
for all h ∈ Hq doif h ≤ evalue then
for all i ∈ Hq doif i ≤ evalue then
if abs(h.location− i.location) ≤ distance thenh← (h + i)
end ifend if
end forend if
end forend for// Extract putative TEs from genomefor all q ∈ Q do
for all h ∈ Hq doPq ←extract(h, S)
end forend for// Assemble consensus TEsfor all p ∈ Pq do
p←trim(CAP3(p))end forreturn P
40
3.2.2.3 Generate Consensus
The extracted near full-length sequences from the previous step are inherently
very similar on a nucleotide-by-nucleotide basis. To generate a consensus from
this set of sequences, we perform a multiple sequence alignment with ClustalW2
[89]. A consensus sequence from the multiple sequence alignment is generated as
follows. We record counts for each nucleotide at each position in the alignment
file. If a gap is encountered, counts for each nucleotide are incremented. If the
percentage for any nucleotide at a given position exceeds a given threshold, 49%
by default, that nucleotide is used for that position in the consensus. We now
have a consensus sequence for the TE that is the most likely sequence to occur in
the genome and we need to verify that it is complete.
3.2.2.4 Identify Complete Transposable Element
To validate and improve the consensus sequence, we look for similar copies
of it in the genome with a blastn search. We again process the BLAST hits as
previously described and extract them from the genome, generally adding short
flanking sequences. The resulting extracted sequences are again iteratively exam-
ined with CAP3 and trimmed. CAP3 produces a sequence which represents the best
estimate for a representative putative TE in the novel genome. Manual inspec-
tion on the putative TE is advisable, both in terms of validity and classification.
Once validated, this TE can then be utilized to calculate the density its particular
family within the genome and to find individual instances.
41
3.2.3 Implementation
Our approach is implemented as TESeeker and was purposely designed to be
modular, while relying upon common bioinformatics tools, namely BLAST, CAP3,
and ClustalW2, as well as BioPerl [133]. TESeeker is released as a VirtualBox
[144] virtual appliance. The local web browser interface to TESeeker offers the
main gateway to the core TESeeker functionality; however, TESeeker can also be
run through the command line. A researcher needs to only provide basic param-
eters, such as TE family, host genome, closeness to combine, minimum BLAST
hit length, flank length, CAP3 window size, CAP3 quality score threshold, and
the nucleotide percentage threshold for consensus generation. We offer suggested
parameters that were determined through extensive testing of the approach on
various TE families. These tests were largely performed on arthropod genomes.
Suggested parameters include combining BLAST hits within 50 bp, a CAP3 win-
dow size of 20 bp, combine distance of 50 bp, and quality score threshold of 18,
a 49% nucleotide commonality cutoff for consensus generation, and a 70% mini-
mum length cutoff (with respect to the query) for the final BLAST search. Further
details on suggested parameters, as well as means to perform a sample run are
provided with the virtual appliance. While not parallelized, researchers can easily
run multiple instances of TESeeker while varying parameters and TE families,
offering scalability.
3.2.4 Advantages
Our approach offers many advantages to researchers. First, TESeeker allows
for the fast and accurate detection of TEs. As demonstrated in several genomes,
across multiple TE families, TESeeker effectively identifies TEs. In addition to
42
TE identification, our approach offers opportunities to reexamine and validate
previous research. Second, TESeeker is very easy to use; we provide TESeeker
as a virtual appliance, completely configured. Researchers must only provide a
few parameters to begin searching. Parameters are easily modified and multiple
iterations of the approach can be run simultaneously. Third, TESeeker is general.
While we primarily evaluated our approach on Class II TEs in arthropod genomes,
the parameters can be adjusted to allow for the effective detection of a variety of
TE families in any genome, including genomes that contain only degraded TEs.
Less stringent parameters will be more effective in detecting such degraded TEs,
but will also increase the number of false positives. As mentioned previously, we
have utilized various stages of this approach to identify non-LTR and LTR TEs
in a number of genome projects. Last, our approach eases the burden on expert
annotators, decreasing genome annotation time.
3.2.5 Limitations
While robust, this approach has several limitations. First, results are highly
dependent on the quality of the sequences in the library and whether the novel
genome contains TEs with homology to those in the library. The library must
contain a thorough representation of TEs for a given family, preferably amino
acid coding regions. The provided library has performed well, but extensive test-
ing has not been performed on LTR elements. Additionally, this approach is not
designed to detect TEs without a coding region, such as SINEs or MITEs. Second,
the approach is most effective for TEs that exist in multiple copies throughout
the genome. While TESeeker has been shown to find TEs that have only a sin-
gle full-length instance, the quality of its output and the extra effort required
43
by the researcher to alter the parameters can be time consuming. Last, results
from TESeeker must be closely examined. An ongoing issue with TEs concerns
their classification. If a search is seeded with mariner sequences, it may produce
consensus TEs that are not true mariners, but are rather mariner -like TEs. For
this study, TEs were classified through examination of their amino acid coding
regions.
3.3 Results
Our approach was developed over the course of several TE detection projects
on several arthropod genomes [5, 83], but was not originally automated. DNASTAR
SeqMan II [33] was used in place of CAP3 and ClustalW2. DNASTAR SeqMan II
produced viable results, but it required extensive interaction from a researcher.
Sequences had to be manually examined and trimmed in the program, a process
which took considerable time and required a trained researcher. This manual ap-
proach produced results that we consider a high-quality annotation of TEs. We
used these results to partially validate TESeeker against the Pediculus humanus
humanus genome, described below. For example, running the approach with de-
fault parameters for a mariner Class II element in P. humanus humanus and with
a Jockey Class I element in Culex quinquefasciatus produced a consensus TE that
was more than 98% identical to the manually produced element. Additionally, the
elements were correctly trimmed. These consensus sequences were generated with
amino acid coding sequences - transposase in the mariner element and both open
reading frames of the reverse transcriptase in the Jockey element. Figures 3.3 and
3.4 show alignments of the automated approach’s consensus versus manually an-
notated elements. We also evaluated our approach against published results from
44
Automated
Manual5’ 3’
AA-TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA
||||||||||||||||||||| ||||||||||||||||||||||
TATTGGGTTGGCAAATAAGTA...AATATCTTTTGCCAACCCAATA
Figure 3.3. P. humanus humanus mariner element. The alignment ofthe consensus sequence from our approach and the manually annotated
element is shown above. As evident from the figure, our approachcorrectly identifies the element and trims the ends almost perfectly.
Automated
Manual5’ 3’
TTT...TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTG
|||||||||||||||||||||||| |||||||||||||||||||
TTTTTTTTTTTAATTTATATTTAT...GAAGGTTCGCAAGACACTGAT
Figure 3.4. C. quinquefasciatus Jockey element. The alignment of theconsensus sequence from our approach and the manually annotatedelement is shown above. Extra thymine elements on the 5’ end aretypical of the corresponding poly(A) tail, characteristic of Jockey
elements.
the Anopheles gambiae PEST genome, as well as a number of other genomes. Var-
ious stages of our methodology have been applied to a number of genome projects
which we describe next. In all cases, we utilized our library of representative cod-
ing regions. If we were searching an annotated genome with representative coding
regions already in our library, we removed them before running TESeeker.
3.3.1 Pediculus humanus humanus
The body louse, Pediculus humanus humanus, is the primary vector of typhus
and several other diseases [109]. It has the smallest presently sequenced insect
45
TABLE 3.1
Pediculus humanus humanus NON-LTR RESULTS
Class I Family Element Length (bp) Full-length Copies Copies Density
non-LTR
SART Hope-like 4655 1 522 0.18%
R4 Dong-like 5266 4 1739 0.45%
LTR
ty3/gypsy Mdg1 5395 2 976 0.28%
Class II Family Element Length (bp) Full-length Copies Copies Density
mariner/Tc1 mariner 1276 24 216 0.09%
MITE MITE1 623 4 39 0.02%
MITE2 169 16 66 0.007%
TOTAL 1.027%
genome at roughly 110 Mb (mega base pairs). TESeeker was able to identify all
Class I and II TEs, with the exception of MITEs, reported in Kirkness et al. [83].
A separate tool was developed to detect MITEs and is not described here. Unlike
many other arthropod genomes, only 1% of the P. humanus humanus genome
is made up of TEs. Our approach’s ability to discover TEs of varying families,
across classes, in a genome with so few TEs demonstrates its utility. Following
is a description of our results for each class of elements, which is summarized
in Table 3.1. Additional detail is provided in the P. humanus humanus genome
paper [83].
46
3.3.1.1 Class I Elements
LTR Retrotransposons– Only one element of the LTR retrotransposon family is
well-represented in the P. humanus humanus genome. Phylogenetic anal-
ysis shows that it belongs to the Mdg1 lineage of LTR retrotransposons
(Ty3/gypsy clade). There were no active copies found - the canonical copy
has point mutations in the gag-like domain. There are only two full length
copies in the genome, which suggests that these genomic insertions are rel-
atively recent and that selective pressure is very efficient in purging func-
tional copies from the genome. The other copies are present in the form of
solo-LTRs and partial to highly deleted proviral copies, demonstrating that
solo-LTR formation (by recombination between the two LTRs of the same
copy) and deletions are important mechanisms in the inactivation and/or
elimination of TEs from this genome. Another characteristic of this element
is that the target site is always ATAT, and many of the copies are located in
poly-AT regions (possible heterochromatin), where recombination rate may
be lower and, therefore, selection pressure is also lower, permitting frag-
mented copies to evolve like pseudogenes over time until selection finally
eliminates them.
Non-LTR Retrotransposons– Two distinct types of non-LTRs were reconstructed
and identified. The longest element is 5266 bp. It has 52% homology to
the A. gambiae Dong reverse transcriptase (R4), possesses a single reading
frame, and has a TAA target site [86]. Four full-length copies and many
partial-length copies were found in the genome.
The second element that was reconstructed was about 4655 bp long; how-
ever, it was difficult to determine the boundaries of the element outside of its
47
coding region. This transposable element is not represented in the genome
as a full-length copy. However, several copies exist in the genome with in-
terrupted reading frames. The highest homology was found to the Bombyx
mori (50% homology) and the Papilio xuthus (48% homology) Hope reverse
transcriptase protein [85]. Probable loss of target site specificity and a trend
of insertions across all genomic locations was reported by Kojima and Fuji-
wara [85] for Hope-like elements. However, it was observed that some copies
of Hope-like elements in P. humanus humanus have a targeting sequence
similar to 28S rDNA.
3.3.1.2 Class II Elements
mariner/Tc1 – From the many inspected Class II families only one mariner/Tc1
element was identified. It is a 1276 bp mariner transposon, with 33 bp
TIRs. We show the consensus for this element in Appendix E.1.3.1. Its
transposase has highest homology to Apis mellifera Ammar1 transposase
and Ceratitis capitata Ccmar2 transposase. Reconstruction of the mariner
revealed that the element has two reading frames and that some copies have
a 24 bp deletion in the coding part, which caused further reading frame
interruptions. No autonomous elements were found, but 24 full-length copies
and many deteriorated ones are still present in the P. humanus humanus
genome.
MITE– Two MITEs were identified in the P. humanus humanus genome. The
first is 623 bp long, present in 4 copies, with a 12 bp TIR. The second is
169 bp long, present in 16 copies, with a 20 bp TIR. Dot plot analysis re-
vealed that the 623 bp element consists of 4-5 repeats within itself and that
48
the 169 bp element has 2 repeats. No homologies with other P. humanus hu-
manus TEs were identified. The MITEs were identified by a separate script
we developed that aims to find inverted repeats within specified distances
within the genome. This script is available in Appendix F.
3.3.2 Culex quinquefasciatus
The Culex quinquefasciatus mosquito is the primary vector of the West Nile
virus and St. Louis encephalitis. We searched its roughly 580 Mb genome for
non-LTR retrotransposons and identified 11 of the 17 known families of non-
LTRs, together occupying 4.4% of the total genome. Among these, full-length
copies of the CR1, I, Jockey, L1, L2, LOA, Loner, and R1 families were found.
The Loner and Outcast families are unique to mosquitoes. There is evidence of
recent activity in the CR1, Jockey, L1, and L2 elements. Table 2 contains our
results, also presented in the C. quinquefasciatus genome paper [5]. Across all TE
families, TEs occupy roughly 29% of the genome, comparable to similar mosquito
species. Our full results are shown in Table 3.2.
3.3.3 Anopheles gambiae PEST Genome
Anopheles gambiae serves as the main vector of malaria [63]. The PEST strain
consists of roughly 273 Mb and has been extensively studied. Class II P elements
within the genome have been especially closely examined. Sarkar et al. originally
identified 6 distinct P elements [126]. More recently, Oliveira de Carvalho et al.
identified 4 additional P elements [106], while Quesneville et al. described 9 clades
(subfamilies) that are at least 30% divergent at the nucleotide level [117]. In all,
previous research has described 12 clades of P elements in A. gambiae that are
49
TABLE 3.2
Culex quinquefasciatus RESULTS
Class I Family Number of Elements Copies Density
non-LTR
CR1 31 973 0.28%
I 11 63 0.02%
Jockey 16 5028 1.77%
L1 57 662 0.15%
L2 9 1416 0.61%
LOA 9 184 0.09%
Loner 2 127 0.12%
Outcast 4 15 0.00%
R1 32 250 0.14%
RTE 8 892 0.38%
Unclassified LINE 30 11,117 0.88%
TOTAL 4.45%
50
more than 30% divergent at the nucleotide level.
TESeeker detected 11 out of the 12 P elements within A. gambiae, as well as an
additional 2 partial hits that showed strong similarity to P element transposase,
but that were more than 30% divergent at the nucleotide level. The lone element
that TESeeker missed, AgaP14, is most divergent from the other elements, which
may explain its absence and which also suggests our library does not fully represent
the P element family. Additionally, TESeeker produced consensus sequences with
TIRs on every P element where they had been previously reported.
Searches for additional Class II TE families were also successful. In particular,
of the 13 piggyBac elements identified in Sarkar et al. [127], we identified 10,
including TIRs where previously described. Again, the elements TESeeker missed
were most divergent from the other sequences. TESeeker did especially well with
mariner elements. TESeeker identified each of the 5 elements at TEfam, each
with complete TIRs and 4 with the expected TSDs.
3.3.4 Other Organisms
TESeeker was also validated on select elements in a variety of organisms. Of
particular note, we detected a previously unreported putative mariner element
in the well-studied Drosophila melanogaster genome. The 1061 bp element has
TIRs 26 bp in length, with 3 mismatches, but with no apparent TSDs. A sin-
gle full-length copy, as well as several partial hits, exist within the genome. Its
transposase has a high homology to related insects, such as Chymomyza amoena
and Cladodiopsis seyrigi. Searches for this element in existing TE annotations for
D. melanogaster produced no hits. Please refer to Section E.3.1.1 of Appendix E
for an annotated version of this putative element. Further investigation in collab-
51
oration with FlyBase [36] is warranted to validate this result.
Additionally, TESeeker was used to search for mariner elements in the human
(Homo sapiens), frog (Xenopus tropicalis), and chicken (Gallus gallus) genomes.
Mariner elements are known to exist in the human, frog, and chicken genomes,
which were found using TESeeker.
3.4 Conclusion
As the number of sequenced genomes rises, the necessity to identify TEs within
them also grows. TEs are an important evolutionary force present in the majority
of these genomes. While there exist mature, effective, and automated gene identi-
fication systems, the tools available for the identification of TEs are not as robust.
Particularly, current homology-based approaches are typically very interactive,
requiring numerous user decisions and many separate tools.
The approach described herein successfully identifies TEs in novel genomes in
an automated and easy to use package, offering researchers the ability to quickly
produce high-quality consensus TEs. TESeeker was developed and refined over
the course of several TE identification projects and works best to detect TEs
with homology to known TEs. We are able to generate high-quality putative
TEs as well as characterize the prevalence of TEs in many genomes. We provide
TESeeker as a web-based tool within a VirtualBox virtual appliance, while also
providing our representative TE library both within the virtual appliance and as a
separate download. While its local web interface automates the underlying logic,
each TESeeker step can be manually started through the command line, offering
additional flexibility. Additionally, we provide documentation and test cases to
evaluate the approach with the virtual appliance. The ability to automatically an-
52
alyze a genome alleviates the exhaustive, error-prone, and time-consuming task of
manually inspecting and manipulating results. The performance of the approach
varies, but is largely dependent on the length of the TE family (longer sequences
take longer to assemble) and its abundance in the genome.
53
CHAPTER 4
DESIGN AND PROOF-OF-CONCEPT PLAN FOR COMMUNITY
ANNOTATION OF TRANSPOSABLE ELEMENTS ON VECTORBASE
4.1 Introduction
This chapter presents our design and implementation plan for an online com-
munity annotation platform for TEs, designed to work in conjunction with Vec-
torBase [91, 92]. Although TEs often represent a very high percentage of genomic
data within a genome, repositories of TE data are lacking and remain unstandard-
ized, specifically for vectors of human pathogens. Moreover, existing TE reposi-
tories typically lack the user-friendliness that other genomic data is afforded. For
example, at NCBI [104], there is an extensive amount of information available
for genes; one can search for them in multiple ways, as well as visualize and eas-
ily browse many details. In comparison, the primary online resources for TEs,
namely TEfam [134], RepBase [119], and WikiPoson [148], offer far less informa-
tion. TEfam allows users to submit TEs and some associated information about
them, but only supports four organisms and offers no structural display options
for TEs. However, TEfam does have the capability for users to submit detailed
TE information, such as the amino acid open reading frames (ORFs), terminal
inverted repeats (TIRs), long terminal repeats (LTRs), and target site duplica-
tions (TSDs). RepBase has a much larger database of TEs for many organisms
54
but simply offers them in the standard FASTA and EMBL file formats, most of-
ten only in nucleotide form. Both sites have a review process for TE submission
yet neither offer any kind of structural display. Additionally, neither offers com-
munity feedback concerning the hosted TEs. The third site, WikiPoson, offers
user feedback in the form of a MediaWiki [99]. While not standardized and quite
limited, WikiPoson offers researchers the ability to submit information and offers
classification guidelines.
The ability to store and visualize the structure of a TE, as well as allow for
community feedback is important for several reasons. First, it allows researchers
unfamiliar with certain types of TEs to be quickly exposed to other types. Second,
the opportunity for user feedback is critical. Much of the existing TE research
is not moderated. The ability for feedback from outsiders increases credibility.
Lastly, a moderated repository for TE data would encourage more standardized
research of TEs.
Our design and implementation plan utilizes the existing VectorBase [91, 92]
bioinformatics resource center for vectors of human pathogens. VectorBase stores
genomic data for a variety of insects and offers the capability to browse and display
genomic data, run scientific analysis tools, obtain information about the organ-
isms, and to provide community feedback. VectorBase also provides a means for
community annotation of genes. It currently does this through an user submit-
ted form for the gene-related data. This data is then reviewed and, if approved,
added to an online database. Currently, the data is accessible for all researchers.
Additionally, the data can be browsed on VectorBase through Ensembl’s Genome
Browser [43].
Our plan builds upon VectorBase’s manual annotation of genes by adding the
55
ability to describe TEs. We utilize and adapt their core methods and extend them
to work with the unique structure of TEs. We utilize the Chado [26] database
schema to store the TEs and build a display based on the php GD library for
TE structural display, as Ensembl does not currently support the display of TEs.
Our primary goal is to complement the existing features of VectorBase with a
mechanism for the submission and display of TEs. This proposed system would fill
many gaps in the aforementioned existing systems and improve upon the quality
and spread of information across the scientific community.
4.2 Transposable Elements and the VectorBase Community Annotation Pipeline
Adding the capability to facilitate TEs through VectorBase’s community an-
notation pipeline (CAP) requires several changes. The following sections describe
how TEs would fit into the VectorBase CAP.
4.2.1 Similarities to the VectorBase Community Annotation Pipeline
Extending CAP to support TEs requires a number of steps. Fortunately,
many of the technologies developed for CAP can be utilized for TEs. From the
user standpoint, other than a TE-specific submission spreadsheet, the interface
and submission process would be nearly identical to that of CAP, making sub-
mission easier for the user. Submissions would be expert-regulated, and the TE
information would be available in the Ensembl browser via DAS. From the user
standpoint, the following steps (summarized in Figure 4.1) would be followed:
1. Download SubmissionForm.xls
2. Enter TE data
3. Upload SubmissionForm.xls
56
4. Wait for approval from curator
5. If approved by the curator, data goes live; otherwise, it must bemodified by user
The TE-specific submission spreadsheet for Class II TEs is summarized below
(some fields, such as ORF, can have multiple instances). A portion of the spread-
sheet is shown in Figure 4.2. Researchers are able to enter as many TEs instances
as they would like for a given submission.
Transposon Symbol Unique name for the TE.
Family Name Family to which the TE belongs.
Organism Organism in which the TE was found.
Transposon Description Description of the TE, as well as any unique proper-
ties.
DNA Transposon Complete sequence data for the TE.
Target Site Duplication Target site duplication for the TE.
5’ Start Genomic location where the transposon starts.
3’ End Genomic location where the transposon ends.
Strand Strand on which the TE was found.
5’ TIR TIR on the 5’ end.
5’ TIR Start Genomic location where the 5’ TIR starts.
3’ TIR TIR on the 3’ end.
57
Figure 4.1. Client-side TE Submission Process. Similar to theVectorBase CAP, researchers download, fill-out, and submit a
TE-specific submission form. Approved data goes online and iseventually incorporated into the Ensembl browser.
58
Figure 4.2. TE Submission Form. The figure above shows a snippet ofthe spreadsheet to submit TEs. Researchers can submit as many TEs
at a time as they desire.
3’ TIR Start Genomic location where the 3’ TIR starts.
ORF ORF sequence data.
ORF Start Genomic location where the open ORF starts.
There are advantages to using an offline form. First, researchers can keep their
information stored locally, making it easy to edit or add to their data. Second,
our form allows researchers to enter as much data as they have for as many TEs as
they have, eliminating the burden of manually going through an online submission
process multiple times.
Once the form has been submitted, phpExcelReader [108] is used to parse the
data. phpExcelReader is an open source php library that works with Excel files.
The library is used to read the file’s contents and display what it has parsed out. If
the researcher clicks approve, the data is inserted into the Chado database with a
status of “under review.” This is the same basic technique used by the VectorBase
CAP.
59
4.2.2 Differences from the VectorBase Community Annotation Pipeline
Extending the VectorBase CAP to support TEs also requires several changes.
While the submission process is largely the same from the client-side, thereby
facilitating submission in an user-friendly format, the aforementioned changes
(from the spreadsheet) are made when storing TEs in Chado. Additionally, the
spreadsheet allows for the submission of TE instances, the coordinates of which
are also provided. As a result, and unlike the alignment of genes, exonerate will
not be required to align the TE to the genome. The ability to submit consensus
TEs will also be supported. In this case, an alignment algorithm, such as BLAST,
will be used to generate instances of the consensus TE within the genome. This
data will be generated “on-the-fly” and could be displayed via a different DAS
track than the normally submitted and curated data.
4.2.3 Transposable Element Representation in Chado
We utilized Chado’s central module, the Chado sequence module, to store
information about TEs. The fundamental table within this module is the feature
table, which is used for describing biological sequence features. Chado defines
a feature to be a region of a biological polymer, which typically means a DNA,
RNA, or polypeptide molecule. A region can be the entire extent of the molecule
or a junction between two bases. Features can be classified according to ontology,
localized relative to other features, and form part, whole, and other relationships
with other features [54]. We store the model (in our case the consensus) of the
TE and all its functional parts in the feature table. This is accomplished by
identifying the relation of each functional part to the “main” consensus via the
feature relationship table, as well as the location of the smaller parts within the
60
FEATURE (feature_id, dbxref_id, organism_id, name, uniquename, residues, seqlen, md5checksum, type_id, is_analysis, timeaccessioned, timelastmodified) FEATURELOC (featureloc_id, feature_id, srcfeature_id, fmin, is_fmin_partial, fmax, is_fmax_partial, strand, phase, residue_info, locgroup, rank) FEATURE_SYNONYM (feature_synonym_id, synonym_id, feature_id, pub_id, is_current, is_internal) FEATURE_RELATIONSHIP (feature_relationship_id, subject_id, object_id, type_id, rank)
• Primary key • Foreign key which will be linked to a PK in the part of the structure which we are creating • Foreign key NOT linked to a PK in the part of the structure which we are creating
n 1
1 n
n 1 1 n n 1 n n n 1
• Boxes with solid borders represent relations which we will build • Boxes with dashed borders represent relations which we will reference in the existing VectorBase database
FEATURE
FEATURELOC
FEATURE_SYNONYM
FEATURE_RELATOINSHIP
ORGANISM
DBXREF
CVTERM
FEATURE_CVTERM
Figure 4.3. Entity-Relationship Diagram of Selected Chado Tables. Thefigure above shows selected changes to the Chado schema to account for
TEs.
main part via featureloc table. We also make a connection to existing data in
Chado. For example, in the feature table, we need to specify the organism where
the particular TE was found, so we utilize foreign keys connecting the TE, in this
case, to the organism, cv term, and dbxref table. The featureprop table is utilized
to set the status of the data (for example, “under review”). Figure 4.3 shows a
graphical representation of some of the tables we utilized.
61
Figure 4.4. TE Start and Submit Page. Here, users can select the TEthey wish to view information about or submit a new TE for review.
4.2.4 Proof-of-Concept
A sample interface independent of VectorBase has been developed. This in-
terface allows for the submission of TE instances into a clone of the VectorBase
Chado database. The basic interface allows researchers to view or submit TEs,
shown in Figure 4.4. Basic information about the TE can be displayed, as in
Figure 4.5, including a structural display. The structural display uses the php GD
Graphics Library [48] to dynamically create a visualization of the TE, a sample
of which is shown in Figure 4.6. This data would eventually be made available
via a link in the Ensembl browser. Figure 4.7 depicts the configuration.
62
Figure 4.5. TE Details Page. This figure shows the informationpertinent to each TE. It also has an option to display the TE structure,
shown in Figure 4.6.
63
Fig
ure
4.6.
TE
Str
uct
ure
Pag
e.H
ere,
we
show
the
dis
pla
yof
the
TE
,to
scal
e.T
he
larg
ece
nte
rre
gion
isth
eop
enre
adin
gfr
ame.
On
eith
eren
d,
one
can
see
the
term
inal
inve
rted
rep
eats
(TIR
s).
64
Figure 4.7. Proof-of-Concept Configuration. Here, we show theconfiguration of the data flow and display, independent of VectorBase.
4.3 Design and Implementation Plan
Work has been performed independent of VectorBase on a clone of the Chado
database utilizes the general design that has been described in the previous sec-
tions. To implement the ability for the VectorBase CAP to accept TEs, the
following steps should be initiated:
1. Add TE-specific Submission Interface to VectorBase. This can bedone through modifications (or edited duplications) of existing files. Namely,files such as UserTools.php, ManualModel.php, and SubmitAnnotation.php
could be used as templates to create files to allow for TE submission.
2. Import TEs into Chado. This is largely done through the CAPorg.vectorbase.www.cap.importer package. The .java files in this pack-age handle the parsing and insertion of the contents of the spreadsheet intoChado.
3. Export TEs to Ensembl Browser. The CAPorg.vectorbase.www.cap.exporter package allows for the display of CAPdata in the Ensembl browser. A link to the structural view of the TE wouldneed to be provided through the Ensembl code.
65
The logic to much of the CAP code would remain unchanged, such as the usage
of hibernate. However, portions of the existing code that utilize exonerate
would not be necessary, and the mechanisms by which TEs are inserted into
Chado must follow the schema previously described. The VectorBase CAP has
many underlying caveats; the implementation is relatively straightforward, yet is
difficult to initially follow because of many interdependencies. Allowing for the
acceptance of TEs into the VectorBase CAP first requires the CAP to be restored
to working order and then edited and extended for TEs. The capability for users
to submit consensus TE sequences is something that should be performed once
the CAP allows for the acceptance and display of TE instances. Once this is
implemented, consensus sequences could be used in blastn searches against the
genome and the results dynamically displayed via a DAS track in the Ensembl
browser. Additionally, TEs would be expert-regulated on an organism by organism
basis, much like the CAP.
4.4 Conclusion
This chapter has described common annotation strategies as well as the tech-
nologies used in the VectorBase CAP. We have described the VectorBase CAP
in detail, and offered solutions to extending it to allow for TEs. As a proof-of-
concept, we have cloned the VectorBase Chado database and successfully accepted
user submissions of TEs from the web, while also parsing and inserting them into
Chado. The Chado database has also been used to generate a structural display
of the TE.
Our approach extends the VectorBase CAP to allow for TEs while utilizing
the technologies currently in place. Such an annotation system for TEs has not
66
been implemented to date, as current systems serve mainly as TE repositories,
offering no structural display or community feedback. The community annotation
of TEs complements the VectorBase CAP for genes while also strengthening the
utility of VectorBase.
67
CHAPTER 5
SIMULATION AND MODELING BACKGROUND1
5.1 Introduction
Simulations of real-world phenomena have the potential to be extremely valu-
able to researchers, particularly in the public health realm. Rather than relying
on complex equations that are the basis for many scientific models, agent-based
models (ABMs) rely on more natural behavioral rules [60]. This leads to a more
direct translation from natural phenomena to a simulation model. It is logical to
integrate spatial data into some simulation environments; however, as Gilbert [50]
pointed out, utilizing geographical information system (GIS) data for dynamic
agents is a difficult challenge that has not yet been adequately solved. Although
GIS data has successfully been integrated into ABMs for several years, the ability
to run complex simulations with thousands of GIS aware agents is computation-
ally challenging. This chapter explores simulations, with an emphasis on ABM
and its utilization of GIS data.
5.2 Simulation and Modeling
A simulation is an imitation of a real-world process [10]. This imitation is
usually done with a computer through the use of a conceptual model. A concep-
1Portions of this chapter were previously reported in Kennedy [75].
68
tual model generally refers to the computer representation of the system that a
researcher has chosen to model. The common goal of simulations is to accurately
represent the behavior of a real-world system while providing feedback and in-
sight in a manner that would otherwise be infeasible. For example, an experiment
that would take months to complete in the laboratory may take only hours or
days to complete with a computer simulation. Also, simulations are especially
useful for models that are unethical in real life, such as infecting a population
with a pathogen. It is appropriate to think of simulations as parts of the scientific
method - we use them to help us check our assumptions or hypotheses as well as to
possibly predict future behavior. Sharing the same goal as the scientific method,
we utilize simulations to help us acquire new knowledge.
The literature presents a multitude of reasons why computer simulations are
valuable [10, 103, 128], a collection of which are listed below:
1. Simulations allow for the timely study of phenomena that would otherwisebe impractical. For example, the evolution of a species over a long periodof time can be simulated in far less time than the actual experiments wouldtake to perform.
2. Simulations can model theoretical behavior that cannot be replicated in thelaboratory. An example of this would be a simulation model that trackedthe historical migration patterns of icebergs or continental drift.
3. Simulation inputs can be modified to determine the outcome or effect on areal-world system without harming the real-world system. This would beapplicable if a researcher wanted to simulate the spread of a pathogen acrossa population without harming the population.
4. Experimentation with simulations can confirm understanding. For instance,a simulation model that mimics the population dynamics of a group of ani-mals could allow researchers to examine particular entities of the model andfollow them over time, thus furthering the understanding of the system.
5. Simulations can be used as prototypes for new experiments before real-worldimplementation. For example, a disaster recovery team could simulate any
69
sort of disaster as well as their response tactics, allowing them to choose thebest approach.
6. Modern systems are sometimes so complex that their internal workings canonly be studied through simulations. Banks et al. [10] refer to a complexfactory system in which the internal interactions are so complex that simu-lations offer the only solution.
As evidenced above, simulations are a powerful tool to researchers; however,
there are cases where a simulation would not be appropriate. Banks and Gibson
[8] list ten rules for when simulations should not be used. A sample of the more
meaningful ones for our purposes are summarized below:
1. Simulations should not be used when common sense can solve the problemor when the problem can be solved analytically in reasonable time.
2. Simulations should not be used when the cost of developing the simulationmodel exceeds the cost of experimentation.
3. Simulations are not useful when system behavior is too complex or unknown.
5.2.1 Advantages and Disadvantages
There are many advantages to using simulations for scientific study [10]. Aside
from the fact that simulations allow one to model the behavior of a real-world
system without harming or altering the real-world system, simulations typically
run and produce results faster than the real-world system being studied, if such
a system exists. Additionally, simulations are useful in testing the influence of
different variables both on the system as a whole and in regard to one another.
Furthermore, a simulation is helpful when performing hypothetical tests or when
testing situations that would be unethical or impractical in the real-world.
Simulation studies also have some inherent disadvantages. Banks et al. [10]
list four specific disadvantages. Namely, simulation models are difficult both to 1)
70
build and to 2) interpret. While true to an extent, experienced programmers will
find model-building manageable. In addition, 3) interpreting and analyzing the
results of a simulation may take some time, but, in many cases, this amount of
time will be less than if the scientist had done the actual real-world experiment.
Lastly, 4) simulations are sometimes incorrectly used when analytical solutions are
more practical. Although valid disadvantages to using a simulation exist, building
a simulation can be extremely useful to scientists as long as the simulation fulfills
the requirements previously mentioned.
5.2.2 Building a Simulation Model
We have already described simulations as being built upon a model. In most
cases, scientists start with a conceptual model, or a model with which they intend
to accurately represent the system they are studying. This conceptual model
typically goes through many phases and revisions as the simulation is being built.
Often, scientists will recognize a problem with their conceptual model or discover a
way to improve it and then implement the change. Once the scientist has sufficient
confidence in the conceptual model, it will transition into simply being called the
model, which will be used as the representation of the system the scientists are
studying. This representation is for the study of the system through simulation.
Accurately representing a model that exactly matches a real-world phenomenon
is extremely difficult, if not impossible.
Inherent randomness often appears in simulation models. While many factors
cause this randomness, the main cause is that real-world systems are far too com-
plex to accurately and comprehensively represent through a computer simulation
model. Randomness is included in simulation models to cover our limited under-
71
standing or uncertainty. In many of the systems we model, we have little idea
about the underlying mechanics. We build simulation models to try to help us to
understand these characteristics and to experiment with them. If done properly,
we will learn about real-world systems through our simulation models. Second,
randomness is included for decision-making. If we have a simulation that models
ants foraging for food, we have to give the ants the ability to make decisions.
If the simulation continuously prompted the entities to perform the same action
or encounter the same obstacles each time, nothing would be learned after the
first run; randomness is included for this purpose. In the ants example above,
the introduction of a random walk would add realism to the model. Lastly, mea-
surement error or quantum effects are accounted for by randomness. Simulations
cannot have the precision of real-world systems because of both the limitations
of computers and of our own knowledge. They also cannot represent entities as
accurately as a real-world system, so we include an inherent randomness. These
examples are not meant to be looked at as limitations of simulation models, but as
reasons why simulation models are created the way they are. In fact, this random-
ness is part of what makes simulations unique and powerful, while representing
how the world actually operates too.
5.2.3 Simulation Model Types
The literature [10, 90] has divided simulation models into the following three
overlapping subcategories:
Static vs. Dynamic Static simulation models are representative of a system at
a specific time. An example of a static system is one that solves complex
analytical problems that are infeasible with other methods. Dynamic simu-
72
lation models are representative of a system over time, such as population
dynamics.
Deterministic vs. Stochastic Deterministic simulation models produce results
determined by the provided inputs. In such simulation models, probability
does not play a role. An example would be a simulation that models a
student going to class at a specified time every day. Stochastic simulation
models involve random variables and produce different results with each
random seed. Our model with the student would be stochastic if we add a
certain probability as to when and whether the student will arrive to class.
Continuous vs. Discrete Continuous simulation models characterize systems
constantly over time. An example would be the population dynamics in
a predator-prey simulation model. Discrete simulation models characterize
systems at specific points in time. An example would be people paying tolls
at a toll booth.
For the purposes of this study, we further classify simulation models into the
following subcategories:
Agent-based vs. Equation-based Agent-based simulation models have indi-
vidual entities, called agents, that drive the simulation. They are good
at modeling systems with emergent properties. Equation-based simulation
models are adept at modeling mathematically based phenomena. We next
elaborate on these two subcategories.
73
5.2.4 Agent-based Modeling
Agent-based simulations, also known as individual-based simulations, have
recently gained popularity [121] and are proving to be very powerful. In an agent-
based simulation, an agent can be thought of as any acting component in the
system. Each agent is treated as an entity, having its own properties and behav-
iors. These can be influenced by a variety of factors, including the environment
and other agents. The interactions between agents and their environment over
time often lead to emergent properties within the system. Time is typically rep-
resented in the form of time steps; namely, each agent usually has a chance to
change its properties and interact with other agents and the environment once
every time step. A time step can represent any amount of time. An advantage
of agent-based simulations is that they are easily extensible; adding agents to the
model is a well-defined process. Additionally, agent-based simulations are rather
intuitive to code, as they are modeled in the same manner that we tend to think
about systems. Agent-based models have been applied to many areas, including
social network models and models of pathogen spread. Our model, named LiNK,
is an agent-based model.
5.2.5 Equation-based Modeling
Equation-based simulations are more mature than agent-based models and
are adept at modeling systems governed by underlying mathematical properties
or formulas. This is somewhat of a limitation, as more complex systems that
cannot be approximated by equations are difficult to build. Also, changing overall
properties of an equation-based simulation is often difficult, as it may require a
new mathematical model. However, modifying parameters in an equation-based
74
simulation is relatively simple. In this respect, equation-based simulations are
rather simple and straightforward. In general, equation-based simulations are
very good at modeling known systems with aggregate behaviors or systems simply
governed by mathematical rules.
5.3 Geographic Information Systems
A GIS is a system that is used to manipulate and store spatial data. For exam-
ple, a GIS could consist of a map of the counties of Michigan and the correlating
population data. Coupled with the proper software, users could query the data
for counties with a population greater than 10,000 or for counties with an area
larger than 500 square miles. Applications of GIS technology span many fields,
including environmental impact assessment, scientific investigations, urban plan-
ning, and resource management [52]. ArcGIS [4], GRASS GIS [56], and Quantum
GIS [114] are several popular GIS software tools. GIS data is usually stored in
either raster or vector format; next, we elaborate on each format. Figure 5.1
visually compares raster and vector data on a portion of Bali, Indonesia.
5.3.1 Raster Data
Raster data is characterized as a collection of pixels, or cells. Many cells make
up a single raster file. These cells are stored in a matrix-like manner, namely in
rows and columns. Each cell has its own attributes and associated data. Raster
files are generally less computationally expensive than vector files. However, they
require more storage space.
75
5.3.2 Vector Data
Vector data is coordinate-based and usually represents data as points, lines,
or polygons. Vector files can more realistically represent spatial data in smaller
storage space than raster files. GIS data is collected as coordinates, so there is
typically much more precision when compared to raster files. Querying complex
polygon-based vector files can be expensive, so the data is often approximated.
5.4 Integrating Geographic Information System Data into Agent-based Modeling
There have been several studies in which ABM has been combined with GIS
data [25, 29, 51, 66, 146]. Few of these models have focused on infectious dis-
eases, while even fewer have agents that intelligently move based on their current
environment. Castle et al. [25] mention numerous toolkits and applications for
coupling ABM and GIS yet fail to go beyond the incorporation of GIS data into a
model and into the realm of its effective use. Crooks [29] more deeply describes the
realm of space within ABM and offers example applications but does not specif-
ically address the underlying issue of how agents can most efficiently access GIS
data. Anwar et al. [3] describe a model built upon GIS data, but one that does
not directly query it. Some models imply space, such as NOSOSIM [135], but
few dynamically interact with GIS data. Gimblett [52], Keeling et al. [74], and
Brown et al. [18] describe aspects of the integration of ABMs and GIS data, but
do not go into detail regarding approaches to efficiently create GIS aware agents.
Moreover, standard means of linking agents with GIS data are computationally
expensive and are therefore not feasible for complex, large-scale simulation mod-
els. In many cases, only particular parts of a GIS are necessary for an ABM;
utilizing a feature-rich GIS toolkit, such as ArcGIS [4], at simulation run-time is
76
(a) Raster Data
(b) Vector Data
Figure 5.1. Panels (a) and (b) show the northwest corner of Bali asrepresented by a raster and a vector file. Here, the granularity of the
raster file is not as precise as the vector file.
77
not typically advisable. We aim to advance the field through efficient and fast
approaches to dynamically working with GIS data within an ABM.
5.5 Summary
This chapter has introduced simulation and modeling techniques, while focus-
ing on ABM. We have also discussed GIS and its popular data formats, offering
advantages and disadvantages. The difficulties in integrating GIS data with an
ABM have also been described. Chapter 6 describes our simulation that integrates
ABM and GIS data to model pathogen spread.
78
CHAPTER 6
A GIS AWARE AGENT-BASED MODEL OF PATHOGEN TRANSMISSION1
6.1 Introduction
In this chapter, we describe an epidemiological model that incorporates spa-
tial data as an influence to agent behavior and pathogen spread. In particular,
we create an epidemiological model to simulate pathogen spread amongst long-
tailed macaque monkeys, Macaca fascicularis, on the Indonesian island of Bali.
GIS data is incorporated into our simulation, and we offer insight on how to ef-
ficiently integrate GIS data into a model, depending on the model’s complexity
and needs. We note optimizations made along the way and compare our methods
to conventional approaches. We conclude with results for our model. This work
is performed with global public health goals in mind and could also be applied to
model infectious diseases carried by arthropod vectors.
6.2 LiNK Simulation Model
We have created a model, the implementation of which is named LiNK after
its creators (Lane, Niederweiser, and Kennedy) and further described in Lane
[88], to aid in the understanding of pathogen transmission patterns. This model
was designed to simulate the spread of infection amongst long-tailed macaques,
1Results from this chapter have appeared in Kennedy et al. [76, 79].
79
shown in Figure 6.1, on the Indonesian island of Bali. We have coupled detailed
GIS data with a detailed understanding of the macaque population to create
a rich simulation. LiNK is deployed on a computing cluster at the University
of Notre Dame [142]. Development of the model has been performed through
the interdisciplinary collaboration of biologists, anthropologists, and computer
scientists.
6.2.1 Model Background
Several zoonotic diseases have recently emerged on the Asian landscape; macaques
have been implicated as both hosts and reservoirs in these disease emergences in
humans. Anthropogenic landscape changes have increased the incidence of human
to non-human primate interaction, potentially leading to bi-directional pathogen
transmission events [32, 41, 88]. In our model, we evaluate how landscape changes
might influence pathogen transmission patterns, based on the behavior and dis-
persal patterns of long-tailed macaques across the island of Bali. Our long-term
aim is to answer the following research questions:
1. What are potential rates and routes of pathogen transmission in macaquesacross the island?
2. How do pathogen life history parameters impact this transmission?
3. Do the answers change with the inclusion of humans as a component of thelandscape?
Landscape plays a very important role in these questions, necessitating the use
of GIS data in our simulation.
A unique system of temples, one of which is shown in Figure 6.2, has existed
on Bali for centuries; these temples and their associated forests act as refugia for
80
Figure 6.1. Adult Female Macaque and Infant. Photo courtesy ofA. Fuentes.
81
the large populations of long-tailed macaques [47, 147]. The island itself is fairly
small at 130 km × 80 km, yet it is an ideal size for study. Each of its roughly 40
temple populations consist of between 30 and 400 macaques. Existing behavioral
and preliminary genetic evidence has documented the matrifocal society of the
macaques, resulting in strong female philopatry [42, 46, 47, 88]. Females remain
at their natal (birth) temples throughout their lives, and social status is inherited
maternally. Typically, subdominant and subadult males disperse from their natal
temple at around age seven, traveling to non-natal temple populations. Actual
dispersal distances and rates are unknown.
The ability of long-tailed macaques to coexist with humans has enabled a
number of macaque populations to thrive in areas where other primate species
have become extinct [47]. On Bali, human land-use patterns have resulted in a
mosaic of riparian forest, small forest patches, agricultural lands, and urban areas
across much of the island. The broad distribution of macaque populations on
Bali suggests that the macaques are utilizing the human modified landscape as
it currently exists. Due to the protection and resource availability at temples,
macaques are able to thrive in moderately high densities alongside high density
human populations. This co-existence, particularly surrounding the temples, has
created an ideal study environment for evaluating both how primate behavior and
anthropogenic landscape changes influence pathogen transmission [41].
6.2.2 Conceptual Model
The conceptual model was developed by K.E. Lane, with support from A. Fuentes
and H. Hollocher. This research group has closely studied macaques and an ar-
ray of pathogens for a number of years. The basic model consists of a display of
82
Figure 6.2. Uluwatu Temple Site. This image shows the southernBalinese temple at Uluwatu.
83
Bali with temple sites and macaques. Users can also view the contents of a given
temple and provide multiple model and pathogen parameter options. We next
introduce the core components of our model and discuss them in greater detail in
Section 6.2.3.
Agents Our agents are macaques, each with their own properties, such as loca-
tion, sex, age, natal temple, and infection status. Macaques move according
to their surrounding environment, and males have the ability to enter and
leave temples. Our model can support thousands of agents, easily support-
ing the roughly 10,000 macaques on Bali. We show a simplified transition
diagram for the life cycle of our macaques in Figure 6.3.
Behavior Macaques have the ability to move through their environment, inter-
act with other macaques, reproduce, and die. Movement is dictated by their
surrounding environment; macaques query their neighborhood and move
appropriately. Macaques within a temple move randomly, with no GIS in-
fluence. All macaques have the ability to carry pathogens and can transmit
pathogens when within a specified distance of one another. Reproduction is
handled by allowing female macaques to produce offspring, with inherited
traits, after they reach a specified age. As macaques age, they have a higher
probability of dying.
Interface Researchers interact with the model through a simple control panel,
shown in Figure 6.4, that allows them to modify simulation parameters.
Once the parameters are set, the user can begin running the simulation.
The simulation is displayed using OpenMap [107] and is shown in Figure 6.5.
Users can also see macaques within temples, as shown in Figure 6.6 .
84
Figure 6.3. Life Cycle Transition Diagram. Macaques are always bornin temple sites. Female macaques spend their entire lives within theirnatal temple. Mature male macaques disperse throughout the islandthrough varying landscape with the ability to join other, non-natal,
temples.
85
Figure 6.4. LiNK Control Panel. Here, we show the parameters a usercan modify when running a simulation. GIS layers can be enabled or
disabled, and pathogen parameters can be set.
86
Pathogens LiNK has the ability to simulate a wide array of pathogens through
the incorporation of several important pathogen parameters. The infectivity
parameter refers to how infectious the pathogen is, while infectiousness is
the proximity a macaque must be to another macaque to have the ability
to transmit the pathogen. Latency represents how long a macaque takes to
become symptomatic after becoming infected, and virulence represents the
deadliness of the pathogen. Acquired immunity refers to the amount of time a
macaque is immune to contracting a pathogen after having been previously
infected. Clearance time is the amount of time a macaque takes to be
cleared of a pathogen. Finally, natural resistance represents the proportion
of macaques that are immune to a given pathogen. Selected pathogen-
related variables and their temporal relationships are shown in Figure 6.7.
A transition diagram for these variables is shown in Figure 6.8. LiNK has
the ability to model one unique pathogen during a given simulation run.
Space The macaques move about on 2D grids that represent temples sites and
the island. The island grids are extrapolated from GIS data, at a customized
granularity. For our purposes, a grid cell has sides of roughly 100m, leading
to over one million possible locations. Each grid is called a layer; we have a
total of eight landscape layers: cities, forests, lakes, rice fields, rivers, roads,
temples, and the actual island (called coast). We have three additional
layers that serve as buffers that represent the impact of humans and water
on infectivity. These eleven layers are melded together and use the same
coordinate system. The coast and temple layers are required, while the
remaining layers can be turned on or off.
Time One time step in our simulation correlates to 12 real-world hours.
87
Figure 6.5. LiNK Display. The figure above shows the display of oursimulation. Here, we have the forests, lakes, and rivers layers enabled,as well as the actual island and the temple sites. Temples are shown as
squares on the map. Green temples have no pathogens, while redtemples have pathogens present within. Macaques are shown on theisland as circles; they are green if they are healthy, pink if they are
infected and not symptomatic, and red if infected and symptomatic.This screen capture has one infected temple and several infected
macaques.
88
Figure 6.6. Temple Site Display. Here, we show the interior of a templesite. Male macaques are shown as solid circles and females as hollow
circles. Macaques are green if healthy, pink if infected and notsymptomatic, and red if infected and symptomatic. The user can
choose which temple site to display at run-time.
Figure 6.7. Temporal Relationship of Pathogen Parameters and RelatedEvents. The diagram above shows the relationships of the pathogenparameters in our simulation. Depending on the parameters used,
macaques can become permanently immune to the modeled pathogen.
89
Figure 6.8. Pathogen Transition Diagram. Macaques generally begin assusceptible and then transition to other states after being infected.Macaques with a symptomatic infection can become reinfected and
macaques can reinfect themselves (autoinfection). An acquiredimmunity is gained after most infections, but may be lost after a given
amount of time.
90
6.2.3 ODD Protocol Description of LiNK
Grimm et al. proposed [58] and recently updated [59] a protocol to describe
agent-based models, the ODD protocol, that consists of 1) the model Overview, 2)
Design concepts, and 3) Details. The overview block consists of the purpose, state
variable and scales, and process overview and scheduling elements. The details
block is further divided into the initialization, input, and submodels elements.
This section describes LiNK according to the original ODD protocol.
6.2.3.1 Purpose
The purpose of the LiNK simulation model is to help understand the effect of
landscape on the spread of pathogens among macaque monkeys on Bali, Indonesia.
6.2.3.2 State Variables and Scales
The spatially explicit model consists of agents representing macaque monkeys
on the island of Bali, Indonesia. ESRI shapefiles serve as the backbone for the
GIS in the model. Layers representing landscape include cities, forests, lakes,
rice fields, rivers, and roads, as well as the island of Bali. We also utilize a
layer representing the geographic location of 42 temple sites on the island and 3
additional layers we created that serve as buffers, namely to represent the impact
of humans and water on infectivity. We abstract the shapefiles to a grid-based
system on which movement amongst the layers is probability-based and relies, by
default, on a Moore neighborhood, which is discussed further in Section 6.3.5.1.
At each time step, macaques evaluate potential new positions, noting their current
landscape and directional bias. Each new position is assigned a value, which is
then normalized. The macaque then has the opportunity to move. Each range of
91
TABLE 6.1
MOVEMENT VALUES FOR DISPERSING MACAQUES
to City Coast Forest Lake Rice Field River Road
from
City 10-30 15-20 40-70 0 10-30 15-45 0-20
Coast 5-20 20-30 40-70 0 10-30 15-45 5-20
Forest 5-20 15-20 10-30 0 10-30 15-45 0-20
Rice Field 5-20 15-20 40-70 0 15-40 15-45 0-20
River 5-20 15-20 40-70 0 10-30 20-55 0-20
Road 5-20 15-30 40-70 0 10-30 15-45 5-30
values in Table 6.1 represents a weighted probability that a macaque will move
from one landscape to another.
State variables for LiNK are described in Table 6.2. A time step of 12 hours
was chosen in conjunction with a grid cell size of 111 meters to obtain the appro-
priate level of precision based on our knowledge of macaque behavior. Movement
probabilities were also chosen in accordance to studied macaque behavior.
92
TABLE 6.2
STATE VARIABLES
Variable Value
Model
Dispersal Deaths per Day 7.14E-4 (2% every 2 weeks)
Autoinfection True
Initial Infected Temples 1
Natural Resistance 1% of population
Temples Temples populated with realistic numbers
Time step 12 hours
Grid cell size 111 meters
Macaque
Sex Temples: 75% female, 25% male
Dispersing: 100% male
Age 50% adult (8-18y male; 8-20y female),
50% juvenile (0-8 yrs)
Latitude Random within island bounds
Longitude Random within island bounds
Natal Temple Random
Directional Bias Random
Current Landscape Based on latitude and longitude
Infected True if infected
continued on next page
93
TABLE 6.2
(continued)
Variable Value
Sick Steps Number of time steps infected to date
Symptomatic True if symptomatic
Pathogen
Infectiousness 1 grid cell
Infectivity 10 (0-100 range)
Virulence 80 (0-100 range)
Clearance Time 28 time steps
Natural Resistance 1% of population
Latency 4 time steps
Acquired Immunity 120 time steps
6.2.3.3 Process Overview and Scheduling
The LiNK model is event-driven. At each time step, a specified number of
events are scheduled and executed, macaque by macaque. Macaques are handled
in two groups: those dispersing and those within temples.
Dispersing macaques are processed first. We begin by incrementing the macaque’s
age and then allowing the macaque to move according to the movement function.
Next, each infected macaque has the opportunity to transmit infection and to
die from infection. Death is also possible as a result of dispersal deaths per day,
94
virulence, or macaque age. Finally, each dispersing macaque has the opportunity
to enter a temple, depending on his proximity to it.
Within temples, the process is similar. We begin by aging the macaques
and next remove them if their age or sickness meets appropriate standards. If
a macaque’s previous coordinates exceed those of the temple bounds and if the
macaque is a male of appropriate age, the macaque leaves the temple according to
a given probability. Female macaques have a 25% chance to give birth annually
from 3-13 years of age. Finally, we simulate the pathogen and randomly move
macaques within the temples.
6.2.3.4 Design Concepts
Emergence Influenced by the landscape, patterns of disease spread across Bali
emerge over time.
Sensing Macaques know their current and surrounding landscape, which they
use to make movement decisions.
Interactions Macaques interact with other macaques only to transmit pathogens.
When a macaque is within the ring of infectiousness of an infected macaque,
it has the possibility to become infected.
Stochasticity Survival in the model is stochastic; pathogens and the dispersal
death rate directly affect survival rate. Movement is also probability-based.
Certain landscapes are more desirable than others, and macaques move with
a directional bias, both of which factor into movement decisions. Births are
stochastic such that females have an annual 25% chance to give birth each
year, between the ages of 3 and 13. The sex of the offspring has an equal
95
chance of being male or female. Finally, macaques located within temples
often attempt to move beyond the bounds of the temple. This is permissible
only a small percentage of the time and only for males of a specified age.
Observation Data is collected based on events. Namely, each infection, death,
birth, and transition between a temple and the landscape is recorded in
the output file. The model is observed through its GUI (graphical user
interface) and also through analysis of the output file. We have written a
separate program named LiNKStat (described in Section 6.5.1) that presents
and performs basic analysis of the output.
6.2.3.5 Initialization
Upon initialization, several things are constant. First, the number of macaques
within each temple site is always the same and is based on scientific data. Sec-
ond, the landscape layers available are always the same; however, the number
of landscape layers that are enabled varies. The initial geographic placement of
macaques, both inside the temples and dispersing, is random. The initial values of
the parameters were chosen based upon observation and prior studies. Pathogen
parameters are varied according to the characteristics of a given pathogen.
6.2.3.6 Input
The input to the model includes the GIS shapefiles representing the various
landscape features of Bali. These were collected as part of a dissertation [132].
96
6.2.3.7 Submodels
Pathogens When a macaque becomes infected, it traverses through a variety of
pathogen-related states. Upon infection, a macaque enters a latent state,
which refers to how long it takes the macaque to become symptomatic. A
latent macaque is also able to transmit the pathogen to other macaques.
After completing the symptomatic phase, a macaque will become free of
infection and clear of the pathogen, meaning it can no longer transmit the
pathogen. The macaque will also enter an acquired immunity phase during
which it will not be able to become infected. Transmission of the pathogen
between macaques depends on infectiousness and infectivity. Infectiousness
refers to the transmission ring which both macaques have to be within to
transfer infection; infectivity is the chance that the infection will take place.
Virulence reflects the deadliness of the pathogen. Figure 6.7 shows the
temporal relationships for selected pathogen-related states.
Movement The higher the virulence of an infected macaque, the smaller the
chance that macaque will move. While movement within temples is ran-
dom, movement amongst dispersing macaques is complex. In its simplest
form, macaques move about a Moore neighborhood, namely the eight imme-
diately surrounding grid locations. At each time step, dispersing macaques
consider their previous movement direction, their current landscape, and the
landscape in their Moore neighborhood to determine their next location. We
utilize the numbers in Table 6.1 to quantify the likelihood of a macaque leav-
ing one landscape for another. This is combined with the macaques current
direction of travel and the new location (if any) is determined stochastically.
The mechanism of movement is independent of the number of layers enabled
97
for any given simulation run.
6.2.4 Implementation
There are several tools and technologies utilized in LiNK. The model is coded
in Java [67] with the Repast simulation toolkit [118]. We utilize Repast and
OpenMap [107] to display the model and GeoTools [49] and JTS Topology Suite
[70] to interact with the spatial information. The choice of tools used in this study
was primarily driven by the necessity to process and visualize GIS data and to be
cross-platform and open-source.
6.2.5 Verification and Validation
Simulations are credible only once they have passed some form of verification
and validation analysis. Verification refers to solving the model right, meaning
that the simulation model matches the abstract model. Validation refers to solving
the problem right, meaning the correct abstract model was chosen. ABMs must
undergo and pass several subjective and quantitative verification and validation
techniques to be considered useful models [7, 9, 81, 149]. Figure 6.9 shows common
techniques for verifying and validating ABMs, adapted from Kennedy et al. [75].
The LiNK model was developed in conjunction with domain experts from multiple
fields and has undergone extensive face validation, both through its display and
evaluation of its output. We have also checked for internal validity and traced
entities of the model. Much of this work has been performed through the use
of LiNKStat, which we describe in Section 6.5.1. We are currently collecting
additional real-world data that we will use in conjunction with the existing data
to continue docking LiNK and to examine LiNK’s predictive power.
98
Figure 6.9. Verification and Validation Techniques for Agent-basedModels. Here, we show techniques we used and plan to use for the
verification and validation of LiNK.
99
6.3 GIS Data and Agent-Based Modeling
In this section, we describe common methods to utilize GIS data in an agent-
based simulation environment. We also describe our improvements to these tech-
niques. We conclude this section with details on our spatially aware agents.
6.3.1 Approximating GIS Data in Simulations
When an ABM environment is built upon GIS data, queries can be expensive,
particularly with complex data or movement. As a general rule, the more complex
the GIS data, the more difficult it is to efficiently utilize it within an ABM.
Additionally, the more GIS data that is available, such as multiple landscape
features, the more time-consuming it will be for agents to query. Put simply, at
each time step, an agent needs to query its unknown surroundings and make a
decision regarding its next move. The more GIS data there is, the longer this will
take. A common solution is to approximate GIS data to the level of granularity
required for a given model. As such, the amount of GIS data is decreased while
the integrity of the data required is maintained. We next describe several ways
to access GIS data from a simulation, offering advantages and disadvantages for
each.
6.3.2 Raster Queries
Raster-based (cell-based) spatial queries made through a spatial package can
be costly, as the mechanisms by which agents access this data are typically not
optimized for use in simulations. Additionally, storing and loading potentially
large raster data files is inefficient at simulation run-time, particularly when not
all of the data is necessary. Raster files are also not ideal for representing complex
100
GIS data where fine-scale granularity is required. An advantage of utilizing raster
data in an ABM is that it easily maps to traditional ABM grid spaces.
6.3.3 Spatial Queries
Spatial queries on vector-based (coordinate) GIS data are the most accu-
rate way an agent can interact with GIS data. Here, an agent simply performs
mathematical-based queries on the loaded GIS data to determine its surroundings.
While very accurate, the cost of performing a spatial query increases as the com-
plexity of the data increases. For example, it may be mathematically simple to
query a rectangle to see whether an agent is contained within it; however, it is very
mathematically expensive to do the same query on a large polygon. Repeatedly
performing such queries is especially expensive, and this problem is exaggerated
as the number of agents and the amount of spatial data increases. While indexing
spatial data alleviates some redundancy, queries are still expensive.
6.3.3.1 Simplified Spatial Queries
The performance of spatial queries can be improved if the vector data is ap-
proximated in a manner such that the number of vertices in a line or polygon is
decreased, while maintaining an appropriate level of data integrity. The Douglas-
Peucker algorithm [34] is commonly used to perform such simplifications. This
technique offers a performance gain over traditional spatial queries, but at a cost
of less accurate spatial data. However, repeatedly performing similar or identical
spatial queries is redundant and can be remedied. Figure 6.10 shows a near 100%
data simplification that maintains considerable data integrity for Bali’s outline.
101
(a) 10,000 Data Points
(b) 100 Data Points
Figure 6.10. Panels (a) and (b) represent Bali, Indonesia withapproximately 10,000 and 100 data points, respectively. Here, we
reduce the number of points by almost 100%, but still retainconsiderable data integrity.
102
6.3.4 Precalculated Query Matrix
Recognizing the drawbacks of earlier techniques, we developed and utilized a
technique involving precalculated query matrices to create spatially aware agents.
This technique relies on the advantages of raster data while utilizing the accuracy
of vector data. Here, vector files are used in conjunction with spatial queries
to build arrays of spatial data. Specifically, we iterate through the vector data,
at a specified granularity, and perform spatial queries at each point. The result
of the query is stored in the matrix for that specific layer as a Boolean value
which specifies whether a given landscape is present. This process is shown in
Algorithm 2 and is performed for all available spatial data. The run-time for
Algorithm 2 is O(xyl), where x and y are the number of latitude and longitude
values and l is the number of matrices. The number of matrices refers the number
of landscape layers in use. While time consuming, the expensive queries only need
to be performed once for a given granularity, prior to simulation run-time. We
utilize serialization to load the arrays into the simulation and agents can access
the data in constant time. The main disadvantage to this method is that arrays
of finer granularity will take longer to build, resulting in larger arrays and slightly
longer query times. The advantages include agents that can more quickly query
their environment and a simulation that scales well, both in terms of the amount
of GIS data available and in the number of agents. Researchers also have the
advantage of choosing a granularity to fit their needs. Currently, we use multiple
precalculated query matrices in LiNK.
103
Algorithm 2 BuildPrecalculatedQueryMatrix
Let X be the set of latitude valuesLet Y be the set of longitude valuesLet L be the set of GIS layersLet M be the Precalculated Query Matrix for a layerfor all x ∈ X do
for all y ∈ Y dofor all l ∈ L do
Ml(x, y)←SpatialQuery(l, x, y)end for
end forend for
6.3.5 GIS Aware Agents
In traditional ABMs, agents move about a grid-like structure. GIS aware
agents move about the same structure, but in a manner such that each move is
influenced by the surrounding environment, including nearby agents. A simple
example would be allowing agents to move preferentially into one landscape over
another. Previously, we listed ways by which agents can query their environment.
Our agents are able to adequately and efficiently survey their surroundings, mak-
ing use of that data to become spatially aware. We utilize precalculated query
matrices for movement decisions. To display this movement on the native vector
data, we use hash tables to “map” the native GIS latitude and longitude points to
our matrices, and vice versa. This mapping avoids repetitive calculations, while
allowing agents to find their real-world coordinates quickly. This also assists in
enabling agents to move with complex rules, which we next describe.
6.3.5.1 Movement
Adding movement to agents in a GIS-based environment is challenging. With
raster data, agents must perform tedious queries through the GIS system to deter-
104
mine the surrounding landscape. Spatial queries are inefficient too, as the queries
can be redundant and take considerable time. Utilizing precalculated query ma-
trices enables us to create many agents with complex and realistic movements in
rapid time.
In traditional ABM cellular automata spaces, agent behavior is based on a
von Neumann or Moore neighborhood. Specifically, von Neumann neighborhoods
describe the four cells immediately adjacent to the current cell in a traditional
square grid. A Moore neighborhood extends this to the surrounding eight ad-
jacent cells, including those diagonally adjacent. Performing spatial queries on
such spaces would be tedious and inefficient, particularly if the neighborhood was
extended beyond a Moore neighborhood.
In our model, spatial movement is based on a Moore neighborhood, with al-
lowance for larger neighborhoods. To move intelligently, agents must know the
landscape they are currently in as well as the surrounding landscape. To repre-
sent possible transitions from one cell to another, we use a matrix of probabilistic
movement values. This table consists of values representing the likelihood that an
agent would move from a given landscape to another, shown in Table 6.1. These
values were determined after discussions with domain experts. Calculations are
performed for each of the cells in the Moore neighborhood. A directional bias
is also added to the agents so they are more likely to continue in the same gen-
eral direction. Once the values for the surrounding cells have been calculated,
they are normalized. We then use probabilities to determine the next location
for the agent, if it moves at all. These calculations are performed quickly, as the
look-ups for the surrounding cells can be performed in constant time, allowing
for realistic movement among agents. Figure 6.11 shows a simplified version of
105
our movement on an example grid and Algorithm 3 describes dispersed move-
ment algorithmically (time-dependent on the number of possible new locations).
WeightedSelectOnAdjust refers selecting the new location based upon the nor-
malized probabilities.
Intelligent agents can be classified as simple reflex, model-based reflex, goal-
based reflex, utility-based, or as learning [123]. Based on the movement deci-
sions described previously, our agents could be classified as utility-based, but with
stochastic-based utility functions and decisions. This classification fits our agents
because they make decisions based upon utility - they are more content in certain
landscapes, and their contentment is determined by their previous location and
current landscape.
Algorithm 3 DispersedMovement
Let Lt+1 be the set of possible locations for the next time stepLet lt+1 be the new locationLet b1 be the directional biasLet b2 be the landscape biasfor each time step t do
for all l ∈ Lt+1 dol← b1 + b2
end forlt+1 ←WeightedSelectOnAdjust(l ∈ Lt+1)
end for
106
Figure 6.11. Macaque Movement. The graphic above shows how amacaque M determines where to move in a landscape consisting offorests (green) and a river (blue). There are movement probabilities
associated with landscape features. For example, a macaque would bemore likely to enter a forest than a river. Here, we base movement on
the immediate surrounding cells; however, it can be based on anarbitrary number of cells in an outward direction.
107
6.4 Results
LiNK has started to demonstrate the importance of landscape in the scope of
epidemiological modeling [88]. The model has been improved in terms of speed
and scalability through an abstraction of typical GIS data representation. We
have shown the ability to have many agents interact with complex spatial data in
a time frame adequate for a simulation while still addressing the research ques-
tions mentioned in Section 6.2.1 at a high-level. Additionally, we have started to
show the impact of landscape on pathogen transmission, as shown in Figures 6.12
and 6.13, which is thus far in accordance with real-world data from Roberts and
Janovy [120]. To date, it appears that virulence is the dominant factor in terms of
pathogen spread. Further sensitivity analysis and more verification and validation
is ongoing.
108
Total Infection by Landscape
0
200000000
400000000
600000000
800000000
1000000000
1200000000
Dis
par1
Dis
par2
7
His
to1
His
to27
Dis
par2
8
Dis
par3
8
His
to28
His
to38
Heterogenous Homogenous
Num
ber o
f Inf
ectio
ns
(a) Total Number of Infections by Landscape
Total Infection by Population Size
0
200000000
400000000
600000000
800000000
1000000000
1200000000
Dis
par1
Dis
par2
8
His
to1
His
to28
Dis
par2
7
Dis
par3
8
His
to27
His
to38
Large Population Small Population
Num
ber o
f Inf
ectio
ns
(b) Total Number of Infections by Population Size
Figure 6.12. Panels (a) and(b) show the total number of infections atfour temple sites. Temple sites 1 and 27 are in heterogeneous areas,
meaning there are many landscape types present. Temple sites 28 and38 are in homogeneous areas. Additionally, temple sites 1 and 28
consist of a small population, while temple sites 27 and 38 consist of alarge population. Dispar refers to Entamoeba dispar and is an avirulent
parasite, while Histo refers to Entamoeba histolytica and is highlyvirulent. Panel (a) shows that the diversity of landscape in which thepathogen is spread has little effect on the total number of infections.Panel (b) shows the same data grouped by population, showing that
population has little effect on the total number of infections. From thisdata, we conclude that virulence has the highest impact on the total
number of infections, while landscape has relatively little impact.
109
Fig
ure
6.13
.P
athog
enSpre
adto
Var
yin
gT
emple
Sit
es.
The
figu
reab
ove
show
sth
enum
ber
ofin
fect
edm
acaq
ues
that
reac
hte
mple
site
sth
rough
out
the
isla
nd
afte
rhav
ing
vaca
ted
the
tem
ple
den
oted
wit
hth
ere
dst
ar.
The
wes
tern
par
tof
the
isla
nd
ishig
hly
hom
ogen
eous,
allo
win
gfo
rth
epat
hog
ento
spre
adfu
rther
.T
he
pat
hog
enlike
lydoes
not
spre
adto
the
nor
thce
ntr
alpar
tof
the
isla
nd
due
toit
shet
erog
eneo
us
landsc
ape.
110
6.4.1 Performance
The model has utilized the aforementioned (Sections 6.3.2-6.3.4) techniques to
interact with GIS data. We started with hefty raster-based queries and refined our
method until we achieved the balance of specificity and speed we desired. Table 6.3
shows the initial GUI load time for the model for each technique, and Table 6.4
and Figure 6.14 show the number of time steps simulated per second for each query
mechanism. These tables and figure show averages over multiple simulation runs
with either the coast and lakes or the coast, lakes, and forests layers enabled, all
with the same number of initial agents. Spatial queries were predictably slowest, as
the raw vector files contain an enormous amount of realism, making calculations
expensive. Utilizing raster data offers a significant improvement but with the
drawback of the long initial startup time. Our simplified spatial query greatly
improves upon the traditional spatial query, but performance drops significantly as
more layers are added. Utilizing precalculated query matrices produces the fastest
simulation, with even greater gains when the display is disabled. Table 6.5 and
its corresponding Figure 6.15 show the scalability, in terms of number of agents,
for the raster and precalculated query matrix method. The precalculated query
matrix method scales very well as the amount of GIS data increases and adequately
as the number of agents increases. The precalculated query matrix method offers
the best, scalable results. All performance tests were run on a single core as
a single thread on a Core 2 Duo 2.0 GHz laptop, highlighting further potential
in scalability. Numbers listed in the figures are averages of 10 simulation runs.
Additionally, LiNK has been adapted to run on a high-performance computing
cluster, making it easy to automate, greatly increasing its utility.
111
TABLE 6.3
PERFORMANCE COMPARISON OF GUI LOAD TIME
GUI Load Time (s)
Coast, Lakes Coast, Lakes, Forests
Spatial Query 3.5 3.5
Raster Query 35 42
Simplified Spatial Query 1.8 2.5
Precalculated Query Matrix 1.6 2
TABLE 6.4
PERFORMANCE COMPARISON OF TIME STEPS/S
Time steps/s
Coast, Lakes Coast, Lakes, Forests
Spatial Query 1.6 0.15
Raster Query 18.5 (11x faster) 19 (126x)
Simplified Spatial Query 39.5 (25x) 15.8 (105x)
Precalculated Query Matrix 126.2 (79x) 124.1 (827x)
Precalculated Query Matrix,
non-GUI 669.6 (419x) 650.2 (4335x)
112
0.1
1
10
100
1000
Spatial Raster Simplified Spatial PrecalculatedQuery Matrix
PrecalculatedQuery Matrix,
non-GUI
Tim
este
ps/s
Coast, Lakes Coast, Lakes, Forests
Figure 6.14. Performance Comparison of Varying Query Methods. Thefigure shows that we obtained nearly an order of magnitude
performance increase in going from spatial to raster to simplified spatialqueries, and then almost another order of magnitude from raster tosimplified spatial queries. Finally, disabling the GUI offers nearlyanother order of magnitude improvement. It is also notable that
enabling more layers in non-GUI mode adds almost no performance hit.We show the figure above with a logarithmic scale.
113
TABLE 6.5
SCALABILITY COMPARISON OF TIME STEPS/S
Time steps/s
Number of Initial Dispersed Macaques 10 100 1000
Raster Query (3 Layers) 51.3 29 19.9
Raster Query (7 Layers) 33.6 27.6 11
Precalculated Query Matrix (3 Layers) 140.7 131.4 83.8
Precalculated Query Matrix (7 Layers) 137.5 129.5 82.9
Precalculated Query Matrix,
non-GUI (3 Layers) 669.8 487.8 154
Precalculated Query Matrix,
non-GUI (7 Layers) 680.4 529.6 158.2
114
10
100
1000
10 100 1000
Number of Initial Dispersed Macaques
Tim
este
ps/s
Raster Query (3 Layers) Raster Query (7 Layers)Precalculated Query Matrix (3 Layers) Precalculated Query Matrix (7 Layers)Precalculated Query Matrix, non-GUI (3 Layers) Precalculated Query Matrix, non-GUI (7 Layers)
Figure 6.15. Scalability with Respect to Initial Number of DispersedMacaques and Amount of GIS data. Here, we show simulations starting
with 10, 100, and 1000 dispersed macaques across different queryingmechanisms. The precalculated query matrix method performs best in
all cases, even better with 1000 agents than other methods with 10agents. The figure is shown on a logarithmic scale.
115
6.5 Analyzing Massive Amounts of Simulation Data
LiNK is a complex model; as such, it creates enormous amounts of output, up
to terabytes for a given experiment. To glean scientific insight and validation,
LiNK tracks of a wide array of events, including infections, births, deaths, and
when a macaque enters or leaves a temple. When simulations are run over a long
period of time, it is not uncommon to have tens of millions of events, or more. We
have created an interactive graphical tool, originally named LiNKStat, to analyze
output from LiNK.
6.5.1 LiNKStat
Written in Perl and Tcl, LiNKStat parses through output files and builds
graphs to gather statistics about the model. Much of the initial analysis and
graph building is done automatically following a simulation run. For example,
LiNKStat allows users to track the route of infection from a given macaque, ob-
taining statistics such as number of macaques directly or indirectly infected. Such
statistics help subject matter experts collect insight from LiNK. A screen capture
of LiNKStat is shown in Figure 6.16 and an example graph from LiNKStat is
shown in Figure 6.17. LiNKStat is efficient, with a run-time mainly dependent on
the number of infection events and their degree of proliferation. The techniques
used in LiNKStat have been generalized and published as P-SAM [6].
6.6 Conclusion
When designing an ABM with GIS aware agents, there are a number of factors
that should be considered. Scalability in terms of the number of agents is generally
most important. Other important issues include the complexity of the GIS data
116
Figure 6.16. LiNKStat. This screen capture shows one of the analysistabs of LiNKStat. The left column displays an interactive list of
macaques in the simulation that updates the middle right panel withspecific infection statistics. These statistics form graphs, an example ofwhich is shown in Figure 6.17. LiNKStat has been and will continue to
be very helpful in the verification and validation of LiNK.
117
Figure 6.17. LiNKStat Pathogen Transmission Graph. The graph aboveallows us to visually track pathogen transmission, helping with
validation and interpretation of output. Nodes refer to macaques, withthe naming convention being natal temple number concatenated with
an id concatenated with a sex identifier. For example, the topmost nodewould be parsed as a female macaque with temple 27 as its natal
temple and 2969 as its id. Transitions are infection events, listed withthe time step and location where the infection occurred. Starting at thetop, macaque 27.2969.0, infected macaque 27.2775.0 at time step 1, intemple 27. Macaque 27.2775.0 went on to infect four other macaques,
and was also reinfected by macaque 27.2870.1. Autoinfection is possibleas indicated by nodes 27.2863.0 and 27.2805.1.
118
and the amount of GIS data that the model will rely upon. An adept modeler
will utilize the GIS data at a granularity appropriate for the model at hand. In
terms of speed, raster data scales reasonably with increasing GIS complexity, but
not as well with an increase in the number of agents. Spatial queries scale poorly
with an increase in the amount of GIS data and complexity, as well as with an
increase in the number of agents. Regarding accuracy, utilizing vector data via
spatial queries offers the highest accuracy, but at the highest performance cost.
Raster data and our precalculated query matrix method offer varying levels of
accuracy, while offering faster speed. Table 6.6 summarizes general ratings for
each approach. Possible ratings are 1-5, from Poor to Excellent. Accuracy of GIS
data refers to the faithfulness to the original GIS data, while the amount of GIS
data refers to the ability of each technique to handle multiple layers of GIS data.
Our precalculated query matrix method scales best in terms of number of agents
and particularly in the amount of GIS data present.
We have presented a complex model of pathogen transmission that utilizes GIS
data. This model has started to demonstrate the importance of integrating spatial
data into models of pathogen transmission. We have created an efficient and ef-
fective mechanism to allow our agents to become GIS aware. Future extensions to
the model include adding the ability to model different pathogens simultaneously,
deploying a web-based front end to the model, and allowing for the use of cus-
tom GIS data. We would also like to explore running our simulation on graphics
processing units, as described in D’Souza et al. [37]. Finally, we plan to further
verify and validate the LiNK model through real-world data.
119
TABLE 6.6
ADVANTAGES AND DISADVANTAGES (1- POOR;
5- EXCELLENT)
Raster Spatial Simplified Precalculated
Query Query Spatial Query Query Matrix
Accuracy of GIS Data 3 5 4 4
Amount of GIS Data 3 1 2 5
Complexity of GIS Data 2 5 4 4
Load Time 1 4 4 5
Memory Requirement 2 4 4 5
Number of Agents 4 1 2 4
Time steps/s 4 1 2 5
120
CHAPTER 7
CONCLUSION
7.1 Overview
This dissertation has described the significance of TEs both in general and
with respect to their detection within newly sequenced genomes. We described an
automated homology-based approach for the identification of high quality TEs.
We next described a design and implementation plan for the annotation of TEs
on VectorBase. We later described a GIS aware agent-based model of pathogen
transmission. Together, we have created numerous approaches and models that
have important public health implications. We elaborate on our conclusions in
the following sections.
7.2 Automated Homology-based Approach for the Identification of Transposable
Elements
Chapter 2 introduced TEs and described strategies to detect them. Chapter 3
described our approach to the identification of TEs. The approach, implemented
as TESeeker, was tested on multiple families of TEs across a variety of organisms.
Overall, results were very good, with resulting consensus TEs as much as 98%
identical to previously annotated elements. TESeeker is available as a download-
able virtual machine. This work has been submitted to BMC Bioinformatics [80]
121
while results of this approach have appeared in print [5, 83].
7.2.1 Future Work
Due to the nature of TEs, there will likely never be an all-encompassing ap-
proach to discover them. Instead, existing approaches will be used in conjunction
with other approaches. For example, LTR TEs can be detected by both structure-
based and homology-based approaches. The utilization of multiple tools and ap-
proaches to detect TEs produces the most robust results. With TESeeker, several
improvements could be implemented. First, incorporating the capability to de-
tect LTRs in Class I and TIRs in Class II consensus elements would allow us to
more correctly trim our consensus sequences. Second, the ability for TESeeker
to automatically determine the size of flanking sequence could be implemented
on a family by family basis. Last, TESeeker could be extended to allow for the
detection of MITEs.
TESeeker also opens up numerous opportunities to further study TEs within
the organisms hosted on VectorBase. For example, a comparative study of the
mariner elements within the Anopheline mosquito complex could be performed.
Additionally, a comparative study of TEs within the Anopheles gambiae, Culex
quinquefasciatus, and Aedes aegypti could be performed. Initial work has been
performed on the comparative study of TEs within the Anopheles gambiae M &
S forms; TESeeker could help validate and complete this study.
7.3 Community Annotation of Transposable Elements on VectorBase
We introduced the VectorBase CAP in Chapter 2. Its technologies and imple-
mentation were also described in Chapter 2. A design and implementation plan for
122
the community annotation of TEs on VectorBase was presented in Chapter 4, in-
cluding a preliminary version which demonstrated the ability to store TEs within
the Chado database schema and to dynamically created a structural display of
the TE.
7.3.1 Future Work
Once the VectorBase CAP is restored to working order and TE instances are
able to be submitted by the community, work can begin to allow for the submission
and display of consensus TE sequences. The submission and display of consensus
sequences is more complex because consensus sequences need to be aligned against
the genome to determine the location of instances within the genome. This could
be done with a BLAST search and results could be displayed in the Ensembl
genome browser. Future additions could allow for the use of TESeeker to produce
an annotation of TEs that are submitted through the CAP and displayed in the
Ensembl genome browser.
7.4 GIS Aware Agent-based Model of Pathogen Transmission
Chapter 5 introduced simulations and described their applicability. We also in-
troduced GIS and discussed its utility within an agent-based model. We combined
GIS and agent-based modeling to create a simulation model for the transmission of
pathogens amongst long-tailed macaques on Bali, Indonesia, described in Chap-
ter 6. Macaques in our model are GIS aware and utilize their surroundings to
make movement decisions. Performance improvements were made through itera-
tive improvements in accessing GIS data. This work culminated in an invited and
refereed journal publication [76], in refereed conference proceedings [79], and was
123
also presented at peer-reviewed conferences [77, 78]. Applications of this work are
also expected to appear in Lane et al. [88].
7.4.1 Future Work
The utility of LiNK could be improved with the ability to utilize custom GIS
data. While this can currently be done manually, the ability to do so “on-the-
fly” would be very useful. This could be accompanied by the capability to add
additional agents with custom behavioral rules to the model.
The most significant improvement to the LiNK model would be the incorpo-
ration of a web-based interface to define and submit simulations, as well as to
view simulation results. The advantages of this include the ability to run simula-
tions on the CRC High-Performance Computing Cluster (HPCC) [142] at Notre
Dame and the ability to store simulation results in a database (rather than a text
file). Simulations run much more quickly on the HPCC, and the richer analysis of
the simulation data could be performed through software designed to work with
massive amounts of data.
7.5 Contributions
This dissertation has described the following contributions:
• Development and implementation of an automated approach to detect trans-posable elements.
• Design and implementation plan for the incorporation of TEs into the Vec-torBase community annotation pipeline, including a preliminary version im-plemented independent of VectorBase.
• Development and implementation of a GIS aware agent-based model ofpathogen transmission.
124
Results from this dissertation have appeared in the following refereed publica-
tions:
• Arensburger, P., Megy, K., Waterhouse, R.M., Abrudan, J., Amedeo, P.,Antelo, B., Bartholomay, L., Bidwell, S., Caler, E., Camara, F., Camp-bell, C.L., Campbell, K.S., Casola, C., Castro, M.T., Chandramouliswaran,I., Chapman, S.B., Christley, S., Costas, J., Eisenstadt, E., Feschotte, C.,Fraser-Liggett, C., Guigo, R., Haas, B., Hammond, M., Hansson, B.S.,Hemingway, J., Hill, S.R., Howarth, C., Ignell, R., Kennedy, R.C. et al.,“Sequencing of Culex quinquefasciatus Establishes a Platform for MosquitoComparative Genomics,” Science, 330(6000):86-88, October 2010.
• E. F. Kirkness, E.F., Haas, B.J., Sun, W., Braig, H.R., Perotti, M.A., Clark,J.M., Lee, S.H., Robertson, H.M., Kennedy, R.C. et al., “Genome Se-quences of the Human Body Louse and its Primary Endosymbiont: Insightsinto the permanent parasitic lifestyle,” Proceedings of the National Academyof Sciences, 107(27):12168-12173, July 2010.
• Kennedy, R.C., Lane, K.E., Arifin, S. M. Niaz, Fuentes, A., Hollocher,H., Madey, G.R., “A GIS Aware Agent-Based Model of Pathogen Trans-mission,” International Journal of Intelligent Control and Systems, 14(1):51-61, March 2009. (invited)
• Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Lof-tus, B., Xi, Z., Megy, K., Grabherr, M., Ren, Q., Zdobnov, E.M., Lobo,N.F., Campbell, K.S., Brown, S.E., Bonaldo, M.F., Zhu, J., Sinkins, S.P.,Hogenkamp, D.G., Amedeo, P., Arensburger, P., Atkinson, P.W., Bidwell,S., Biedler, J., Birney, E., Bruggner, R.V., Costas, J., Coy, M.R., Crabtree,J., Crawford, M., Debruyn, B., Decaprio, D., Eiglmeier, K., Eisenstadt, E.,El-Dorry, H., Gelbart, W.M., Gomes, S.L., Hammond, M., Hannick, L.I.,Hogan, J.R., Holmes, M.H., Jaffe, D., Johnston, J.S., Kennedy, R.C. etal., “Genome sequence of Aedes aegypti, a major arbovirus vector,” Science,316(5832):1718-23, June 2007.
• Lawson, D., Arensburger, P., Atkinson, P., Besansky, N.J., Bruggner, R.V.,Butler, R., Campbell, K.S., Christophides, G.K., Christley, S., Dialynas, E.,Emmert, D., Hammond, M., Hill, C.A., Kennedy, R.C. et al., “Vector-Base: a home for invertebrate vectors of human pathogens,” Nucleic AcidsResearch, 35(D503-505), January 2007.
• Kennedy, R.C., Lane, K.E., Fuentes, A., Hollocher, H., Madey, G., “Spa-tially Aware Agents: An effective and efficient use of GIS data with an
125
Agent-based Model,” In proceedings of Agent-Directed Simulation (ADS2009), Spring Simulation Multiconference 2009, San Diego, CA, March 2009.
The following manuscripts are under review or in preparation:
• Kennedy, R.C., Unger, M.F., Christley, S., Collins, F.H., Madey, G.R.,“An automated homology-based approach for identifying transposable ele-ments,” BMC Bioinformatics. (Under review)
• Lane, K.E., Kennedy, R.C., Miller, L.A., Madey, G., Hollocher, H., Fuentes,A., “Exploring the use of agent-based models in understanding patterns ofpathogen transmission.” (In preparation)
126
APPENDIX A
AUTOMATED APPROACH WALKTHROUGH
In this appendix, we utilize our approach described in Chapter 3 to identify
the mariner Class II TE from P. humanus humanus. We show iterative results
along the way.
A.1 Representative Amino Acid Coding Regions
We begin with 26 transposase sequences from various mariner elements:
>gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperla plorabunda]MEKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAKWVPRELTFDQKQQRVDDSERCLQLLTRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQNGIKKLEGRYNRCIALEGNYVE>gi|19570323|dbj|BAB86288.1| mariner transposase [Apis cerana]MQDQKEHFRHILLFYYRKGKNAVQARKKLCEIYGEGILTVRQCQNWFSKFRSDNFDIKDAPRSGRPVEADEDKIKALIEANRRITTREIATRLNLSNSTVHDHMKRLGFVSKLDIWVPHVLKEKDLLCRIDICDSLLKREENDPFLKRIVTGDEKWIVYDNIKRELNEPAQRTSKNNIHKKVMLSVWWDFKGVVFFELLPNNCTINSEVYCNQLDKLNNSIKQKRPELINRKGVVFHHDNAKAHMSLMTRQKLLQLGWEVLPHPPYSPDLAPSDYHLFRSLQNSLNDKTFTSNEDVKNYLDQFFANKDQKFYERGIMLLPKRWQYVLDHNGQYVIK>gi|5353885|gb|AAD42284.1| mariner transposase [Bombyx mori]WVPHELSEKNLNDRIIICTSLLAHNKIEPFLDRIITGYEKWITYENIIRKRAFYEPGKPAPSTSKPKLSLNKRMLCIWWNIRRPMHFELLKPNERLNSERHCQQFDKLKTALQEKRPAMFNRKDIILLHDNARPHAALGTRQKAAELG>gi|1698455|gb|AAC52011.1| mariner transposase [Homo sapiens]MNSAKIEARTNIKFMVKLGWKNGEITDALRKVYGDNAPKKSAVYKWITRFKKGRDDVEDEARSGRPSTSICEEKINLVRALIEEDRRLTAETIANTTDISIGSAYTILTEKLKLSKLSTRWVPKPLRPDQLQTRAELSME
127
ILNKWDQDPEAFLRRIVTGDETWLYQYDPEDKAQSKQWLPRGGSGPVKAKADWSRAKVMATVFWDAQGILLVDFLEGQRTITSAYYESVLRKLAKALAEKRPGKLHQRVLLHHDNAPAHSSHQTRAILREFRWEIIRHPPYSPDLAPSDFFLFPNLKKSLKGTHFSSVNNVKKTALTWLNSQDPQFFRDGLNGWYHRLQKCLELDGAYVEK>gi|1399036|gb|AAB17945.1| mariner transposase [Ceratitis capitata]MDNEKDHMLYEFRKGKTVGAATKDIREVYSDRAPALRTVKKWFAKFRSGDFNLEDRPRSGRPCELDNDVLRISVANNSRISTKEVASELNVNKPTAFRRLKKVGYTLKLDKWVPHQLSEKNKVDRMSTAISLLRRVKNEPFLDRLLTGDEKWILYNNVQRKRTWKQAHEGAEPMSKGGLHPMMVLLCIWWDIRGVIYFELLPAGETITANKYCQQLVELKKAIDEKRPIFANRKGVLFHYDNARPHVAKPTLAKLKEMNWEIMPHSPYSPDIAPSDYHLFRSLQNNLNGKKFKNVEDVKSHLDNFFNEKPRDFYESGIRKLVERWEWIAEHDGEYIID>gi|2564437|gb|AAC28162.1| mariner transposase [Glossina palpalis]NENQKNRRFEVSSSLLLRNNDDPFLNRIVTCDEKWILYDNRRRSAQWLDADEAPQHFPKPKLHQKKIMVTVWWSAVGLIHHSFLNPGETITAEKYCQQIDEMHQRLQQKQPALVNRKGPILLHDNARPHVSMITRQKLYELGYETLDHPP>gi|2564433|gb|AAC28160.1| mariner transposase [Pycnoscelus surinamensis]SDGLKCTRVEWCTEMLKRFNNGDSRRVSDIVTGEETWIYQFDLKTKCQSSVWVFPDEQPPTKVKRQRSVGKKMVATFFSKSGHLATVVLEDQRTVTVKWYTEVWLPQVFSKIQEKRPRTGLRGILLHHDNASSHTANATIAFLEKMPMKLMTHTA>gi|2564426|gb|AAC28158.1| mariner transposase [Plebeia frontalis]NAKNLHDRVTICTSLLARNKNDPFLDRIITGDEKWITYENIVRKRASCEPGQPAPSTFKPSLSLNKRMLCIWWEVQGPIHYVFLKPNEKLNSERYCQQMDDLNKELKKKRPAVFNRKHIILHHDNARPHTAFGTRQMIAELGWEILSHPP>gi|2564423|gb|AAC28157.1| mariner transposase [Stomoxys uruma]TFDQKQQRVDDSEWCLQLLTRNTPEFLHRYVTMDETFLHHYTPESNRQSAEWTAIGEPTPKRGKDQKSAGKVMASVFWDARGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHMKKKKVLFHQDNAPCHKSLRTMVKIHELGFELLPHPP>gi|2564419|gb|AAC28156.1| mariner transposase [Cryptolestes ferrugineus]TEKNMMDRISICEALTKRNKIDPFLKRMATGNEKWITYDNRVRKRSWSKSGEAPQTVVKPGLTARKVLLCIWWDWKGIIYYELLPYGQTLNSDLYCQQLYRLKIAIDHKRPELTNRRGVVFHQDNPRPHTSTVTRQKLRELGWEVLMHPP>gi|2564414|gb|AAC28155.1| mariner transposase [Delphinia picta]KEIHLTNRINACDMHLKRNEFDPFLKRIITGDEKWIVYNNVNRKRSWSKHGEPAQTTSKADIHQKKVMLSVWWDWKGVVYFELLPRNQTINSDVYCQQLDKLNAAIKEKRPELINRKGVIFHQDNARPHTSLMTRQKLGQLGWEVLMHPP>gi|2564412|gb|AAC28154.1| mariner transposase [Culex restuans]NDRQMENRKTVCEMLLQRFERKSFLHRIVTGDEKWIYYENPKRKKSWLSPGEAGPSTAKPNHFGRKTMLCVWWDQDGVVYHELLKPGETVNTARYRQQIINLNYALIEKRPEWARRHGKVILQHDNAPSHTAKPVKDALKTLNWEILSHPP>gi|2564407|gb|AAC28153.1| mariner transposase [Mantispa pulchella]TERQMENRKVTCEMLLQRYKRKSFLYRIVTGDEKWIYLENPKRKKSWVSPGEASTSTARPNRFGRKAMLCVWWDQTGVIYFELLKPGETVNAVRYQQQIKDLSRAIAENRPEYQERQKKVILLHDNAPSHKSKVVRDTLEKLQWEVLDHAA>gi|2564401|gb|AAC28151.1| mariner transposase [Bittacus strigosus]NDGQQENRKTTCEMLLARQKRKSFLHRIVTGDEKWIYFVNLKRKRSYVDPGQPAQLSPRPNRFGRKTMLCVFWDQRGVIWYELLKPGETVNGQRYQQQLANLNRALRQKRSEYETRHDKVIFLDDNAPSHRTKQTRELVE
128
SYSWQPLPHPP>gi|2564397|gb|AAC28150.1| mariner transposase [Tribolium madens]TLDEKKARVNWCKKMLTKFNNGQSNHVFDIVTGDETWIYRYEPETKRQSAQWVFPYEENPTKLKRPKSVGRKMIAAFFSRSGYIATIPLEDRKTVNANWYTSICLPQVFEKVREKRPRSEIILHHDNASSHTAGETLDFLNVSGIKIMTHPP>gi|2564394|gb|AAC28149.1| mariner transposase [Poecilia reticulata]SEANRQMRVDCCVTLLNRHNNEGILNRIITCDEKWILYDNRKRSSQWLNPGEPAKSCPKRKFTKKKLLVSVWWTSAGVVHYSFLKSGQTITADIYCQQLQTMMEKLAAKQPRLVNRSRPLLLQDNARPHTAQRTATKLEELQLECLRHPP>gi|2564369|gb|AAC28140.1| mariner transposase [Andrena erigeniae]SEENKRRRIDTAASLLSRFKRKSFLHKIIAGDEKWVLYDNPKRQKSWVSPGEPSTSMAKPSIHAKKVMLSIWWDFKGVIHYELLVPGKTITADYYQQQLMNLHDELERKRPFTGQGTRHVILQHDNARPHVAQGTRNTIYALGWEVMSHAA>gi|2564360|gb|AAC28136.1| mariner transposase [Atteva punctella]TERNLMNRVLICDSLLRRNETESFLKKLITGDETWITYDKNVRKRSWSKAGQASQTVAKPGLTRNKVMLCAWWDWKGIIHYELLPPGRTIDSELYCEQMMRLKQKAERKRPELINRRGVVFHHDNARPHTSIATQQKLREFGWGVLMHPP>gi|2564392|gb|AAC28148.1| mariner transposase [Nabis sp. HMR-1997a]TPQQSAKRLEICRNLLENPFDLRFCHRIVTCDEKWVYWRNPNTNKQWLDYGQTALPVAARGQFEKKSMLCVFWNFEGVIHHEFVPDGCSINSELYCEQLERLYSKISERYPALINRKGVLLQQDNARPHTSHRTKEKFTELHGFELLPHPP>gi|2564387|gb|AAC28147.1| mariner transposase [Epicauta funebris]SEKNLNDRVVICTSLLARNNVEPFLNRMITGDEKWITYENILRKRAYCESGKPSPSTSKPNLNLNKRMLCIWWDIRGPIHYELLKPNKKLNSEKYCQQLDNLTTAVQEKRPAMFNRRDIILHHDNARPHTALGTRQKIAELGWEILSHPP>gi|2564376|gb|AAC28142.1| mariner transposase [Buenoa sp. HMR-1997]TSDQKQQRIDDSEQCLKMFNRNKSEFLRRYVTMDETWLHHFTPESSRQSAEWTAYDEPNPKRAKTQQSAGKVMASVFLDAHGIIFIDYLEKGKTINSDYYIALLERLKDEIAEKRPHLKKKRVLFHQDNAPCHKSMKTMAKLNELGYELLPHPP>gi|1816499|gb|AAC47445.1| mariner transposase [Cymodusa distincta]TTRNLISRIEICDTLLKRNKMDPFLKRLITGDEKWIKYKNVKRKRSWLKPGEVPQTTTKPELTASKVMLSVWWDWKGIVYYEILEPGQTVDSGLYCQQLTRLQEAIQKKRPELVNRKSIEFHHDNARPHTSLMTRQKLTEFGWEILLHPP>gi|520556|gb|AAA20470.1| mariner transposase [Tetranychus urticae]PPGQMEHRVMACRFNLQMHRKTRELIQRTISIDETWVSLYMEPEKEQAKGWYYPDEQPEEVPRQNIHGNKRMLIMGMDYNGIAFFELLPEKTTVDGQTYKGFLERHVRHWLGTRASKHLWLLHDNARPHKHQVVREWLERHEITLWHHPP>gi|3093971|gb|AAC15448.1| mariner transposase [Heliothis subflexa]MLKLYENGTSNNINNIVSGDETWLYYFDVPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFTRRGILERIVLESQRTVTASWYINDCLPKVFQKLQEIRPNSRMDTWHFHHDNASAHRARDTVEFLNTSGVKVLEHPAYTPDL>gi|520553|gb|AAA20469.1| mariner transposase [Metaseiulus occidentalis]SERQKEVRLTVCRELLSRYKNKSFLYRIITSDEKWIYYDNPGRKRSWVSPGEPAEKSVRRNRFGKKTMLCVWWDQRGVIYHELLKPGETVDTARYQQQLIDLNRAVKEKRPNWDQVRNRVILLHDNAPCHTSKPTQETLSALNWEVLTHPA
129
>gi|3142710|gb|AAC16889.1| mariner transposase [Heliothis virescens]KWRKKMLHMYENGTSNNINNIVTGDETWLYYFDLPSKNKNKVWLFENEQTPVQVRKSRSVKKKMIAVFFTRRGILERVLLESQRTVTASWYINECLPKVFQRLQEIRPNSRMDTWHFHHDNAPAHRARDTVEFLNSSGVRVLDHPAYPPDLPQ
130
A.2 Identify Coding Region
A.2.1 tblastn Search
We utilize the library of TEs from the previous section to perform a tblastn
search against the genome. Here, we present a subset of the tblastn hits.
TBLASTN 2.2.23+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs", Nucleic Acids Res. 25:3389-3402.
Database: phumanus.CONTIGS-USDA.PhumU1.fa8,555 sequences; 108,367,968 total letters
Query= gi|600840|gb|AAC46948.1| mariner transposase [Chrysoperlaplorabunda]Length=348
Score ESequences producing significant alignments: (Bits) Value
AAZO01005188.1 Pediculus humanus USDA contig 1103172084976 136 7e-32AAZO01006015.1 Pediculus humanus USDA contig 1103172085458 132 1e-30AAZO01003584.1 Pediculus humanus USDA contig 1103172096529 110 1e-28AAZO01007101.1 Pediculus humanus USDA contig 1103172086085 122 1e-27AAZO01003080.1 Pediculus humanus USDA contig 1103172096313 120 9e-27AAZO01005603.1 Pediculus humanus USDA contig 1103172085202 100 3e-26AAZO01001198.1 Pediculus humanus USDA contig 1103172094763 116 1e-25AAZO01004437.1 Pediculus humanus USDA contig 1103172096885 114 4e-25AAZO01001978.1 Pediculus humanus USDA contig 1103172095787 113 8e-25AAZO01005816.1 Pediculus humanus USDA contig 1103172085330 112 1e-24AAZO01001816.1 Pediculus humanus USDA contig 1103172095715 94.1 4e-24AAZO01007534.1 Pediculus humanus USDA contig 1103172086338 111 4e-24AAZO01007070.1 Pediculus humanus USDA contig 1103172086064 111 4e-24AAZO01006787.1 Pediculus humanus USDA contig 1103172085910 111 4e-24AAZO01005899.1 Pediculus humanus USDA contig 1103172085374 111 4e-24AAZO01007995.1 Pediculus humanus USDA contig 1103172088564 110 9e-24AAZO01006175.1 Pediculus humanus USDA contig 1103172085571 110 9e-24AAZO01000215.1 Pediculus humanus USDA contig 1103172094998 110 1e-23
131
AAZO01003840.1 Pediculus humanus USDA contig 1103172096640 109 1e-23AAZO01006286.1 Pediculus humanus USDA contig 1103172085643 108 2e-23AAZO01005421.1 Pediculus humanus USDA contig 1103172085111 108 2e-23AAZO01000288.1 Pediculus humanus USDA contig 1103172095038 108 2e-23AAZO01007414.1 Pediculus humanus USDA contig 1103172086274 108 2e-23AAZO01007386.1 Pediculus humanus USDA contig 1103172086255 108 3e-23AAZO01006519.1 Pediculus humanus USDA contig 1103172090088 107 3e-23AAZO01007033.1 Pediculus humanus USDA contig 1103172094563 107 4e-23AAZO01007487.1 Pediculus humanus USDA contig 1103172086313 107 6e-23AAZO01004198.1 Pediculus humanus USDA contig 1103172096805 107 6e-23AAZO01003892.1 Pediculus humanus USDA contig 1103172096659 107 6e-23AAZO01001012.1 Pediculus humanus USDA contig 1103172095359 106 1e-22AAZO01001082.1 Pediculus humanus USDA contig 1103172095391 106 1e-22AAZO01004190.1 Pediculus humanus USDA contig 1103172096798 106 1e-22AAZO01003375.1 Pediculus humanus USDA contig 1103172096437 105 2e-22AAZO01006313.1 Pediculus humanus USDA contig 1103172085657 104 3e-22AAZO01003096.1 Pediculus humanus USDA contig 1103172096321 104 4e-22AAZO01007094.1 Pediculus humanus USDA contig 1103172086080 104 5e-22AAZO01008517.1 Pediculus humanus USDA contig 1103172093434 104 5e-22AAZO01006528.1 Pediculus humanus USDA contig 1103172085776 102 2e-21AAZO01005218.1 Pediculus humanus USDA contig 1103172084995 100 7e-21
> AAZO01005188.1 Pediculus humanus USDA contig 1103172084976Length=113412
Score = 136 bits (318), Expect = 7e-32, Method: Compositional matrix adjust.Identities = 106/329 (32%), Positives = 153/329 (46%), Gaps = 21/329 (6%)Frame = -1
Query 2 EKKEFRVLIKYCFLKGKNTVEAKTWLDNEFPDSAPGKSTIIDWYAKFKRGEMSTEDGERS 61+ F ++ + F KG N +A L D A +W+AKF+ G+ S ++ ERS
Sbjct 112767 QSEHFLHILLFYF*KGVNASQANKKLWVV*GDEALTERQCQNWFAKFRSGDFSLQNEERS 112588
Query 62 GRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERVGHIIHQYLDMRKLCAK-WV 120GR EV DE IK +I DR +I L +S V + + +KL A W
Sbjct 112587 GRQLEV-KDEQIKA---LIDYDRYSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLW- 112423
Query 121 PRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDETWLHHYTPEFNRQSAEWTA 178V+++ L+ T+ F R VT DE W + +F R+ + W
Sbjct 112422 ---------GTLVNEATWSLRYAS*TQCK*PFFERMVTGDEKWVVY--DDFLRKRS-WFR 112279
Query 179 TGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAK 238G + + + KV+ S WD GI+ + L + +TINS+ Y+ L L I K
Sbjct 112278 QGNRHQQLLRLTFTKKKVLLSFWWDYKGIVNFELLPRCQTINSEVYIRQLTNLSDTIQEK 112099
Query 239 RPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRM 297RP + K + FH+ NA +L T K+ ELG L HPPYSP LAP ++ F LK
Sbjct 112098 RPELANSKGIVFHHHNARPSLTLATGQKLLELGWNVLLHPPYSPKLAPNNYHFFRFLKNF 111919
132
Query 298 LAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326L G+KF + EV E +F K KE Y+
Sbjct 111918 LNGQKFQNDNEVKTALEQFFAPKTKELYE 111832
> AAZO01006015.1 Pediculus humanus USDA contig 1103172085458Length=109781
Score = 132 bits (308), Expect = 1e-30, Method: Compositional matrix adjust.Identities = 94/288 (32%), Positives = 138/288 (47%), Gaps = 21/288 (7%)Frame = +1
Query 43 DWYAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKISKERV 102+W+AKF+ G+ S ++ ERSGR EV DE IK +I DR I L +S+ V
Sbjct 96649 NWFAKFRSGDFSLQNEERSGRQLEV-KDEQIKA---LIDYDRHSSTKYIIKKLDVSRTCV 96816
Query 103 GHIIHQYLDMRKLCAK-WVPRELTFDQKQQRVDDSERCLQLL--TRNTPEFLRRYVTMDE 159+ + +KL A W V+++ L+ T F R VT DE
Sbjct 96817 KNCLRRLECQKKLDALLW----------GTLVNEATWSLRYAS*TECK*PFFERMVTEDE 96966
Query 160 TWLHHYTPEFNRQSAEWTATGEPSPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTI 219W + +F R+ + + G +P K K+++S WD GI+ + L + +TI
Sbjct 96967 KWVVY--DDFLRKKS*-SRQGKQAPTTSKVDIKQKKILSSFWWDYKGIVNFELLPRCQTI 97137
Query 220 NSDYYMALLERLKVEIAAKRPHMKKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPP 278NS+ Y+ L L I KRP + K + FH+ NA +L T K+ ELG L HP
Sbjct 97138 NSEVYIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPTLATGQKLLELGWNVLLHPS 97317
Query 279 YSPDLAPSDFFLFSDLKRMLAGKKFGCNEEVIAETEAYFEAKPKEYYQ 326YSP L P ++ F LK L G+KF + EV + +F K KE+Y+
Sbjct 97318 YSPKLPPNNYHFFRSLKNFLNGQKFQNDNEVKTALDQFFAPKTKEFYE 97461
> AAZO01003584.1 Pediculus humanus USDA contig 1103172096529Length=67685
Score = 110 bits (254), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 55/145 (37%), Positives = 78/145 (53%), Gaps = 1/145 (0%)Frame = -2
Query 183 SPKRGKTQKSAGKVMASVFWDAHGIIFIDYLEKGKTINSDYYMALLERLKVEIAAKRPHM 242+P K K+++S WD GI+ + L + +TIN + Y+ L L I KR +
Sbjct 16786 APTTSKVDIKQKKILSSFWWDYKGIVNFELLPRNQTINLEVYIRQLTNLNDTIQEKRLEL 16607
Query 243 KKKK-VLFHQDNAPCHKSLRTMAKIHELGFELLPHPPYSPDLAPSDFFLFSDLKRMLAGK 301+K + FH+DNA SL T K+ ELG + L HPPYSP LAP ++ F LK L G+
Sbjct 16606 ANRKGIVFHHDNARPSPSLATGQKLLELGWDVLLHPPYSPKLAPNNYHFFRSLKNFLNGQ 16427
Query 302 KFGCNEEVIAETEAYFEAKPKEYYQ 326KF + EV +F K KE+Y+
133
Sbjct 16426 KFQNDNEVKTALNQFFAPKTKEFYE 16352
Score = 31.8 bits (67), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 21/54 (38%), Positives = 29/54 (53%), Gaps = 4/54 (7%)Frame = -1
Query 45 YAKFKRGEMSTEDGERSGRPKEVVTDENIKKIHKMILNDRKMKLIEIAEALKIS 98+AKF G+ S + E SG EV D++ +K I NDR +IAE L +S
Sbjct 17150 FAKFYSGDFSLKNEECSGCLVEV--DDDQRK--AVIVNDRHSSTRDIAEKLDVS 17001
Score = 25.1 bits (51), Expect(3) = 1e-28, Method: Compositional matrix adjust.Identities = 21/68 (30%), Positives = 32/68 (47%), Gaps = 11/68 (16%)Frame = -3
Query 117 AKWVPRE---LTFDQKQQRVDDSERCLQLLTRNTPE-FLRRYVTMDETWLHHYTPEFNRQ 172W+P+E +T +Q C LL RN + FL+ VT DE W + +F R+
Sbjct 16974 VSWIPKEACCITLGSVRQL----GLCDMLLKRNANDPFLKEMVTGDEKWVVY--DDFLRK 16813
Query 173 SAEWTATG 180+ W+ G
Sbjct 16812 RS-WSRQG 16792
...
Matrix: BLOSUM90Gap Penalties: Existence: 10, Extension: 1Neighboring words threshold: 13Window for multiple hits: 40
134
A.2.2 Extract Sequences from the Genome
After the tblastn search, we combine hits within 50 bp that originate from
the same query sequence. When extracted from the genome, we include the in-
tervening sequence. We show a subset of the extracted sequences following these
steps below:
>AAZO01000215.1-0 40432 40911 fATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGA>AAZO01000215.1-1 40432 40923 fATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTCTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAATGTTGTGAGCTGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTAAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAACGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACTGTCTGCGGCGTCTTGGGTGCCAAAAAAAGCTTTATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTTTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGG>AAZO01000215.1-14 40902 41231 fCTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-15 40929 41231 fGCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-16 40929 41441 f
135
GCAAGGAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGAGAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCAAAAAATGTCAAAAGGTCACCAATAATAATGGACATAATATAATA>AAZO01000215.1-17 40935 41207 fAAAGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGC>AAZO01000215.1-18 40938 41231 fGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA>AAZO01000215.1-19 40938 41231 fGAGGCACCAACAACTTCTTTGAATGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGAAATCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGAAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCA
136
A.2.3 CAP3 Assembly
The extracted sequences are fed into the CAP3 assembler, which produces con-
tigs and singletons from the sequences. The contigs file contains an accompanying
quality score file, denoting the quality of each base pair in the contig. We utilize
the quality scores to trim the contigs to encompass the TE, without irrelevant
adjacent sequence. In the following sections, we show the raw CAP3 contigs and
their accompanying quality file. In this case, the quality scores for each contig
never drops below the threshold for the required amount of time; therefore, the
contigs do not get trimmed.
A.2.3.1 CAP3 Contigs
>Contig1ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGTGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGGCTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCCACGATCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACTAATTTAAATGATACCATCCAAGAAAAACGACCGGAGCTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCTAAAACTAAAGAGTTGTATAAAAAAAGGAAAATGATACCACCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATAGAT>Contig2ATGGGGAGCCAAAGCGAGCATTTCCTTCACATTTTGCTTTTTTATTTCTGAAAGGGTGTTAATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAACCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAGGAGTGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACTAAGGACATTGTAAAGAAGCTAGATGAGTCACATACGTGCGTC
137
AAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGGGTCTTTGCGATATGCTTCTTAAATGCAATGCGAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTTTATGATGACTTTTTGAGAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACCTTAGAACTTCTAAGGCTGACATTCAGCAAAAAAGTTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCGGAGCTAGCCAATAGCAAAGGAATTGTCTTTCACCACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATT>Contig3ATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTGCTTTTTTATTTCTGAAAGGGTGTTAATGCTTCACAAGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAATAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTCCTTTGAAAAATGAGGAGCGCTCCGGGCGTCCATTGGAGGTTAATGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACAAGGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGACGTCTTGGGTACCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCCTTTTTTGAAAGAATGGTCATCGAAGATGAAAAGCGGGTTGTTTATGATGACTTTTTGAGAAAAAGATCCTGGTCTAGGTAAGGAAACAGGCACCTTAGAACTTCTAAGGCTGACATTCACCAAAAAAGTTATTGTTATCATTTTGGTGGGCTTACAAAGGCATAGTCAACTTTGAGCTGCTGCTGCCACGAAGTCAGATCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGATCATAGCTAGCCAATAGCAAAGGAATTGTCTTTCACTATGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGAAGCTAGGCTGGAATGTTTTGCTGCATCCTCCTTAAAGTCCCGACCTAACTCGAAGTGAGAATCATTTTTCCCGATTCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAATTGTACTACATTGGACCAGTTTTTTGCT>Contig4ATGGGGAGCCAGAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGAGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGTCGTGTAGGGGGATGAAGCTTTAACAGAACGGCAGCGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGTACATTATAAAGAAGCTAGATGTGTCACGTACGTGCGTCAAAAACTGTCTGCGGCGTCTTGAGTGCCAAAAAAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGGAGTGCAAATGACCCATTTTTGAAAGAATGGTCACCGAAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGGAAAAAATCCTGATCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGTTGACATAAAGCAAAAAAAGATCTTGTCATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAACTGCTGCCACGATGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCAACTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTTCATATAGTCCCAAACTACCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATA
138
CTACCCGAAAAATGTCAAAAGGTCACTAATAATAATAAACATAATATAATA>Contig5TTATTTTTGAAAGGGTGTTAATGCTTCACAAGTTCATAAAAAGTTGTGGGCTGTGTATGTACGGTGATAAAGCCTTAATAGAACGGCAGTGTCAAAACTGCTTTGAGAAATACAGTTCTGGAGATTTTCCTTTGAAAAATGAGAACCGCTCCAGGCATCCCGTGGAGGTGAATGTCAGTCATAAATAAAGGTTCTCATTGATTATGATCGGCATAATTCGACTATGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTACGGCGTCTTGGGTGCCAAAAAAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCGAATGGCCCTTTTTTGAAAGAATGGTCAGCGGAGATGAAAAGTGCATTGTCTATGATGACTTTTTGAGAAAAAAGATCCTGGTCTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCATTACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGGATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATCCCTAAAAAATTTTTTGAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCATTAATAATAATGGACATAATATAATAGAT>Contig6ATGGAGAGCCAAAGCGAGCATTTCCTCCACATTTTCGTTTTTTATTTTTGAAAGGGTGTTAATGCTTCACAAGCCAATACAAAGTTGTGGGCTGTGTAGGGTGATGAAGCTTTAATAGAACTGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGAAAGATAAGGAGCGCTCTGGGGGTCCAGTGGAGGTCGATGATGACCAAATAAAGGCCCTAATTGTTAATGATCGGCATAGTTCGACAAGGGACATTGCAAAGAAGCTAGATGTGTCACATAAGTGCGTCCAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCTTGATGCATAACCTTGGGGAGTGTGAGTTAACGAGGCGACTTGATCTTTGCGATATGCTTCTTAAACGCAATGCGAATGACCCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTGAGGAAAAGATCCTGGTCTAGGCAAGGGAAACAGGCACCAACAACTTCTAAGGCTGACATTAACCAAAAAAAGGTACTGTTATCAGTTTGGTGGGATTACAAAGGCATAGTCTACTTTGAGCTGCTGCTGCCACGAAATCAGACCATAAATTCAGAGGTTAATATTTGACAATTGAGAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGAAAAGGAAGAATTGTCTTTCAACACGATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCTTATAGTCCCGACCTAGCTCAAAGCGAGTATCATTATTTCCGATCACTAAAAAATTTTTTGAACGGACAAAAATTCCAA>Contig7CTAGATGTGTCACATACGTACGTCGAAAACCGTCTGCGGCGTCTTGGGTACCAAAAGAAGCACAGGAGATGAAAAGTGGGTTGTCTATGACAACATTTTAAGGAAAAGATCCTGGTCTAGGCAAGGGAAACAGGCACCAACAACTTTTAAGGCTGACATAAACCAAAAAAAAGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTTTACTTTGAGCTGTTCCTACAAAGTCAGACCATAAATTCAGAGGTTAACATTCAACAATTGACGAATTTAAATAATGCCATCCAAGAAAAACGACCGGAGCTAGCCAATAGAAAAGGAATTGTCTTTCATCACAATAATGCCAGGCTCCACACATCTTTAGACATCAGACAAAAACTACTGGAACTAGGCTGGGTTGTTTTGCCGCATCCTTCTTATAGTCCCAACTTAGCTCGAAGTGAGTTACATGTGTTTCGATCACTAAAAAAATTTTTGAACGGACAAAAAATCCAAATGAGGTCAAAACTGCATTGGAGCAGTTTTTTGCTCCT
139
AAAAATAAAGAGTTTTATAAAACATGGGATGATGATACTACCCGAAAAATGGCGAAAGATCATTGATAACAATGGACATAATATAATG
140
A.2.3.2 CAP3 Contigs Quality Scores
>Contig197 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 11 97 97 97 97 8297 97 97 97 97 97 97 97 82 97 97 82 97 82 82 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 82 97 97 97 97 97 97 9797 97 17 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 11 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 82 97 97 97 11 97 97 97 17 97 9797 17 97 97 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 9797 97 97 17 97 97 97 97 97 97 97 97 17 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 17 97 97 9797 97 97 97 97 97 97 97 97 17 97 97 82 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 497 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 9797 82 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9782 82 97 97 97 97 97 82 82 97 97 82 97 97 97 82 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 82 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9797 97 97 97 97 97 97 11 17 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 82 82 82 97 97 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 82 97 97 97 97 97 97 97 9797 97 97 97 97 82 97 11 82 97 97 97 97 97 97 97 97 97 82 8282 82 82 82 82 82 82 82 97 97 97 82 97 17 97 97 97 97 97 9797 97 97 82 17 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 82 82 82 97 97 11 97 97 97 97 97 97 97 17 97 82 97 9797 97 97 97 97 82 82 97 97 97 97 97 97 97 97 97 97 97 17 9782 97 97 11 97 97 97 82 97 97 97 97 97 97 97 82 97 97 97 9782 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 82 17 97 9797 97 97 82 97 97 97 97 97 97 97 97 97 97 4 97 4 97 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97
141
97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 497 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 4 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 95 97 97 97>Contig297 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 30 97 97 4 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 70 97 97 97 97 97 97 50 97 97 97 70 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 7097 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 90 70 9797 97 97 97 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 70 97 97 97 97 97 97 97 97 97 90 97 9797 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 80 50 97 97 97 97 97 80 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 90 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 42 97 9797 97 97 42 97 97 97 97 97 97 97 42 97 42 42 97 97 97 97 9742 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 40 4297 40 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 71 97 97 97 97 97 97 97 97 15 97 97 5 42 97 97 597 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 84 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 74 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 77 77 97 77 97 97 77 97 97 97 97 97 9797 97 97 77 97 97 77 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 87 97 97
142
97 97 97 97 97 97 48 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 77 97 97 97 97 97 97 77 77 77 7777 60 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 77 50 97 97 82 97 9797 97 97 97 97 97 97 97 97 97 97 82 97 62 97 97 97 97 97 9797 97 97 52 90 90 90 65 90 90 90 90 67 38 90 67 90 90 90 5267 90 90 67 90 90 67 90 67 90 67 90 90 90 90 90 90 90 90 9090 90 90 53 90 90 90 90 27 90 90 90 90 90 90 90 90 90 90 9090 90 90 90 90 67 20 90 90 90 90 65 90 90 90 90 90 90 90 8282 82 77 77 55 23 75 75 75 75 75 70 70 70 47 70 70 70 70 4770 70 70 70 70 70 70 70 35 35 70 70 35 70 70 70 70 70 70 7070 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 7070 70 70 70 70 70 47 70 65 65 65 40 55 55 55 55 55 55 55 5555 55 35>Contig330 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 5 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 15 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 15 30 30 30 30 30 30 3030 15 30 30 30 30 30 30 5 30 15 30 30 30 30 30 35 35 35 3537 37 37 37 37 37 15 37 37 37 37 37 37 37 37 42 42 42 42 2742 42 42 42 42 42 42 42 42 42 27 42 42 42 42 42 42 42 42 2042 42 42 42 42 42 42 42 42 42 42 42 20 42 42 42 42 42 42 4242 42 42 42 27 27 42 42 20 42 20 42 5 42 42 20 42 42 20 4242 42 42 42 42 42 42 5 42 42 42 42 42 42 42 20 20 42 42 4242 42 42 42 42 42 42 42 42 42 42 42 42 42 27 42 42 42 42 2727 42 42 42 42 42 20 42 42 42 42 42 42 20 42 20 42 5 42 542 42 42 42 17 42 42 42 5 42 42 42 42 42 42 42 5 42 42 4227 42 42 20 42 42 5 42 42 42 42 42 27 42 42 27 27 42 42 4242 42 42 42 42 20 27 42 42 42 42 20 42 20 5 27 20 42 42 4242 42 42 42 42 20 42 37 37 15 37 37 37 37 37 15 37 17 37 3737 17 37 42 11 42 11 42 20 11 42 42 42 42 42 47 47 47 25 4747 25 47 47 57 57 57 57 57 42 42 35 35 67 42 67 67 67 67 6767 67 67 67 67 52 67 27 67 67 67 67 67 67 67 67 67 0 67 6767 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 67 45 67 6767 67 67 67 67 67 67 65 65 65 60 60 60 60 60 60 60 60 60 1560 35 15 60 60 60 60 60 60 60 60 60 60 15 60 60 60 60 60 6060 45 45 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 6060 60 60 60 60 60 60 45 60 60 60 60 55 55 55 55 55 55 10 5555 55 55 55 55 30 55 55 55 55 55 55 55 55 55 55 55 55 55 5555 60 60 60 60 60 60 60 60 10 60 60 0 0 0 60 65 10 10 1567 45 57 57 57 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9595 95 92 95 95 95 95 95 95 95 50 95 95 95 95 92 92 92 92 9292 92 92 92 92 92 92 40 92 92 92 92 92 92 92 92 92 92 92 92
143
92 92 40 40 40 92 92 92 92 25 92 92 40 92 92 92 92 92 25 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 25 92 25 40 92 92 92 92 92 92 92 92 9292 92 92 40 25 40 25 25 92 92 92 92 92 92 92 92 92 92 92 9292 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 40 4092 40 40 92 92 92 92 92 92 92 92 92 92 92 92 40 25 92 92 4092 92 40 40 40 92 62 62 62 62 62 62 62 62 62 62 62 62 62 6262 62 62 62 62 62 62 62 62 62 10 62 62 10 62 62 62 62 62 6262 62 10 62 62 62 62 62 62 62 62 62 62 62 62 62 10 62 62 6235 50 35 20 20 5 20 20 20 20 20 20 5 20 5 20 20 22 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15>Contig420 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 5 20 20 20 20 20 20 20 5 20 20 20 20 5 20 20 20 2020 20 20 20 20 20 5 20 20 20 20 20 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 10 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 10 25 25 25 10 25 2525 25 25 25 25 25 25 10 25 25 25 25 10 25 25 25 25 25 25 2525 25 25 10 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 2525 25 25 25 25 25 10 25 25 25 25 25 25 25 25 25 10 25 25 2525 25 10 25 25 25 25 25 25 10 25 25 10 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 2525 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 10 105 5 5 5 45 40 45 45 45 45 45 45 45 45 45 45 45 45 45 4545 45 45 45 45 45 45 45 45 50 50 35 50 50 50 50 50 72 97 975 97 97 97 97 97 97 97 97 97 97 97 97 97 93 97 97 97 97 9797 97 82 97 93 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 82 93 97 97 82 97 97 97 93 93 97 97 9797 97 97 97 82 97 97 97 97 97 97 82 97 97 97 97 97 97 97 9793 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 95 95 95
144
97 85 97 97 97 97 97 97 97 30 97 30 97 97 97 97 30 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 30 97 97 9797 97 97 97 97 97 97 97 97 30 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 30 97 97 97 97 30 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 30 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5 905 50 50 25 50 50 50 50 50 50 25 50 25 25 50 50 4 50 50 504 50 50 50 50 25 50 25 25 50 50 50 50 50 50 50 50 50 25 5050 25 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 25 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 25 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 5050 50 50 50 50 45 45 45 45 45 45 45 45 45 45 45 45 45 45 4545 45 45 45 45 45 20 45 45 45 45 45 45 20 45 45 45 45 45 4545 45 45 45 45 45 20 45 45 45 45 45 45 45 45 45 20 4 20 4040 40 35 35 10 35 35 35 35 35 35>Contig510 10 10 10 10 10 10 10 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 35 35 35 35 35 35 35 35 35 40 4040 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 4040 40 15 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40 4040 15 85 85 85 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95
145
95 95 95 40 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9570 70 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 95 9595 95 95 95 95 95 95 95 90 25 90 90 90 90 95 95 95 95 90 9090 90 90 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 45 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 36 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 42 97 97 97 97 97 97 97 36 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 57 97 57 97 97 97 36 97 97 97 9797 36 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 42 97 97 42 97 97 97 97 97 97 97 97 97 9797 97 97 97 42 97 97 97 42 42 97 36 97 97 97 97 97 42 97 9797 97 42 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 42 42 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 42 97 97 97 97 97 9797 97 97 97 82 97 97 97 97 97 97 97 97 97 82 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 5 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 82 97 97 5 82 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 17 97 72 97 97 97 97 97 97 97 97 97 97 9797 75 97 77 67 97 97 97 97 97 97 97 97 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 37 97 97 97 9797 57 97 97 97 97 97 97 97 97 90 60 85 60 70 85 85 85 85 8585 85 85 85 85 75 75 75 60 60 60 60 60 60 60 60 60 60 60 6060 60 60 60 60 60 60 60 60 60 15 60 60 35 60 60 35 60 60 6060 60 60 60 60 60 60 60 60 35 60 60 60 55 55 55 30 30 25>Contig615 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 20 20 20 20 20 5 2020 20 20 20 20 20 20 20 20 5 5 20 20 20 20 20 20 20 20 520 5 20 5 20 20 20 22 22 22 22 22 7 22 30 30 22 30 30 307 30 7 22 37 22 37 37 37 37 22 7 37 37 37 37 37 30 37 3737 37 37 37 37 37 22 37 37 37 37 37 37 37 3 37 22 7 37 3037 37 37 37 22 30 37 37 37 37 30 37 7 37 37 22 37 37 37 3737 37 37 37 37 37 30 37 37 37 37 37 22 37 37 22 37 30 37 1537 37 37 37 30 30 37 22 5 37 7 37 37 22 37 7 37 7 37 3737 37 37 37 15 15 7 7 0 37 37 37 37 3 37 37 22 37 7 1537 37 15 37 37 37 37 30 15 30 7 37 22 37 37 37 37 37 37 3737 37 15 15 37 37 37 37 37 37 37 22 5 22 22 22 5 22 22 22
146
5 22 22 22 22 22 22 5 7 7 37 37 30 37 7 3 22 37 37 3037 37 15 37 37 37 30 22 30 37 37 37 37 37 35 12 35 35 35 350 2 35 35 35 0 12 2 35 0 0 12 35 35 35 15 40 25 25 2510 32 25 32 32 32 25 2 25 32 40 10 2 45 45 45 45 45 45 455 45 45 45 45 45 45 32 45 45 45 45 45 45 45 45 45 32 45 4545 45 15 15 0 15 2 15 15 20 37 20 50 50 50 50 42 2 97 9427 97 97 97 97 97 64 94 97 94 94 97 80 97 97 97 97 94 97 9797 97 97 12 97 94 97 97 94 97 97 97 94 97 97 97 97 94 97 9794 97 94 97 97 94 94 94 97 94 97 97 94 97 97 97 97 97 97 9797 97 97 97 97 97 97 97 63 97 97 97 82 64 97 97 97 42 97 9797 97 42 87 72 94 97 97 97 94 53 97 94 97 97 97 97 97 94 9797 97 97 97 97 97 97 97 97 97 97 97 94 97 97 97 97 97 13 9797 97 97 97 97 97 97 97 97 90 97 97 87 97 97 35 26 97 97 9797 97 80 97 97 97 97 97 97 97 97 97 92 97 97 32 97 97 97 9797 97 97 97 97 97 97 97 70 75 97 97 97 65 97 70 95 97 95 9597 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 5545 97 55 97 97 97 97 97 97 97 77 77 77 25 77 77 77 15 77 2070 70 70 35 70 97 85 85 85 85 55 85 90 45 45 97 97 97 97 9797 97 97 97 20 97 97 20 97 97 97 97 97 50 97 97 97 97 97 9797 97 97 97 97 97 97 77 77 97 20 97 97 97 97 77 97 97 97 9797 97 97 97 97 97 97 97 97 97 77 77 77 97 97 77 97 77 97 7797 97 97 20 97 77 77 97 97 97 97 97 97 97 97 97 97 97 50 9720 50 20 20 50 20 97 97 97 97 97 97 97 97 97 97 97 97 97 7777 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 50 9797 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 77 77 9797 97 80 80 97 97 97 97 27 27 27 27 27 27 27 5 27 5 27 2727 27 27 27 27 27 27 5 27 27 27 5 27 27 5 27 27 27 5 2727 27 5 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 2020 20 20 10 10 10>Contig710 10 10 10 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1515 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 20 20 9292 92 92 92 70 92 92 70 92 92 70 92 37 92 92 92 92 92 92 1592 92 92 92 92 92 92 92 92 92 92 92 97 42 97 42 20 42 42 4242 97 97 97 97 42 97 97 97 97 97 97 97 97 97 97 97 75 97 9797 97 97 97 97 97 20 42 97 97 97 75 97 97 97 97 97 97 97 2075 42 97 97 97 97 97 97 97 95 40 80 40 95 95 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 35 35 90 90 90 90 90 90 90 90 9090 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 35 90 35 9090 90 90 35 90 90 90 90 90 90 90 90 90 35 90 90 90 90 90 90
147
85 85 85 85 85 85 85 85 85 85 85 85 80 80 80 80 80 80 80 8060 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 7575 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 5075 75 75 50 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 7575 75 75 75 75 75 75 75 75 75 50 50 50 75 50 75 75 75 75 7575 75 75 75 75 75 75 75 75 75 75 75 75 50 75 75 75 75 75 7570 70 70 45 45 70 70 70 70 70 70 70 70 70 70 70 70 70 70 7070 35 35 35 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 3030 30 30 30 30 25 25 25
A.3 Encompass Complete TE
We now aim to extend the TE beyond the coding sequence and find instances.
Using the contigs from the previous step, we perform a blastn search, followed
by combining and extracting hits. At this point, when we perform extractions, we
add an extra 200 bp on both ends of the sequence. We now have instances in the
genome with flanks.
148
A.4 Generate Consensus
Once we have extracted the instances from the genome with flanks, we perform
a multiple sequence alignment using ClustalW2. Using the alignment, we generate
a consensus sequence, shown below:
>ContigTTTT-AGTA-GTTTAC-TTT-TG----A-A-A-A--AACTTTTG-T-AT-TTATTA-TTGGTTCCTTCTG-----A--TTC-GTACCAAAATTTCA-T--A--TCTTA---AAT-GT-CG-GAGCAAAAGA-A-T--AAAC------GGGAGCCAAAGCGAGC--------ATTTCCTCCACATTTT-CTTTT----TTATTTTTGAAAGGG-TGTTAATG-CTTC-CA-GCCAATAAAAAGTTGTGGG--G----TGTAGG------G-GATGAAGCCT-TAATAGAACGGCA-GTGTCAAAA--CTGGTTTGCGAAATTCCGTTCTGGAGATTTT-CTTTGAAAAATGAG---GAG-----GCTCCGGGCGTC-ATTGGAGGTTAATGATGAGCA------AATAAAGGCCCTCATTGATTATGATCGGCATAGTT-CGAC-AAGGACATTGTAAAGAAGCTAGATGTGTCACATACG-TGCG---TCAAAAACCGTCT--GCGGCGTCTTGGGT-----CCAAAAGAAGCTTGATGC-T-AC-TTGGGGA--GT---------AG-C-AC---TTGGTCTTTGCGATATGCTTC---TTAAACGCAA-------CGAATGACCCTTTTTTGAAA---GAATGGTCACCGG----AGATGAAAA---GTGGGTTGTCT-ATGAT--GACTTTTTGAGA-----AAAAG-ATCCT---GGTCTAGGCAA---G-AAACAGGCACCA-ACAACTTCTAAGGCTG-ACATTCACCAAAAAAA------GGTATTGTTA-TCATTTTGGTGGGATTACAAAGGCATAGTC-------ACTTT-GAGC----TGC-GCCACGAAGTCAGACCATAAATTCAGA--GGTTTACATTCGACAATTGACAAAT----TTAAATGATACCATCC---------------AAGAAAAACGACCG-------------------GAGCTAGCCAATAGCA---AAGGAATT------GTCTT-TCACCAC-ATAATG----CCAGGCCCT--------------------------------CCCC----ATCTTTAGCCACTGG--ACAAA-----AACTACTGGAGCTAGGCTGG-ATGTTTTGCTGCACCCTCC---TATAGTCCCAAACTAGCTCCAAATAATTATCA--------TTTTTTCCGAT--CTAAAAAATTTTTT-AA-GGACAAAAATTC----AAAA-GACAATGAGGTC------ACTGCATTGGA-CAGTTTTTT-GCTCCT--AAAACTAAA-GAGTTTTAT-------GAAAAAA-----G-A-AATGATACTACC--CGAAAAAT------------------A-AA--T-A-TAATA-ATA-------ACATAAT-T---ATA-ATAA----A-TAATTT---ATAATTAATAAAT------TTT-TT-TTTT-T-A-AAA-TTC--A-A-T--A----T-T---------AATA
149
A.5 Identify Complete TE
We now utilize the consensus sequence to perform a blastn search against the
genome to find its instances. The hits are again combined and extracted, with
50 bp flanks. Hits must be at least 90% of the query length before adding flanks
to be considered. We next assemble the hits iteratively in CAP3.
A.5.1 CAP3 Assembly
The results from the CAP3 contigs file are shown below.
>Contig1ATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTTTTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCCTTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGACATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGATAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATATCTTTTTGCCAACCCAATA
150
A.5.2 CAP3 Contigs Quality File
The quality file for the contigs in the previous subsection is shown below.
>Contig135 25 5 72 72 72 72 72 72 72 72 72 72 72 72 37 72 57 72 7272 72 72 72 72 72 72 72 72 72 57 72 72 57 72 72 72 72 65 7257 65 50 57 72 56 72 72 72 72 72 72 72 20 5 42 57 72 72 7272 57 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 4772 72 72 72 72 72 32 72 72 72 72 72 72 72 72 72 72 72 72 7220 72 72 72 65 72 72 72 72 72 72 72 72 72 72 72 72 72 72 5072 72 57 72 72 72 72 72 72 72 72 72 65 37 72 72 72 72 72 7262 72 72 72 72 72 72 72 72 72 35 72 72 20 72 72 47 72 72 7272 72 72 72 72 10 72 72 72 72 72 72 72 72 72 57 72 72 72 1072 21 65 35 72 72 57 72 72 72 72 72 72 72 72 72 72 72 72 072 72 72 72 72 72 72 72 72 57 72 72 72 72 72 50 72 72 72 7272 72 72 20 72 72 72 72 72 72 72 72 72 72 72 37 72 72 72 7272 56 47 72 72 37 72 65 72 72 72 65 72 72 72 72 20 72 72 2072 35 35 35 72 72 72 72 72 72 57 72 72 72 72 72 72 57 72 041 72 72 57 72 72 65 72 72 32 72 72 72 72 65 72 30 72 57 7265 65 65 15 65 72 72 72 47 72 65 72 72 72 57 72 72 72 72 7272 65 72 72 65 72 72 72 65 72 72 72 72 72 72 72 72 72 72 7272 72 72 72 57 72 72 72 72 72 72 72 72 31 57 72 72 72 57 2172 72 72 72 72 72 72 72 72 72 72 65 72 72 72 72 72 72 72 6572 72 37 65 37 72 72 72 72 72 72 72 72 72 72 72 72 25 72 7272 72 72 72 72 72 72 72 72 72 72 57 72 72 72 72 72 72 72 7272 72 72 72 72 72 65 72 72 72 72 72 72 72 20 72 72 65 56 7272 72 72 57 72 20 72 72 72 57 72 72 57 72 72 57 56 72 72 7272 72 72 65 72 72 72 72 72 72 72 57 72 72 72 72 57 57 72 7272 72 72 72 72 72 72 72 72 72 72 72 62 72 20 65 72 72 72 5672 47 57 72 72 72 72 72 72 72 47 65 72 57 72 47 72 72 72 7211 72 72 65 72 72 72 57 72 72 72 72 57 72 72 72 72 72 72 7272 57 72 72 72 65 72 65 72 65 72 72 65 65 72 72 72 65 65 6565 65 50 65 65 72 72 72 72 72 47 72 72 72 72 47 72 72 72 7272 72 72 72 72 50 72 72 72 72 72 72 72 57 72 57 72 72 72 072 72 72 72 37 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7265 72 65 72 57 72 65 65 72 47 40 72 57 72 72 72 72 72 72 7272 72 50 65 65 72 72 72 72 47 72 72 72 72 72 72 72 72 72 7272 72 72 72 72 72 72 47 72 72 72 72 57 72 72 65 72 57 57 5772 5 72 72 57 72 72 72 72 72 72 72 57 57 42 57 72 57 57 5772 47 57 72 57 72 72 72 72 72 72 72 57 57 72 27 27 72 72 7272 72 72 47 47 72 47 72 72 72 72 72 57 72 50 72 35 72 57 5772 72 72 47 72 72 57 72 57 72 72 72 72 65 72 72 57 72 57 72
151
72 72 72 57 72 72 72 65 72 72 72 72 72 65 72 25 72 72 72 7272 72 72 57 72 72 72 72 72 72 72 65 72 65 72 72 0 31 72 7272 72 72 37 65 72 72 72 72 72 72 72 72 72 37 72 72 57 72 7272 72 72 72 72 72 72 57 72 72 72 72 72 57 72 72 72 20 72 7272 72 72 72 72 20 72 72 65 72 72 72 72 72 72 72 72 72 72 5772 72 72 72 72 72 72 72 37 57 72 57 72 17 72 72 72 72 72 6572 72 72 72 72 72 65 72 72 72 57 72 72 72 72 72 72 72 72 7272 72 72 35 35 72 57 37 72 72 72 72 72 72 72 72 72 72 72 7257 72 72 72 5 72 20 72 10 72 72 72 65 72 72 72 72 72 72 7272 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7272 72 72 57 72 72 72 20 72 57 72 72 72 72 72 72 72 72 72 7272 72 72 72 72 47 62 72 72 72 72 72 72 72 72 72 65 72 72 7257 72 72 57 72 72 72 57 72 72 72 72 72 72 72 72 72 72 65 7272 72 72 72 72 72 72 65 57 72 72 10 72 72 72 72 72 72 72 7272 72 72 72 72 72 72 25 72 72 65 65 72 72 47 72 72 72 72 7272 72 72 72 72 72 72 72 72 72 65 72 37 72 72 57 72 72 72 7272 72 72 62 72 72 65 72 72 72 72 72 72 72 72 0 72 72 72 7272 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 7272 72 57 65 72 20 72 72 72 57 72 72 72 65 72 65 72 72 72 7272 72 72 72 72 72 72 72 57 72 72 72 72 37 72 72 72 72 72 7257 57 72 72 72 72 72 72 72 72 72 72 40 30 57 72 72 72 72 7272 72 65 20 20 72 72 72 72 52 67 60 67 67 67 52 67 52 42 6752 67 67 67 67 67 67 67 67 67 67 67 5 5 60 67 52 67 67 6767 67 57 67 67 42 67 67 60 67 15 67 67 67 67 67 67 67 67 6767 67 67 67 67 67 67 67 67 67 67 67 67 60 67 67 67 5 67 6720 42 67 42 67 67 67 67 67 67 67 67 60 67 67 67 67 67 67
152
A.5.3 Trimmed CAP3 Contigs
Here we list the final contig, which represents the consensus TE. In this case,
there was nothing trimmed from the previous step, as the quality scores were all
above the threshold.
>Contig1-0 0 1280 fATTTATTGGGTTGGCCAATAAGTAACTGCGGATTTTACCAACAGATAGTTTGTTTATTTTTTTGAGTACGTTTACGTTTTTGTACAGACATGAACTTTTGATATGTTATTACTTGGTTCCTTCTGTAACATTCGGTACCAAAATTTCATTGAACTCTTAAATAGTACGCGAGCAAAAGACATTTAAACATGGGGAGCCAAAGCGAGCATTTCCTCCACATTTTACTTTTTTATTTTTGAAAGGGTGTTAATGCTTCCCAGGCCAATAAAAAGTTGTGGGCCGTGTAGGGGGATGAAGCCTTAACAGAACGGCAGTGTCAAAACTGGTTTGCGAAATTCCGTTCTGGAGATTTTTCTTTGCAAAATGAGGAGCGCTCCGGGCGCCAATTGGAGGTTAAAGATGAGCAAATAAAGGCCCTCATTGATTATGATCGGCATAGTTCGACGAAGGACATTGTAAAGAAGCTAGATGTGTCACATACGTGCGTCAAAAACCGTCTGCGGCGTCTTGGGTGCCAAAAGAAGCTTGATGCGTTACTTTGGGGAACGTTAGTTAACGAGGCGACTTGGTCTTTGCGATATGCTTCTTAAACGCAATGCAAATGACCCTTTTTTGAAAGAATGGTCACCGGAGATGAAAAGTGGGTTGTCTATGATGACTTTTTGAGAAAAAGATCCTGGTTTAGGCAAGGAAACAGGCACCAACAACTTCTAAGACTGACATTCACCAAAAAAAGGTATTGTTATCATTTTGGTGGGATTACAAAGGCATAGTCAACTTTGAGCTGCTGCCACGAAGTCAGACCATAAATTCAGAGGTTTACATTCGACAATTGACAAATTTAAATGATACCATCCAAGAAAAACGACCGGAACTAGCCAATAGCAAAGGAATTGTCTTTCACCACCATAATGCCAGGCCCTCCCCATCTTTAGCCACTGGACAAAAACTACTGGAGCTAGGCTGGAATGTTTTGCTGCACCCTCCATATAGTCCCAAACTAGCTCCAAATAATTATCATTTTTTCCGATTCCTAAAAAATTTTTTAAACGGACAAAAATTCCAAAACGACAATGAGGTCAAAACTGCATTGGACCAGTTTTTTGCTCCTAAAACTAAAGAGTTTTATGAAAAAAGGAAAATGATACTACCCGAAAAATGTCAAAAGGTCACTAATAATAATGGACATAATATAATAGATAAAAATAATTTTGCATAATTAATAAATCGTTTTTTGTTTTCTTAAAAAATTCGTAAATATCTTTTTGCCAACCCAATA
153
APPENDIX B
TESeeker WEBSITE
The TESeeker website is located at http://www.nd.edu/~teseeker and in-
cludes the virtual appliance, representative TE library, and documentation. Fig-
ure B.1 shows a screen capture of the home page.
154
Figure B.1. TESeeker Website. The Website allows researchers todownload the TESeeker virtual appliance, as well as view the
documentation. We also provide the library of representative TEs fordownload.
155
APPENDIX C
TESeeker USER MANUAL
TESeeker is available as a VirtualBox[144] virtual appliance in the open vir-
tualization format (OVF). TESeeker requires at least 5 GB free hard disk space
and at least 1.5 GB of RAM on the host machine. TESeeker can dynamically
allocate up to 40 GB hard disk space for use in the virtual appliance. TESeeker
is licensed under GNU General Public License (GPL) v3 [55].
C.1 Installation
TESeeker can run on any operating system that supports the VirtualBox vir-
tualization software package, currently available for Windows, OS X, Linux, and
Solaris.
The following steps shall be followed to install TESeeker:
1. Download and install VirtualBox from http://www.virtualbox.org.
2. Download the TESeeker virtual appliance files (2) from http://www.nd.
edu/~teseeker.
3. Open VirtualBox.
4. Click File then Import Appliance... and complete the wizard, selecting theTESeeker .ovf file as the source. Be sure both downloaded TESeeker filesare in the same directory.
156
C.2 Usage
After installation, start TESeeker by opening VirtualBox, clicking teseeker in
the left frame, and then clicking Start. The virtual appliance hosting TESeeker
will then boot.1 As shown in Figures C.1-C.7, the booted appliance will contain
7 desktop items: the Genomes and TELibrary folders, shortcuts to bring up the
documentation and web interfaces, and the license. The TESeeker interface is
shown in Figure C.5. Hovering the mouse over the parameter name will provide
a more detailed description. All genomes and library files must be placed in the
folders on the desktop and must be in the FASTA file format with a .fa, .fas, or
.fasta file extension. We have included the Pediculus humanus humanus genome
and our representative TE library within the virtual appliance.
Clicking the TESeeker shortcut on the desktop will load the web interface.
Here, researchers can modify the default parameters, most notably the BLAST
Query Library, BLAST Database, and the Desktop Output Folder Name. Hovering
over the parameter name will provide a detailed tooltip description. Once the
parameters have been set, clicking submit will briefly show the selected parameters
and then start the search. The browser will display Job X is Running, where X
represents the job id number. The browser will continually refresh the page until
the job completes, at which point the page will notify the user. When finished,
researchers navigate to the specified output folder on the desktop to view results.
If the researcher elects to find only the coding region, results are organized as
follows within the specified output folder: the codingRegion files folder contains
intermediary output, the output folder contains all the singlets and contigs pro-
1Some Linux distributions automatically enable the KVM kernel extension If this is the case,disable it with the following command sudo modprobe -r kvm intel. To restore the KVM kernelextension, run sudo modprobe kvm intel.
157
Figure C.1. TESeeker Desktop. This figure shows the desktop.
158
Figure C.2. TESeeker Genomes Folder. Researchers can place FASTAgenome data in this folder.
159
Figure C.3. TESeeker TELibrary. This figure shows the folder for therepresentative TEs. Researchers can also place FASTA sequence data in
this folder.
160
Figure C.4. TESeeker Documentation. Here, we show a screen captureof the HTML TESeeker Documentation.
161
Figure C.5. TESeeker Web Interface. This figure shows the TESeeker
web interface. Researchers can alter the default parameters as desired.Library and genome files in the desktop folders are selectable through
drop-down menus.
162
Figure C.6. TESeeker BLAST Interface. Here, we show the BLASTinterface. The BLAST Database drop-down menu is populated via the
genomes available in the Genomes folder.
163
Figure C.7. TESeeker Extract Interface. This figure shows theTESeeker extract interface. Researchers can extract specified sequence
data from any genome in the Genomes folder.
164
duced, and the remaining files represent the contigs and singlets produced from
CAP3. For example, a file called cap2c out.fas contains the contig sequences from
the second iteration of CAP3, while cap1s out.fas contains the singlet sequences
produced from the first iteration of CAP3.
If a consensus sequence is desired, the results are organized as follows within the
specified output folder: the codingRegion files folder contains intermediary output
from the coding region search, the folder consen files contains intermediary files
from the consensus search, and the output folder contains the contig and singlet
sequences produced from each sequence that was fed into the consensus search.
Additionally, all contig and singlet sequences are available in single FASTA files
in the specified output folder.
C.3 Example Search
TESeeker is distributed with the Pediculus humanus humanus genome as well
as our library of representative TEs. We next describe how one could obtain
a high-quality consensus element for the Pediculus humanus humanus mariner
element, once the virtual appliance has been loaded.
1. Launch TESeeker. Double-click the TESeeker shortcut on the desktop.
2. Confirm Parameters. Ensure mariner ac.fa is selected for the BLASTQuery Library and that the phumanus.SUPERCONTIGS-USDA.PhumUA.fagenome is selected for the BLAST database. Also click Find Consensus? toenable a consensus search. The screen should now look as shown in Fig-ure C.8. The status for TESeeker will be continuously updated through theweb interface until the job completes.
3. Inspect Results. When the job is finished, click the link to the specifiedoutput folder, louseOut, and inspect the results. The web view of this folderis shown in Figure C.9. As mentioned in the previous section, the mainconsensus results will be in up to three FASTA files, consensus contigs.fas,
165
Figure C.8. TESeeker Default Parameters. This figure shows theTESeeker web interface with the default parameters set for a search for
the mariner transposon in P. humanus humanus.
consensus iter1 singlets.fas, and consensus singlets.fas. The best hits aregenerally in the consensus contigs.fas file, while the ones with the least like-lihood are generally in the consensus iter1 singlets.fas file. In this case, thefirst contig in consensus contigs.fas, Contig1-0 6 1309 f, contains a sequence99% identical to the manually annotated element, differing mainly in itsroughly 10 extra nucleotides on both ends. Figure C.10 shows the ends ofthe aligned sequences.
C.4 Additional Tools
There are also BLAST and Extract shortcuts on the desktop. These web in-
terfaces offer additional functionality by making it simpler to do a custom BLAST
search or sequence extraction using the files in the Genomes folder.
166
Figure C.9. Web Interface File Browser. The figure above shows thecontents of the main output folder, louseOut. FASTA sequences are in
the .fas files, shown here as consensus contigs.fas,consensus iter1 singlets.fas, and consensus singlets.fas.
167
(a) 5’ End
(b) 3’ End
Figure C.10. ClustalX Alignment with Annotated Element. Panels (a)and (b) show the 5’ and 3’ ends of the annotated mariner (mariner)and the top consensus sequence produced by TESeeker (Contig1-0)
when run with the default parameters. The sequences are 99%identical. The extra sequence on both ends of Contig1-0 can be reduced
with stricter parameters.
168
C.5 Technology
TESeeker utilizes a variety of technologies. The core bioinformatics tools,
BLAST, CAP3, ClustalW2, and BioPerl were mentioned previously in Section 3.2.3
and are united through bash scripts. Researchers interact with TESeeker through
a web-based form implemented in html/php and handled by the lighttpd web
server. The form interacts with the local scripts and utilizes a PostgreSQL
database and cgi/Perl to notify researchers when a job has completed. TESeeker
is installed on Ubuntu 10.04 LTS. The administrative password for user teseeker
is teseeker.
169
APPENDIX D
SELECTED AUTOMATED APPROACH SOURCE CODE
This chapter presents selected BioPerl scripts written for our automated ap-
proach.
D.1 Combine BLAST Hits
###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Authors: Ryan Kennedy and Scott Christley###### Arguments:### ARGV[0]: closeness in order for segments to be joined### ARGV[1]: minimum length a segment must be to be joined### ARGV[2]: closeness segments must be in the query to be joined### ARGV[3]: maximum length percent of query of combined sequence### if joined because of ARGV[2]### ARGV[4]: length combined sequence must be relative to the query### ARGV[5]: length of flanks to add to each side### ARGV[6]: BLAST file to process#######################################################################
use Bio::SeqIO;use Bio::SearchIO;use List::Util qw[min max];
$closeLen = $ARGV[0];$minSegLen = $ARGV[1];$querySepDist = $ARGV[2];$queryMaxPerc = $ARGV[3];$minLenPerc = $ARGV[4];
170
$flankLen = $ARGV[5];$blast = $ARGV[6];
$bres = new Bio::SearchIO(-format => ’blast’, -file => $blast );
%TEfor=();$forNum=1;%TErev=();$revNum=1;
while(my $result = $bres->next_result) {$minLen=$minLenPerc * $result->query_length;$queryMax=$queryMaxPerc * $result->query_length;while(my $hit = $result->next_hit) {
while(my $hsp = $hit->next_hsp) {$orientation = $hsp->strand(’hit’);if($orientation>0) { #forward strand$hitStart=$hsp->start(’hit’);$hitEnd=$hsp->end(’hit’);$QStart=$hsp->start(’query’);$QEnd=$hsp->end(’query’);$overlap=0;$QJoin=0;
if(abs($hitStart-$hitEnd)>=$minSegLen) {#only combine segments of a specified length
$TEfor{$hit->name}{$forNum}{"start"} = $hsp->start(’hit’);$TEfor{$hit->name}{$forNum}{"end"} = $hsp->end(’hit’);$TEfor{$hit->name}{$forNum}{"qStart"} = $hsp->start(’query’);$TEfor{$hit->name}{$forNum}{"qEnd"} = $hsp->end(’query’);$TEfor{$hit->name}{$forNum}{"minLen"} = $minLen;$TEfor{$hit->name}{$forNum}{"query"} =
$result->query_accession . $result->query_length;++$forNum;
} #end minSegLen check} else { #reverse strand$hitStart=$hsp->start(’hit’);$hitEnd=$hsp->end(’hit’);$QStart=$hsp->start(’query’);$QEnd=$hsp->end(’query’);$overlap=0;$QJoin=0;
if(abs($hitStart-$hitEnd)>=$minSegLen) {#only combine segments of a specified length
$TErev{$hit->name}{$revNum}{"start"} = $hsp->start(’hit’);$TErev{$hit->name}{$revNum}{"end"} = $hsp->end(’hit’);$TErev{$hit->name}{$revNum}{"qStart"} = $hsp->start(’query’);
171
$TErev{$hit->name}{$revNum}{"qEnd"} = $hsp->end(’query’);$TErev{$hit->name}{$revNum}{"minLen"} = $minLen;$TErev{$hit->name}{$revNum}{"query"} =$result->query_accession . $result->query_length;++$revNum;
} #end minSegLen check} #end orientation if
} #end hsp} #end hit
} #end result$bres->close();
#JOINfor $scaffold ( sort keys %TEfor ) {
$i=0;$#Forscaf= -1;$#teForStart=-1;$#teForEnd=-1;$#teForQStart=-1;$#teForQEnd=-1;$#teForMinLen=-1;
# go through each scaffold and collect into arraysfor $TE ( keys %{ $TEfor{$scaffold} }) {$Forscaf[$i]=$scaffold;$teForStart[$i]=$TEfor{$scaffold}{$TE}{"start"};$teForEnd[$i]=$TEfor{$scaffold}{$TE}{"end"};$teForQStart[$i]=$TEfor{$scaffold}{$TE}{"qStart"};$teForQEnd[$i]=$TEfor{$scaffold}{$TE}{"qEnd"};$teForMinLen[$i]=$TEfor{$scaffold}{$TE}{"minLen"};$teForQuery[$i]=$TEfor{$scaffold}{$TE}{"query"};$i++;
}
# sort and get indexes@list_order = sort { $teForStart[$a] cmp $teForStart[$b] } 0 .. $#teForStart;
# push sorted stuff onto are final arrayfor my $i ( 0 .. $#list_order ) {
push @joinScaffold, $Forscaf[$list_order[$i]];push @joinStart, $teForStart[$list_order[$i]];push @joinEnd, $teForEnd[$list_order[$i]];push @joinQStart, $teForQStart[$list_order[$i]];push @joinQEnd, $teForQEnd[$list_order[$i]];push @joinMinLen, $teForMinLen[$list_order[$i]];push @joinQuery, $teForQuery[$list_order[$i]];
}}
@Forscaf=@joinScaffold;
172
@teForStart=@joinStart;@teForEnd=@joinEnd;@teForMinLen=@joinMinLen;@teForQuery=@joinQuery;
$joinDist=$closeLen;for($i=0;$i<=(@Forscaf+0);$i++) {if($Forscaf[$i+1]) {
if(abs($teForStart[$i+1]-$teForEnd[$i])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])
{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teForStart[$i]-$teForEnd[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])
{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teForStart[$i]-$teForStart[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])
{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teForEnd[$i]-$teForEnd[$i+1])<=$joinDist &&$Forscaf[$i] eq $Forscaf[$i+1] && $teForQuery[$i] eq $teForQuery[$i+1])
{$teForEnd[$i]=max($teForEnd[$i],$teForEnd[$i+1]);splice @Forscaf, $i+1, 1;splice @teForStart, $i+1, 1;splice @teForEnd, $i+1, 1;splice @teForMinLen, $i+1, 1;splice @teForQuery, $i+1, 1;
173
$i-=1; #unless $i==0;}
}}
$#joinScaffold=-1;$#joinStart=-1;$#joinEnd=-1;$#joinMinLen=-1;$#joinQuery=-1;$#list_order=-1;
#REVERSE JOINfor $scaffold ( sort keys %TErev ) {
$i=0;$#Revscaf= -1;$#teRevStart=-1;$#teRevEnd=-1;$#teRevQStart=-1;$#teRevQEnd=-1;$#teRevMinLen=-1;
# go through each scaffold and collect into arraysfor $TE ( keys %{ $TErev{$scaffold} }) {$Revscaf[$i]=$scaffold;$teRevStart[$i]=$TErev{$scaffold}{$TE}{"start"};$teRevEnd[$i]=$TErev{$scaffold}{$TE}{"end"};$teRevQStart[$i]=$TErev{$scaffold}{$TE}{"qStart"};$teRevQEnd[$i]=$TErev{$scaffold}{$TE}{"qEnd"};$teRevMinLen[$i]=$TErev{$scaffold}{$TE}{"minLen"};$teRevQuery[$i]=$TErev{$scaffold}{$TE}{"query"};$i++;
}
# sort and get indexes@list_order = sort { $teRevStart[$a] cmp $teRevStart[$b] } 0 .. $#teRevStart;
# push sorted stuff onto are final arrayfor my $i ( 0 .. $#list_order ) {
push @joinScaffold, $Revscaf[$list_order[$i]];push @joinStart, $teRevStart[$list_order[$i]];push @joinEnd, $teRevEnd[$list_order[$i]];push @joinQStart, $teRevQStart[$list_order[$i]];push @joinQEnd, $teRevQEnd[$list_order[$i]];push @joinMinLen, $teRevMinLen[$list_order[$i]];push @joinQuery, $teRevQuery[$list_order[$i]];
}}
@Revscaf=@joinScaffold;
174
@teRevStart=@joinStart;@teRevEnd=@joinEnd;@teRevMinLen=@joinMinLen;@teRevQuery=@joinQuery;
$joinDist=$closeLen;for($i=0;$i<=(@Revscaf+0);$i++) {if($Revscaf[$i+1]) {
if(abs($teRevStart[$i+1]-$teRevEnd[$i])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])
{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teRevStart[$i]-$teRevEnd[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])
{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teRevStart[$i]-$teRevStart[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])
{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;$i-=1; #unless $i==0;
}elsif(abs($teRevEnd[$i]-$teRevEnd[$i+1])<=$joinDist &&$Revscaf[$i] eq $Revscaf[$i+1] && $teRevQuery[$i] eq $teRevQuery[$i+1])
{$teRevEnd[$i]=max($teRevEnd[$i],$teRevEnd[$i+1]);splice @Revscaf, $i+1, 1;splice @teRevStart, $i+1, 1;splice @teRevEnd, $i+1, 1;splice @teRevMinLen, $i+1, 1;splice @teRevQuery, $i+1, 1;
175
$i-=1; #unless $i==0;}
}}
#print$ct=0;for($i=0;$i<=(@Forscaf+0);$i++) {if(($Forscaf[$i] ne "") && (abs($teForStart[$i]-$teForEnd[$i]) >=
$teForMinLen[$i])) {print "perl get_fasta2.pl " . $Forscaf[$i] . " $ct ";print $teForStart[$i] . " " . $teForEnd[$i] . " " . $flankLen .
" f $teForStart[$i] $teForEnd[$i]\n";}$ct++;
}for($i=0;$i<=(@Revscaf+0);$i++) {if(($Revscaf[$i] ne "") && (abs($teRevStart[$i]-$teRevEnd[$i]) >=
$teRevMinLen[$i])) {print "perl get_fasta2.pl " . $Revscaf[$i] . " $ct ";print $teRevStart[$i] . " " . $teRevEnd[$i] . " " . $flankLen .
" r $teRevStart[$i] $teRevEnd[$i]\n";}$ct++;
}
176
D.2 Extract Sequences
###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: BLAST database (genome)### ARGV[1]: input file that contains get_fasta queries#######################################################################
use Bio::SeqIO;use Bio::SearchIO;
$GENOMEDB = $ARGV[0];$in = Bio::SeqIO->new(-file => $GENOMEDB, -format => ’Fasta’);$ii=0;$eeend=0;%cpipiens=();while($seq=$in->next_seq()) {$cpipiens{$seq->id()}=$seq;
#print "scaff: " . $seq->id() . "\n";$ii++;
}
# search for a specific scaffold id$filename=$ARGV[1];$strand="";$seq="";open(INFILE,$filename);while(<INFILE>) {($p,$g,$scaffold,$scaffoldNum,$pos,$end,$flank,$strand)=split();$seq=$cpipiens{$scaffold};
if ($strand eq "r") {$revseq = $seq->revcom();$olen = $end - $pos+2*$flank;$start = $seq->length() - $pos - $olen + $flank;
if($start<0) { $start=0; }if($olen>$seq->length()) { $olen=$seq->length(); }
$seqstr = substr($revseq->seq(), $start, $olen+1);$start=$pos-$flank;$eeend=$start+$olen;print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n";
print $seqstr . "\n";} elsif ($strand eq "f") {
$olen = $end - $pos + 2*$flank;
177
$start = $pos-$flank-1;if($start<0){ $start=0; }
if($olen>$seq->length()) { $olen=$seq->length(); }$seqstr = substr($seq->seq(), $start, $olen+1);# seq is getting the raw sequence form the $seqm which has# additional info about theother param
$eeend=$olen+$start+1;$start=$pos-$flank;print ">" . $scaffold . "-" . "$scaffoldNum $start $eeend $strand \n";
print $seqstr . "\n";}}
178
D.3 Trim CAP3 Contigs
###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: quality score file### ARGV[1]: minimum distance to perform joins### ARGV[2]: sliding window size### ARGV[3]: CAP3 quality baseline### ARGV[4]: CAP3 threshold quality multiplier### ARGV[5]: minimum length of processed CAP3 sequence#######################################################################
use Bio::SeqIO;use Bio::SearchIO;
$qualfile=$ARGV[0];$joinDist=$ARGV[1];$windowSize=$ARGV[2];$quality=$ARGV[3];$val=$ARGV[4];$minLen=0;$minLen=$ARGV[5];
$in = Bio::SeqIO->new(-file => $qualfile, -format => ’qual’);
my $sum=0;my $outside=1;my $teNum=0;
for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes$window[$i]=0;
}
while($seq=$in->next_seq()) {for($i=0;$i<$seq->length();$i++) {
unshift(@window,$seq->qual()->[$i]); #add new quality score$sum+=$seq->qual()->[$i];$sum-=pop(@window); #remove new quality score and update sumif($sum>=($quality * $windowSize * $val) && $outside==1) {$outside=0;$scaf[$teNum]=$seq->id();$te[$teNum]=$teNum;
$len[$teNum]=$seq->length();
179
if($i-$windowSize<0) { $start[$teNum] = 0; }else { $start[$teNum]=$i-$windowSize; }
}if($outside==0 && ($sum<($quality*$val*$windowSize)) ) {$end[$teNum]=$i;$teNum++;$outside=1;
}} #end scaffoldif($outside==0) { $end[$teNum]=$seq->length(); $teNum++; }$outside=1;$sum=0;
for($i=0;$i<$windowSize;$i++) { #define array of set size and fill will zeroes$window[$i]=0;
}}
#perform joinsfor($i=0;$i<=(@scaf+0);$i++) {if($start[$i+1]) {if(abs($end[$i]-$start[$i+1])<$joinDist && $scaf[$i] eq $scaf[$i+1]) {$end[$i]=$end[$i+1];splice @start, $i+1, 1;splice @end, $i+1, 1;splice @scaf, $i+1, 1;splice @len, $i+1, 1;splice @te, $i+1, 1;$i-=1 #unless $i==0;
}}
}
#output for get_fasta2for($i=0;$i<(@scaf+0);$i++) {if(abs($start[$i]-$end[$i]) > $len[$i]*$minLen) {print "perl get_fasta2.pl " . $scaf[$i] . " " . $te[$i] . " " .
$start[$i] . " " . $end[$i] . " 0 f \n";}
}
180
D.4 Generate Consensus
###################################################################### This is part of TESeeker, licensed under the GNU General Public### License (GPL) v3.###### Author: Ryan Kennedy###### Arguments:### ARGV[0]: alignment file to create consensus from### ARGV[1]: percent of nt that must be common for consenus#######################################################################
use Bio::SeqIO;use Bio::SearchIO;
$alnFile=$ARGV[0]; #file to make consensus from$percThresh=$ARGV[1]; #percent of nt that must be common to go to consensusmy @A=0; my @T=0; my @G=0; my @C=0;my @consensus="";my $numSeqs=0;my $length=0;
$in = Bio::SeqIO->new(-file => $alnFile, -format => ’fasta’);
while($seq=$in->next_seq()) {$numSeqs++;my @seq_array=$seq->seq() =~ /./sg;$length=$seq->length();for($i=0;$i<$seq->length();$i++) {
if(uc $seq_array[$i] eq "A") {$A[$i]++;
} elsif (uc $seq_array[$i] eq "T") {$T[$i]++;
} elsif (uc $seq_array[$i] eq "G") {$G[$i]++;
} elsif (uc $seq_array[$i] eq "C") {$C[$i]++;
} elsif (uc $seq_array[$i] eq "N" || $seq_array[$i] eq "-") {$A[$i]++;$T[$i]++;$G[$i]++;$C[$i]++;
}}
}
for($i=0;$i<$length;$i++) { #calculate percentages
181
$Ap[$i]=$A[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Tp[$i]=$T[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Gp[$i]=$G[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);$Cp[$i]=$C[$i]/($A[$i]+$T[$i]+$G[$i]+$C[$i]);
}
for($i=0;$i<$length;$i++) { #find commonalitiesif($Ap[$i] > $percThresh && $Ap[$i] > $Tp[$i] &&
$Ap[$i] > $Gp[$i] && $Ap[$i] > $Cp[$i]) {push(@consensus, "A");
} elsif ($Tp[$i] > $percThresh && $Tp[$i] > $Ap[$i] &&$Tp[$i] > $Gp[$i] && $Tp[$i] > $Cp[$i]) {
push(@consensus, "T");} elsif ($Gp[$i] > $percThresh && $Gp[$i] > $Ap[$i] &&
$Gp[$i] > $Tp[$i] && $Gp[$i] > $Cp[$i]) {push(@consensus, "G");
} elsif ($Cp[$i] > $percThresh && $Cp[$i] > $Ap[$i] &&$Cp[$i] > $Tp[$i] && $Cp[$i] > $Gp[$i]) {
push(@consensus, "C");}else {push(@consensus, "-");
}}
$i=0;until($consensus[$i] ne "-" && $consensus[$i+1] ne "-" &&
$consensus[$i+2] ne "-" && $consensus[$i+3] ne "-") {shift(@consensus);
}
$i=(@consensus+0);#print $i;until($consensus[$i] ne "-" && $consensus[$i-1] ne "-" &&
$consensus[$i-2] ne "-" && $consensus[$i-3] ne "-") {pop(@consensus);$i=(@consensus+0)-1;
}
print ">Contig\n";print @consensus;print "\n";
182
APPENDIX E
TRANSPOSABLE ELEMENTS IDENTIFIED
This chapter presents all full-length consensus TEs we identified in the P.
humanus humanus genome and selected sequences from the C. quinquefasciatus
genome. We offer a detailed annotation of the mariner element from P. humanus
humanus and an annotated version of the putative mariner from D. melanogaster.
E.1 P. humanus humanus
E.1.1 Non-LTRs
E.1.1.1 Hope-like SART
LOCUS hope-like 4655 bpDEFINITION hope-like, 4655 bases, 1B9 checksum.ORIGIN
1 GGCCGATGGG GGTCACGGCC TCGTGTCCCG AGAGGCGAGT CGACCTGCGC51 GTGGAAGGTC TTTGCGAGTC RGCGACGTCG TCGATGGTCG CAGCGGAATT101 AGCTAGGGTA GGGGGGTGCT CGTCATCGGA CATCTCTGTG CGCTTCCCGA151 CCGGCGGAGG CCCACATYGT WACTCGTSYG CGTACGCATC CGTCCCGGCG201 GCCGTCGCGT CCGTYCTCTT RAAGGCRGGC GCGGTTGTCR TCGGGTGGRT251 GCSTGCGAGG GTGGTCCTGC TGCCGAAACR CAGGACCAWC TGCGTCAGRT301 GCCTACAYCY GGGGCATGTA GGCTCGCGTT GTCCCAATGT CAGGGAAAGG351 GGAGATACGG CGCTGGACCG CTGCTTCAGG TGCGGGGAGA AGGGGCACCG401 GGCGCGAGAA TACGCAAACA AAATCCGCTG CGCTCCCTGT TCGGAGGCGG451 GCCGGCAGGC TCAACACCGG CTCGGTGGGT CTTCGTGCGG GGCGCCGGCC501 GTCAGGGGTA GGCCCATTCG GTGGGAATCC GATACCAGGA ACACGGGGGG551 AAACGTTAGG GGGGCAGCCT CCGCATGTCA GCCGGGAACG AACGGCTCCG601 CCGGTAGTTC TCTGTGAGCA TGAGGAGGCT TCTCCAAATA AATCTGAATA651 GGTGCAAAGG GGCACAGGAT CTCTTGCTTA ACTTGGTAGG GACGTGGGGC701 GTCGGGATCG CGATGATCTC GGAACCCCAC GTTGTTCCGT CGGGGGAAGG
183
751 GAGTTGTTCC GCTGGCACTT GGTTCGTATC CGAGGACGGC GGGGCCGCGA801 TCTTCTTCTC GAGCTTGGCG CCGGGTCCGA AACCTCGGGA ATTGTCCCGT851 GGGGTCGACT TCGTTATTGC AGGCTGGGGA AACTCGGTCC TGGTATCGGT901 ATACGCGTCG CCAAATCGTC CATTRMRAGT ATTTGAAGAT CAGCTGCGRG951 CGATTTCGAG GGAGCTGGAC AGGTTAGGGG ACTCGCGTCC GATTCTGATG1001 TCCGGGGACT TCAACGCGAA ACAYRCATCG TGGTCAGGTT ACGCGACTAA1051 CGCGCGCGGC CGATTGTTAA GACAGTGGTT AGACGAGCGT CATCTCATMG1101 TGTGGAACGC AAAAGGAKTC AAGACGTGCG TGCGGTCCAC CGGCGGGTYC1151 AYGATCGATC TCACGATTTC ATCCGGTTCK ACGGCGRCGC GGGTCTCCGA1201 CTGGGAGGTC CTTTCCGACT TCGAGTCCCT CAGCGATCAC GCCTACGTGG1251 CGTTCTSGTT CGGGGATYCT YGGAGCTCGG GGGGGGGCAG CCTCCTCGCC1301 GGACARRCCT CGCGTCGTCY CRCCCCGRTG GAGTCTGCGG TCGTTSGATG1351 ACGAACGGCT CGCCGAATGT GTCGGCTCGC GGCTGAGGTC TGGAGARATG1401 ACGTYGCTCC CGGCCCGAMA AGACGAAACG CACACGATAT GGCGCGGGGA1451 GTTCGGGACG ACTTAMGACG GATCTCGGAC TCCTGCATGA AGAGAATCGG1501 CTCGCGCTCG ACGACTCGGA AACGGSCGAA GTACTGGTRG ACGGACGAGA1551 TAGGAGAATT GTGGCAGGAG TCCTGTCTCG CACGTCGCCR GTTCGTCGGC1601 AAAAAAAGAC AAYTRATCAA ACGGGGAGGA CKTTACGAGG ACATCGCGAA1651 CGACGACGAA CTCGCGCKTT TAAAGGCGGA GTGGAGRGCT TCCCGGGYCC1701 GGGTCCGACG AGCGATATGG AGTTCTAAGG RGACGTGYTG RAAAGAACTC1751 TTGAGCGAAG CGGATTCGGA TCCGTGGGGT CTCCCRTTCC GTCTYGTGAC1801 CAACAAACTR AGRGGTTCGT YGGSACCGCT GACGGCCGGC ATGACGGAGR1851 AGTTTCTCGC CGAAGTGATC GCAGAGCTCT TCCCTCGCGT CGARCCGTTC1901 GCGTTTCCCC CGCCGGATYT CTCGCACGAC GCGAACCGCG ACCACGASAY1951 CGAAGTGACG GAGGGGGAGG TCCGCCTTCT CCTCGACGCG GCCTCGAAGA2001 GACGCTCGGC TCCGGGCMCG AACGGCGTGC ATTACCGCAT TATCGGCAAC2051 TCTGCCGGGG TGCTGTGTGC GCGTCTCTCG GCGCTCTACA CCGCGTGTTT2101 CCGAGAGGCG ACGTTTCCGA CGCCGTGGAA AGAAGCAAAC TTGGTACTGC2151 TGGACAAACC GGGTAGGGAT CCGACGACGC CGAGCGCGTA CCGGCCGATT2201 TGTCTTCTCG ATGTGGARGG CAAGTTGTTC GAGCGCGTGA TAGCGTCYCG2251 GATMGACGAA CACTTACGAT CAGGGGGAGG GAGGAACGAC CTCTCTCCAA2301 ATCAATACGG CTTCCAGACG GGACGTTCGA CGACCGACGC ACTCGACCGC2351 GTGTGCGCCG GAATCCGGKA CACGCTTCAC CGGGGCGGCG TCGCGATCGC2401 CGTCTCCATC GACATCAAAA ACGCGTTCAA CACGGTACCT TGGTCCGCGA2451 TAAGGGACGG GCTTAMGTCG AAGTCCGTGM CAGACTACCT CGTGTCGGTG2501 ATAGRGTCCT TCCTCTCGGA GAGGAARATC GCGTACGAGA AACCGGACGG2551 AWCGACGGGA AGGGCGGACG TGTTCTRCGG CGTGCCTCAG GGATCGGTTC2601 TAGGRCCACT CCTGTGGAAC ATCGCCTACG ACCGYGTCCT CACCCGGACC2651 GTTCTTCCCG AAGGCGTWTC GTTGACTTGC TACGCCGACG ACACTCTGCT2701 CCTCGCGACC GGTCGCGGGT GGGCGGAGGT TCGCGAACGY GCCGAAACRG2751 GACTTAACGC RACGGTGGAG SCCATCCGGG ACACCGGTCT CCGGGTGTCT2801 CTGCCGAAAA CGGAAGCTTG CGGGTTTCAC CGTCCKCGGA ATCCGCGTCC2851 TCGCGACTTR TCGATAATCG TGGACAACGT CRGAATCGGA GTGGGTAGTT2901 CACTTAAATA CTTGGGTTTG GTTCTCGAYT CCGGTCTTCA CTTCGGGGAA2951 CATCTGGCRC GCYTCGGACC GARRATCCGA GCCGTCAGYG CGACGTTAGG3001 RCGTCTGATK CCGAACCTGC GYGGMYCGCA GGTCAAGGTC AGRCGATTAT3051 RCRCGACGGT CGTTCACTCC GTGGYCCTGT AYGGAGCTCC GATCTGGGCC3101 GAGTCGSTTT CGGCAGCTCG GCCTCTTCGC GAGAAGGCAA TCCAACTCCA3151 RAAGGCGTCT CTGAACAAGG TGGCGATGGC GTACAGGGAC GTCGYGGCGG3201 AGGTGTCTTG TCTTCTGTYG GGCACTCCCC CGCTYGATCT CCTCGCGATG
184
3251 GARAGACTCG TYCTCTATCG CGAACGGGAT CGCGGCGGGC CTAGGACAGC3301 TCGTCACCGC ARGGAACTTC GCTCTCAAAC GATCAACACG TGGCAGGCGC3351 GATTCACCGA CGGGCGTAAG CGTTACGGGG GGGAAATCAT CAACGTTCTC3401 GGCCCCCGAG TGGGGGAATG GGTRGGGCGR GGTCACGGAA ACTTGACTTA3451 TCGCCTCACT CAGRTCCTCA CRGGTCACRG CGTGTTCGGC TCGTACTTGG3501 CGCGCATCGG CAGAGAGGAG ACGGCGRAGT GTTGGTTCTG CGGTGCGCYC3551 GAGGACGACG TCGARCACAC GGTCGCGATA TGTYCCAYGY GRGASGTCCA3601 YAGRCAGCGG CTAGTCGAGG TCATYGGACC CGACTTGTYM ATACGCGGTT3651 TAGTGAACGG GTTGCTCCGC GGACCTCGGG AATGGTCCGT GATCTCGCGA3701 TTTTCGGAAA CGGTCCTACG GACGAAAGAG GACCGCGAAC GCGAAAGAGA3751 AAAATCCGGA GTTCGGCGTC GCGCGCTCCA GGAAAAAAGA AAAAGGAAGA3801 AAAACAACGG CGGCAGCGAC GGCGAAAAGG AAACAGCCAA CGCCTGATCA3851 TCCGTGGATG ATGGACGATC GCATCCTCAG GAAACAAGCG AGTTTCCCAT3901 CCCCTTCCCC CAGCGAAGGA CCGGAGGACA CTCCCGAAAG GGACTTGCCG3951 AACGTCCTGG GAAATGCGTT CTCGTCTTCG AGGTGGCAGC AGAACGAAGA4001 CTAGGGCGAA TCTCCTACCA AGTAATCGTG TAGAACGGTA ACGTTCTGGC4051 TTCCCGGAGA GGGGACTGAC TGAAATGTAG GTACAACTCC CACGAACGTC4101 CCCCCCTCTT CATATAAGCC ACGCTATTGC GACATGCTCA GGAGTTTTAG4151 TGGGTATGGT TCCCGGTTAG GCCTTCTTTT CGAATCCCAT ACCGAGCCCC4201 ACACTCCCTC TTCGGAGGGG TTGTGCGTAA CTTGCATTTC TCCTGAGTTA4251 ACAAAAAAAA AAAAAAAAAA TTTTTTTTTT TTAAAAAAGG TTGGCTTGCT4301 CAGAAAGGGT CAGCGACCCG TTCTGAGTCT CTGGCTGGGG AGTCCTCTCC4351 GGGTTCGAGA CCTTGAGCGA TCACGTTGGC GGTGAATGAA TGTATGGATT4401 TCTTTTTTTT TTGGAATGCA CATCACCATC ATTCTTATAA GGTTCTTATA4451 AGGTACACTC CAAGATGAAT ACTGCATTAA ATTATATATT TATATGTATA4501 TGGGTAATTA TGGAAGGCAG AACCGAAAGT AAAACCTCGG GAACGTCTCC4551 CAGATTGTCA TTGGGAGAAA AGAATTCCAA TTAGGTGGAG AGGATCTTTG4601 CACTCCACTG TTCTATCCAC TATTTACATT TATTTCAAGC GGGAATCCAT4651 TGTTA
//
185
E.1.1.2 Dong-like R4
LOCUS dong-like 5266 bpDEFINITION dong-like, 5266 bases, 28B checksum.ORIGIN
1 GAAAARGAGG GCGCTGGTGT TCCATGGTTG ACAAGCCTTT TAGTCATATC51 GTTTTCTTTA ATTAATTTTA ATAGTTTTTT TTTAGGGTTT ATTTATAAGT
101 TATATATCAA AAGTTTTGTA ATTTAGTCTA GTTTACAGTG ATATAATTAG151 TTTTTTAAAT TATTTAGTAA TAACTATATT ATCTCAAGTT TAAATTTACA201 AATTGTTCAA CGATTCCATG TGACAAAAGG AAGCGTGGGC AATAAAAAAC251 CTTAGAAGGC ACCTTAGTAC ATATAAATTA AACGGATAAG TACCTCATTT301 TAAGACTTAA ATTTCAATTG TAAAACGTTA GGCTAACAAA GAAATTGAAC351 ACATAACCAA AAAATTTTTT TTATTTTTAT TTTTTTTttt tATATTTCTT401 TGATTAATTA TAAGTGATTT GACTATTTAT ATCTTTATCT TATTTAACAA451 AATAAGAATT AAAATATTTT TACAATAATA TAGGTTGATA CATGATTTTT501 TCAGGGTGTA ATTTCTATAC CCTTATTGTT AAGTGTTAAC TTGTGATTGT551 ATGTTATTTT AGCCTTGTCT TAAGCTAACT TAATTCATTG TAGTAAAGTT601 TATTAATAAT AACATAGTGT AACTAATACC TTTCTTAAAT CTGAATCAAA651 TTAGGAAAGT ATAGTAGTAG TCTTTATATT AAGTGATATT AGTGTGAAAG701 GGTACAACAG ATTTCTCTTT AACCAGACTT CATTTAATAA ATTTTGCATC751 TTTTGAATTA CTTATAAGAG TTAAATATCA TTGTCATACT AAATAAGAAT801 AGAGGTAGTT CAATATAAAT ACTCATTATT TATTGGTTTA ATGTAATGTA851 TAGCTGGATA TAACATCTCT GTACTCCTGT GTTAATCTAC TGATTGTTTT901 AATACAGAAT ATTTTGCCAT ATACTTTATG CAAATCTCCA TTTTGATCTA951 CATTAACACT AGCAGAGATC AATTTCTTTT ATACTACTTA ATAATAGTAC1001 AGAATTGTAA TTTGTATGTT GTGCATACAG TTAAGTGTAG TCTATGTAAT1051 CTCTTTTATC CTTAAATTAG TTTGTTTTGT TACTGCTTAG TTTATAATCT1101 TGATAGTGTA ATATTGCCTG TAAATTTGAT TCAAGATCAA TTAATTAAGT1151 TATAAATAAG TAGCCTTTTA GTATTTGTTT CTAATGTAAA AATTGTGAAA1201 AAGCATTATA CATTTGAGGG ACCACAACTT AAACAAACCT TATACAACTT1251 AAGTTGAAGT CTAATTGCAT ATTATAACTA AATAGTCTAG CCAAATATCT1301 CTACCATACT CATTGAAAAT AGGGATAGCT TAACAAAATA GGTTAATAAG1351 ATAGTTAGTA ATATCTGATT TAATATATCA ATTGGGTAAA AAATTGGTTC1401 TGTCTAATCC AGTTGTCTGA TAACTAGTGT TCTAATTGTA TAATATTTAA1451 CCTTACATTA AATTGTTTCT GCCTTATTGC AGCATATTTT ACATGATTTA1501 CAGCAAATTG ACTTTTATAA GTATAAGAGA CATAAGTGAT AAGTTTAATT1551 AAAAATTTTA TAATAGTGAA CTCTAAATAC TGAATTAGAA AATGTTTTGT1601 TATAAAGGTG AATGAATTTA TAAATTTATA TATAGGTAAA AAGAGCAAAA1651 TGAGAACAAG ACAAAGTAAG AACCAAAACA AATCTTGTTC TACTGTCGAC1701 CTGAGGCAGC TCGACGAGAA TGTGAATTTC ACTGCTTCTG ACCCTGGCCA1751 CTCCAACGAT GTTAGTCACC GGAGCCCAGT TCAGGAGACA ACTCGTTCCC1801 ATCGAATAAG ATGGACTCAG GAAGACCTTC AAGAATTGAT GTGGTGCTAT1851 TTTTACTCTC AAAAATTTGG TTCAGGATCG GAAAGTGACA CCTTTAAAAT1901 CTGGAGAGGG AGAAACCCAA ACAGCAGGAA GGACATGACA TCGAAGAAGT1951 TGGCTGCCCA AAGAAGATAT ATAATAAAAA CAAAAAAAAT TGAAAATGAT2001 AAATTAGAAG AAATTAAGAA AAATGTAGAC TCTTCGTGCA GTAATGTTAT2051 ACGAGATAAT GCACAAATAA TAACAGAATT AAATGAGTCT AACCACAAGA2101 GTAAAACCGA TACTGACGAG CAATTGTTGA CTGATGCAGA AATGAAGAGC2151 ATCGAAGAGA GACTAATTGA AGAGATAAAA AAAGTAAAGA TGTGCCCATT
186
2201 AATAAACAGG GAGCCATTAA GGAAAATTTA CAAAAATAAA AAGGCAACTG2251 AAGTTTTACA TCTTATAGAT AACACCCTCA TAAATGTATT AGAAAAAGTG2301 GTAGACATTA ATTTAACCAC AATTAATGAA ATTATATATG CTGCGGGGGT2351 AGTTGCCACA GATATTATAC TTGGGCCAAG AAAAGAAGCT CGTCATAAAG2401 GAATGGAAAC AAAAAAGTCA ACATCACCAA TATGGATACA AAGAATAGAG2451 GGAAAAATAG AAAGAATAAG ATTACATATC TCCCTGGTAT CCGAAATGAA2501 GAAGAATAAT AATTTAAAGA AGAGGACAAT AAAGAAACTG GATCACCTTA2551 AAAGGATCTA TAAATTGAAG ACCATGGAAG ATATAGAACT TACCATGGAA2601 ACTTTAAAAC AAAAGGTATT GCTCTATTCT CAGAGAATAA GAAGATATAA2651 GAAGAGAGAG CAGTTTTGGC GACAAAATAA ATTGTTTGAA TCGGACCCAA2701 AGAAATTTTA CAGAACAATT AGAGAGCAGA ACATACAAAA TGGATTCAGC2751 ACCTTAAATG TAGAAAAAAT GGCAGACTTT TGGTCAAATA TTTGGGAAAA2801 ATCTCATCCA TTAAATAAGA ACTCTACATG GATGAATAAA GAAAAAGAAG2851 CACATGCCTG GATTGCTTCT TCAACAATGT CGGATGTAAG AATGGCGGAT2901 TTAGAAACAT GCCTAAAAAA CACGGCCAAT TGGAAAAGTC CCGGATTAGA2951 TAGGGTACAA AACTTCTGGA TTAAGAATTT TACAAGTACC CATAAGTATT3001 TACTGGTATC CATAAATAAA TTAATAATGG GACGGCAAGA AATGCCAGAA3051 TGGATAACCA CAGGTAAAAC GTATTTGCTG CCAAAAAAAT CAGGAGCTAC3101 GGAACCGAAA GATTTTCGGC CGATAACCTG TTTGCCCACA ATGTATAAAA3151 TAATAACAGC TATTATTGCT GAAAAGATTT ATGGGCATTT AAGAAAAAAT3201 AACATTTTTC CTCCTGAACA ATATGGATGC AGAAAAGGGT CTTACGGCTG3251 CAAGGAGGTT TTATTAATAA ATAAATTGAT CATGGCCAGT GCAAAACAAA3301 AGAGGAAGAA TTTAAGCATG GCATGGATTG ACTATCAAAA GGCCTTTGAT3351 AGTGTGCCTC ACGAATGGAT TATTGAGGCA TTGAAAATAT ATAAAGTAGA3401 CCCTAATATT ACAGCGTTCT GCGAGAAGAG TATGAAAAAT TGGTGCACCC3451 AGCTGGAAGT GCAAAAATAC TCTTCTAGAA AAATATTTAT AAAAAGAGGA3501 ATTTTTCAGG GAGATTCATT GTCGCCACTT TTATTTTGCA TGTCTTTAAT3551 TCCTCTATCC AGACAGCTTA ATATCAAGGA TCAAGGATAT GAGTTGGTAC3601 CGGGAGGCAG GAAAATTACC CATATGCTAT ATATGGATGA CTTAAAAATT3651 TATGCCAAAA ATGAAGAGGA GTTAAATAAA ATGTTACGGA CGGTTCAAAC3701 CTTTTCCTCT GACATCAACA TGAAATTTGG GTTAGAGAAA TGTGCCAGAA3751 TAAATATTGT CAGAGGAAAG TTAAAACAAA AGCAAAATAT AGAAGACTCC3801 GAAGAAGAAC TTATTAAAGA ATTGGACCCT GGATCATCAT ACAAATATCT3851 GGGGATTGAA GAAAATTTTG GGATAGCCAA CAAGGAAATT AAACCTCGAT3901 TGAAAAGAGA ATATTTTAAA AGATTGAGGC TTATATTACA GTCGGAACTG3951 AATGGAAGAA ATAAGATAAC CGCCGTCGGC ACATTAGCAG TTCCTGTAAT4001 AGAATATAGT TTTGGCCTCG TAGACTGGAC GAAAGAAGAA ATCACGCACC4051 TAGATAGAAG GACAAGAAAA ATATTAACCA TGAATGGTGC GTTACATCCA4101 AARGCTGATG TGGATAGATT GTACGTCAGC AGGAAAGATG GAGGAAGAGG4151 ATTACGACAA ATAGAAGCGG CATACCAGAA TGCCATWATT GGAATGGGAA4201 AATACATAGA ATCCCATCRA GAGGACCCYA TCTTAGCCCA AGTTATACAT4251 GCAGAAGAAA AAACTACAAA AAAAGGAGTT CTGAAAAGGG CAAAACAAAT4301 CGTCCAAGAA AATAAAGAAA ACGAGATAAT GGAAGARGGG CAACTTGCAA4351 CTTATAATAG CAAAGCCCAA TCTCAGAAGA AATTAATAGG CAAATGGGAA4401 CAGAAAAAAT TACATGGGCA ATACCTAAAA AGAATAAATG CCGAAGATAT4451 TAATAAGAAG AGCACGCACA ATTGGCTACG ACGTGGAAAA CTTAAAATTG4501 AAACAGAAGC GTTTATTACA GCGGCTCAAG ATCAGGCATT GCGGACCCAT4551 AACTATGAAA AAGTAATTCT CAAAGTCCGC CAAGATGACA AGTGCCGAAT4601 TTGCCAATCW CAATCGGAAA CCATCGATCA TTTAATTTCC GGTTGTCCGA4651 TACTGGCAAA ACATGAATAC TTAGAAAGGC ACAATAAAAT ATGTCAATAT
187
4701 CTCCATTGGA ATATATGCCG AGAATATGGA ATGGATGGAT TACCCAAGGA4751 GTGGTACAAC CACATTCCAA GCCCGGTTAC GACAGTAGGT CCATGCACAG4801 TTCTATATGA TCAACAAATC CACACTGATA GAACTGTGCC AGCTAACAAA4851 CCGGATATCA TCCTCAGGCA TAATGGGGAA AAATGGTGTA AGTTAATTGA4901 GGTATCCGTG CCGGCAGAAA AAAACACCAC AGCCAAAGAA GCAGACAAAA4951 GGCTGAAATA CAGAAATTTG GAAATTGAAA TAACCAGGAT GTGGGGAACA5001 AAAACTGAAA CGATTCCGGT CATCGTGGGA GCATTGGGAG CCATGCCAMA5051 TTCAATAAAA GGAAATTTAA AGAAGATTAT GAAGAACCTA AAAGAAGAAA5101 CCATCCAGGA AATCGCACTA TGTGGGACGG CCCACATTCT TCGGAAAATA5151 TTATAAATAG CACCACCGAT ATTCGAGTAT TTGTCCCTAA GGAATCGGGT5201 AGAGACCTGG GATAAATGCT TAAAAACCCC GACAAATAAG AGCCTTGTCG5251 TGTTTAAGAG TGACCA
//
188
E.1.2 LTRs
E.1.2.1 Mdg1 ty3/gypsy
LOCUS mdg1 5395 bpDEFINITION mdg1, 5395 bases, 9DF checksum.ORIGIN
1 TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT TATAAACAAA51 AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA AGCTTCTAAA
101 GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC CCATTGTTGC151 TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT AGGCAAGCTA201 CACAGCTTGG CGTnGGTTCA TTTGATTTTT AGTCTAGCTG TAATAAGTTC251 TTGTTTATTA ATATTAGTTT TGTtagtGAT AGTTTTAAGT GCTTGTTCAT301 TACTTACAAA AATAAAGAAC GAACCGAATA TAACATATTA AAATTTTGGC351 GCAGTCGACG AAGGATCATT CTGAACGACA AGAAACCGGA GAATTCTTAA401 GTAATTCGTT TCATGTTGGT GGGTTTTAAC TCTTTATCAG AATTTCTTTA451 TACTGTAAGT AGTAGTAATA GGTAACTCAT TTttTGAACA ATGCCTAAAC501 TTGATCAACC TGTTTTCAAC CCATCCCAAA TGTCCCTTAG TTTCTTGTGT551 AGTCTAGTTC CTAATTCCTT TTCCGGAGAA AGGAACCAAC TTAAtGCATT601 TATctcTGAT TGTGATAGTG TTATCGAAAT GTCCTCTGAA GAgAATAAGT651 ATCCACTCTT TAGATTCATT TATTCCAGAA TCACTGGTAA agCCAgGgaT701 CGAATTTCTa TATACCATTT TGATAATTGG GATGACGTAA AAaCGAAACT751 CATAGAATTA TATcAAGACA AGAAACCCCA TAGTCAACTA ATGGAGGAGT801 TGACTAGTTG CAGGCAAAAA TCAAATGAAT CGGTAACGGA ATTTTATGAA851 AGATTGGAGA ATTTGTCCCG TGAAATTGTA TCCAACTTAA AGGTGGATGT901 CAAGGATAAA AGAGCTCACT CAATAAAAAT AGATTACATT AACGAGATTG951 CTTTGGGTCG TTTTATTTAC CATTCAAATT CCGGAATTTC CCAAGCGCTG1001 AGATGGAGGA ATTTCGACAG TATCAACTCA GCATATTCAG CAGCCATTGC1051 TGAAGAAAAA TTTCTAGAAA TGAGAAGGCG TGATAGGTGC TGTAATTGTG1101 ATTCTAGAAA TTCTAATGCT TCAAAACTAT TGTCTCCAGC TCAAATTCCA1151 AAATTCTGCA AGTACTGTAA AAAGTCAGGA CATCTTTTAG ATGAATGTTT1201 CCGAAGAAAG AAAAtTAATG AGCTCCGAAA AAAACAAAGT AAAGCAAGTG1251 TTAATTTAAA CTTACATCAG TCCCCAGTGG TAGACACCGC ACtGGAGGAG1301 TCAaTAGGtC aACTGAAGGT GTCAGAAATA TAAAATTcTT tGTTGTTAAt1351 GATAATtTAA ATTTCATTAA AGTATCGTCT AATAACTCCA AAAATCCgGG1401 TAAGTCgCTG AAATTCATCT TAGATACCGG TTCCAGtGTA AACATtATAA1451 AaGCGTGTAA GTTAACtCCT GAAACTAAAT TTAACTCCaA gGAAAAAaTC1501 AAaCTTCAgG GAATCAGTCA tagTCAgCAA gCcTTGGAAA CAGTCGGTtC1551 tTGcgTAATa CCacTAagAA TcGAGGtaaA AtaatataCC aCAAAATTTC1601 AtATTCTtaA TCAaaCAACT AATATcCCat AtGATGGTcT GCTAGGAAAA1651 GAATTTTTTa TAaAAcAcTC tGCATGCATC GAATATGAAA CCAACACcGT1701 AAAATTGAAA GAAATTTCTA AACCACTAGT TGTGTATTCC GAGGAGACTT1751 CATCTTCAAA CAACCTTCAC CTTAAGGCTA GAACAGAAAC CaTAGTTAAA1801 ATAAACATCC TCAACCCtGA AATAAAGGAA GGAATAGTGC CAaACACAAA1851 AATTATGGAA GGCGTCTATT TGTCaCGTGC CATCGTTAAA GTAAACAAtA1901 ACAACGAGGC TTACGCCACC ATTTTGAATA CAAGAACGAC TGATcAAACA1951 ATTGAACCAA TCACaGTTcG CCTTGAGAAA GCTTCAAAAC AGTGTTTcCA2001 AATAAAAAGT CTAGAACCAC GACATAAAAG GAAATCAATA ATAAGTAACC
189
2051 ATTTGAGGTT AGAGCATCTT AATGAAGAAG AAAAAACATC AATAGTTAAA2101 ATTTGTGAAT CTTATTCcGA TATTTTCCAC TTACCCGGAG ATTATtTAAG2151 TTCGACAGAG GCAATTGAGC ATAAAATCAA TACTATTAAT GAAAATCCTA2201 TTTATACTAA AACTTACAGA TATCCnGAAA TACACAAAAG AGAGGTTAAC2251 AAACAGGTGA CCGATATGCT GAAACAAAAC ATTATAAGAC CTTCCAATTC2301 TCCTTGGTCG TCTCCATTAT GGGTTGTTCC AAAAAAATCT GATGCTTCTG2351 GAGAGAAGAA ATGGAGAGTA GTAATTGATT ATCGAAAATT GAATGAAGTT2401 ACAGTAGATG ACAAATATCC TATTCCTAAT ATTGAAGAAA TTTTAGACCA2451 GTTAGGGCGT TCAAAATATT TTACTACTCT AGACTTGGCT TCCGGTTTTC2501 ACCAAATTCC ATTGAATGAC GATGACAGTA AAAAGACAGC TTTTACAACT2551 CCTTTTGGAC ACTACGAGTA CACGCGTATG CCTTTTGGAC TGAAAAATGC2601 CCCAGCTACT TTTCAAAGAC TAATGAATAC CGTTTTATCC GGTTTACAGG2651 GACTACAATG TTTTGTTTAC CTTGACGACA TCGTGATCCA TGCTGCTAGT2701 ATTCAGGAAC ACGAAATTAA GTTAAGGAAA ATATTCGACA GACTCCGACT2751 AAATAATCTC AAATTACAGC CGGACAAGTG TGAGTTCCTA AGAAGAGAAG2801 TCGTTTACTT GGGACATACT ATCAGTGACG TAGGTGTCAG ACCAAACCCA2851 GACAAAGTAC AAGGAATTAA CTCATTCCCT ATCCCGAAAA GTACTAAGGA2901 CATAAAGTCT TTCTTAGGTC TGGTAGGTTA TTATCGTAGG TTTATAAAAG2951 GATTTGCCAA AATTGCAAAG CCCTTGACAA TTCTTCTTAA GAAAAACCAG3001 GACTTCAAAT GGACAAACAA AGAGCAAGAA GCATTTGAAA AATTCAAGGA3051 AATCCTTTCC ACACAACCCA TAnTACAGTA CCCCGATTTT AATAGCGAGT3101 TTGTCTTGAC AACGGATGCA TCGAATTTCG CAATAGGAGC CGTCCTCAGT3151 CAnGGGGAAA TAGGAAAAGA CTTnCCAATT GCCTATGCAT CTAGAACTCT3201 GAATGAGTCC GAATTAAACT ATAGTGTAAT nGAnAAnGAA CTCCTAGCnA3251 TCGTATGGAG CGTCAAACAT TTTCGACCCT ACCTCTTCGG AAGGAAATTC3301 ACTATTGTCA CAGATCATAG ACCTCTAACC TGGTTGTTTA ATTGCAGGGA3351 GCCCAATAGT AGGTTGGTAC GGTGGAGAAT TAAATTAGAA GAATATGATT3401 ATAAGATTAT TTATAAGAAG GGAAGCCTAA ATGCAAATGC CGATGCACTC3451 TCCAGAAATG TCTATATAAA CGTACCATCG TCCGACAAAT CCGAAAGAAT3501 CCATCCCAAA AGTAAGGAAG AGATTAAGAA AATTTTGGTA GAAAACCATG3551 ATTCTAAACT GGCTGGACAT TGTGGATTTT GTAAAACCTA TCAAAGAATC3601 AAACAACGTT ATTACTGGAA AACAATGAAG AATGATATAA AAAACTATAT3651 AAGAAGTTGC AAGTCCTGCC AAGTTAACAA AATTAATTTT AAACCTATTA3701 AGGTGCCCAT GGAAATAACA ACTACCTCTA AGCAAGCTTT TGAAAAATTG3751 GCCATAGATG TCATGGGACC ATTGCCAACC ACTAGTGAGG GAAATAAGTT3801 CATTCTGACC ATGCAAGATG ACTTAACCAA GTATTCATAC GCGGAGCCCA3851 TACCGAACCA TGAGGCGAGA ACAATAGCTT CGAAGATCTC CAAATTCGTG3901 ACGTTATTCG GAATCCCTAA ATTTATTCTG ACAGATCAAG GCACTGATTT3951 TACATCGAAC ATAATTAAAA ACTTAATGAA ATTATTTAAA ACAAATCACA4001 TAAAATCTAC TCCATATCAT CCGCAAACTA ATGGAGCATT GGAGAGATCT4051 CATCTGACCC TGAAGGAATA TTTAAAACAT TATATAAACG AGAGACAAGA4101 CGATTGGGAC GAATTTATTC CTTTTGCCAT GTATTCCTTC AATTCGCACA4151 CGCATACTTC CACAGGTTAC ACTCCATATG AGCTTTTATT TGGCAAGAAA4201 CCGTTCATAC CGAATTCATT GATAAGAAAA TCTACACTCA GAAAATTCAT4251 TGATAAACCA TTTAATAAAA ATTACGATGA TTATATTAGT GATTTAAAGG4301 AGAAGATTCA AATCAGTCAA AAATTAGCCA GAGAGAATCT AATAAAACAT4351 AAGGTGAAGT CAAAACAATA TTACGACAAT AAGATCAATA TTCACGATTA4401 TAAGATTGGA GATTTGGTTT ATATAAAAAA CAATTTAACT AAAATAGGAA4451 TCAATAAAAA ACTTAGTCCA AAATTTAAAG GCCCATACGA GATCAAGAAA4501 ATATCTGGGA ATAACGTTTA TTTAAAAATC CGTAACAAAT TAGTTACTTA
190
4551 TCATGTTAAC AACACAAAAC CGTGTTCTGG GTAGTTGATC TAAAGAGAGA4601 AATCCAAAAT AATAATAATT TAATCCTTTT ATTTTCCCAT ATAGCTAATC4651 ATCATTTATT TTATTATTAA TCCATTTATT CTTTTTCTAA TCATTATTTA4701 TGTTTCCTAT ATACAATTCT TGTATTTTTC AGAACTTATG TCAATTTTGT4751 TTCGAATTTC ATGTCTTTTG ATTTTATTCT TTTATCTTTA CTCTTTATGT4801 AACATTGTTC ACTAACGAGT ATTGCCTAGG AAAGGTTCTG CTACTTGCAG4851 CTATAAAGCT TAAAATTATG TTCCTGATAA GAAGCCAATG TGAAGAGATA4901 CAACAGTCTA CCATACAGAA GACATAGCAC CTACTACTAC TCAAGGAGTC4951 TCAACCTGGA CGGAACCATT CAGAACCCGA CATAGGAAGC TGTCAAATCC5001 GAACAAGAGG AGAAAACAAA TTCTTTTTCA AATAATTGTT TTCCGTCTTA5051 AGGGGGGAGG TGTTATGATC CCGTACTCAA TATTCACATT TTCAAATTTT5101 TATAAACAAA AGACTGGCGA CGAACTCGAA TTGTTCTTCT TTAGAAGCGA5151 AGCTTCTAAA GACTCCTTCT TTTATTTTAT TTGCAAGTGT TTGCTAGTTC5201 CCATTGTTGC TCGTTCGAAA CAGTTGACTG ACGAAAGACC AAAGTAATAT5251 AGGCAAGCTA CACAGCTTGG CGTCGGTTCA TTTGATTTTT AGTCTAGCTG5301 TAATAAGTTC TTGTTTATTA ATATTAGTTT TGTTAGTGAT AGTTTTAAGT5351 GCTTGTTCAT TACTTACAAA AATAAAGAAC GAACCGAATA TAACA
//
191
E.1.3 Transposons
E.1.3.1 mariner
DEFINITION Pediculus humanus humanus mariner consensus sequenceFEATURES Location/Qualifiers
source 1..1276/organism="Pediculus humanus humanus"/mol_type="genomic DNA"/transposon="mariner transposon"
misc_feature <1..2/note="target site duplication"
repeat_region 3..33/note="left terminal inverted repeat"
ORF1 285..584translation=GDEALIERQCQNWFAKFRSGDFSLQNEECSGRQLEVKDEQIKALIDYDRHSSTKDIVKKLDVSHTCVKNRLRRLGCQKKLDALLWGTLVNEATWSLRYAS/product="mariner transposase"
ORF2 683..1213translation=ARKQAPTTSKTDIHQKKVLLSFWWDYKGIVNFELLPRCQTINSEVYIRQLTNLNDTIQEKRPELANSKGIVFHHHNARPSPSLATGQKLLELGWNVLLHPPYSPKLAPNNYHFFRFLKNFLNGQKFQNDNEVKTALEQFFAPKTKEFYEKRKMILPEKCQKVTNNNKHNIIDKNNLT/product="mariner transposase"
repeat_region 1244..1274/note="right terminal inverted repeat"
misc_feature 1274..1276/note="target site duplication"
ORIGIN1 TATTGGGTTG GCAAATAAGT AACTGCGGAT TTTACCAACA GATAGTTTGT51 TATTTTTTTT GAGTACGTTT ACGTTTTTGT ACAGACATGA ACTTTTGATA
101 TGTTATTACT TGGTTCCTTC TGTAACATTC GGTACCAAAA TTTCATTGAA151 CTCTTAAATA GTACGCGAGC AAAAGATAAT TAAACATGGG GAGCCAGAGC201 GAGCATTTCC TCCACATTTT ACTTTTTTAT TTTTGAAAGA GTGTTAATGC251 TTCCCAGGCC AATAAAAAGT TGTGGGTCGT GTAGGGGGAT GAAGCCTTAA301 TAGAACGGCA GTGTCAAAAC TGGTTTGCGA AATTCCGTTC TGGAGATTTT351 TCTTTGCAAA ATGAGGAGTG CTCCGGGCGC CAATTGGAGG TTAAAGATGA401 GCAAATAAAG GCCCTCATTG ATTATGATCG GCATAGTTCG ACTAAGGACA451 TTGTAAAGAA GCTAGATGTG TCACATACGT GCGTCAAAAA CCGTCTGCGG501 CGTCTTGGGT GCCAAAAGAA GCTTGATGCG TTACTTTGGG GAACGTTAGT551 TAACGAGGCG ACTTGGTCTT TGCGATATGC TTCTTAAACG CAATGCAAAT601 GACCCTTTTT TGAAAGAATG GTCACCGGAG ATGAAAAGTG GGTTGTCTAT651 GATGACTTTT TGAGAAAAAG ATCCTGGTTT AGGCAAGGAA ACAGGCACCA701 ACAACTTCTA AGACTGACAT TCACCAAAAA AAGGTATTGT TATCATTTTG751 GTGGGATTAC AAAGGCATAG TCAACTTTGA GCTGCTGCCA CGATGTCAGA801 CCATAAATTC AGAGGTTTAC ATTCGACAAT TGACAAATTT AAATGATACC851 ATCCAAGAAA AACGACCGGA ACTAGCCAAT AGCAAAGGAA TTGTCTTTCA
192
901 CCACCATAAT GCCAGGCCCT CCCCATCTTT AGCCACTGGA CAAAAACTAC951 TGGAGCTAGG CTGGAATGTT TTGCTGCACC CTCCATATAG TCCCAAACTA1001 GCTCCAAATA ATTATCATTT TTTCCGATTC CTAAAAAATT TTTTAAACGG1051 ACAAAAATTC CAAAACGACA ATGAGGTCAA AACTGCATTG GAGCAGTTTT1101 TTGCTCCTAA AACTAAAGAG TTTTATGAAA AAAGGAAAAT GATACTACCC1151 GAAAAATGTC AAAAGGTCAC TAATAATAAT AAACATAATA TAATAGATAA1201 AAATAATTTG ACATAATTAA TAAATCGTTT TTTGTTTTCT TAAAAAATTC1251 GTAAATATCT TTTTGCCAAC CCAATA
//
193
E.1.3.2 MITE1
LOCUS MITE1 623 bpDEFINITION MITE1, 623 bases, A64 checksum.ORIGIN
1 CTCCGACGTC GGAACCCCGC GACTCGATGG GAGCCGCGAT CTCGCGGTTT51 CGGGTGGAGG TGGGGGAGGG CGCGAAAAAT TTTTACTTTT TTTTTTCGGA
101 ATTTTGACCG CGGTCGGAGA CTCTCCGACG TCGGAaCSCS GaCSMCKCGA151 YGGGAGCCGC GATCTCGCGG TTTCGGRTGG AGGTGGGGGA GGGCGCGAAA201 AATTTTTACT TTTTTTTTTT TGAATTTTGA CCGCGGTCCC GGACTCTCCG251 ACGTCGGAaC SCSGaCSMCK CGAYGGGAGC CGCGATCTCG CGGTTTCGGG301 TGGAGGTGGG GGAGGGCGCG AAAAATTTTT mTTTTTTTTT TYGGAATTTT351 GACCGCGGTC GGAGACTCTC CGACGTCGGA ACCCCGCGAC TCGATGGGAG401 CCGCGATCTC GCGGTTTCGG GTGGAGGTGG GGGAGGGCGC RAAAAAWWWT451 TacTTTTTTT TTTCGGAATT TTGACCGCGG TCGGAGACTC TCCGACGTCG501 GAaCSCSGaC SMCKCGAYGG GAGCCGCGMT CTCGCGGTTT CGGRTGTAGR551 TGGGGGAGGG CGCGAAAAAT TTTTTtTTTT TTTTtCGGAA TTYGACCGCG601 GTCCGAGACT CTCCGACGTC GGA
//
194
E.1.3.3 MITE2
LOCUS MITE2 169 bpDEFINITION MITE2, 169 bases, 2038 checksum.ORIGIN
1 TAGGTCGACC TTGAATMCAA GGCCATTGGT TTTACATTTC RATTTTGTGA51 AATTTTTCAT GGTCATGATT TTTCAACCAA GGTCGACCTT GAATACAAGG
101 CCACCAGTTT TAAAATTTTA TTTTGTGAAA ATTTTTCATG GTCACGTTTT151 TGCATTCAAG GTCGACCTT
//
195
E.2 C. quinquefasciatus
Because there were more than 100 non-LTR TEs identified in C. quinquefas-
ciatus, we list only one element per family.
E.2.1 Non-LTRs
E.2.1.1 CR1
LOCUS Cp_CR1_Ele11 3116 bpDEFINITION Cp_CR1_Ele11, 3116 bases, CE7 checksum.ORIGIN
1 CCACGAATCT GCTGCTTTAY TAYCAGAACG TTGGAGGCAT TAATACCACT51 ATCGCCAACT ACGCCCTYGC AATCTCTTCY GCCTCMTACG ACCTGTACGC
101 AWTMWCTGAR ACGTGGTTGA CYTCWGCTAC TCTRTCTGGT CAAATYTTYG151 GTCCCGAATA YGAAGTATTC CGTGGAGATC GGACMGYCTC GAACAGYWKT201 AAAGRRTCAG GCGGRGGAGT YCTGCTTGCC GTCCGCTCSA AMCTAAAGCC251 ACGCCAAYTR TTCCCWCCAR ATTGTACCGT TCCRGAGCAA GTMTGGGTYG301 CAGTTCCACT CGCTGCATCY ACGATGTTYG TGTGTGTTAT CTACATTCCT351 CCYAAATTTG ACAACGATAA GCCGCTGTTC GATCAGCACA GACATTCTTT401 GACGTGGATA GTCTCCAAAA TGAAAGTGAA CGATAGTGTT ATGGTCCTCG451 GTGACTTCAA CTTCCCAGCC ATTCGCTGGA CGCGSACMCC GACGAACAAA501 CTGMTTCCAA ACYTAGCCCT YACTCCGACC AAYGMGTTAA AGCACAAMCT551 CCTGGATGAS TATTCYACCG CAAACCTTAG CCAACTGAAY GACATGYGCA601 ACAACTCAAR CAACGTTCTY GACCTRTGCT TTGCCAGCTC WGRKACACCG651 ATCAACTWTA CYCTTCTMCC AGCWCCTYTR CCKTTGGTKA AAGACGTGCG701 SCACCACYTK CCRTTTCTYG TWTCSATWTC YTGCACGRYG CTCSMTTTTC751 GTGAWGTYGC TGGYAAYWCK TTYATGGACT AYCGWAAGGG AAACTAYGAT801 GRCATGAACA ACTTCCTGAC CAACATTAAY TGGMACCAAC TWYTGGCCAA851 CCTTGACGCC GACACAGCYG CTGMTACTTG GACAGGTGTT CTGACGGATG901 CCATCAACAC CTTCGTTCCA AGGAAACWGC GCCAGCCTCC AAGAYATCCA951 CCGTGGTCAA CACMTCGAYT GCAGATTYTG AAGTCCAGGA AACGMGCTGC1001 CCTCAAGAAA TWCGCCAAAC AYCCGACAGA TCGATGGAGA AACCATTATA1051 GGTCAAGAAA CCGGAAGTAC AGTATCCTGA ACAAACAACT TTTTCATCGC1101 CACCAACACC GAATCCAAAG CCGATTGAAA CGAGACCCCA AGAAGTTCTG1151 GAAYCAYGTA AACGAGCAGC GGAAAGAGAC AGGTCTRCCA ACTGCGATGA1201 TACTCGACGG TGARGAGGCY ACTTCCACCG AGAGTATAAG CGATCTKTTT1251 CGTCGCCAGT TCAGCAGCGT ATTCACCAAC GAAGCAGTAG RGGAAACGCA1301 TATTGCTAAG GCTGCTAGCA ACGTTYCACT GCGACCTCCC ATYGGACCTC1351 ACCCGGTGGT CACTTCCGAG TCCATCCGTC GTGCCTGCGC CTCTCTCAAA1401 GGTTCTACCA GCTGCGGYCC AGACGGCATC CCMGCGTTTG TGCTMAAAAA1451 GTGTTGYGAT GCACTCGCGG AACCAYTGGC TCAACTYTTC AAYACCTCGC1501 TTGCTACTGG AGTTTTCCCG TGTTGCTGGA AGAAGTCYTW TGTKTTCCCA1551 GTYCACAAGA AGGGCCCMAA ACGTGATGTC CGGAACTATC GYGGAATTGC1601 TGCCCTCTGC GCAGTYAGCA ARCTGTTCGA AGTWATCGTG CTGGATTTYA
196
1651 TYAAGTTCAA CTGCTGTGAC YATGTCGCCC WGGAACARCA CGGCTTCATG1701 GCGAAACGTT CCACYAACTC YAACTTGGTC TCYTACTCGT CCTTCATTCT1751 WCGAACCATG CAGCAACGGA AGCAGATCGA TGCCATCTAT ACGGACCTAT1801 CAGCGGCYTT CGACAAGCTG AACCACCGYA TYGCYGTTGC KAAACTGGAA1851 CGMCTAGGYT TCGGCGGGCC CATGCTYGAT TGGCTWCGCT CCTATCTCAC1901 TGGMCGTGAA ATGAGCGTYA AAATCGGTGA CGTGATTTCC GCTGCTTTYT1951 CTGTTTTTTC AGGCRTTCCR CAAGGAAGCC ATCTGGGCCC TCTGATCTTC2001 CTCCTCTACA TGAACGACGT GCATCATCTg YTTAGGCTGT CACAAACTGT2051 CGTATGCGGA TGAYATCAAR CTGTTCRYCG TTGTCGAGAA TGATACCGAC2101 TGCCAGWTTC TTCAGGAGCA GCTCRACCKG TTCGCCAACT GGTGCTCCGA2151 MAACAGGATG GTTCTGAACG CTTCCAAGTG CTCGGTYATC TCTTTCACAC2201 GCAAGCGCAA YACMATKTCY TTYSACTACA CACTTTCAAA CACCACCATA2251 CCYAGGACCT CYTGTGTGAA AGATTTAGGT GTGATGCTGG ATAGCAAAAT2301 GACGTTTRCT GACCACATYA CGTATACAGT CTCCAAGGCT TCCAAAACTC2351 TTGGCTTCAT CTTYAGRATA GCTAAAAACT TCCGGGATTT AGGCTGTCTC2401 AAAGCTCTTT ATTGTTCGTT GGTYCGCTCT ACTTTRGAGT AYTGTTGTAY2451 TGTTTGGGCT CCCTTCTACC AGAACGCGAT TCAACGCGTG GAGTCGGTSC2501 ARCGGAAGTT CGTTAAGTAC GCGCAACGTC ACATTATCTG GCCTGATCCC2551 GCCAATCCRC CGAGTTACSC AGAGCGCTGT AAAATGCTTA ATCTCGAACT2601 TCTTACAGTA AGRCGTGACG TTKCCAAGGC GACYTTCGTT GCAGATCTCC2651 TTCGWTCGTC CATCGATTGT CCTGCCGTTT TGCAAMTGGT MAACATAAAC2701 ACTCGCCCTC GCGTACTCCG CAATCACTCR TTCTTGACTG TCCRCAGGGC2751 TCTCACAAAC TAYGGGCAGA ACGAACCGGT TTCWAGTATG TGTCGTGTTT2801 TTAACTTGTG CTCAGATCTG TTTGACTTTG ACATCTCCCG TGACACAATC2851 AAAAAACGAT TCCTTAATCA CCTGAAATCC CATCCCTAAc CtgAcgATAC2901 ACACGTAGAT TTTAGAACTG TGATATTTWT GTTAATTTAT TGAGTTAGTT2951 TTAAGAGTGA ACCCGTCTTG TATCATTTGA GTTTTGTGTA CTTGTTGATG3001 CGATAAGATG AGGTGGTTTT GTGCCTTTTT GAGAAAGTGT CTTKAAYRAT3051 ACCAGACACA GCTCAAGGGG GCTTTTGTCC ACCTCCAATA AAGAAAAAAT3101 AACaAAAATA AATAAA
//
197
E.2.1.2 I
LOCUS Cp_I_Ele1 3837 bpDEFINITION Cp_I_Ele1, 3837 bases, 16BF checksum.ORIGIN
1 TTTTTTTTTT GTATTTATTT AGGACCTTTT AACTATGAGT CTTTCGGGTC51 CTGTTWATGA ATCTTTACAC TTTTCCATAC ATGTCAGTAG CCTTTAAAAA
101 TTTCAACACA TTTTTCGCCA TATCTCGATC ATCAGCCAAG GCTTCCCTGA151 CGTTGGATGG TACTCCAGCT CGGCGTCTCT GCTCTTCAAA CTCAGGGCAT201 TCCGCTAAGA TGTGTTTCAC CGTCAGCGCC AAATTGCACC TTGTGCAACG251 CGGCGCACTG TCCTTCTCCA GCAGATACTG ATGGGTTAGC AAGGTGTGTC301 CTATCCTGAG TCTTGTCAGA ATGACGTCTT CCTTCCTCGA TCCAACAAAT351 ACATCTCGAT AAGGTAGTAC CGAGTTCTTC ACCTCTCGTA GTTTGTTGCC401 CACCTTTCTG CTCCATTCCG CATTCCAGCG CCAAACAGTC CTCTTCTTGA451 TCACCGTCCT GAACTCCTTG AATTCTACGC TTCTGTCCCA GATGTTTCTG501 TCGTTCAATG ACTGCTTCGC TTCTTCATCT GCCTTCTCGT TGCCTGCTAT551 TCCTACATGA CTTTTCACCC ACATGTAGAT TATCTCCGTG CCATTACATT601 CTGCCTGATT GTGTAGGATG TTGATCTCRT CCTTCCATCT GCATTTGGTT651 TTTCGTTTAC CCAAGGCCGT GATGGCACTC AAGGAATCTG TACAGACGAG701 GTAGGTCCCC ACGCGATTCT GACCAATTAT CCACCTGAGC GCTTCAGTGA751 TCGCACCACA CTCCGCAGCA AAGATACTGC TCAGATCACT GATTCTTCTT801 CGAACAACCA AGTCCCCTCT AACCATCGCG TAACCAACCC TGCCATCTTT851 CTTTGATCCA TCGGTGAAAA TTGTTTCACA CAAGCGATAT TCCGTGTTCC901 TTCGGGACAC GAAGAGCTCC TTCAGTTGTG TTGAAGTTGC TCCGGCTCGT951 TCAGCTTCGA GTAATGTTTT GTCTATCCGG ATCCGTCTGC GTTCCCAGGG1001 GGGACACAAT GGTAGTGTGA ATATCTTCAG CTCGGGTAAC GGTAGTTCCA1051 GCTCTTCAAG TATCGCCTTT CCTCTTGTCT CAGCAGTTTC CACTCCACGT1101 AGCGGACCTC GATGCCTAGT CGAATTCCAC TCCTCTCCTG AACTGTTGCT1151 ATCGTAGCTG CTCTCGGTAC TGCTTCCTGC TGATTGTGAG TCGTCTTCTG1201 CTGGCTGGGT TMTGCTCTGT GCTGATATGG CTGCTTTTCT GGCGGCGTAG1251 ATCGCTGTCC GCTGTTCAAA GAACACTCTC AGGCTCGGGA TTCCTGTTTC1301 AGCTTGGAGA CTATCTACAG GGCTGGTACG GAACGCACCA CAGATAGCTC1351 TCAAGCCTGT GTTGTGTGTT GGCTCGAGGA TCTTCAGGAC GTTGTCACTG1401 ACGGCAGCTG TTATAGGTGC TGCGTACAGR ATTTTCTCCA ACACTGTGGC1451 TCTGTACAGC TTGATTAGAG TCTTCCTATC GCCACCCCAT GACCTACATG1501 CTACACAACG GATWAGCTGG ACTCGTTGTC GGCAAGCCGC TTTCACTTCC1551 TCGCAGTGTG TCTTGAACAG TAAATGTTGA TCCAGGATAA CGCCCAAACA1601 TCGGTGCTGC TTTTTGGTTG GGATCAGGGT CCCATTCAGC TCCAGTGCAG1651 CTCTGCTCGG TGGTTTCCTC GTTCCGTAGC TTCTAAAGAT GACMGTAGCG1701 CTTTTCTCCG CAGAAATTTT AAATCCCGTT GAGCTCTGCC AGCATTCCAC1751 CGCCTTCAGT GCAGCTTGCA AGTCGTTCTC AACTTCTTCA ACATYTCGTC1801 CACTGGCCAG CAGTACAACA TCATCTGCGT ACAAGAGTGT TGTTATGCTG1851 GGCGGCATCC GTGCTACCAA KGTGTTGATA GCCACCAAGA ACAAGGTGAC1901 GCTCAGTACT GATCCTTGGC AGAGGCCTGT TTCCATGATC TTGCTYTGCG1951 ACAGCTGGCC GTTCACAAAA ACTCTAAATG AGCGATTTTC CAGCATTCGG2001 TCCAAGAATT TCAACATCAG ACCTTCTATG TTCCAGTCAC GCAGTTGGTT2051 CAGGACCAGT CTTCTCCAGG TGGTATCGTA CGCCTTGGTM ACATCCAGAA2101 AAATCCCCTG AACGTACTCC TTCTTGTTCC AGGCTGCTCG TACCACTTTC2151 TCGAGCTCAG CCAAGTGGTC AACGGTAGTT TTTCCTTTCC GGAAGGCAAA
198
2201 CTGGTGAGGA TGCAGTAATC TTCTTGTCTC GATGATGTGA ACTAGACGAT2251 TGTTGACCAT TCGCTCAAAC ACCTTTCCCA GGCAACTGTT YAAGAAGATT2301 GGCCTATAGT TACGAGGATT TGCTTTTTCA CCACAATTTT TGAAGATAGG2351 GATCACCAGC GATTCCGTCC ATTCGGGTGG GTATACTGAA CCGAGCCACA2401 GCCGGTTGTA CGCGTCAAGA AGCTGTCGTT TGCAATCCAA CGGTAACTTT2451 ACGATCATGG AATAGTGGAC AGTGTCAGGG CCTGGTGATG AGCCACGCAA2501 ACCTGCCGTT GCCTCTTCAA ATTCAGTGAA TAAAAAGGGC TTGTTGTACT2551 CAGCACTCGT GTCAGCTGGR ATTGACAGAG GAGTTTGCTC AACCTGATCC2601 TTGAATGCGC GAAACTCGGG GCTGTAAGCA TCGTTACTAG AGATGGAAGC2651 GAAAGARCTA GCCAGCGCCT CAGCAATATC TTCATCTTTA GACACAGTAC2701 CTCCTTCGGC TGTTATTGCA CTGATACGGT TCGTTTTGCR CTTGCCTTGG2751 ATCCGGCGGA AGTTCTCCCA CACTTCTTTA ACAGGTGTTT GCACTGTGAA2801 GTCGTTGACG AAGTTGATCC ACGAGGTCCG TTGTGCTTCC TGGATGACTT2851 TGCGTGCATG MGAGCGGGCT GCTCTGAATT CTGCTGCAAG AGCCTCTTGA2901 TTCTCCAAGT TGTTCCGTTT GGCTGATTGC ARTGCACGCA AGGCTTTCTT2951 CCTGCTCTTG ACAGCGGAGG CCACCTCTTG GTTCCACCAA GGCACCGCCC3001 TCTTGTTCAC CATCCCCGTC GTTTTCGGTA TAGTTTGTTC CGCTGCTTCA3051 AGTATTTTCT GGGTAATGCT GGAGATTTGT TGTGTAGCTG TAGGCATKAG3101 GGGGAACTGC ACTGTTGTCT GGAATGTCTC CCAGTCTGCT TCCTCGATCT3151 TCCATTTGGG TCGCGTTCGG ATGGCATGGG CCGTGTCAGG GAGAGTTAGC3201 AATARTGGTA GATGGTCACT ACCGTAGGMG TCTTGTAGTA CAGAAAAATC3251 TAGTTGGTCT ATTATTTCTG TTGAGCAGCA TGCCACGTCA ATACTCGTGA3301 GAGTTCCGGT CGCAACACTA ATGTGGGTTG GATCTGTTTT ATTCAGGACC3351 ACCAGGTTGC ATTCGTTCAA TACTTCCTCG AACATTAGTC CTCGGGCGCT3401 GAACGTTTCC GATCCCCAYA GTGGGTGGTG GGCGTTTATA TCTCCCACGA3451 GTAGGCGTGG CTGGGGGACT TGGTTTATCA GGTGTACAAT WTCCGAGGCT3501 TGTATGGCTT GCCCCGGTGG CAGATATRTA TTGATAATGG TCAGRTTAAA3551 TGGGGGACCT ACCTTAACAG CGATGGCTTC CAGGTTGGTG TCCAGGTCGA3601 TTTCTTCACT GTCCAGTTCA GGTTTCACKC CAACCAAAAC TCCTCCGGCT3651 GCACGGTCGC CACCARCTCT CGGTCGGTAG TATATGCTGT AGCCGTTGAT3701 GTTCGCCTGG TCCTTAGATG AGCACATGGT TTCTTGGAGG CATAGTACAG3751 ATGGGTTGAT TTTGCTGCAT AGTATTTTCA AATTTGGTTT ACTGGTTCTA3801 AGTCCTTGGG TGTTCCATGA AATAATGTTC GAGTAGT
//
199
E.2.1.3 Jockey
LOCUS Cp_Jockey_Ele4 4487 bpDEFINITION Cp_Jockey_Ele4, 4487 bases, 6C9 checksum.ORIGIN
1 TTTTTTTTTT TAATTTATAT TTATTCAAAT TTCTTTTCCA TGTACATTCA51 TTCAGTTAAA ATATTATTGA GTGTCCAATC ACAAACGATG ACTTTTCACC
101 TCAATTTTAA ATACTAGCAA CTTTCATTTA TTCATGAAAT ATTGTAGCTT151 TCGCTATTCA GTGATTTCAA ATGTAGGAGG TCCTACATGT ACAAAAGGGA201 AAAGGGATAC CTTAAAACTA ACTTATAAAC TATATAAAGA GCGGATCAAT251 GCAGCTGAAG ACTGCAATGA TTTTTGTCGA AATGCATCAA TTATCTTATT301 GGACATAACA TCCAAAGTGT CAACTTCGGC TAATTGATGA AGTTCACTGG351 TGCTGAACCA GGGAGGAAGT TTCAGAATCA TTTTCAGAAT TTTGTTCTGA401 ATCCTCTGAA GTTTTTTCTT CCTGGTTAAG CAACAGCTTG TCCAGATCGG451 CACAGCATAA AGCATGGCAG GTCTGAAAAT TTGTTTATAA ATTAACAGTT501 TATTCTTGAG ACAAAGTCTA GAATTCCTGT TTATAAGTGG ATACAAACAT551 TTAATATATT TGTTACATTT AACCTGKATA CTTTCAATGT GATCCTTGTA601 AGTAAGGTTT TTGTMWtAAA CCAAGTCCAA GATATTTCAC TTGatCCtCC651 CACTTTAAaT TTACMTCATT CATCTTTATA ATGTGATGAC TTTTTGGTTT701 AAGAAAATCA GCCCTTGGTT TGTGAGGGAA AATAATAAGT TGAGTTTTTG751 CAGCATTTGG AGTAATTTTC CATTCTTTCA AATAAGAATT GAAAATATCC801 AAGCTTTTTT GTAATCTTCT TGTGATGACA CGAAGGCTTC TGCCTTTGGC851 GGAGATGCTT GTGTCATCAG CAAAAAGTGA TTTCTGACAT CCTGGGGGCA901 AATCAGGCAA GTCAGAAGTA AAAATATTGT ATAAAATTGG ACCCAAAATG951 CTTCCTTGAG GGAYGCCAGC ACGTACAGGT AGTTGATCAG ATTTGCTATT1001 CTGATAACAT ACCTGCAGAG TACGATCCGT CAAATAATTT TGAATAATTT1051 TCACGATATA AATCGGAAAA TTAAACCTTT TCAATTTCGC AATCAAACCT1101 TTATGCCAAA CACTGTCAAA TGCTTTTTCT ATGTCTAGAA GAGCAGCGCC1151 AGTAGAATAG CCCTCAGATT TGTTGCTTCG AATTAAATTT GAAACTCTCA1201 ACAACTGATG AGTAGTTGAA TGCCCAAGGC GAAATCCAAA CTGCTCATCA1251 GCGAAAATTG AATTTTCATT AATGTGCGTC ATCATTCTAT TAAGAATTAT1301 TCTTTCGAAT AATTTACTAA TAGATGAAAG CAAACTAATG GGCCGATAGC1351 TTGAGGCTTC AGCAGGATTT TTATCCGGTT TCAAAATCGG AATTACTTTG1401 GCATTTTTCC AACTACTGGG AAAATATGCC AAATCAAAAC ATTTGTTGAA1451 AATTTTGACC AAGCWACTTA AAGTTGCTTC TGGTAATTTT TTAATTAAAA1501 TGTAAAAAAT GCCATCCTCA CCAGGGGCTT TCATATTTTT AAATTTTTTG1551 ATAATAGATT TTATTTCATT CAGATCCGTA TTAAAAACTT CATCTGATGA1601 AAATTCTTGT TCAACAATAT TCTGAAATTC TATTGAAATT TGATTTTCAA1651 TAGGACTCAA AACATTCAAG TTGAAATTAT GAGCACTCTC AAACTGCTGA1701 GCAAGTTTTT GAGCTTTTTC CCCATTAGTT AATAGAATAT TATCACCATC1751 TTTTAAAGAA GGGATGGGTT TTTGAGGTTT CTTAAGAACC TTTGAAAGTT1801 TCCAAAAAGG TTTGGAATAA GGTTTAATTT GTTCGACATC TCTTGCGAAC1851 TTTTCATTTC GCAGGAGAGT GAATCTGTGG TCAATAACCT TTTGCAAATC1901 TTTTTGAATT CGCTTCAGTG CAGGATCACG AGAACGTTGA TACTGTCTTC1951 GGCGAACATT TTTCAGACGA ATCAGAAGCT GAAGATCGTC ATCAATAATG2001 GGAGAATCAA ATTTGACTTG GACTTTAGGA ATAGCAATAT TCCTAGCATC2051 CAAAATTGCA TTAGTTAAAG ATTCCAAGGC TGAATCAATA TCAGCTTTGG2101 TTTCTAAAAC AAAATCATGA TTTAAATTAT TCTCAATATG ATGCTGATAC2151 CTGTCCCAAT TAGCTTTGTG GTAATTAAAC ACAGAACTAT TGGGTCTGGT
200
2201 AACTGCTTCA TGAGAAAGTG AAAAAGTTAC TGGAAGATGA TCAGAATCAA2251 AATCAGCATG AGTCACTAAA GGACCACAAT ACTGACTTTG ATTTGTCAAM2301 ACCAAATCAA TTGTTGATGG ATTTCTAACA GAAGAAAAGC AAGTTGGCCC2351 ATTCGGGTAT AAAACCGAAT AAAGACCAGA AGTGCAATCT CTGAACAGAA2401 TTTTACCATT GGAATTTACT TTTGAATTAT TCCAAGATTG GTGTTTGGCA2451 TTAAAATCAC CGATGATCAA AAATCGAGAT CTATGCCGAG TAAGTTTATT2501 CAAATCCCCT TTGAAATAAT TTTTATTTTC CCCAGTGCAT TGGAAWGGCA2551 AATATGCAGC TGCAATCATA ATTTTCCCAA AAGAAGTTTC AAGTTCAATG2601 CCCAAACTTT CAATAACTTT TAACTTAAAG TCACGTAACG TGCTATAAGT2651 CATACTACGG TGGATAACTA TTGCAACTCC ACCGCCATTT CGATTCATTC2701 GGTTATTAGT TATAACTTTA TAATCTGGAT CACTTTTCAA ATAAGTGCCA2751 GTTTTTAAAA ATGTTTCGGT TATAACAGCA ACATGCACGT TATGAACTCG2801 TAAAAAGTTG AAAAATTCAT TTTCTTTCGC TTTTAAAGAG CGAGCATTAA2851 AATTCATAAT ATTGATGGAA TTACTTAGAT CCATGATTAA ACTTCAGGGT2901 AAGAACAACA TCATTCGCAA ATTTTAATCC AATCTGGATT GCTTCCATCA2951 TGGATGTAGC ATTACTCATT GTTTGAATCA AACCAAACAG TGAGTTTTGC3001 AAAAAAGTCA TTTTTTCAAA CGTAACATCG CCGAGATCAG AAGATCCCAA3051 AGCGTTGCCA GCAGAAAAAT TTTCAAATGA GATTTGAGGT ACCTGCCCAA3101 TTTCAGAAAG ATTGGTAGAG GATTTAAAAT TCGTGGATGA ACCCGAAACG3151 ACGTTAGCAT AAGAAATGCC ATTGTTGTTA CCTAACTTTT CCACGGTAGG3201 GGTATTTCTA GAATTGTTCG AGTGAGACAG CACGAACGTT TGATTTAAAG3251 ATGCAGGTAC AACCTGACTT TGAGAAAATT TCGGTTTGGA TTTCGGCTGA3301 TGCTTAGCAC GAGAATCCAA AACCTTTTTT CTGATGGGGC AATCCCAGAA3351 ATTTGATTTG TGATTTCCAC CACAATTTGC ACATTTAAAT TGGGTGACTT3401 CTTTCACGGG ACAATTGTCC TTGTCGTGAG AAGAATCCCC GCAAACCATG3451 CATTTTGGAA CCATGGCGCA ATGATCAGTA CCGTGACCGA ATGCCTGGCA3501 ACGCCGGCAC TGGGTCAGAT TCTGGCCATT ACCGCCATGT TTCTTAAAAT3551 GCTCCCACTT TACCCGTACA TGGAACAAAA ACTGAACTTT GTCCAAAAGT3601 TTCAAATTGT TGATTTCATT TCTGTTGAAA TGAATCAGAT AAAATTGTGA3651 AGTCAAACCA AAGCGAGAAA TATTCCCGTT TGATTTTTTC TTCATTGGTA3701 TTACTTGGGA TGGGGCAAAG CCAAGCAACA CCTTAAGTTC GTTTTTGATC3751 TCATCCACCG ACAAGTCGTT GGAGAGACCT TTCAAGACCG CCTTGAATGG3801 CCGAGCATTC TTGGTCTCAT ACGTGTAGAA ATTGTGTTTG TGGTTTTTCA3851 AATAACCAAC AAAAGTTTGG TGATCTTGTA AAGATTCCGT CAACAAGCGA3901 CATTCTCCTC TTCGACCAAG CTGGAACGAA ACYTTCAAAT TGCAAGTTTC3951 CTTGCAATTC TTCAGTTGCG TTCGAAAGCT GGCCAAATCG GAGACGGAAG4001 TCACWACAAT TGGCGGAGCC TTTACTCGTT TCTCGACGGC AGAAGGCTCA4051 GTACGAGGAG AAGGTTCCTT GTCATCAGTT TCGGATAAAA CACCGAAACT4101 GTTTGTCAAT GGAATTGGAG GATTGACCTC ACATTCAGAA TCAGAATCAG4151 ACCTCAGAAG AGGCTGTTTT CTTTTTCTGT TTCCGTTAGC CGGTTTAGCG4201 TTCAAACGCT TCACCGACGT AACGACCAAA TCTTCAGAAG ATTTGCGTTT4251 ACCTTTGTTT TGACGCATTT TGCAAGCAAA GTTCTCTTAA AAAGATGGCT4301 TCGTTTGTAA AATACAACAA AATTTCAGGT GGGGTTAGTC TTGAAAAGAC4351 TGTTTAGAAT TTTGGAAAAT AACTCAGGTA GTCTTTAAAA AGACTGTGAT4401 TTTGTTTTGA AATAACTCTG AGCTTAGGTA GTAAAAAATA CCGCAGCTCT4451 AGTGTCCGTT CACCACGAAG GTTCGCAAGA CACTGAT
//
201
E.2.1.4 L1
LOCUS Cp_L1_Ele39 3228 bpDEFINITION Cp_L1_Ele39, 3228 bases, D32 checksum.ORIGIN
1 TACAACGTAG CCACCATCAA CACTAACGCA ATATCCAACG AAAACAAGTT51 AAACGCACTA CGGACACTTG TCCGACTACT CGRCCTTGAC GTAGTGTTGT
101 TGCAAGAGGT CGAGAGCAAC CAATTTTCGA TCCCTGGCTT CAACACCTAC151 ACAAATGTAA ACGAAACTAA AAGAGGAACA GCAATTGCCG TGAAACAACA201 CATGCTGGTG AGCAATGTTC AGCGTAGCCT GGACAGTAGA ATACTCACAC251 TCAAGGTCAA CAACTGCGTT ACGATCTGCA ATGTCTATGC GCCTTCTGGA301 GTCCAAAGCT ATCAGTCACG AGAGAGTATG TTCAACCAAT CTTTGCCTTT351 CTATCTCCAA AACGCGGGGG AGTATGTACT TGTTGGGGGT GACTTCAACT401 GTGTCGTGTC AGCCAGGGAT GCCACGGGTA CAAACAGTCA AAGCATCGCG451 CTGAGAGTCC TTGTACAGAA CATGAACCTG AAGGACACTT GGCAGATTAT501 GAATGGAACT CGGACGGAGT TCAGCTTCAT TCGAGCAAAC TCAATGTCTC551 GTTTGGACAG AATTTACGTG TCGTCAAACA TCTGCTCACA GGTGCGCACA601 ACGTCCTTCC ATGTGAACTC CTTCTCGGAC CACAAAGTCT ACAAAACAAG651 AGTCTGTTTA CCAGATCTTG GAAGGGCAGC TGGCAATGGC TACTGGTCTA701 TGCGAACGCA CACACTCACT GACGAGAACA TAATCGAGTT TGAGCATAAG751 TGGAACTGGT GGACAAGACA AAGACGGGAC TATAACAGCT GGATGAGCTG801 GTGGCTTGAG TACGCAAAAC CGCGCATCAA GACTTTCTTC AAGAGAAAAA851 CCAACGAAGC ATTCCGTGCA TTCAACGCGG AAAATGAGTA CCTGTACGCT901 CAACTGAGGG AGGCATATGA CTGTTTGTAC CTGAACCCGA ATGCTCTTGC951 CGATGTAAAC CACATTAAAG GGAGGATGCT GCGACTGCAA CGTGACTTCT1001 CTTCGAGCTA CCAGCGTCTT AATGATCCTG TCGTTGCCGG GGAGCACATC1051 TCGTCCTTTC AACTTGGAGC CAGGATGAAA AGGAAGAAAA ATTCGTTCAT1101 CTCTAAAATC ACCGACGGAG TCCAGACGCA GCCTTTGGAT GCAGCAGAAA1151 TAGAAGCACA CATTCACCAG TATTTCCAGT CCCTGTACTC TGCTGGAGAC1201 GTAGCTGATC CTGACGGTGC AACAACCAAC CGGGCTATTC CATCTGACTC1251 GGTGCCGAAC GCGCAGGTAA TGGAGGAAAT AACAACTGAG CAGCTGTATA1301 ACATCATCAA AACGAGTGCA TCGCGCAAAG CTCCGGGGAA CGATGGAATA1351 CCCAAAGAGT TTTACGTGCG GACCTTCCAC GTAATACACA GACAACTAAA1401 TCTGGTGATC AATGAAGCGC TGAACGGGAA CATCCCCCAG AAGCTGGTTG1451 AAGGTGTTGT AGTATTGTGC CACAAAAAAG GTGGCAACAA TACTATCAAA1501 TCCTACCGAC CTCTCACGAT GCTCAATTTT GACTACAAAA TCCTAAGCCG1551 AATACTCAAA ACCCGAATTG AGGAGATCAT GGTCCGGCAC GACATCCTCA1601 CACCCTCTCA AAAGTGCTCA AACGGCAAAA GGAATATATT TGAAGCTCTT1651 CTTGCCGTCA AAGATCGAAT TGCCCAGATC AAGCACACAC ACATACAAGG1701 AACGCTCGTA TCATTTGATC TTGATCACGC ATTCGATCGA GTTGAACACT1751 CTTATCTGTT TCGGGTTATG GACGATATGG GCTTCAACAG GGCACTTATA1801 CAGCTGCTGC GCACTATCAT GGACCACTCA CGCTCTCGTG TGCTAGTAAA1851 CGGGCATTTG TCTCCAGAGT TCGAGATACG GCGCTCGGTT CGGCAAGGGG1901 ATCCGATGAG CATGCATCTC TTCGTTCTCC ATCTGCACCC GCTGCTGGAG1951 AAAATACGCA CACTCTGCAA CGACCAGCTA GACCTCTCCA CCGCATATGC2001 CGACGATATA TCTGTTATCG TGGTTGATAA CACGAAGTTA CCAACACTCA2051 AACAACTCTT CTTCGACTTT GGACGGTATT CTGGGGCCGT CCTCAACCTC2101 GAGAAAACAG TTGCAATGAA CATAGGAAGA AGCAGCGAAA ACCTACCCTG2151 GCCGTCGATG GAAACGCGTG TGAAGATCTT GGGAATCAAT TTCTTCAATG
202
2201 ACCACAAGCA GATGATACAG TTCAACTGGG ACGAAGTGAT CCGAAAAACT2251 ACGCAGCTAA TGTGGATGTA TAAAGCGCGA AACCTTACGT TGATCCAAAA2301 GGTTACCGTG CTGAACATGT TTGTGACCTC AAAACTGTGG TTCGTGGCAT2351 CTGTGTTGAG CATACGCAAT CAAGACATAG CAAGAATCAC AAGACAACTT2401 GGGTTCTTCC TATGGGGTCG CCAGCTGAGA GTTCCAATGG AGCAAATTTG2451 TCAACCTATT GCAAAGGGAG GGCTGAATCT GCATCTTCCC ATGCACAAAT2501 GCAGAGCACT ACTGGTCAAC CGATTCCTGT GTACGATTGC CGAAACTCCC2551 TTCGCCGAGC ACCTGTACGG CCTGGTTAAC AATGGAGGAT CCCTACCAGC2601 AACATACCCT TGTCTACGGC CGACGTGGAC CACTATTCGA GAACTTCCCC2651 AGCAGCTACG AGACAACCCG TGTTCGAGCA GCATCGAAAG TCATCTTCTG2701 CAAGCTTTGC CAACCCCGAA GGTAGTGGTG AACAACCCCA GAGCATCGTG2751 GAGAAGCGTG TGGCGAAACG TACGAGCAAG GAGTCTCACG TCGTTGGAAA2801 AATCCACGTT CTATCTGCTT GTGAATGGCA AGCTGCCTCA CGCGGCCCTG2851 TTGTTCCGGC AACATCGGAT CAGTAGTGCT TTTTGCATTC ATTGTCCGAA2901 CGAAACAGAA GATCTAGAAC ATAAGCTAAG CAAGTGCCGT AAAATAAGCC2951 ATTTGTGGAA CCACCTTAAA CCAAAATTAG AATCCATTTT GGACCGAAGA3001 GTAGAGTTCA AAAACTTTCA AATCCCTGAA TTCAGGGCAA TAAGAATGGC3051 AAATGTAGAG AGATGTCTAA AATTGTTTAT CAACTACGTA AACTTTATTT3101 TAGATACAAA AAATGATTTT ACGACTCAAG CACTTGATTT TTTACTAAAT3151 TGTAACTGCC CATAATATGT ATCTCTGCAA ACTGTAACAA AACTAAATAA3201 ACGTGTTAAA AAAAAAAAAA AAAAAAAA
//
203
E.2.1.5 L2
LOCUS Cp_L2_Ele4 2824 bpDEFINITION Cp_L2_Ele4, 2824 bases, 358 checksum.ORIGIN
1 TCTAAAAACT AACTATTATT AAAATGCTCT AAACATGCTT TCTTAAATCC51 TATACTCGAA TTTAACAATT GTAGATTAGG CGGCAATTGA TTCCATTCTG
101 CAATACCTCT AACAAAAAAA GAATTGGCAT AGTGAGATGA TCTAAATCTT151 GGCAAGATGA ATTTTCTTCC TCTTCGGCTT CTCATAGGAG TTATTTTTGA201 GGTGATGTAT GGAGGGGACT GATTGACAAT AAATTTATGT AAGAGTAGAA251 CTGATCTAAG CACACCAAAA CGAGAAAACG GACAACCAAT TAGGATATTC301 TGTCTGTGAG TCACGCTTGC AAAACGATTT AAACCAAACA CAAAACGTAC351 ACAACAGTTT AAAGCAACTT TTAATTTATT ATAAGCAAAA GTGGACGATT401 GTTTCAACAC AAAGTCACAG GTTATAAAGT GCGGTAAAAT TAAAGCTTTA451 AATAATTTTA GTTTAATCTT AAAGTCGAGA TGATTTGCTT TGAGACGAAG501 GGATCTTAAA GCAGCATAAA CTTTTCCACA TTGCATCAAA ACAAATTCGT551 CCCACTCGAA TTTACTGGTA ATTTTCACTC CTAAGCTGAT AACTGTATCG601 TTATAATGAA CTAATTGATT ATTTAAAAAT ATGTCGGGTT TAAACACAGG651 TCTTTTTGAT CTTGAAATAC AAATAGCTTG CGTTTTTGAT GCATTAAGTT701 TTAGATTGTT TGAAATAGCC CAATTGGCAA CAATGGTCAT ATTTGTATTT751 ATCATGTTAG ATATTTCAAG TTGTGGTTTA TTGGTACAGT CAAAATATAT801 CTGTACGTCA TCTGCGAACA AGTGAACACC ACATCCTTGT AGGTGTGACG851 GTAGATCATT AATGTATAAT GAGAAAAGTA ATGGACCAAG AACGGATCCT901 TGTGGCACTC CTGAAAGAAC CGGAAGCAAC TGAGAAAATA TTCCGTCGTT951 AAAAACAGTT TGAAATCTAT TTGACAAATA AGATCGTACT AAACGTACTG1001 CATCAGGACA AAAACCAAAA ATCATTTTTA ATTTATCACA CAAAGTTTTA1051 TGACAAATAG TATCRAATGC TTTAGAATAG TCTAAAAGAA TAAGCACAAC1101 GTATCCACCT GAGTCTATCA CTAAACCAAT ATCATCACAT ATCTTTAGCA1151 TTGCTGTTTT AGTACTATGT CTTGGGCGAT ATCCCGACTG GCAAGGTGTT1201 AATAAATCAA AACGAGAAAC AAAATCCGTA ATTTGTTTTT TAATAACTCT1251 TTCAAAAGCT TTAGAAAGCG CAGATAAAAT GCTAATTGGA CGTAAATTGT1301 CCAAACTAAT TTTAGGACCT TTCTTTTTAA TAGGTACAAC TTTCGATATT1351 TTCCAAACTT TGGGGTAAAT ACAAGTTTTA ATTATTTTAT TAAAAATATG1401 TCGAATGGGT TCAACTATGA GGGGAATTAT AGCTTTGCAA AATTTAATTG1451 GAATTTCGTC CAGACCAACT GCATTAGATT TTATTTCATA TATTGCATTA1501 ACAACATGAT ATGATTCAAT AGGTTTAAAA GCAAAAGCCT CAGGTACAGG1551 TGCGATAGCA TTCAATGTTG TGGATGAACT GATCCCAGTT GAAAAATTAC1601 TTGCAAAGAA ATGATTGATT TGATCTGCAT TTAGGTTGTT TTTTTCGTTT1651 TTATTTTTTT TCTTAAATCC AATAGCATTG AGTTTATTCC ATAATTCTTT1701 CGAGTTCAAA TTTTGAGCAA ACATTGTGCT GAAATAATTT TTCTTGGCAT1751 TGTTGATCAT ATTTGTTGCC TTATTTCTAA GCCTTTTATA AAGCGTGAGG1801 TCATTATCAT GTTTAAATTG TTTCCATTTA GAAAAAGCTA AATCTCTATC1851 AACAATTACT TTAGACAAGT TTGCATTAAA CCAAGGTTTA TTATGGGGTT1901 TTTGGATAAA ATTCCGTAAT GGAACGCAGT TATCATGCAG ATTTTTAATG1951 TTTGTATTGA AAAAATCTAC TAAGATATCT GGATTGTCAA GGCTAAAAAA2001 ATAATTCCAA TCGATCTGAT CAAACATTGT TAAAAGCAGA GACGAATTCA2051 TGTGACCATA ATCACGAAAA CAAATATTTT CATTAACATT CGATACATCA2101 ACATCAACAG ACGCAAAAAT TAAATCATGA TTGGACATAA ACGGTACAGA2151 CACTTGGTTA AAACGCAAAA TAAATTCTTT TTTGTTAGTT AATAGAAGAT
204
2201 CTATCAAAGA TGCTCCTGTT CTGTGAAAAT GTGTAGGAAT TTCTCCAACA2251 GGTTTCAGAG ACAAACTTTG TAAACACTTT AAAAAACGTC TTGTGTTAGA2301 AGAGTTCAGG TCAAGAAGAT TGGTATTGAA ATCGCCCATG ATAATCATGT2351 GTTCATGCTG TAAATTACTA CGGGAAAGCA AGTCGATCAT CAACTGAGAG2401 CAATCGGAAT TTGGAGTATT GTAGATAAAT CCCATGAGTA AACTTTCATG2451 ATTTACAGAA ACTTCGACGA AAATAAATTC AGTTGGTGCA TTGTCAATAT2501 CATGTGATGA TATGATGACT TTACAATTTA AAAAGTTTTT AACATACAAC2551 AAGATACCAC CTCCAATTCT TCCTTTACGA TCACATCTTA TAATTTTATA2601 ACCATCAATT TCAAGTACGT CGTTTGGTAT TTCCATTTTC AACCATGACT2651 CACAAACACA CATGACGTCG ACTTTGGAAA TGTATGCAAT ATTGCGAAGT2701 TCATCCAATT TAGTTAATTT TCGAGCACAT ATACTTTGGC TATTCATACA2751 GCATACAGAA AGTTTGTTGT GAATGAGAGC AGATTTCATC ACAATACCTG2801 GAATACTCAA ATTGTTAGAG TTAT
//
205
E.2.1.6 LOA
LOCUS Cp_LOA_Ele7 3751 bpDEFINITION Cp_LOA_Ele7, 3751 bases, 9C checksum.ORIGIN
1 TTTTTTTTTT TTTTTTTTTT GTTGAGCCTT ATGACCACTG CGACCAATTA51 GAGGAATATT GTGGTGTGAC TCTTTTTTTT TGTAGCAAAT CAGGTTAACC
101 CAACACCATG GAACGATGAG AGGCAGCTCC TGTTTACTGC ATGTTGTCCC151 AGTTAGGGTC GACTTGTTTT ATAAAGTCTA TCACCCTACT AGGACGGAAA201 TGGTTGATAT TGGAGGGATG TAAGATCCCT TTCCCAAAAA ATAATATTCT251 AGTTTGAATC AGGGCACCAC AATTACACAA AAGGTGCTCT GATGTTTCAT301 TCTCAATATT ACAAAATCTA CAATTTGCGT TTGGAATTTT CCTCATATTA351 AATAAGTGGT ATTTGCATGG GCAGTGACCA GTTAGTAAAC CAGTTATCAC401 CCTTAAATTT TTCTTATCTA GTTTTAATAG TTCTCGAGAT AGTTTCCCAC451 CTGGAAAGAT AAATCTTTTT GCTTGTCTGG AAGTATCTGT AGAATTCCAG501 ATTGAATTGA TTGTTAATTG TTCCCAACGT TTTAGCTCCA TTTTCAGAGA551 ACAATCTGGA ACACCAAAGA AAGGTTCTGG GCCAATGAAG TCAGTTTCAG601 ATCCTTTCCT TGCCAGACTG TCCGCGTTTT CGTTTCCTAA AATTCCACAA651 TGACCAGGAA CCCAGTAAAG ATTGACCGAG TTCTTTTGGG ACAATTGCTT701 TAGTAATTTT ATTCCTTCCC ATACTAGTCT TGACTGGCAG GTATACCCTT751 TCAGAGCATT TAGTGCTGCC AAGCTGTCGG AAAAGATACA AATATTAGCT801 AATCTATAAT TTCTTTTGAG GCAAATGTGC AAACATTCAA TTATTGCATA851 AACTTCTGCT TGAAAAACTG TTGGAAATTG TCCCATTGGA ACAGATGTCT901 TAATTCCTGG TCCAAACACT CCAGCTCCCG TTTTATTATT CATTCTAGAA951 CCATCCGTGT AGAACAGAAT AGAGCCTGGA CGAAGATTAG GCCCACCATT1001 ATCCCACTCG TAGCGACTTT TAGCGTCCAC TGTGAACGGA ATGTCATAGT1051 TAGCCTTTGT CGGCATCCAG TCTACTATTG AGTTTAACGC CGGATTCAAG1101 GTAAAGTTTT TAAGGACTTC TAGGTGCCCT GTCAAATCTT CATCCTGAAT1151 TCTTGTGGCT CGATTCATTA ATAAAGCACT CTTAAGGGCT TCTAATTCTA1201 TGAATTTATG AAGAGGAAGT AAATTTAACA AGGAGTTTAG TGCAGCAGAT1251 GAAGTGCTGT GCTTAGCTCC AGTTATTGAA ATGGTGGCTA ATCTTTGCAG1301 CTTAGCTAAT TTAATTTGGG CAGTTGATTC TTTAACTTTC GGCCACCACA1351 CAAGTGACGC ATACGTAATT CTTGTTCTAA CTATGCATGT ATATATCCAA1401 TAAATCATTT TAGGGCGGAG ACCCCACTTT TTTCCAAAWG TTTTTTTACT1451 CAGCCATAGA GAATTAATTC CTTTGTTTAA TACATTATTC AGGTGAGCAT1501 TCCAATTAAG TTTCCTATCT AAAGTAATAC CAAGATATTT CACTTCGTTA1551 GAATACTCAA TAGTTGTCCC ACGAATAGCA AGATTAGGTA GTACTAGTGA1601 CCTTCGTCTT GTGAACGGTA TTATAACTGT TTTAGAGGGA TTTATACCAA1651 GACCCTCATT TTTRCACCAT TTTGAGATAA AATTCAAGGC CGATTGCATT1701 CGGTTACAAA TTTCGCTACC TATTTTTCCC CGAACTATTA TACAGATGTC1751 ATCCGCATAG CTGATCAGTT CAAAGCCGAG ACTCGCCAAC TTTTTCAGAA1801 GATCGTCGAC TACTAATGAC CATAGCAGAG GTGATAGAAC TCCACCTTGT1851 GGACATCCCC TCCGTGGTCT GATTACCAAT CGTTCTCCTC CAAGGTCGGC1901 AGATATCTCC CTTTCAATAA GCATTGTTTC AATCCATTTG ATAATCCATG1951 AACCAAAGCC CCTTGATTCC ATGGCGGAAC GCATAGAGCT GTGAGATGCA2001 TTATCAAAAG CTCCTTCAAT GTCTAGGAAA GCAGCCAGAG CTACTTCCTT2051 GGCTTTGAGG GACTTCTCAA TTTTAGAAAC TAAACAGTTT AAAGCAGTTA2101 TAGTAGATTT ACCACTTTGG TATGCAAATT GGCTATTACT WAGAGGAAAT2151 TTGATCAAGA ACGtTTTTTT TtACAAAATC ATCAAGAATT TTTTCCATAG
206
2201 TTTTCAATAA AATAGAAGAG AGACTAATTG GACGGAAGGA TTTGGGACTT2251 GTTTTGTCCC TTTTCCCAAC TTTTGGAATA AATATGACCC GTACCTCTCT2301 CCAAGCTTTT GGGATATAGC CAAGAGCGAT ACTGGCTCGA AAAATGTCTG2351 TTAATTTAGC ATTGATAATA TCTTTACCCT GCTGGAGAAG AGCAGGAAAG2401 ATGCCATCCA TCCCTGCGGA TTTAAATGGT TTGAAAGAGC TCACCGCCCA2451 GTCAACCTTG GAACGAGTAA ATATTTCACA GGCTTGATTA TGGGCAAGTA2501 CAGACCAATC TTTTGTCAGT CTTTCGCCCA CGGCCTGATC TATACAACTT2551 ATTACCTTGG TCGAGCCAGG AAAGTGAGAT TCCATCATTA AATCCAGAGT2601 TTCAGATGCA TTTTTTGTAA AACACCCATC TGCTTTTTTA AGATTTCCTA2651 GACCGTTTGA ATGATCTTTG GAAAGAGCTT TTTGTAATCT TGCGACTATT2701 GGAGTTTCCT CGATTTTTTC ACAAGAACGT CTCCAGTCGT TGCGTTTAGA2751 ACGTTTTATT TCTTTATTGT ATTCAGTCAG GGCTTTTTTA TATGAGTCCC2801 AATCTCCAGA TTTTTTTGCT CTGTTGAACA ACCTTCTAGA TTTTTTCCTT2851 AACGCAGCTA ATTTTTCATT CCACCAAGAT ACCTCACGAT TTGAAGATCT2901 TTGCTGGACT GGGCAACTAG CGGAGTAAGA ATTGGTWATA GCAATAATTA2951 ATTCAGACGA AGCCTTCTCC AACTGATCAA CAGTACGGAT TTCGGCTTCG3001 GAATTGAAAT CGTAGTCATT TAGAAGAAGT TTATAGGAGT CCCARTTGGT3051 TTTCTTGGGA TTYCTATAAC TTCCAACTAT CATTTCACCT GCATTGTATT3101 CGAATTCAAT TTGTTTATGG TCTGAGAGTG ATATTTCATC AGAAACCTTC3151 CAATTTATAA CTCTTTCAGA AATAACTTCG CTACAAAACG TGAGGTCAAG3201 AACCTCCTGT CTTGTTGCTG TAACAAAAGT GGGATTATCA CCATTGTTAC3251 AAATGTCAAT ATTGTTAGTT GTTATGTAAT TGAGCAAATT CTCACCTCTA3301 TTGTTAATGT TAGTGCTACC CCATACTGTG TGATGAGCAT TRGCGTCGCA3351 CCCAACAATA AATTGTTTAT TTATTTTTTT GCAGTAAGAG ATAAACATTG3401 TTACATCAGM TGGAGGGGCC TCCTCGGTGT CTCCGGGGAA GTACGCAGAT3451 GCAATACACA CCCAGGTACT TCCTCTGACC GTAGGGACCT CCATTTGAAT3501 TGCAACAATG TCTCTTGAAA TAAATTCGGT TATTGGAATA AAATTTGTAT3551 CATTTTTTAC WATCATAGCT GCTCTAGGGT TGGGAGAACT ATTATCRTAT3601 AATAGTTTAC CCATTTGAGC TGAAATTCCT TTAATTTTAC CCTTATTTAT3651 CCAGGGTTCT TGAATGAGCC CTACGTCAAT ATTATTTTTA CTAAACCTTC3701 TGCTAAGAAC GGCTGACGCT CCTTTGGCAT GATGAAGATT AATTTGAATA3751 A
//
207
E.2.1.7 Loner
LOCUS Cp_Loner_Ele1 5618 bpDEFINITION Cp_Loner_Ele1, 5618 bases, 78D checksum.ORIGIN
1 CaGTCGGTAT CTGATCGAGG TACAAGCAAG ACGTGTTTGT TTTCGCTAGA51 GCACCGAAGC TTCTAATCGT TCAACGGATC AAGGGATCCA CCGCCGACTT
101 CTCGAAGGTT GACCTTACGA GTGCGTACGT GACACGCGTG ACAGTTCGTG151 AACGTGAATA GTGCGTGTAA GCCAATTATC GCCGGTCAAG TTCACCATCA201 GCAGCCGAAC CACAAAGgCA aCcctccAaa cCGCGAAGCG TGTTTTGGAG251 TTTTCGAAAA CGTTTACACA AGGCATTGTC TAACCGCGAG AGCTAGACAA301 AGGAGCACCA CCGAATACAA TAGCAAACGG ACGGTTTGTT TCATCCTCAT351 CAACTCGGCG TGGCAAAATC TAGAGTTAAG TAGAAAATCA TCTCCCCCTC401 CCCCCATGTC TGGGGGCGAA GGCGGCATGG AAAGCGATGG CGAAGATGAC451 GAGGCTTCCA ATGTGCGCAC AAAGATCTAC CCTAATGGCT CAACAGGGCC501 GTTTATTGTC TTCTTTCGGC CCAAATTGAA ACCCCTAAAC CTGATCAGCA551 TCACCCGAGA TCTAACGAGA AAGTTTTCTG GCGTATCCGA AATAAAACGT601 GTCCATGCTA ACAAGATCCG CGTAGTTGTG AATAACATCT CTCACGCCAA651 TGAAATTGTC ACCTGCGAGC TTTTCACGCT CGAATATCGA GTTTACATCC701 CTTCACGCAT CGTGGAGTGT GATGGTGTTG TCACAGAAGA GGGCTTAACC751 CTAGATGAGC TGTATGAGTG CCGTGGTTAC TTCAGAAATC CTGCCGTGAA801 CCCCGTGAAG ATCATTGAGG TTAAACAACT GTTCTCCTCC TCCACACAGG851 ATGGCAAAAC GGTTTACTCC CCTTCGAACT CTTTCCGAGT GACCTTTGAA901 GGATCCGCTC TGCCAAAATA CATTGAGATA GACAAGGCTC GTCTACCTGT951 TCGACTTTTT GTCCCCAAGG TAATGAACTG CCAAAAATGC AAACAACTCG1001 GCCATACCAC AGCTTACTGT TGCAACAAGG CTAGATGCAT CAAATGCGGT1051 GAAGAGCACG ATGATAGTAG CTGTACGCAA GCTGCTACCA AGTGCCTCTA1101 CTGCGACGAG GACGCCCTTC ACAAACTCTC GGATTGTCCG ACGTATAAGC1151 AGCGTCAGGA GAAACTTAAG CTTTCTCTGA AGCAACGATC GAATCGTACT1201 TTTGCGGAAA TGCTCAAACA AGCCACCGAA CCACTCAATT CTGGAAACAT1251 CTACAATATA CTGCCTTCCG ACGAGACGGT CGCCGACTCG ATCAATGCGG1301 GCGCGTCAAC GTCCGGGACG GGTAACTCGA GGAAAAGGAA CAATGGATCA1351 CCAAGCATCC GCCGAAAAGA AATAAAGCTA TCCCCACAAC AAGACAGGAT1401 CCCTAATTTT CAGCCAACTC CCCCTGGAAT CAACCCCCCC GGTTTCCCCC1451 CATTGCCAAG GCCCCCACCT CTGACCCCCA AACCAAATCC TAACAAACCT1501 AAGCAAGGAT TAATCGGTTT CACAGTGTTG ATTAACCAAA TTCTCGATGC1551 GCTCCAGATT TCCACCGGTG TTCGAACCGT GGTAATCACT CTGATCCCTT1601 TCGTTCGGAC ATTTTTGATC AAATTATCTG AACAATGGCC CCTTATTTCA1651 ACAATCATAT CCTTCGATGG ATAATTCAAC GTCGAAGATG AATGAAGAAA1701 TTTCTGTTCT CCAGTGGAAC TGCAGGAGCA TTGTTCCAAA ATTAGATTCT1751 CTTAAAATAT TAGCTCACGA AACTAAATGT GAAGTATTTG CTCTCTGTGA1801 GACATGGCTT CCACCCAACG ATGATGGTCT GAATTTTCCC AATTTTAATA1851 TCATTACCAA AAATAGAGAC GACTCCTACG GAGGGGTTTT GTTAGGCATA1901 AGACACGGTT TAACATTCCA AAGATTGAAT CTTCCTTCTC AGCCTGGAAT1951 TGAAGTAGTT GCGATTCAGG TTCAAATTAA GAATAAATGT TTTTCAATAG2001 CTTCTGTATA TATCCCGCCC AAARCAAGTG TTAATCGTCA ACAGTTAAAA2051 AACATCGTTG AAATGATGCC TGAGCCAAGA CTTATTCTCG GCGACTTCAA2101 TTCTCATGGG ACAGGATGGG GTGAATTGTA CGACGACAAT CGAGCAAATC2151 TTATATATGA CTTATGYGAT GAATTTAATC TAACTATTAA GAACAGTGGT
208
2201 GAAATAACTC GAATTGCTAG ACCTCCTGCA AGGGAAAGTA GATTGGAYTT2251 GTCAATTTGC TCAAGAACAC TCTCAATAGA TTGCACCTGG AACGTAATTC2301 AAGATCCCCA TGGTAGCGAT CACCTTCCTA TTTTGATTTC AATTGCGACA2351 GGAAATCAAC CTKTAGAACC AGTYAGCTAT ACATACGATC TTACGAAAAA2401 TATAGATTGG AAAAGATATG CTCTCATTAT CACCGAGGCG ATTGAATCAA2451 TAGATCCTCT TACCCCCCAA GAAGAATACA CCTTCCTTGC AAATCTCATC2501 CACAGTAGCG CGATCCAAGC TCAAACAAAA CCAATACCAT CWGCTTCTTC2551 CCGAATGCGA CCTCCATCTT TATGGTGGGA CAAGGAGTGC TCGGAAGTGT2601 ACTCTGAGAA ATCAAATTCT TTCAAAATTT ACAGACGAAC GGGTCAAATT2651 GAGTCTTACG AACAGTACCT CCTTTTGGAG ATTAAGTTCC AAAaTTTAGT2701 AAAATGTAAA AAACGAAAYT ATTgGCGAAC GTTTGTTGAT GGGCTTTCAC2751 GCGAAACCTC CATGCGTACT CTTTGGACTA CAGCAAGAAG AATGAGAAAC2801 CGAGCTCCCA AAAACGCTAG TGAAGAGTAT TCTGATCGGT GGTTGCATAA2851 TTTTGCCAGA AAAGTGTGCC CCGACTCCAC GATTCCCAAA CAGAAAAGGT2901 ATTCGAATGA TCTTGTATTC CCGGAACTAT CATCCGCGTT CTCGATGATA2951 GAATTCTCGG TCGCTCTCCT TTCATGCAAT AACACTGCCT CTGGAATGGA3001 TGGAATTAAA TTTAATCTCC TGAAAAATTT GCCTTCCGWT GCAAAATGTC3051 GACTATTAAA CTTATTCAAT ATTTTCCTTG AACAAAACAT CGTCCCAGAA3101 GTCTGGAGAC AAGTCAGAGT TATAGCTATT CAAAAACCGG GTAAGCCGGC3151 CACCGATCAC CATTCATATA GGCCCATTTG TATGCTATCG TGCGTGCGAA3201 AGTTATTGGA AAAAATGATA CTTTTCAGAT TGGATAAATG GATGGAATCA3251 AACGGATTAT TATCAGATAC TCAGTTTGGA TTTCGTAGGG GCAAGGGAAC3301 GCARGATTGT TTAGCGCTGC TTTCAACCGA AATTCAACTA GCTTTCGCTA3351 AAAAAGMACA AATGGCTTCA ATTTTCTTAG ATGTAAAGGG AGCATTTGAT3401 TCAGTGTGCA TCGAGGTGCT AGCAGATAAA CTCCACAAAA GTGGACTCCC3451 ACCTTTATTG AACAATTTTT TGTATAACTT ACTCTCGGAA AAACACATGA3501 ATTTCATTCA TGGTAACGTG ACAATCACAA GATCTAGCTT TATGGGCCTT3551 CCTCAAGGAT CATGTYTAAG CCCTCTCTTG TACAATTTCT ATGTAAATGC3601 AATTGACTCT TGCCTCGATA ACGGGTGCAC AATAAGACAA TTGGCAGATG3651 ATTGCGTTGT ATCAGTTACT GGTCAGTCGG CCAACCATCT TTCTGAACCT3701 CTGCAGAACA CTTTAAACAA TTTATCTCGC TGGGCTATGG AATTAGGAAT3751 CGAGTTCTCA ACTGAGAAAA CGGAAATGGT CGTCTTCTCC AGAAAGCACA3801 ACCCCCCCTC ACTGAAGCTG TACCTACTGG GAAAACTTAT AATACAGTCC3851 CTGGTTTTCA AATATCTCGG TATTTGGTTT GACTCGAAAG GTACTTGGGC3901 TTGTCAAATA AGATACCTGA AACAGAAATG CCAACAGAGA ATAAACTTCC3951 TCCGAACAAT CACGGGTACG TGGTGGGGCG CACATCCCAC GGACCTCATT4001 AGGCTATACC AAACGACGAT ACGTTCAGTA TTGGAATATG GATGTTTTTG4051 CTTTCAATCC GCCGCGAAAA TCCACATGAT CAAACTTGAA AGAATACAGT4101 ATCGTTGTCT GCGCATTGCC TTAGGATGCA TGCACTCAAC TCATACGCTG4151 AGCCTAGAGG TACTTGCAGG CGTTCTTCCG CTGAAAACCA GATTGTATCA4201 GCTCGCTCAC AGAACGTTGA TTCGTTGTGA GATTAGGAAT CCATTAGTGA4251 TCCAGAACTT CGATCTTCTT CTCGACAAAA ATCCTCAGAC TAGGTTTATG4301 ACTATCTATC ACAACCACAT AACCAAGGAA ATCTCACCTT CAAACTTTAC4351 TCCCAACCGC AGCACAATAA GCAGCACGCA TAACCCATCA GTTTTATTTG4401 ATTTATCTAT GCAACAAGAA ATCAAGATGA TACCAGCAAG TCAACGTTCG4451 CAATTAGTAC CGCATATTTT TTTGTCTAAA TATAACCATA TTAAGGCGGA4501 AAACATGTTC TACACAGACG GATCGCTAAT CGAAGGGTCC ACAGGCTTCG4551 GGGTATTTAA TACGAAAGTA AGTGCCTTCC ACAAACTCCA AAATCCTGCT4601 ACAGTATACG TAGCAGAACA AGCTGCAATT CATTATGCAC TAGGGATCAT4651 TAACCTGCAG CCACAAGATC ACTACTACAT ATTTTCTGAC AGCCTTAGTA
209
4701 CAATTGAGGC TCTCCGGTCG TTGAAATCAC CCAATTCCTC GTCGTTCTTT4751 TTTCATAAAA TTAAAGAAAT CATGAGTTTA CTGGTAGAGA AAAAATACAA4801 AATTACTCTT GTTTGGATCC CTTCTCATTG TTCTGTATTA GGAAATGAGA4851 AAGCGGACTC GTTGGCAAAG CAAGGTGCCT TGGAAGGATC CACTTACGAT4901 CGTATTATCA CTTATGACGA ATATTTTACA ATCCCTCGTC AAGAATCTCT4951 TGTAAGCTGG CAAACCAAAT GGGACAAAAG CGAAATGGGT CGATGGCTTT5001 ACTCTATCAG GCCAAAAGTT TCTACAACTT CGTGGTTCAA ACACATGAAT5051 GTTGAAAGGG ATTTCATACG CGTAATATCA AGATTAATGT CAAACCACTA5101 CCTACTCAAC GCTCACTTAT ATCGGATTAA CTTAAAAGAT GACAATCTCT5151 GCGGTTGTGG AGAGGGTTAT CACGATATCG AACATATTGT TTGGAACTGT5201 CCAGAGAACC TTCACGCTAG ATCTCAACTC TTAGACTCCC TTAGGGCCCA5251 AGGAAGACAA TCAGACTTCC CTGTTCGTGA CATTTTGGCA AGTCAAGATG5301 TGCCATATCT TCTCTGCTTG TACCGCTTTC TAAAGTCAAT TAAAGTGCAC5351 CTGTAACAGC ATCAATCTCG CAAGCATCGC CACCCTGCAA CCTAGCAATA5401 GTAACATCTG ATAAAAACTA GAACCTTAGC CCGCACAGAA GCAAAAGTCC5451 GTCCTTAAAC ATAATGTATT ATTAACCTCG AAACAGCCGC GAGTATTCGG5501 CTTTCCCCCT TTACTAACCC TAGCTTTAAG TAATTATGTA AAAATGATAT5551 CCGGCTCCGT AAAACTTTGG TAGATGAGCC TAAATAAATA AAGACAGTTA5601 TAAAAAAAAA AAAAtAAT
//
210
E.2.1.8 Outcast
LOCUS Cp_Outcast_Ele3 2336 bpDEFINITION Cp_Outcast_Ele3, 2336 bases, 2063 checksum.ORIGIN
1 AACTACATCC AGCTACAAAA AGCCAGAGCA CTGTTTAAAA AAGCTGTCAG51 GAAGGCGAAA CGGGAACACG TAGCTGAGCT GACGGGAAAG ATTGACGAGT
101 CGACACCTCC MAAACAGCTA TGGAACATCG TCAAGGGAAT AGAYACGGCG151 TTGGCTGRGG GCAGTAAAAA GAGGGCGATC CTGGAGCGCT CCAAAGGAGA201 GGAGTTTATG GAGCACTACT TCAGTGGAAG ATGTGGTACA GTGCAGTTGC251 CGAACTACGA GACAGCCCGG GACTTGGAAG GTTTCGAAAT GGCGCTYAAG301 GACGGCGAAG TGCTTAAMGC ACTGAAAAGA ACAAAAAATC ACTCGGCGCC351 GGGRGAAAAT CAAGTCTCGT ACGACATTRT TAARCAATTG CCGCTAGGTC401 TGCAGCTCAA GTTCGCAGAA ATGCTGAGCA GAGTATTCGC GACTGAAAAC451 ATTCCTGAAA GGTGGCGCAT CACTGAAGTA CGACCGATTC CAAAGAAAGG501 AGYGAACCCC AACCTACCRA ACTCGTGGAG ACCCATTGCG CTCATGAATA551 TCGAGATAAA GCTGATCAAC AGTGTAGTGA AAGACCGACT GGCGGCGATC601 GCGGAGCTAA ATGGTYTGAT CCCGGATYTG TCTTTTGGTT TCCGGAAGAA651 TGTRTCATCG GTAACCTGCG TGAACTATGT CGTGAATGCT GTACGAGAGG701 CGAAGGAGTA CAACAAYGAA GTCATCGTAG CATTTCTCGA CGTGAAGATG751 GCGTATGACA CCGTGAACAC GACTAAGCTG CTTCAGATCT TGGCARGGCT801 GGGTATCCCG GAAAAACTGA CATCGTGGCT CTACGAGTAT CTCAGATGTC851 GCGTGTTACG ACTACAAACG GAGGACGGAG TCGTAGAACA AGTRATCTCY901 GAGGGCCTAT CACAAGGTTG TCCGGCAGCA CCGACACTTT ACAACTTCTA951 CACGGCTGGG TTACACGATC TCTCAAACGA AACGTGCAAG TTRGTGCAGT1001 TTGCTGATGA TTTCGCCGTY ATCGCAACAG GTGCCTCCCT CGAGCTGGCG1051 GAACAACGGT TGAACGGTTT CCTCGATGTT TTGGCAGGSC GGCTRAAAGA1101 GCTGGACATG GAAGTAAGCC CATCCAAGTG CGCTGCGATC GCCTTCACCG1151 GAAAAAGGAT CGACCATCTC AGAGTCAAGA TGCAGGGGCA AGYGGTTCAG1201 ATCGTCAACA CCCACAAGTA TCTGGGRTAC ACCTTGGACC GGRCCTTGAA1251 ACACAGAAAA CACATCGAAA CCGTGACCGC CAAAGCCGGA GAGAAACTTG1301 GWCTGCTCAA GYTACTATCG AGGAAAACAA GTGGTGCGAA TCCGGCAACC1351 TTGGTCAAGG TGGGAAATGC GATTGTTCGG AGCCGGATGG AGTACGGAGC1401 CACGATCTAC GGGAATGCCG CCAAATCAAA TCTGGGAAAG CTGCAGGTGT1451 TACAAAATTC GTACATCAGA ATCGCCATGG GATATGTACG AAGCACACCC1501 ATCCACGTGA TGTTGGCCGA AGCTGRCCAA ATCCCGACAA GTCTTAGAAY1551 AGAGGCTCTT ACCAAGAGRG AACTGATCCG AAGTACGTAC TTCAGGACGC1601 CGTTGCTRCG CTTTATAAGC GACACGCTAT CGAGGGAGAT TCCAAACGGA1651 TCGTACCTGA CGGAAATGGC GGACAAGCAT GCGGATATCC TGTACCAACT1701 GCACCCYTCA GACAAGGATG TCGCACAGGA GGCAAGAATG AGCTACTTCA1751 GCAACTTTGA CCTGGAAGAC TAYGTACAGC ACACRCTAGG AAARGAAACA1801 CKGAAAAAGG AGAACAYAAA CGAAGCAGTT TGGAGGCAGA ATTTTCATGA1851 AGTGGCCAAT GGAAAGTACA AAGACCACAA GCAGATATWC ACAGATGCTT1901 CGAAGACGCC CGGAGGGACA GCGCTGGCGG TCTACGACTC GAGCGAGGAG1951 GCGACCTACA CGGAGAGCAT TAACGACAAC TACTCAATCA TGAACGCAGA2001 GCTGCGGGCT ATTTGCATCG CAGTTGAGCA TGTGAAGCAA AAACAGTACG2051 AAAAGGCGGT CATCTACACG GATTCCAAGG CGGCTTGTCA GAGCYTGCTA2101 AACCWAAATG CACTGCGAGA GAACTTTATC GTTTGGAACA TTTACAAGGA2151 GATTCAAARC ATGCGGAGAG GCTCGCTGAG AATCCAATGG ATCCCCAGYC
211
2201 ACGTCGGAAT ACGAGGAAAT GAAATTGCGG ACCAAGCAGC GAAAGCGAAG2251 TCATACGAGA AGCAGACGGA GTTCATTGGA ATTACACTTG GAGATGCTAG2301 AGTACTTTGC CATGAAGAAA TCTGGTACAA TTGGaG
//
212
E.2.1.9 R1
LOCUS Cp_R1_Ele1 5425 bpDEFINITION Cp_R1_Ele1, 5425 bases, 116E checksum.ORIGIN
1 CGTGGGGACA GGTGAGGGTC CCTGCGGAGC TTAGCTGCCA GCTACCGGGC51 GGGTTGCAGT AGGCGGATAG CTGTCGGCGA TTGCATACAT TCATTGCATC
101 GCCCCCGGAC CAGCAGCGGG AGGATGTCTA GGACGTGGCG GAATTGAACA151 AGGGCTCTGT TTAATTTCTT CGAAWAAAAA AACCACATGG GTCCGTAAAC201 ACCTCTGTCA AGCGATCGAA CGCCGCTATA AGTGTTTTAG CCCAAAACCT251 CACCAAACCC CGAATCCAAG ATGTGAYGCG ACCCGTGTCG AGGGATGCAT301 GGCTGGGGGG GTTCAACAAA TTCCCAGTCG ATAACGGAGC CTGTGGGGCA351 CAGGGGCGAA CCCCACACGT AATTTGCCCT TACTGCGTCA CGGCAGGGCA401 CTGGCGCAGC GGACCGTATT TCCCTAGCGA CTCGTGGGAT TCAAAATGAA451 GACAAGTGAA ACAAACCAAC AAAAGGGTCC CGATGCGCCC CAAGCATCGG501 AAAACACGGA GGTAGAAGAG GACGCAAACG TCGAAATGGC AGGAGGCGAG551 TCAAACGGCG GCGACACAGC AGGCGGAGTG GCGAGCGCGT TCCGTGGAAG601 CGGGAAGGTG TTGAGATCCC CAGTGTTGAA CCAGGCGGCT GCTTCGAGTC651 AGCAGATAGG AGTAATTGGA GAGGAGACTC CCAAGTCATC CTTGTTGAAC701 TTCGCCGGCA GTACCCCTCA GGACGGAGTC CTGCTCGGAa GGACCGcGTT751 GCAGGAGGTC AGAAGGAGGG TCAACGAACT CTTTGATTTC ATCAAGGACA801 AAAACAACGT CCACACCAGA ATCAAGCAGA TGGTGAATGG AGTCAAGGCA851 GCCATGAATG CCGCAGAGCG CGAAAACAGC TCGCTGGTGG TGACGCGGAA901 TTCACTGAAG CTCAGAGCTG AAAGAGCCGA AGAAACGCTG AAGGCAAAAC951 TGGAGGAGGA AGCGCTACGG GAGAAAGAAC CGAAAACGCC GCCCGGCCCA1001 AGCTCTAAAA GGGACAGGGA AACGCCTGGA GAGGAGGAGG ACGCAAAGAA1051 GCAGAAGCAG GGGAATGGAG ACAGTCCGGA CCCAGCGAAG GAGCCAGAAC1101 CAGACCCAGG GAAGGAGAAG GAATGGGAGA AGGTCAAGAA AAAGAAGCGG1151 AAGAAAAAAG GGAAGCAGAA CGAGGACACC CAAAAACCCA AGTTTCGCAG1201 GGAGCGTAAC AAAGGCGAGG CTTTGGTGGT CGAGGTGAAG GAAGGTGTTT1251 CGTACGCAGA CCTCCTCCGG AAAGTACGAA CCGATCCGGA ACTCAAGGAG1301 CTTGGCGAGA ACGTGGTTAA AACCAGGCGC ACTCAAACCG GAGCGATGCT1351 TTTTGAGCTG AAGAATGATC CCGCGGTCAA GAGCTCAGCT TTTAAGTCCC1401 TCGTCGAGAA AGCCGTAGGC TACGAGTCGA AGGTAAGGGC GCTATCGCCG1451 GAGACAACGA TTGAGTGCAG GAACCTGGAC GAGATCACGA CGGAGGAAGA1501 GCTAGAAGAT GCGCTGATCG TTCTTCTGGA TGACCGTACG ACACCGATGG1551 CAATCCGGTT GAGGAAAGCC TACGGCGGCA CGCAAATTGC GTCGATCCGA1601 CTATCGACGC CTTCGGCGTC TAAGCTGCTG GAAGCCGGCA AGGTCAAAGT1651 AGGGTGGTCG GTGTGCCCAC TGAGGCCTGT TCCTCGAGTG ACCCAGCAGA1701 TGACGAGGTG TTTCCGCTGT ATGGGTTTCG GCCACCAGGC GAGAAATTGC1751 GACGGTCCCG ATCGAACCAA CAGTTGCAGA AGGTGTGGTA GAGAAGGCCA1801 CATGGCAAGA GACTGCAAAA ATAAGCCGAA GTGCGTGCTC TGTAAAGAAG1851 GCGATGGCAA TAGCCATGCG ACGGGTGGCT TTAATTGCCC GGTGTACACA1901 GAAGCTGGCC TCGGGCAAAA AGTAATGGAG GTGTCCCAGG TGAACCTCAA1951 TCACTGCGAC ACTGCACAGC AACTGCTGTG GCAGTCGACC GCGGAGACGG2001 GGTGTGACGT GGCAATTATT GCAGAACCGT ACCGAGTTCC ACACGACAAC2051 GGAAACTGGG CCGCGGATAC AGCAAGAATG GCGGCGATAC ACGTGATGGG2101 GCGGTACCCC ATACAGGAAG TGGTCTCGAG GGCGTTTGAA GGATTCGTGA2151 TCGCCAAAGT AAACGGAACC TTCTTCTGTA GCTGCTATGC TCCCCCAAGA
213
2201 TGGACCTTGG AGCAGTTTCA GCAGATGCTG GATAGTCTGA CCGACGAACT2251 GATCGGACGA AGCCCGATCG TTATCGGAGG TGACTTCAAC GCGTGGGCGG2301 TCGAGTGGGG TAGCAGATGC ACCAATGCTA GGGGGCATAG CCTAATGGAA2351 GCTCTGGCAA AGCTAGACGT TAGGCTGGCG AATCGCGGAA CCAGCAGTAC2401 CTTCCGCAAA GACGGTCGTG AGTCCATTAT CGACGTTACG TTCTGTAGCC2451 CGCGACTGGC GGCCGACATG AACTGGAGGG TGAGTGAGGA CTATACCCAT2501 AGCGATCACC AAGCGATCCG GTACAGCATC GGGAGACGAG CCCCTGTACC2551 AGATAGGAGC AGCCGGTCCT ACGGAAGGAA ATGGAAGCTG CAGTACTTCG2601 ACGAGGGTCT CTTCGTGGAA GCGCTCCATT GGTGTGATGG TCCCCAAGAC2651 TTGAGTGCCG ACGTGCTAAC AGCACAACTG GTGACAGCAT GCGACACAAC2701 CATGCCGCGG AGACTGGAGC CAAGGAACTG TCGTCGTCCA GCCTACTGGT2751 GGAATGAAGA ACTCGGTACC CTTCGGGCAA GTTGCCTCAG CGCCAGAAGA2801 CGAGTCCAGA GAGCAAGATC CGAAGCAACT AGAGAGGAGT GCAGAGAGGA2851 GTACCGGTCT GCAAAGGCCG CGCTCAAGAA AGCGATCAAA TGCAGCAAGA2901 CAAACTGCTT CAAGGAGTTA TGCCAAGACG CTGATGCAAA CCCTTGGGGG2951 AGCGCATATC GTGTCGCGAT GGCGAAGATC AGAGGCCCAT CGATGGTGGC3001 TGAAACGTGT CCCGACAAGC TGAAGGTCAT TGTGGAAGGG CTCTTCCCAA3051 GACATGACCC AACGACATGG CCTCCTACAC CGTACAACGA CGAAGGGGGT3101 AGCAACGCCG AAGGTCATCT GATCACCAAC GAAGAACTTG TGGCAGTAGC3151 GAAGAGATTG AAGGTGAAGA AAGCTCCCGG CCCGGATGGA ATCCCGAATT3201 TCGCCCTGAA ATCGGCGGTT CAAGCATTCC CGGACAGGTT TCGAACAGTC3251 CTGCAGAAAT GCCTGGACGA AGGACACTTC CCCGACCCGT GGAAGGTTCA3301 AAAGCTCGTG TTGCTGCCGA AGCCAGGCAA ACCACCGGGG GACCCATCAT3351 CGTATAGGCC TATATGTTTG CTGGACACCC TCGGAAAGCT TCTGGAACGG3401 ATCATCCTTA ACCGGCTGAC CAAGTACACG GAGAGCGAGC ATGGCTTAGC3451 AGCGAGGCAG TTCGGCTTCC GTAAAGGGAG ATCCACGGTG GACGCCATCC3501 GGAAAGTGGT CGAGAAAGCC GACGAAGCGC GGAGGAAAAA ACGCAGGGGG3551 AACCGTTGCT GCGCAATAGT CACGATTGAC GTCAAGAACG CGTTCAACAG3601 TGCGAGCTGG GCGGCCATAG CAGCAGCGCT GCACAAAATG AAGGTGCCTG3651 ACTATTTGTG CATGATCTTG AAGAGCTACT TCGAGAACCG CGTGCTGGTC3701 TACGACACTG CCGATGGACA AAAAACCGTT GTTGTTACCG CGGGAGTTCC3751 ACAGGGATCC ATTCTGGGTT CAGCACTGTG GAACGGAATG TATGACGGAG3801 TGTTGACACT GGGGCTACCC AACGGCGTAG AGATTGTTGG CTTTGCAGAC3851 GACATAGTGC TGACGGTAAC CGGCGAAAAT GTCGAGGAGG TCGAAATGCT3901 GGCTATGGAG GCAATCGCAA TGATCGAGAA CTGGATGCTC GAGGTGAAGC3951 TGCGGATCGC TCACCACAAG ACGGAGATGG TACTGGTTAG TAACCACAAA4001 AAGGTGCAGC AGGCCCAGAT ACACGTTGGG GAACACGTAG TGCACTCGAA4051 GAGAGCGCTC AAGTACCTCG GGGTGATGGT GGATGACCGG CTGAACTTCA4101 ACAGCCACGT CGATTACGCC TGCGAGAAGG CGGCTAAGGC GATCATGGCA4151 CTGTCGAGGA TGATGCCGAA CAACGCTGGA CCCAGGAGCA GTAGGCGCCG4201 CCTCTTGGCA AGTGTCGCGA CGTCCATACT TAGGTACGGC GGACCGGTAT4251 GGTGGACGGC GCTGGGGACG AAGCGAAATC GAGCGCTGCT CGACAGAACG4301 CAGAGACTGA TGGCCATGCG GGTTGCAAGC GCGTACAGGA CCATYTCGTC4351 GGAAGCAGTT GGCGTCATAG CCGGAATGAT CCCCATCGGC ATCACACTGG4401 AGGAGGACAC CGTGCGCTAC ACCCGRAGAG GCACGAGAGG TATCCGGGAA4451 GCTGCGAGAG CCGAATCGCT GGCAAGGTGG CAACGTGAGT GGGACACCAC4501 GGAGAAAGGC AGATGGACGC ATCGGCTTAT CCCGTCCGTA TCCACGTGGG4551 TGAGCAGAAG GCAYGGAGAG GTCACCTTCC ACCTCACACA GTTCCTGTCG4601 GGCCATGGCT GCTTCAGGAA GTACCTGCAC AGGTTYGGAC ATGCAGAGTC4651 TCCTCTCTGT CCGGACTGCG TCGATTGCGA GGAAACACCG GAGCACGTGG
214
4701 TGTTCGCCTG CCCTCGCTTC GAGGCAGCGC GAAGCGAAAT GCTGGCCATT4751 ATCGGAGCRG ACACCAGCCC GGATAATGTG GTGCGAAGAA TGTGCAGCGA4801 CATYGCCAAG TGGAATGCGG TCGTCGGAGC GGTGACGCAG ATCACTTCGG4851 CTCTCCAGCG GAAATGGAGA GACGATCAGA GGAGGAACGA CTAGGAGCCT4901 AGTCGAAAAC CCACGAGTGT GGCTGTGAAG GAGAGCACGT TATGATGGTC4951 GGCTCTACCA AATCGGTACA CGTCTCGATG GTCACAGGAG TCGAGAACCC5001 ACGAGTGTGG CTGTGAAGGA GAGCACGTTA TGACGGTTGG CTCTACCAAA5051 TCGGTACACG TCTCGATGGT CCAAGGAGAG GGCTGCATAT GACTAGCCGA5101 TCAAAAGCAA CGCGATTCTT GGGCGCGGTT AAACCCTCGC ATGGACTCAT5151 ATGTATGTRG AACAGGAAAT GGTTCTAGYA CCCGGCATGG ATCCTGTAAG5201 TAGACTAGTG CAGAAAATGC AACGCCTCCC CCCGAAGTTA TACCGAAAGG5251 TGGTCCCGGG GGGAMAAGGG CACGGCGTTC AAGGACTGGT TTAGTGGGTC5301 GGGAAAACTC TTTTTGTTTT CCCAACCCCA CACTACCTGA GAAATGAATT5351 CTCAGGTGTC TGGTAGCAGA TTCCGACCTT GTAAAAAAAA AAACACACAC5401 ACACACACAC ACACACACAC ACACA
//
215
E.2.1.10 RTE
LOCUS /tmp/readseq.in.19664 4069 bpDEFINITION /tmp/readseq.in.19664 [Unknown form], 4069 bases, 1CD7 checksum.ORIGIN
1 Cp_RTE_Ele MCGTGCACAT CGGTATTGTT TTTCGTACCG TTCCCGCGAA51 AGTGCAAAAT TTACCACGAA AATACGCGTT AGTGGGGTGT GAAATTCATT
101 GCATCCGTGT TCTGAACCCG TCTCCTGGCC CGGGAAAGTG ATAGTCCGGT151 GTTTTAACGA CCGCGAGTGA GACAAATCAA TCACCAGATT TGACCGAATA201 TCGCGGTTCC GCCACGGGCA ATTGAAAAAA AGTACAAAAG AACTGCTAAG251 GCATCGTCGT CGTCGTCGCG ATCGATGGTC GTTGTATTGT GTCGYGCGGC301 GTGTGGAAGT GTACCGCGWA GACCATACTG ASGYCGTGTC GTCGTCGTCC351 TCGTCAGCYG TGDTCTGTTG CTASGAGAGW MGTGYAWAGA AGAGAAKCCG401 TWGGTGCAAC GGWGGGGTGA AKTGCTAAGG ARAGCKACAG AAAAAAAAGT451 GCTGCAAAAG TTTTTATTGG TRAAAWTAAA AAGAAGGAAW VAAKCAARCD501 GAAGMGTCMA CCCTGCTRCR TACACAGCCC CCCCCCCCCC CCTCTCTAGT551 TCCGTTTYTG GGCGTCTGCA CCCCAKGTTA GGGGCGGCCC AAACGGACTA601 GGTKGTCGAT TCCATTCTCA CCCGTGAGCG GCTGTTCCCC ACGTTAGGGG651 CGGCTCATGA AAGGAAGTGA GCCGACACCC CCCCCCCCCT CCCCCCACCC701 TTCTTGAGCG TCTGTTCCCC AGGTTAGGGG CGGCTCRAAG AAACCGGTGT751 CCTGCCTCYA TCGTCGAGGT AAGCGTCTGT TCTCCAKGTT AGGGGCGGCT801 TACAGCAGGA TAGAGTTCGG ASCCCCCACC CCCTCCCCCC CGAGCGTCTG851 TTCCCCAGGT TAGGGGCGGC TCGAAACAGC GTCTGTACCC CAGGTTAGGG901 GCGGCTGAGT AAAAGTCCYT GTGTCGGCGT GGGACTKTAA ACAGTACCGG951 CACGATGGTC CTCCGGCGAG ACAGGGGGTT GGTGCAGGCC ACACGAACCC1001 GCCGTAAAAC ACCAGTGCAG GAAGCACACG ATGCGAGCCG GACCAATCGG1051 CACGGAACTG GACATCWTAT GAGGTCCCAC GATTGGAAGC TCGGGACGTG1101 GAATTGCAGG TCTCTCAAAT TTGACGGGAG TATCCGCATA CTTTCCGACA1151 TATTGAGGGT CCGCAAGTTC AGCATCGTAG CGCTGCAGGA GGTTKGCTGG1201 ATAGGCGCGG AAGAGGTACA AGCGTACCCA AGGATTGGGC TGTACAATCT1251 ACCAGAGCCG CGGCGAAAAC AAGAGGCTGG GGACAGCCTT TATAGTGCTG1301 GGCGAAATGC GCGATCGCGT GATTGGGTGG ACCCCGCTCA CCGACCGAAT1351 GTGCGTGCTG AGGATTAAAG GCCGTTTCTT CAACATTAGC ATCATAAACG1401 TGCACAGCCC GCACTCAGGA AGCGAAGATG ACGACAAGGA CGCATTTTAC1451 GAGCAGCTGA ACTGGACGTA CAACAGCTGC CCAAAACATG ACGTCAAAAT1501 CGTCATCGGA GATTTTAACG CTCAGGTTGG CCAGGAGGAG GAATTCAGAC1551 CGGTGATAGG AAAGTTCAGC GCCCACGTAC GCACGAACGA AAACGGCCTG1601 CGACTGATCG ACTTCGCCAC CTCCAAAAAC ATGGCCGTAC GAAGTACCTG1651 CTTCCAGCAC AACCTCCGAG ACAAGTACAC CTGGAGATCA CCGCAAGGAA1701 CGGAATCACA AATCGACCAC GTCGTAATCG ACGGTAGACA CTTTTCCGAC1751 ATCATCGACG TCAGGACCTA TCGCGGCGCC AACGTCGACT CGGACCACTA1801 TCTGGTGATG GTGAAAATGC GCCAACGACT TTCCCTGGCG AAAAGCGTTC1851 GGTACCGCCG CCCTCCGCGG TTGGATCTGG AGCGGCTTAA GTTACCGGAA1901 GTCGCATCCC GGTACGCGCA TTCGCTGGAG GCTGCGTTGC CAGGGGAGGG1951 TGAGCTGTTG GAAGCTCCCC TCGAGGACTG CTGGAGGAGC GTCAAGGCAG2001 CCATCACCAA CGCAGCGGAA AGCACCATCG GATTTGTGGA ACGAGGACGA2051 CGGAACGATT GGTTCGACGA GGAGTGTCGA GCGATTTTGG AGGAGAAGAA2101 TGCAGCACGG AGGGCAATGC TGCAGTACAA TCTCCGTGAT TACGAGGAGG2151 CGTATGGACA GAAGCGAAGG CAGCAGCACC AGCTCTTCCG AGCAAAAGTG
216
2201 CGCCACCAGG AAGAGTTGGA GTTTGAGGAC ATGGAGCAGC TGCATCGCTC2251 AAACGAAACG CGCAAGTTCT ACAAGAAGCT CAACGGATCC CGMAACGGCT2301 TCACGCCGCG AGTCGAAATG TGCCGGGATA AAAATGGAGC TATCTTGACG2351 AACGAGCGTG AGGTGATTGA CAGGTGGAAG CAGCACTTCG ATGAACACCT2401 GAATRGCGCA GAAGCAGAGG CAGGGGTCCA AGGCGGCAGG AGAGAGGACT2451 TCATCGGTAC AGCGGGAGAA GGAGAGGAGC CAGTTCCCAC GATGAGGGAA2501 GTTAAGGATG CCATCAAGAA GCTGAAGAAC AACAAAGCAG CGGGTAAGGA2551 TGGTATCGGT GCTGAACTCA TCAAGATGGG CCCGGAGAAG CTGGCGTCCT2601 GTCTGCACCG ACTGATAGTC AGGGTCTGGG AGTCAGAACA GCTACCGGAG2651 GAGTGGAAAG AGGGAGTAAT ATGCCCGATC TACAAGAAGG GGGACAAGTT2701 AGATTGTGAG AACTACCGTG CCATCACAAT CCTCAACGCG GCCTACAAAG2751 TGTTCTCCCA GATCCTCTTC AGCCGCCTAT CGCCAATAGC GGAAGGTTTT2801 GTTGGAAGTT ATCAAGCCGG ATTCGTCATG GGGAGATCAA CAACCGACCA2851 AATCTTCACT GTGCGACAAA TCCTCCAAAA GTGTCGCGAG TACCAAGTCC2901 CCACGCACCA CCTTTTCATC GACTTCAAAG CCGCGTACGA CTCAGTCGAT2951 CGCGAAGAGC TATGGAAAAT TATGGACGAG AACGGTTTTC CCGGGAAGCT3001 GATCAGACTG ATCAAGATGA CGATGGATGG GGCTAGGTGT TGTGTGAAGA3051 TATCGGGTGC GGAATCGGAC TCGTTTACTT CACTTGGGGG GCTTCGGCAA3101 GGCGATGGGA TCTCTTGYCT YTGTTTCAAT GTCGTGCTAG AAGGTGTTAT3151 GAGACGAGCG GGCTTCAATA TGCGGGGCAC GATCTTCAGC AAGTCCAACC3201 ARTTCATCTG CTWCGCCGAC GACATGGACA TTGTTGGCAG AACGTTCAAG3251 GCGGTTGCKG ATGCGTACAC CGRCTTGAAG CGGGAAGCAG AGAAGGTTGG3301 GCTAAGGGTG AATGTGGCGA AGACAAAGTA CCTGCTGGCA GGAGGAACCG3351 AGTCCCTTAG GGCTCGCATT GGACCRAGCG TTACRATCGA CGGGGACGAA3401 TTCGAGGTRG TGGAGGAGTT TGTATACCTC GGATCGTTGG TAACGTCGGA3451 CAACAGCTGC AGCAGGGAAA TTCGGAGGCG CATCATCGCT GGAAGTCGTG3501 CCTATTTCGG TCTYCACAAG AGCCTAAGGT CCCGGAAATT CTCCCTACAT3551 ACGAAGTGTT CCATCTACAA GTCGCTGATA AGACCGGTCG TCCTCTACGG3601 GCACGAGACG TGGACAATGC TCGARGAGGA CYTACGAGCG CTAARCGTYT3651 TCGAACGTCG AGTGCTAAGG ACCATCTTTG GCGGCGTATA TGAGAACGAC3701 GGATGGCGGC GGAGAATGAA CCACGARCTT GCRCAACTCT ACAACGAACC3751 AAGCATCCGG AARGTCGCGA AGGCTGGACG GTTGCAGTGG GCGGGTCATG3801 TTGCAAGGAT GCCGGAACGA GCCGASCAMT TGAGCCAACG GAACCAGAAG3851 ATCAATCCTG CGAAGTTGGT GTTTGTGTCG GAGCCGGTAG GAACAAGACG3901 TAGGGGGGTG CAACGTGCGA GGTGGGTGGA CCAAGTGGAG ARCGATYTRG3951 AAAGTGTGGG TGCGCCGCGA AATTGGAGAM AWGCAGCCAT GGACCGAGCT4001 TGTTGGCGGA GAATCGTGCA GCAGGYCAAG CTAATGGTGT AGCGCCAAYA4051 AAAGTAAAGT AAAGTAAGT
//
217
E.2.1.11 Unclassified LINE
LOCUS L1_Contig_59 405 bpDEFINITION L1_Contig_59, 405 bases, 1F4B checksum.ORIGIN
1 ACAATAGCCT AAGTATTCGC AGAGAAGTTT TAGTATTTGA AGTCTTTGTC51 TTCTTCCGGC TCCCTCTAGG GCAGGTCTCA GCAGGTTCTC GAATGAGATA
101 GAGCGCCTTY CTCGCAAGAT CACAGTGACC TTGTTTTGAA ACAKCAGCCA151 CAYGTCTGCC ACSCGTGAAC ATTCACTAAA TTTATGCTGC AGCGTTTCCA201 CAGCAGCTCC ACAGTGAAGG CAACTCTMAC TGTTCACCCT CTGTATCGTG251 TGAAGCAACT TACGATGTTC TGTTTTTTCG TTCACGAACA TGTACAGTTG301 ATTCTGTTGT GCAGAAGTGA GCCCTCTCAA AGCGATGTTT TTTTTCCAAA351 TTCTCCGCCA GTTGACCGCT GGGTTAGCTT GCTGAACCTT CGGCTGTTCG401 GTTTG
//
218
E.3 D. melanogaster
E.3.1 Transposons
E.3.1.1 mariner
DEFINITION Putative Drosophila melanogaster mariner sequenceSOURCE Drosophila melanogaster
flybase.org, dmel-all-chromosome-r5.29.fastaFEATURES Location/Qualifiers
source 1..1043/organism="Drosophila melanogaster"/mol_type="genomic DNA"/transposon="putative mariner transposon"
repeat_region 1..26/note="left terminal inverted repeat"
ORF1 66..780translation=LCNCILSSVSYQLLLLAECQNWFRKFRSGDFSLKHEPRSGRLYEVDDDLIKALIELDRHVNKQEIGEKFNIPKSTVYYHIKRLVKKFDIWVPHVLKEIHLTHRINACDMQLKCNEFDPFLKRITSGKEKWIVYNNVSRKRSWSKHGEPAQTTSKADIHQKKVMLSVWWDWKGVVYFELLPRNQTINSDVYCHQLNKLNTRRSDQNWSIVKVSYSTRITLDCTHLWSLSKNCVSLGRNF/product="putative mariner transposase"
repeat_region 1018..1043/note="right terminal inverted repeat"
ORIGIN1 TGCCCAAAAA GTAATTGCGG ATTTTTCATA TAGTCGGCGT TGACAAATTT51 TTTCAACGGC TTGTGACTTT GTAATTGCAT TCTTTCATCT GTCAGTTATC
101 AGCTGTTACT ATTAGCTGAG TGTCAAAATT GGTTTCGCAA ATTCCGTTCT151 GGAGATTTTT CACTTAAACA TGAGCCCCGT TCAGGTCGGC TATATGAAGT201 TGATGATGAC CTAATCAAAG CATTAATCGA ATTGGATCGT CATGTAAATA251 AGCAGGAGAT AGGAGAGAAG TTTAATATAC CAAAATCAAC CGTTTACTAT301 CACATAAAAA GACTAGTGAA AAAGTTTGAT ATTTGGGTAC CACATGTATT351 GAAAGAAATT CATTTAACAC ACCGAATAAA TGCTTGTGAT ATGCAACTTA401 AATGCAATGA ATTCGATCCG TTTTTAAAAC GAATCACATC TGGAAAGGAA451 AAATGGATTG TTTACAACAA CGTTAGTCGA AAACGATCAT GGTCCAAGCA501 TGGTGAACCA GCTCAAACCA CTTCAAAGGC TGATATCCAC CAAAAGAAGG551 TTATGCTGTC TGTTTGGTGG GATTGGAAGG GTGTCGTATA TTTTGAACTG601 CTTCCAAGGA ACCAAACGAT TAATTCGGAT GTTTACTGTC ACCAATTGAA651 CAAATTGAAT ACAAGGAGAA GCGACCAGAA TTGGTCAATC GTAAAGGTGT701 CATATTCCAC CAGGATAACG CTAGACTGCA CACATCTTTG GTCACTATCC751 AAAAACTGTG TGAGCTTAGG TAGGAACTTT TGATGCATCC ACCGTATAGC801 CCTGACCTGG AACCATCAGA CTACCATTTA TTTCGATCTT TGCAGAACTC851 CTTAAATGGT AAAACTTTCG GGAATGATGA GGCTATAAAA TCGCACTTGG901 TTCAGTTTTT TGCAGATAAA GGCCAGAAGT TCTATTGACC GTGGAATAGA951 AAAAAGGTTA TCGAAAAAAA TGGCAATTCA TTCTAAGTAT TATTAAAAAT1001 GCATTTACTT TCTTTTAAAA AATCGGAAAT TATTTTTTGG GCA
//
219
APPENDIX F
SCRIPT USED TO IDENTIFY MITES
This chapter presents the BioPerl script we utilized to identify MITEs in
Pediculus humanus humanus.
#Ryan Kennedy and Scott Christley
use lib ’/opt/bioperl’;
use Bio::SeqIO;use Bio::SearchIO;
$len=0;$minlen=$ARGV[0];$file=$ARGV[1];$allowMismatch = $ARGV[2];$maxDistance=$ARGV[3];$name=$ARGV[4];
$minDistance = $ARGV[5];
$len=$minlen;$counter=0;open(OUTDAT, ">$name.fa");flock(OUTDAT,$LOCK);print OUTDAT "==================================================\n";print OUTDAT "PARAMETERS:\n";print OUTDAT "\tMinimum Distance:\t $minDistance\n";print OUTDAT "\tMaximum Distance:\t $maxDistance\n";print OUTDAT "\tMismatches Allowed:\t $allowMismatch\n";print OUTDAT "\tMinimum Length:\t $minlen\n";print OUTDAT "==================================================\n";flock(OUTDAT,$UNLOCK);close(OUTDAT);$in = Bio::SeqIO->new(-file => $file, -format => ’Fasta’);
220
if(length($str1)>length($str2)) { $max=length($str1); }else { $max=length($str2); }
$num=1;while($seq=$in->next_seq()) {if ($seq->length() < 80) {
#starting sequence too shortnext;
}
$seq1=$seq->seq();$rvseq=$seq->revcom();$rseq=$rvseq->seq();$seq_full2=$seq1;$rseq_full=$rseq;
# search through scaffold, forward strand$lastMatch = "ZEWQSDV";for ($k = 0; $k < length($seq1); $k++) {
#print "k: " . $k . "\n";
# extract substrings$distance = length($seq1) - $k;if ($distance > $maxDistance) { $distance = $maxDistance; }
$seqSub = substr($seq1, $k, $distance);$rseqSub =
substr($rseq_full, length($rseq_full) - $distance - $k, $distance);
# forward strand$i = 0;#for ($i = 0; $i < $distance; $i++) {
# reverse strandfor ($j = 0; $j < $distance; $j++) {
# compare strings and allow for mismatches$len = 0;$numMismatch = 0;$match = "";while ((substr($seqSub,$i + $len, 1) eq
substr($rseqSub,$j + $len, 1))|| ($numMismatch < $allowMismatch)) {
if (substr($seqSub,$i + $len, 1) nesubstr($rseqSub,$j + $len, 1)) {
++$numMismatch;}$len++;$match=substr($seqSub,$i,$len);
221
if( (($i + $len) > $distance) ||(($j + $len) > $distance) ) { last; }
}
# if the distance between the two repeats is too small then skip$back = $distance - $j + $k - $len;$front = $i + $k;if ($front > $back) { next; }if (abs($front - $back) < $minDistance) { next; }
if (($match ne "") && (length($match) >= $minlen)) {#print $match . " $i\n";
# trim out simple repeatsmy $cntA = 0;my $cntT = 0;my $cntG = 0;my $cntC = 0;for ($l = 0; $l < length($match); ++$l) {
$seqstr = substr($match, $l, 1);if ($seqstr eq "A") { ++$cntA; }if ($seqstr eq "T") { ++$cntT; }if ($seqstr eq "G") { ++$cntG; }if ($seqstr eq "C") { ++$cntC; }
}if (($cntA == 0) || ($cntT == 0) ||
($cntG == 0) || ($cntC == 0)) {# simple repeat
} else {# check the other sequence$cntA = 0;$cntT = 0;$cntG = 0;$cntC = 0;$match=substr($rseqSub,$j,$len);#print $match . " $j\n";for ($l = 0; $l < length($match); ++$l) {
$seqstr = substr($match, $l, 1);if ($seqstr eq "A") { ++$cntA; }if ($seqstr eq "T") { ++$cntT; }if ($seqstr eq "G") { ++$cntG; }if ($seqstr eq "C") { ++$cntC; }
}if (($cntA == 0) || ($cntT == 0) ||
($cntG == 0) || ($cntC == 0)) {#print "Simple Repeat " . $seq->id() . "\n";# simple repeat
} else {#print $lastMatch . "\n";($seqNew,$passed) =
222
loop(substr($seqSub,$i,$len), $lastMatch);if($passed==1) {
$lastMatch = $seqNew;open(OUTDAT, ">>$name.fa");flock(OUTDAT,$LOCK);print OUTDAT ">" . $seq->id() .
" $counter $front $back $len\tRPT1:" .$seqNew . "\tRPT2:" .substr($seq1,$back,$len) . "\n";
flock(OUTDAT,$UNLOCK);close(OUTDAT);$counter++;
}}
}}$len=$minlen;$match="";
}}$num++;
}
sub loop() {$check=0;$pass=0;
$keeper=$sequence;my($sequence, $last_sequence) = @_;#print $sequence . " " . $last_sequence . "\n";while($sequence =~ /(A{4}|T{4}|G{4}|C{4})$/i ||
$sequence =~ /^(A{4}|T{4}|G{4}|C{4})/i ||$sequence =~ /(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)$/i ||$sequence =~ /^(ATAT|GCGC|GAGA|CTCT|CACA|TGTG)/i ) {
#Removed$check=1;$sequence = $‘ . $’;
} #end whileif(length($sequence) >= $minlen) {if($check>0) { $sequence=$keeper;
}if($last_sequence =~ $sequence) {
$pass=0;} else {$pass=1;}#$pass=1;
} else { $pass=0;#Sequence full of Repeats or is a substring
}return $sequence, $pass;
} #end loop
223
REFERENCES
1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller,and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Research, 25(17):3389–3402, July 1997.
2. O. Andrieu, A. S. Fiston, D. Anxolabehere, and H. Quesneville. Detectionof transposable elements by their compositional bias. BMC Bioinformatics,5(94), July 2004.
3. S. M. Anwar, M. Musiani, G. McDermid, and D. Marceau. How Do HumanActivities Shape Wolves’ Behavior In The Central Rocky Mountain Region,Alberta, Canada? In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.
4. ArcGIS. http://www.esri.com/software/arcgis.
5. P. Arensburger, K. Megy, R. M. Waterhouse, J. Abrudan, P. Amedeo, B. An-telo, L. Bartholomay, S. Bidwell, E. Caler, F. Camara, C. L. Campbell, K. S.Campbell, C. Casola, M. T. Castro, I. Chandramouliswaran, S. B. Chap-man, S. Christley, J. Costas, E. Eisenstadt, C. Feschotte, C. Fraser-Liggett,R. Guigo, B. Haas, M. Hammond, B. S. Hansson, J. Hemingway, S. R. Hill,C. Howarth, R. Ignell, R. C. Kennedy, C. D. Kodira, N. F. Lobo, C. Mao,G. Mayhew, K. Michel, A. Mori, N. Liu, H. Naveira, V. Nene, N. Nguyen,M. D. Pearson, E. J. Pritham, D. Puiu, Y. Qi, H. Ranson, J. M. C. Ribeiro,H. M. Roberston, D. W. Severson, M. Shumway, M. Stanke, R. L. Strausberg,C. Sun, G. Sutton, Z. J. Tu, J. M. C. Tubio, M. F. Unger, D. L. Vanlanding-ham, A. J. Vilella, O. White, J. R. White, C. S. Wondji, J. Wortman, E. M.Zdobnov, B. Birren, B. M. Christensen, F. H. Collins, A. Cornel, G. Di-mopoulos, L. I. Hannick, S. Higgs, G. C. Lanzaro, D. Lawson, N. H. Lee,M. A. T. Muskavitch, A. S. Raikhel, and P. W. Atkinson. Sequence of Culexquinquefasciatus Establishes a Platform for vector Mosquito ComparativeGenomics. Science, 330(6000):86–88, October 2010.
224
6. S. M. N. Arifin, R. C. Kennedy, K. E. Lane, G. R. Madey, A. Fuentes,and H. Hollocher. P-SAM: A Post-Simulation Analysis Module for Agent-Based Models. In Proceedings of the International Simulation Multiconfer-ence (ISMc2010): Summer Computer Simulation Conference (SCSC2010),2010.
7. O. Balci. Handbook of Simulation: Principles, Methodology, Advances, Ap-plications, and Practice, chapter Verification, Validation, and Testing. JohnWiley & Sons, New York, NY, 1998.
8. J. Banks and R. R. Gibson. Don’t simulate when... 10 rules for determiningwhen simulation is not appropriate. IEE Solutions, September 1997.
9. J. Banks and J. S. C. II. Introduction to discrete-event simulation. InProceedings of the 1986 Winter Simulation Conference, pages 17–23, 1986.
10. J. Banks, J. S. C. II, B. L. Nelson, and D. M. Nicol. Discrete-Event SystemSimulation. Pearson Education, Inc., Upper Saddle River, NJ, fourth edition,2005.
11. Z. Bao and S. Eddy. Automated de novo identification of repeat sequencefamilies in sequenced genomes. Genome Research, 12(8):1269–1276, August2002.
12. E. A. Bennett, L. E. Coleman, C. Tsui, W. S. Pittard, and S. E. Devine. Nat-ural genetic variation caused by transposable elements in humans. Genetics,168:933–951, October 2004.
13. C. M. Bergman and H. Quesneville. Discovering and detecting transposableelements in genome sequences. Briefings in Bioinformatics, 8(6):382–392,November 2007.
14. J. Biedler and Z. Tu. Non-LTR Retrotransposons in the African MalariaMosquito, Anopheles gambiae: Unprecedented Diversity and Evidence ofRecent Activity. Molecular Biology and Evolution, 20(11):1811–1825, 2003.
15. E. Birney, M. Clamp, and R. Durbin. GeneWise and Genomewise. GenomeResearch, 14:988–995, 2004.
16. E. Birney and R. Durbin. Using GeneWise in the Drosophila AnnotationExperiment. Genome Research, 10:547–548, 2004.
17. BLAST. http://www.ncbi.nlm.nih.gov/blast.
225
18. D. Brown, R. Riolo, D. Robinson, M. North, and W. Rand. Spatial Pro-cess and Data Models: Toward Integration of Agent-based Models and GIS.Journal of Geographic Systems, Special Issue on Space-Time InformationSystems, 7(1):25–47, 2005.
19. R. V. Bruggner. A system for integration and management of communityannotation for vectorbase.org. Master’s Thesis, University of Notre Dame,2007.
20. C. Burge and S. Karlin. Prediction of complete gene structures in humangenomice DNA. Journal of Molecular Biology, 268:78–94, 1997.
21. R. E. Butler. The Design and Development of VectorBase: A BioinformaticResource Center for Invertebrate Vectors of Human Pathogens. Master’sThesis, University of Notre Dame, 2010.
22. P. Capy, C. Bazin, D. Higuet, and T. Langin. Dynamics and Evolution ofTransposable Elements. Landes Bioscience, Austin, Texas, 1998.
23. L. Cary, M. Goebel, B. Corsaro, H. Wang, E. Rosen, and M. Fraser. Trans-poson mutagenesis of baculoviruses: analysis of Trichoplusia ni transposonifp2 insertions within the fp-locus of nuclear polyhedrosis viruses. Virology,172(1):156–169, September 1989.
24. A. Caspi and L. Pachter. Identification of transposable elements using multi-ple alignments of related genomes. Genome Research, 16:260–270, February2006.
25. C. Castle, A. Crooks, P. Longley, and M. Batty. Agent-based modellingand simulation using repast: A gallery of gis applications from CASA. InG. Priestnall and P. Alpin, editors, Proceedings of the 14th GeographicalInformation Systems Research UK Conference, pages 237–239, 2006.
26. Chado. http://www.gmod.org/wiki/index.php/chado.
27. Chado Best Practices.http://gmod.org/wiki/index.php/Chado Best Practices#Transposons.
28. N. L. Craig, R. Craigie, M. Gellert, and A. M. Lambowitz, editors. MobileDNA II. ASM Press, Washington, DC, 2002.
29. A. T. Crooks. UCL working paper series: The repast sim-ulation/modelling system for geospatial simulation. available athttp://www.casa.ucl.ac.uk/working papers/paper123.pdf, September 2007.
226
30. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle,and M. Clamp. The ensembl automatic gene annotation system. GenomeResearch, 14:942–950, 2004.
31. V. Curwen, E. Eyras, T. D. Andrews, L. Clarke, E. Mongin, S. M. Searle,and M. Clamp. The Ensembl Automatic Gene Annotation System. GenomeResearch, 14(5):942–950, 2004.
32. P. Daszak, A. Cunningham, and A. Hyatt. Anthropogenic environmentalchange and the emergence of infectious disease in wildlife. Acta Tropica,78:103–116, 2001.
33. DNASTAR SeqMan.http://www.dnastar.com/products/seqmanpro.php.
34. Douglas-Peucker Algorithm.http://geometryalgorithms.com/Archive/algorithm 0205/#Douglas-Peucker%20algorithm.
35. R. D. Dowell, R. M. Jokerst, A. Day, S. R. Eddy, and L. Stein. The Dis-tributed Annotation System. BMC Bioinformatics, 2(7), 2001.
36. R. Drysdale and the FlyBase Consortium. FlyBase: a database for theDrosophila Research Community. Methods in Molecular Biology, 420:45–49,2008.
37. R. M. D’Souza, M. Lysenko, S. Marino, and D. Kirschner. Data-ParallelAlgorithms for Agent-Based Model Simulation of Tuberculosis On Graph-ics Processing Units. In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.
38. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analy-sis: Probabilistic models of proteins and nucleic acids. Cambridge UniversityPress, Cambridge, United Kingdom, 2003.
39. S. R. Eddy. Profile hidden markov models. Bioinformatics Review,14(9):755–763, 1998.
40. R. C. Edgar and E. W. Myers. PILER: identification and classification ofgenomic repeats. Bioinformatics, 21 Suppl. 1:i152–i158, March 2005.
41. L. J. Engel, G. A. Engel, M. A. Schillaci, A. Rompis, A. Putra, K. G.Suaryana, A. Fuentes, B. Beer, S. Hicks, R. White, B. Wilson, and J. S.Allan. Primate-to-human retroviral transmission in asia. Emerging Infec-tious Diseases, 11(7), July 2005.
227
42. J. E. Fa and D. G. Lindburg, editors. Evolution and Ecology of MacaqueSocities. Cambridge University Press, 2005.
43. P. Flicek, B. Aken, K. Beal, B. Ballester, M. Caccamo, Y. Chen, L. Clarke,G. Coates, F. Cunningham, T. Cutts, T. Down, S. Dyer, T. Eyre, S. Fitzger-ald, J. Fernandez-Banet, S. Graf, S. Haider, M. Hammond, R. Holland,K. Howe, K. Howe, N. Johnson, A. Jenkinson, A. Kahari, D. Keefe,F. Kokocinski, E. Kulesha, D. Lawson, I. Longden, K. Megy, P. Meidl,B. Overduin, A. Parker, B. Pritchard, A. Prlic, S. Rice, D. Rios, M. Schus-ter, I. Sealy, G. Slater, D. Smedley, G. Spudich, S. Trevanion, A. Vilella,J. Vogel, S. White, M. Wood, E. Birney, T. Cox, V. Curwen, R. Durbin,X. Fernandez-Suarez, J. Herrero, T. Hubbard, A. Kasprzyk, G. Proctor,J. Smith, A. Ureta-Vidal, and S. Searle. Ensembl 2008. Nucleic Acids Re-search, 36:d707–d714, January 2008.
44. P. Flicek, B. L. Aken, B. Ballester, K. Beal, E. Bragin, S. Brent, Y. Chen,P. Clapham, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gor-don, S. Graf, S. Haider, M. Hammond, K. Howe, A. Jenkinson, N. Johnson,A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, G. Koscielny,E. Kulesha, D. Lawson, I. Longden, T. Massingham, W. McLaren, K. Megy,B. Overduin, B. Pritchard, D. Rios, M. Ruffier, M. Schuster, G. Slater,D. Smedley, G. Spudich, Y. A. Tang, S. Trevanion, A. Vilella, J. Vogel,S. White, S. P. Wilder, A. Zadissa, E. Birney, F. Cunningham, I. Dunham,R. Durbin, X. M. Fernandez-Suarez, J. Herrero, T. J. P. Hubbard, A. Parker,G. Proctor, J. Smith, and S. M. J. Searle. Ensembl’s 10th year. Nucleic AcidsResearch, 38:D557–562, 2010.
45. P. Flicek, M. R. Amode, D. Barrell, K. Beal, S. Brent, Y. Chen, P. Clapham,G. Coates, S. Fairley, S. Fitzgerald, L. Gordon, M. Hendrix, T. Hourlier,N. Johnson, A. Khri, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski,E. Kulesha, P. Larsson, I. Longden, W. McLaren, B. Overduin, B. Pritchard,H. S. Riat, D. Rios, G. R. S. Ritchie, M. Ruffier, M. Schuster, D. Sobral,G. Spudich, Y. A. Tang, S. Trevanion, J. Vandrovcova, A. J. Vilella, S. White,S. P. Wilder, A. Zadissa, J. Zamora, B. L. Aken, E. Birney, F. Cunningham,I. Dunham, R. Durbin, X. M. Fernndez-Suarez, J. Herrero, T. J. P. Hubbard,A. Parker, G. Proctor, J. Vogel, and S. M. J. Searle. Ensembl 2011. NucleicAcids Research. Epub ahead of print.
46. J. Fooden. Systematic review of southeast asian longtail macaques, Macacafascicularis. Fieldiana Zoology, 81, 1995.
47. A. Fuentes, M. Southern, and K. G. Suaryana. Monkey forests and humanlandscapes: Is extensive sympatry sustainable for Homo sapiens and Macaca
228
fascicularis on Bali? In J. D. Patterson and J. Wallis, editors, Commen-salism and Conflict: The Primate-Human Interface. American Society ofPrimatology Publications, 2005.
48. GD Graphics Library. http://www.libgd.org/.
49. GeoTools. http://geotools.codehaus.org.
50. N. Gilbert. Agent-based Models. SAGE Publications, Thousand Oaks, CA,2008.
51. H. R. Gimblett, editor. Integrating Geographic Information Systems andAgent-based Modeling Techniques for Simulating Social and Ecological Pro-cesses. Oxford University Press, 2002.
52. H. R. Gimblett. Integrating geographic information systems and agent-basedtechnologies for modeling and simulating social and ecological phenomena.In H. R. Gimblett, editor, Integrating Geographic Information Systems andAgent-based Modeling Techniques for Simulating Social and Ecological Pro-cesses. Oxford University Press, 2002.
53. GMOD. http://www.gmod.org/.
54. GMOD Names of Features.http://www.gmod.org/wiki/index.php/Chado Sequence Module#Names of Features.
55. GNU General Public License (GPL) v3.http://www.gnu.org/licenses/gpl.html.
56. GRASS: Geographic Resources Analysis Support System.http://grass.osgeo.org.
57. P. Green. http://www.phrap.org/phredphrapconsed.html.
58. V. Grimm, U. Berger, F. Bastiansen, S. Eliassen, V. Ginot, J. Giske,J. Goss-Custard, T. Grand, S. K. Heinz, G. Huse, A. Huth, J. U. Jepsen,C. Jørgensen, W. M. Mooij, B. Muller, G. Pe’er, C. Piou, S. F. Rails-back, A. M. Robbins, M. M. Robbins, E. Rossmanith, N. Ruger, E. Strand,S. Souissi, R. A. Stillman, R. Vabø, U. Visser, and D. L. DeAngelis. Astandard protocol for describing individual-based and agent-based models.Ecological Modelling, 198(1):115–126, 2006.
59. V. Grimm, U. Berger, D. L. DeAngelis, J. G. Polhill, J. Giske, and S. F.Railsback. The ODD protocol: A review and first update. Ecological Mod-elling, 221:2760–2768, 2010.
229
60. V. Grimm and S. F. Railsback. Individual-based Modeling and Ecology.Princeton University Press, Princeton, NJ, 2005.
61. U. Hellsten and et al. The genome of the Western clawed frog Xenopustropicalis. Science, 328(5978):633–636, April 2010.
62. Hibernate. http://www.hibernate.org.
63. R. A. Holt, G. M. Subramanian, A. Halpern, G. G. Sutton, R. Charlab,D. R. Nusskern, P. Wincker, A. G. Clark, J. M. C. Ribeiro, R. Wides,S. L. Salzberg, B. Loftus, M. Yandell, W. H. Majoros, D. B. Rusch, Z. Lai,C. L. Kraft, J. F. Abril, V. Anthouard, P. Arensburger, P. W. Atkinson,H. Baden, V. de Berardinis, D. Baldwin, V. Benes, J. Biedler, C. Blass,R. Bolanos, D. Boscus, M. Barnstead, S. Cai, A. Center, K. Chatuverdi,G. K. Christophides, M. A. Chrystal, M. Clamp, A. Cravchik, V. Curwen,A. Dana, A. Delcher, I. Dew, C. A. Evans, M. Flanigan, A. Grundschober-Freimoser, L. Friedli, Z. Gu, P. Guan, R. Guigo, M. E. Hillenmeyer, S. L.Hladun, J. R. Hogan, Y. S. Hong, J. Hoover, O. Jaillon, Z. Ke, C. Kodira,E. Kokoza, A. Koutsos, I. Letunic, A. Levitsky, Y. Liang, J.-J. Lin, N. F.Lobo, J. R. Lopez, J. A. Malek, T. C. McIntosh, S. Meister, J. Miller,C. Mobarry, E. Mongin, S. D. Murphy, D. A. O’Brochta, C. Pfannkoch,R. Qi, M. A. Regier, K. Remington, H. Shao, M. V. Sharakhova, C. D. Sit-ter, J. Shetty, T. J. Smith, R. Strong, J. Sun, D. Thomasova, L. Q. Ton,P. Topalis, Z. Tu, M. F. Unger, B. Walenz, A. Wang, J. Wang, M. Wang,X. Wang, K. J. Woodford, J. R. Wortman, M. Wu, A. Yao, E. M. Zdobnov,H. Zhang, Q. Zhao, S. Zhao, S. C. Zhu, I. Zhimulev, M. Coluzzi, A. dellaTorre, C. W. Roth, C. Louis, F. Kalush, R. J. Mural, E. W. Myers, M. D.Adams, H. O. Smith, S. Broder, M. J. Gardner, C. M. Fraser, E. Birney,P. Bork, P. T. Brey, J. C. Venter, J. Weissenbach, F. C. Kafatos, F. H.Collins, and S. L. Hoffman. The genome sequence of the malaria mosquitoAnopheles gambiae. Science, 298(5591), October 2002.
64. X. Huang and A. Madan. CAP3: A DNA Sequence Assembly Program.Genome Research, 9:868–877, 1999.
65. International Human Genome Sequencing Consortium. Initial sequencingand analysis of the human genome. Nature, 409(6822):860–921, 2001.
66. R. M. Itami. Mobile agents with spatial intelligence. In H. R. Gimblett, ed-itor, Integrating Geographic Information Systems and Agent-based ModelingTechniques for Simulating Social and Ecological Processes. Oxford UniversityPress, 2002.
67. Java. http://java.sun.com.
230
68. A. Jenkinson, M. Albrecht, E. Birney, H. Blankenburg, T. Down, R. Finn,H. Hermjakob, T. Hubbard, R. Jimenez, P. Jones, A. Kahari, E. Kulesha,J. Macias, G. Reeves, and A. Prlic. Integrating biological data - the Dis-tributed Annotation System. BMC Bioinformatics, 9(Suppl 8):S3, 2008.
69. N. C. Jones and P. A. Pevzner. An Introduction to Bioinformatics Algo-rithms. The MIT Press, Cambridge, MA, 2004.
70. JTS Topology Suite. http://www.vividsolutions.com/jts/jtshome.htm.
71. J. Jurka, V. V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany, andJ. Walichiewicz. Repbase update, a database of eukaryotic repetitive ele-ments. Cytogenetic and Genome Research, 110:462–467, 2005.
72. J. Jurka, P. Klonowski, V. Dagman, and P. Pelton. Censor–a program foridentification and elimination of repetitive elements from dna sequences.Computers and Chemistry, 20(1):119–121, 1996.
73. K. Kaiser, J. W. Sentry, and D. J. Finnegan. Eukaryotic transposable ele-ments as tools to study gene structure and function. In D. J. Sherratt, editor,Mobile Genetic Elements. Oxford University Press, 1995.
74. M. Keeling, M. Woolhouse, R. May, G. Davies, and B. Grenfell. Modellingvaccination strategies against foot-and-mouth disease. Nature, 421:136–142,January 2003.
75. R. C. Kennedy. Verification and Validation of Agent-based and Equation-based Simulations and Bioinformatics Computing: Identifying Transpos-able Elements in the Aedes aegypti Genome. Master’s Thesis, Universityof Notre Dame, April 2006.
76. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, A. Fuentes, H. Hollocher, andG. R. Madey. A GIS Aware Agent-Based Model of Pathogen Transmission.International Journal of Intelligent Control and Systems, 14(1):51–61, March2009. Invited.
77. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, andG. R. Madey. Effectively integrating gis data into an agent-based epidemi-ological model. In NICO Complexity Conference, September 2009. PosterAward Winner.
78. R. C. Kennedy, K. E. Lane, S. M. N. Arifin, H. Hollocher, A. Fuentes, andG. R. Madey. Simulation and Analysis of Pathogen Transmission in anAgent- and GIS-based Model. In North American Association for Com-putational Social and Organization Science 2009 Conference, Tempe, AZ,October 2009.
231
79. R. C. Kennedy, K. E. Lane, A. Fuentes, H. Hollocher, and G. Madey. Spa-tially Aware Agents: An effective and efficient use of GIS data within anAgent-based Model. In L. Yilmaz, editor, Proceedings of the 2009 Agent-Directed Simulation Symposium. The Society for Modeling and SimulationInternational, March 2009.
80. R. C. Kennedy, M. F. Unger, S. Christley, F. H. Collins, and G. R. Madey. Anautomated homology-based approach for identifying transposable elements.BMC Bioinformatics, 2010. Under review.
81. R. C. Kennedy, X. Xiang, T. F. Cosimano, L. A. Arthurs, P. A. Maurice,and S. E. Cabaniss. Verification and validation of agent-based and equation-based simulations: A comparison. In L. Yilmaz, editor, Proceedings of the2006 Agent-Directed Simulation Symposium. The Society for Modeling andSimulation International, April 2006.
82. M. G. Kidwell and D. Lisch. Transposable elements as sources of variation inanimals and plants. Proceedings of the National Academy of Sciences USA,94:7704–7711, July 1997.
83. E. F. Kirkness, B. J. Haas, W. Sun, H. R. Braig, M. A. Perotti, J. M.Clark, S. H. Lee, H. M. Robertson, R. C. Kennedy, E. Elhaik, D. Ger-lach, E. V. Kriventseva, C. G. Elsik, D. Graur, C. A. Hill, J. A. Veen-stra, B. Walenz, J. M. C. Tubo, J. M. C. Ribeiro, J. Rozas, J. S. Johnston,J. T. Reese, A. Popadic, M. Tojo, D. Raoult, D. L. Reed, Y. Tomoyasu,E. Krause, O. Mittapalli, V. M. Margam, H.-M. Li, J. M. Meyer, R. M.Johnson, J. Romero-Severson, J. P. VanZee, D. Alvarez-Ponce, F. G. Vieira,M. Aguad, S. Guirao-Rico, J. M. Anzola, K. S. Yoon, J. P. Strycharz, M. F.Unger, S. Christley, N. F. Lobo, M. J. Seufferheld, N. Wang, G. A. Dasch,C. J. Struchiner, G. Madey, L. I. Hannick, S. Bidwell, V. Joardar, E. Caler,R. Shao, S. C. Barker, S. Cameron, R. V. Bruggner, A. Regier, J. Johnson,L. Viswanathan, T. R. Utterback, G. G. Sutton, D. Lawson, R. M. Wa-terhouse, J. C. Venter, R. L. Strausberg, M. R. Berenbaum, F. H. Collins,E. M. Zdobnov, and B. R. Pittendrigh. Genome sequences of the humanbody louse and its primary endosymbiont provide insights into the perma-nent parasitic lifestyle. Proceedings of the National Academy of Sciences,107(27):12168–12173, July 2010.
84. O. Kohany, A. J. Gentles, L. Hankus, and J. Jurka. Annotation, submis-sion and screening of repetitive elements in Repbase: RepbaseSubmitter andCensor. BMC Bioinformatics, 7(474), 2006.
85. K. K. Kojima and H. Fujiwara. Evolution of Target Specificity in R1 CladeNon-LTR Retrotransposons. Molecular Biology and Evolution, 20(3):351–361, 2003.
232
86. K. K. Kojima and H. Fujiwara. Cross-Genome Screening of Novel Sequence-Specific Non-LTR Retrotransposons: Various Multicopy RNA Genes andMicrosatellites Are Selected as Targets. Molecular Biology and Evolution,21(2):201–217, 2004.
87. I. Korf. Gene finding in novel genomes. BMC Bioinformatics, 5(59), 2004.
88. K. E. Lane, R. C. Kennedy, L. A. Miller, G. Madey, H. Hollocher, andA. Fuentes. Exploring the use of agent-based models in understanding pat-terns of pathogen transmission. In preparation.
89. M. Larkin, G. Blackshields, N. Brown, R. Chenna, P. McGettigan,H. McWilliam, F. Valentin, I. Wallace, A. Wilm, R. Lopez, J. Thompson,T. Gibson, and D. Higgins. Clustal w and clustal x version 2.0. Bioinfor-matics, 23(21):2947–2948, November 2007.
90. A. M. Law and W. D. Kelton. Simulation Modeling and Analysis. McGraw-Hill, Boston, MA, third edition, 2000.
91. D. Lawson, P. Arensburger, P. Atikinson, N. J. Besansky, R. V. Bruggner,R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas,D. Emmert, M. Hammond, C. A. Hill, R. C. Kennedy, N. F. Lobo, R. M.MacCallum, G. Madey, K. Megy, S. Redmond, S. Russo, D. W. Severson,E. O. Stinson, P. Topalis, E. M. Zdobnov, E. Birney, W. M. Gelbart, F. C.Kafatos, C. Louis, and F. H. Collins. VectorBase: a home for invertebratevectors of human pathogens. Nucleic Acids Research, 35:D503–D505, 2007.
92. D. Lawson, P. Arensburger, P. Atkinson, N. J. Besansky, R. V. Bruggner,R. Butler, K. S. Campbell, G. K. Christophides, S. Christley, E. Dialynas,M. Hammond, C. A. Hill, N. Konopinski, N. F. Lobo, R. M. MacCallum,G. Madey, K. Megy, J. Meyer, S. Redmond, D. W. Severson, E. O. Stin-son, P. Topalis, E. Birney, W. M. Gelbart, F. C. Kafatos, C. Louis, andF. H. Collins. VectorBase: a data resource for invertebrate vector genomics.Nucleic Acids Research, 37:D583–587, 2009.
93. E. Lerat. Identifying repeats and transposable elements in sequencedgenomes: how to find your way through the dense forest of programs. Hered-ity, 104:520–533, 2010.
94. S. Lewis, S. Searle, N. Harris, M. Gibson, V. Iyer, J. Richter, C. Wiel,L. Bayraktaroglu, E. Birney, M. Crosby, J. Kaminker, B. Matthews,S. Prochnik, C. Smith, J. Tupy, G. Rubin, S. Misra, C. Mungall, andM. Clamp. Apollo: a sequence annotation editor. Genome Biology, 3(12),2002.
233
95. N. Lobo, A. Hua-Van, X. Li, B. Nolen, and J. M.J. Fraser. Germ linetransformation of the yellow fever mosquito, Aedes aegypti, mediated bytranspositional insertion of a piggyBac vector. Insect Molecular Biology,11(2):133–139, April 2002.
96. E. R. Mardis. Next-Generation DNA Sequencing Methods. Annual Reviewof Genomics and Human Genetics, 9:387–402, September 2008.
97. E. M. McCarthy and J. F. McDonald. LTR STRUC: a novel search andidentification program for LTR retrotransposons. Bioinformatics, 19(3):362–367, February 2003.
98. B. McClintock. The discovery and characterization of transposable elements:The collected papers of Barbara McClintock. Garland Publishing, Inc., NewYork, NY, 1987.
99. MediaWiki. http://www.mediawiki.org/wiki/MediaWiki.
100. P. Medstrand, L. N. van de Lagemaat, C. A. Dunn, J. R. Landry, D. Sven-back, and D. L. Mager. Impact of transposable elements on the evolutionof mammalian gene regulation. Cytogenetic and Genome Research, 110:342–352, 2005.
101. J. R. Miller, S. Koren, and G. Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315 – 327, 2010.
102. D. M. Mount. Bioinformatics: Sequence and Genome Analysis. Cold SpringHarbor Laboratory Press, Cold Spring Harbor, NY, second edition, 2004.
103. T. Naylor, J. Balintfy, D. Burdick, and K. Chu. Computer Simulation Tech-niques. John Wiley, New York, NY, 1966.
104. NCBI: National Center for Biotechnology Information.http://www.ncbi.nih.gov.
105. V. Nene, J. R. Wortman, D. Lawson, B. Haas, C. Kodira, Z. J. Tu,B. Loftus, Z. Xi, K. Megy, M. Grabherr, Q. Ren, E. M. Zdobnov, N. F.Lobo, K. S. Campbell, S. E. Brown, M. F. Bonaldo, J. Zhu, S. P. Sinkins,D. G. Hogenkamp, P. Amedeo, P. Arensburger, P. W. Atkinson, S. Bidwell,J. Biedler, E. Birney, R. V. Bruggner, J. Costas, M. R. Coy, J. Crabtree,M. Crawford, B. deBruyn, D. DeCaprio, K. Eiglmeier, E. Eisenstadt, H. El-Dorry, W. M. Gelbart, S. L. Gomes, M. Hammond, L. I. Hannick, J. R.Hogan, M. H. Holmes, D. Jaffe, J. S. Johnston, R. C. Kennedy, H. Koo,S. Kravitz, E. V. Kriventseva, D. Kulp, K. LaButti, E. Lee, S. Li, D. D.Lovin, C. Mao, E. Mauceli, C. F. M. Menck, J. R. Miller, P. Montgomery,
234
A. Mori, A. L. Nascimento, H. F. Naveira, C. Nusbaum, S. O’Leary, J. Orvis,M. Pertea, H. Quesneville, K. R. Reidenbach, Y.-H. Rogers, C. W. Roth,J. R. Schneider, M. Schatz, M. Shumway, M. Stanke, E. O. Stinson, J. M. C.Tubio, J. P. VanZee, S. Verjovski-Almeida, D. Werner, O. White, S. Wyder,Q. Zeng, Q. Zhao, Y. Zhao, C. A. Hill, A. S. Raikhel, M. B. Soares, D. L.Knudson, N. H. Lee, J. Galagan, S. L. Salzberg, I. T. Paulsen, G. Di-mopoulos, F. H. Collins, B. Birren, C. M. Fraser-Liggett, and D. W. Sever-son. Genome sequence of Aedes aegypti, a major arbovirus vector. Science,316(5832):1718–1723, June 2007.
106. M. Oliveira de Carvalho, J. Silva, and E. Loreto. Analyses of P -like trans-posable element sequences from the genome of Anopheles gambiae. InsectMolecular Biology, 13(1):55–63, 2006.
107. OpenMap. http://openmap.bbn.com.
108. phpExcelReader. http://sourceforge.net/projects/phpexcelreader.
109. B. R. Pittendrigh, J. M. Clark, J. S. Johnston, S. H. Lee, J. Romero-Severson,and G. A. Dasch. Sequencing of a new target genome: the Pediculus humanushumanus (Phthiraptera: Pediculidae) genome project. Journal of MedicalEntomology, 43(6):1101–1111, November 2006.
110. R. H. Plasterk, Z. Izsvak, and Z. Ivics. Resident aliens: the tc1/marinersuperfamily of transposable elements. Trends in Genetics, 15(8), August1999.
111. M. Pop, S. L. Salzberg, and M. Shumway. Genome sequence assem-bly:algorithms and issues. Computer, 35:47–54, 2002.
112. PostgreSQL. http://www.postgresql.org/.
113. K. D. Pruitt, T. Tatusova, W. Klimke, and D. R. Maglott. NCBI Refer-ence Sequences: current status, policy and new initiatives. Nucleic AcidsResearch, 37:D32–36, 2009.
114. QGIS: Quantum GIS. http://www.qgis.org.
115. H. Quesneville, C. M. Bergman, O. Andrieu, D. Autard, D. Nouaud, M. Ash-burner, and D. Anxolabehere. Combined evidence annotation of transposableelements in genome sequences. PLoS Computational Biology, 1, 2005.
116. H. Quesneville, D. Nouaud, and D. Anxolabehere. Detection of new trans-posable element families in Drosophila melanogaster and Anopheles gambiaegenomes. Journal of Molecular Evolution, 57, 2003.
235
117. H. Quesneville, D. Nouaud, and D. Anxolabehere. P elements and mite rela-tives in the whole genome sequence of Anopheles gambiae. BMC Genomics,7(214), 2006.
118. Repast. http://sourceforge.repast.net.
119. Repbase. http://www.girinst.org/repbase/index.html.
120. L. Roberts and John Janovy, Jr. Gerald D. Schmidt & Larry S. Roberts’Foundations of Parasitology. McGraw-Hill, eighth edition, 2009.
121. L. M. Rocha. From artificial life to semiotic agentmodels: Review and research directions. available athttp://informatics.indiana.edu/rocha/ps/agent review.pdf, Los AlamosNational Laboratory Complex Systems Modeling Team, 1999.
122. G. Rubin and A. Spradling. Genetic transformation of drosophila with trans-posable element vectors. Science, 218(4570):348–353, October 1982.
123. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pear-son Education, Inc., Upper Saddle River, NJ, 2003.
124. S. Saha, S. Bridges, and Z. V. Magbanua. Computational approaches andtools used in identification of dispersed repetitive dna sequences. TropicalPlant Biology, 1:85–96, 2008.
125. F. Sanger, S. Nicklen, and A. Coulson. Dna sequencing with chain-terminating inhibitors. In Proceedings of the National Academy of Sciences ofthe United States of America, volume 74, pages 5463–5467, December 1977.
126. A. Sarkar, R. Sengupta, J. Krzywinski, X. Wang, C. Roth, and F. Collins.P elements are found in the genomes of nematoceran insects of the genusAnopheles. Insect Biochemistry and Molecular Biology, 33(4):381–387, April2003.
127. A. Sarkar, C. Sim, Y. Hong, J. Hogan, M. Fraser, H. Robertson, andF. Collins. Molecular evolutionary analysis of the widespread piggyBac trans-poson family and related “domesticated” sequences. Molecular Genetics andGenomics, 270(2):173–180, 2003.
128. R. E. Shannon. Introduction to the art and science of simulation. In Pro-ceedings of the 1998 Winter Simulation Conference, pages 7–14, 1998.
129. J. A. Shapiro. The discovery and significance of mobile genetic elements. InD. J. Sherratt, editor, Mobile Genetic Elements. Oxford University Press,1995.
236
130. R. K. Slotkin and R. Martienssen. Transposable elements and the epigeneticregulation of the genome. Nature Reviews Genetics, 8(4):272–285, April 2007.
131. SOAP. http://www.w3.org/TR/soap/.
132. M. W. Southern. An Assessment of Potential Habitat Corridors and Land-scape Ecology for Long-Tailed Macaques (Macaca fascicularis) on Bali, In-donesia. Master’s Thesis, Central Washington University, June 2002.
133. J. E. Stajich, D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdi-gian, G. Fuellen, J. G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla,C. J. Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D.Stein, E. Stupka, M. D. Wilkinson, and E. Birney. The Bioperl Toolkit: PerlModules for the Life Sciences. 12(10):1611–1618, October 2002.
134. TEfam. http://tefam.biochem.vt.edu/tefam/.
135. L. Temime, Y. Pannet, L. Kardas, L. Opatowski, D. Guillemot, and P. Y.Boelle. NOSOSIM: an agent-based model of pathogen circulation in a hos-pital ward. In L. Yilmaz, editor, Proceedings of the 2009 Agent-DirectedSimulation Symposium. The Society for Modeling and Simulation Interna-tional, March 2009.
136. S. Tempel, M. Jurka, and J. Jurka. VisualRepbase: an interface for thestudy of occurrences of transposable element families. BMC Bioinformatics,8(345), 2008.
137. TESeeker. http://www.nd.edu/˜teseeker.
138. Z. Tu and C. Coates. Mosquito transposable elements. Insect Biochemistryand Molecular Biology, 34:631–644, 2004.
139. Z. Tu and S. Li. Mobile genetic elements of malaria vectors and othermosquitoes. In P. J. Brindley, editor, Mobile Genetic Elements in Meta-zoan Parasites. Landes Bioscience, September 2008.
140. J. M. C. Tubıo, H. Naveira, and J. Costas. Structural and EvolutionaryAnalyses of the Ty3/gypsy Group of LTR Retrotransposons in the Genomeof Anopheles gambiae. Molecular Biology and Evolution, 22(1):29–39, 2005.
141. UniProt Consortium. The Universal Protein Resource (UniProt) in 2010.Nucleic Acids Research, 38:D142–148, 2010.
142. University of Notre Dame Center for Research Computing.http://crc.nd.edu.
237
143. VectorBase: A Bioinformatics Resource Center for Invertebrate Vectors ofHuman Pathogens. http://www.vectorbase.org.
144. VirtualBox. http://www.virtualbox.org.
145. J. L. Weber and E. W. Myers. Human whole-genome shotgun sequencing.Genome Research, 7:401–409, 1997.
146. J. D. Westervelt. Geographic information systems and agent-based model-ing. In H. R. Gimblett, editor, Integrating Geographic Information Systemsand Agent-based Modeling Techniques for Simulating Social and EcologicalProcesses. Oxford University Press, 2002.
147. B. P. Wheatley. The Sacred Monkeys of Bali. Waveland Press, 1999.
148. WikiPoson. http://www.bioinformatics.org/wikiposon/doku.php.
149. X. Xiang, R. Kennedy, G. Madey, and S. Cabaniss. Verification and val-idation of agent-based scientific simulation models. In L. Yilmaz, editor,Proceedings of the 2005 Agent-Directed Simulation Symposium, volume 37,pages 47–55. The Society for Modeling and Simulation International, April2005.
238