Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University...
-
Upload
brenda-banks -
Category
Documents
-
view
216 -
download
2
Transcript of Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University...
![Page 1: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/1.jpg)
Building CryptoDB using GUS
Mark HeigesCenter for Tropical and Emerging Global Diseases
University of [email protected]
![Page 2: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/2.jpg)
Genomic DataAnalysis Results
GUSPluginsTomcat
WDKApache
![Page 3: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/3.jpg)
![Page 4: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/4.jpg)
![Page 5: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/5.jpg)
![Page 6: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/6.jpg)
![Page 7: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/7.jpg)
![Page 8: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/8.jpg)
![Page 9: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/9.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS)
Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
AnalysisResults
helper script
![Page 10: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/10.jpg)
Site Design Considerations
• data types we wanted to warehouse• additional analyses desired• how to load data into GUS• how to visualize data
– tables
– text
– graphics (interactive, static)
• what types of questions will be asked of the data
![Page 11: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/11.jpg)
Deciding Factors
• What data was available.
• What the research community needed.
• What we could accomplish by the contractual deadline for our first release.
![Page 12: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/12.jpg)
Crypto External Resource Data
• Genomic sequence and gene annotations for two species (GenBank)– sequence
– CDS translations
– gene product descriptions
– exon coordinates
– RNA type (mRNA, tRNA, snoRNA, rRNA)
– other features
• EST/mRNA (GenBank)
![Page 13: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/13.jpg)
Auxillary Data Required
• NRDB
• NCBI Taxonomy Reference
• Sequence Ontology Definitions
![Page 14: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/14.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
AnalysisResults
helper scripts
![Page 15: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/15.jpg)
GUS Plugins
• Perl modules for loading data into GUS– facilities to connect to the GUS perl object
layer and the database– process command line arguments– create tracking information in the database– log and handle errors
![Page 16: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/16.jpg)
GUS Plugins
• Supported and Community plugins bundled with GUS
• Plugins are versioned• Each plugin version must be registered with
GUS before use– records cvs version and md5 checksum– auditing
![Page 17: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/17.jpg)
Data Loading at CryptoDB
• Install GUS• Register selected plugins• Load Controlled Vocabularies
– NCBI Taxonomy– Sequence Ontology Definitions
• Load Crypto annotated sequences from GenBank records
• Load NRDB from FASTA file
![Page 18: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/18.jpg)
Data Loading at CryptoDB
• Load Crypto mRNA GenBank records
• Load ESTs from U Penn's database of NCBI's dbEST
![Page 19: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/19.jpg)
CryptoDB Analyses
• BLASTP - compare annotated proteins to nrdb• BLASTX - compare whole genome to nrdb• BLASTN - synteny comparison of the two Crypto
species we host• EST/mRNA clustering and alignment• signal peptide predictions• transmembrane predictions
![Page 20: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/20.jpg)
Analysis Workflow
• Load Source Data into GUS (NRDB, genomic seqs)
• Dump same data from GUS with GUS Ids• Perform analysis with this data (BLASTX)
• Load results into GUS• GUS Ids allow results to be linked back to
analysis input data
![Page 21: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/21.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
Analysis
helper script
AnalysisResults
![Page 22: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/22.jpg)
>336 source_id=0703290B secondary_identifier=223280 tubulin alpha length=411TIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLDRIRKLADQCTGLQGFSVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPARQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE
Data Analysis - BLASTP
• Dump NRDB records from GUS to FASTA file - with GUS Ids
• Dump annotated protein sequences from GUS to FASTA file - with GUS Ids
![Page 23: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/23.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
Analysis
helper scripts
AnalysisResults
![Page 24: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/24.jpg)
Data Analysis - BLASTP
• Run BLASTP algorithm with these two GUS Id labeled datasets– used a Perl wrapper to BLAST executable, included
with GUS... plugin compatible output
• Load BLAST results with plugin– ga GUS::Common::Plugin::LoadBlastSimFast --file
blastSimilarity.out --restartAlgInvs "" --queryTable DoTS::ExternalNASequence --subjectTable DoTS::ExternalAASequence --commit
![Page 25: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/25.jpg)
Post Data Loading
• Find where the results were loaded– read documentation
• ga GUS::Common::LoadBLAST --help
– looked in plugin source code– asked other users– gusdb.org schema browser– fishing expeditions in GUS tables
![Page 26: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/26.jpg)
Getting Our Database On Line
![Page 27: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/27.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
AnalysisResults
helper scripts
![Page 28: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/28.jpg)
Web Development Kit (WDK)
• provides accelerated development of database driven web sites– define questions and records in model XML file– default JavaServer Pages (JSP) views provided
• not specific to GUS
• can be used with any RDBMS
![Page 29: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/29.jpg)
• Users supply parameter values to a canned question on the website– "Which genes have at least __ exons?"
• The result is returned in summary pages that list links to the record pages
• Record page - detailed view of data object– text– graphics– tables
WDK Question - Summary - Record Paradigm
![Page 30: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/30.jpg)
Questions Summary Record
![Page 31: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/31.jpg)
WDK Model - View - Controller architecture
• Model XML configuration defines– questions– answer summaries– records
• View– displays the model– defined in customizable JavaServer pages
• Controller– internal, not configurable
![Page 32: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/32.jpg)
WDK Setup
• build• write WDK model (WDK comes with Toy site -
spent some time with that before hand)• test model from command line• install WDK into Tomcat• customize the view (jsp) pages• integrate Tomcat with Apache - personal
preference
![Page 33: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/33.jpg)
WDK Model:Defining Questions
<question name="GeneByContig" displayName="Genes by Contig" queryRef="GeneFeatureIds.GeneByContig" summaryAttributesRef="source_id,product,organism,contig" recordClassRef="GeneRecordClasses.GeneRecordClass"> <description>Find gene located on a given contig</description></question>
![Page 34: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/34.jpg)
<sqlQuery name="GeneByContig" displayName="By Contig" isCacheable='true'> <description> Find Genes By Contig ID. </description> <paramRef ref="params.contig"/> <column name="source_id" isInternal="false"/> <sql> <!-- use CDATA because query includes angle brackets --> <![CDATA[ select g.source_id from dots.genefeature g, dots.naentry nae, dots.sequencetype st, dots.externalNAsequence enas where nae.na_sequence_id = g.na_sequence_id and enas.sequence_type_id = st.sequence_type_id and enas.na_sequence_id = nae.na_sequence_id and st.name = 'contig' and nae.source_id = '$$contig$$' ORDER BY g.source_id ]]> </sql></sqlQuery>
![Page 35: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/35.jpg)
WDK Model - Record <recordClass idPrefix="" name="GeneRecordClass" type="Gene" attributeOrdering="source_id,exoncount,overview, product,linkout,dnaContext,genomeCompare,tmdata,blastpgraphic, translation,sequence,reference"> <attributeQueryRef ref="GeneAttributes.GeneAttrs"/> <attributeQueryRef ref="GeneAttributes.ExonCount"/> <attributeQueryRef ref="GeneAttributes.TMCount"/>
<tableQueryRef ref="GeneTables.BlastP"/>
<textAttribute name="overview" displayName="Overview"> <text> <![CDATA[ This <b><i>$$organism$$</i></b> gene spans positions <b>$$start_max$$</b> - <b>$$end_min$$</b> of contig <a href="showRecord.do?id=$$contig$$"><b>$$contig$$</b></a> which maps to chromosome <b>$$chromosome$$</b> ]]> </text> </textAttribute>
</recordClass>
![Page 36: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/36.jpg)
Testing the Modelcommand line tools
• wdkXml - check xml syntax
• wdkSummary - test a summary
• wdkQuery - run specific query
• wdkRecord - test a record
• wdkSanityTest - exercises all queries and records
• wdkCache
![Page 37: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/37.jpg)
Install WDK into Tomcat
• follow the installation instructions carefully• relies on symbolic links from Tomcat webapp to
$GUS_HOME– disallowed by default Tomcat configuration
• keep an eye on Tomcat logs for troubleshooting• reload the webapp when model changes
– retest on command line
– don't forget about the cache
![Page 38: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/38.jpg)
WDK Default View
![Page 39: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/39.jpg)
CryptoDB Custom View
• Made style changes, added site branding
• Added additional form elements– radio buttons, check boxes
• 'Flattened out' the questions
![Page 40: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/40.jpg)
CryptoDB Custom View
• Record pages - alterations to acheive the desired ordering and placement of text, tables and graphics
• Standard JSP tags to embed external objects– GBrowse graphic
![Page 41: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/41.jpg)
Integrate Tomcat with Apache
• Apache front end answers all web requests
• Serves the static pages and cgi tools– BLAST interface– motif search– BLASTX keyword search
• Calls to the WDK are passed to Tomcat
![Page 42: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/42.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
AnalysisResults
helper scripts
![Page 43: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/43.jpg)
GUS
External Resources:•NCBI Taxonomy (SRes)
•SO (SRes)
•NRDB (DoTS)
•Our data (DoTS) Plugins
Plugins
Analysis Input:•contigs•proteins•NRDB
WebDevelopment
Kit
AnalysisResults
helper scripts
Pipeline
![Page 44: Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu.](https://reader036.fdocuments.in/reader036/viewer/2022062713/56649ce55503460f949b25c2/html5/thumbnails/44.jpg)