Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, &...

Post on 26-Jan-2020

5 views 0 download

Transcript of Scientific Databasing with TreeGenesScientific Databasing with TreeGenes : Genotype, Phenotype, &...

Scientific Databasingwith TreeGenes:

Genotype, Phenotype, & Environment

Jill WegrzynDepartment of Ecology & Evolutionary BiologyInstitute for Systems Genomics: Computational Biology CoreUniversity of Connecticut, Storrs CT

treegenesdb.org

Big Data in Genomics

“ComparedgenomicswiththreeothermajorgeneratorsofBigData:Astronomy,YouTube,andTwitter...Genomics iseitheronparwithorthemostdemandingofthedomainsanalyzedhereintermsofdataacquisition,storage,distribution,andanalysis”

Unit SizeByte 1Kilobyte 1,000Megabyte 1,000,000Gigabyte 1,000,000,000Terabyte 1,000,000,000,000Petabyte 1,000,000,000,000,000Exabyte 1,000,000,000,000,000,000Zettabyte 1,000,000,000,000,000,000,000

Mostly Genomic but…Proteomics, Phenomics, Metabolomics…

•Kb=1000bp

•Mb=1x106 bp

•Gb=1x109 bp

•Tb=1x1012 bp

•Pb =1x1015 bp

1Gb 10Gb 100Gb

GenomesarevastinformationrepositoriesHuman3Gb

Sensors & MetadataSequencers, Microscopy, Imaging, Mass spec, Metadata & Ontologies

IO SystemsHardrives, Networking, Databases, Compression, LIMS

Compute SystemsCPU, GPU, Distributed, Clouds

Scalable AlgorithmsStreaming, Sampling, Indexing,

Machine Learningclassification, modeling,

visualization & data Integration

ResultsDomainKnowledge

Acquiring Knowledge through Big Data

Gene Conservation of Tree Species –Banking on the Future (2016)

• Survey Conducted– Breeders, Geneticists, Land Managers, and

Ecologists– 31 Questions

• Trees (greenhouse, plots, landscape, numbers, species)• Data collection (devices, software)• Analytical tools (statistical, databases)• Data storage• Challenges

– 283 Respondents

Gene Conservation of Tree Species –Banking on the Future (2016)

01020304050607080

ComputationalResources

FormattingData

HostingDataontheWeb

AccessingDatafromDatabases

IntegratingDataAcrossDatabases

ScriptingSupporttoExtract

Information

Motivation (Data Provider)

• Support next-generation data requirements for the biological database– Increased quantity and availability of new data– Support data integration across resources– Support complex data analytics–Move data efficiently

treegenesdb.org

TreeGenes Database: History

– Began to hold forest tree genetic maps and associated markers

– Expanded to other data types• Sequence

– Reseqeuncing, Large-Scale Genotyping, Transcriptomics/Expression

– Full Genome Sequences

• Analysis and Visualization Tools– Ability for users to mine the data

• Resources for the user community– Literature, Colleagues

TreeGenes Database: Users

Unique Web Visitors to TreeGenes Database per month, January-December 2016

treegenesdb.org

10,000

2,086 users from 862 organizations in 94 countries

• 1,774 species from 101 genera– At least one genetic artifact from each species

• Full genome sequence: 21 species• Transcriptome/Expression resources:

4,120,817 sequences from 283 species• 106 genetic maps from 35 species

treegenesdb.org

TreeGenes Database: Species

treegenesdb.org

TreeGenes Database: Species

treegenesdb.org

TreeGenes Database: Data Sources

Primary data sources (semi-automated)• Primary databases such as NCBI/EBI• Appropriate data should be submitted to primary

databases• Consistent with changing standards

– Currently no repository for non-human SNPs (new!)

User submissions • For data and metadata not captured well by primary

databases (Journals)

Project submissions• Internal project management (private to public)

Curated Sources• Phytozome and PlantGDB• PLAZA (OrthoFinder)• TRY-DB (Phenotypes)• Dryad (Flat files)

Data that is not collected!treegenesdb.org

TreeGenes Database: Data Sources

Submit genetic maps, association or population study data

Most submissions from journal requirement: Tree Genetics and Genomes, New Phytologist, and Forests

PopulationStudy

•Publication•Species

StudyDesign

•Landscape•CommonGarden•Greenhouse•GrowthChamber

•Breeding(Plot)

Phenotype,Genotype,Environment

•Georeferenced

RawData•Trees•Genotypes•Phenotypes

treegenesdb.org

TreeGenes Database: Data Sources

Metadata on published studies!treegenesdb.org

TreeGenes Database: Data Sources

Genetic maps, association or population studies

treegenesdb.org

TreeGenes Database: Data Sources

Genetic maps, association or population studies

Obtain TGDR accession number!

Opensourcecontentmanagementsystem(CMS)anddatabaseforbiologicaldata

Modulesforgenetic,genomic,andbreedingdatageneratedthroughaCMSandstandardizedschema

Benefits:• Reducesdevelopmentcosts• ProvidesanAPIforcomplete

customization• UsesGMODChado andcommunity

ontologiesforstandardization• Accesscontrolforuser/usergroups• Allowsforsharingofextensionsbetween

sites– Implementedinover30databases!

Current State of Tripal

• http://tripal.info• Content Management System for Biological Data• Over 100 Installations• Current Version 2.0

Tripal Gateway Project (Data Provider)

• Support next-generation data requirements for the biological database

• Tripal Gateway Project– Increased quantity and availability of new data– Support data integration across resources (Web

Services) – Tripal Exchange (v3.0)– Support complex data analytics (Integration with

Galaxy API)– Move data efficiently (Software Defined

Networking – Tripal Data Transfer BDSS)

AlexFeltus,Kuangching WangClemson,Univ.DataTransfer,SDN,SOS

DorrieMain,SookJung,StephenFicklinWashingtonStateUniversity• GenomeDatabaseforRosaceae,• CoolSeasonFoodLegumes• CitrusGenomeDatabase

KirstinBett,LaceySandersonUniv ofSaskatchewan• KnowPulse

JillWegrzynUniversityofConnecticut• TreeGenes

UniversityofUtahNSFACI-REFCollaborators

SteveCannon,Ethy Cannon,IowaStateAndrewFarmer,NCGR• LegumeInfo,PeanutBase

DataTransferCollaborators

ProjectPIs

CollaboratingDatabasesDataAnalysisCollaborators

GalaxyProjectTexasAdvancedComputingCenter,publicGalaxyServer

MegStatonUniversityofTennessee• HardwoodGenomics

Tripal GatewayProjectTree(&Legume)Databases

treegenesdb.org

TreeGenes Database: Interfaces

Web-based framework (Galaxy) promotes genomics analysis

Integrating Galaxy with Tripal

Data analysis brought to the user via the database with Galaxy Workflows

DNA Sequence Data• Re-sequencingalignment• Variantdiscovery(againstthereference)• Variantdiscovery(betweensamples)• Predictionoffunctionalgeneticvariants• AssociationGenetics• FunctionalAnnotation

RNASequenceData• Transcriptomeassembly• Alignmenttoareference• DifferentialExpressionanalysis• Geneco-expressionnetworkconstruction• MiRNA analysis

treegenesdb.org

BDSS: Big Data Smart Socket

• SmartDataTransfer• Standaloneclientwithametadatarepository• Firststepistobuildaninventoryofdatasourcesrelevanttoaparticularusercommunity• NCBI(Genbank forRawData)• Cyverse (iPlant foranalytics)• Tripal supportedwebsitesforsupportingdata

• Determinesoptimalmethodfordatatransferforeachdatasourcethroughtesting

• Datatransfermethodologyisencodedintothemetadatarepository

treegenesdb.org

BDSS: Moving data efficiently

Tripal Gateway: Use Cases

Tripal Gateway:

1. A user could search across community DBs for their set of SNPs interest (from a genotyping array) using Tripal Exchange.

2. The probe sequences could be gathered as a list and transferred to the user with the Data Transfer (BDSS)tool.

3. If the user prefers to use Galaxy for analysis, the transfer could load the probes into the Tripal Galaxy module and align them to a recently released genome reference

4. Basic workflow for alignment could be selected along with the appropriate target in Galaxy

PopulationStudy

•Publication•Species

StudyDesign

•Landscape•CommonGarden•Greenhouse•GrowthChamber

•Breeding(Plot)

Phenotype,Genotype,Environment

•Georeferenced

RawData•Trees•Genotypes•Phenotypes

treegenesdb.org

TreeGenes Database: Data Sources

Inadditionto:• Internalprojects• TREESNAP(public)• DRYAD• TRY-DB

treegenesdb.org

TreeGenes Database: CartograTree

– Providing context to geo-referenced data–Data from TreeGenes, WorldClim, Ameriflux,

TRY-DB

treegenesdb.org

TreeGenes Database: Interfaces

– Retrieve genotype, phenotype, environmental, and sequence data

– Further analysis (MUSCLE, TASSEL, PAML) via SSWAP

treegenesdb.org

TreeGenes Database: SSWAP

– SSWAP “reasons” over the input data and responds with relevant applications

– Send data through pipeline with selection (parameters)

treegenesdb.org

TreeGenes Database: Cyverse(TACC)

– Connect with Cyverse Views– Download data locally or maintain on cloud-based

storage

treegenesdb.org

CartograTree: Current Development

• Flexible georeferenced tagging• Approximate• Exact• Obscured (radius)

• Environmental layers (Geoserver)• Soil• Fire/Drought• Climate models• LIDAR

• Integration with Tripal• User control of workspace• Ability to upload their own trees/phenotypes

• Connection with Galaxy framework • More analytical options (PLINK, TASSEL, MSA, PAML)• Intelligent workflows

treegenesdb.org

CartograTree: TreeSNAP

• Validated accessions from TreeSNAP (obscured)

treegenesdb.org

CartograTree: Galaxy Workflows

Transcriptomics

ExomeCapture

RNA-Seq

GenotypingArray

Affy

Illumina

WholeGenome

Resequencing

GBS• RAD-Seq• ddRAD-Seq

treegenesdb.org

CartograTree: Advanced Interface

• 142species• 27,913TGDR• 17,412Inventory• 26,332TRY-DB

• 815TreeSNAP

• ReleaseDate:• December2017

treegenesdb.org

TreeGenes Database: Team

Project LeadsJill Wegrzyn Emily GrauNic Herndon

AdvisingDamian Gessler

Semantic Options

tg-help@gmail.com

@TreeGenes TreeGenes Database

Project DevelopersSean BuehlerTaylor FalkPeter RichterClayton Michael

CollaboratorsStephen Ficklin (Tripal)Alex Feltus (BDSS)Meg Staton (HWG)Dorrie Main (GDR)