Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf ·...
Transcript of Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf ·...
![Page 1: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/1.jpg)
EnsMart
Ewan Birney, European Bioinformatics Institute (EBI)
Data Mining...
• ...Is more than a buzz-word
• Most molecular biology ismoving away from one-gene-at-a-time approaches
• Needs to make and work withGene Lists
![Page 2: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/2.jpg)
Complex Disease Association
GenomeScan
Linkage peaks
Animal models
Candidate Genes
SNP choosing and validation
Patient vs Control association studies
MicroArray
Interesting
Spots
From A
Experiment
Done on
Platform A Interesting
Spots from
B
Experiment
Done on
Platform B
Integrate Spots
To form Gene Set
Dump orthologs and promoters
![Page 3: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/3.jpg)
Computational Screen
Dump Gene regions from set of protein kinase genes
Formulate Clever method in house
Display results in Genome context
On biologist-friendly web display
data mining problems...
• You need all the data in one place to
provide data
• The natural language for database
queries (SQL) is not... so natural!
• Often SQL queries are very slow on
normalised databases
• Often there is additional analysis
which needs to occur
![Page 4: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/4.jpg)
EnsMart
• SQL queries are slow :-
– transform the data into queryoptimised read-only database
• Additional analysis is needed
– Precompute additional analysis forall items (disk space is cheap!)
• You need all the data in oneplace
– Federate databases (BioMart)
Normalised databases
Gene
Transcripts
Exons
Sequence
External
Reference
>1
>1
>1
Five (six) table join for “genes with this set of affymetrix
Ids on this chromosome
![Page 5: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/5.jpg)
Mart Transformation
Normalised
Query optimised
(reverse star schema)
Web User interface
• Web based
• Wizard like
• “dataset”
(focus)
• “filter” -
restrictions
• output
– columns to
show
– sequence
![Page 6: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/6.jpg)
Set based work
• EnsMart can export sets of Ids
(Ensembl, Affymetrix, Uniprot)...
• EnsMart can also filter on a
given set of Ids
– (give me all the chromosome
locations of genes defined by my
Affymetrix information)
BioMart
Ensembl specific
Only runs from
www.ensembl.org
Made Generic
Multiple
installations
Query federation
(Arek Kaspryck)
![Page 7: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/7.jpg)
BioMart
Any Schema data-mart schema user interface
Mart Builder Mart config
(XML specification)
BioMart
• www.ebi.ac.uk/b
iomart
• Google for
BioMart
• Ensembl
• Uniprot
• MSD structures
![Page 8: Data Mining - University of Virginiapeople.virginia.edu/~wrp/cshl06/pdf/birney_ensmart.pdf · Computational Screen Dump Gene regions from set of protein kinase genes Formulate Clever](https://reader034.fdocuments.in/reader034/viewer/2022050304/5f6cb097a6a81b7f904674a9/html5/thumbnails/8.jpg)
BioMart future...
• ArrayExpress (gene expression
dataset)
• WormBase
• ...others
Cross-Internet Mart…
WWW
Mutant Stock
Sample Mart Ensembl
Genome
Mart
Firewall
Array Express
Expression Atlas
Mart
Mart Query Building
Software
Give me all the genes mapped within phenotype
X in my samples that are also at least 4 fold upregulated
In kidney. Give me all the ht SNPs in all the genes…