Institute of Bioinformatics, National Yang-Ming University

Post on 04-Feb-2022

1 views 0 download

Transcript of Institute of Bioinformatics, National Yang-Ming University

1 of 33

Genome annotation with Genome annotation with EnsemblEnsembl

Institute of Bioinformatics,National Yang-Ming University

XosXoséé MMªª FernFernáándezndezEuropean Bioinformatics Institute

December 2004

2 of 33

Outline of talkOutline of talk

• High level overview of Ensembl– Making genomes useful

• Outline workshop– New web code,– DAS, display your own data,– Modify EnsMart,– BLAST/SSAHA,– Comparing genomes– Customising Ensembl.

• Outlook– Manual annotation– Other features

3 of 33

We make genomes usefulWe make genomes useful

4 of 33

Making genomes usefulMaking genomes useful

• Interpretation– Where are the interesting parts of the genome?– What do they do?– How are they related to elements in other

genomes?

• Access– for bench biologists– for non-programming mid-scale groups– for good programming groups

5 of 33

AccessAccess…… bench biologistsbench biologists

• Mainly via the web• Web site designed for non

programming, not that genome aware biologist– Simple things to find are simple to find– Graphically displays and overviews– Consistency of layout, colour and text

6 of 33

Ensembl website: Role

– Visual display of Ensembl data• A graphical, intuitive display for biologists

– “Public face” of Ensembl• Contact point for the project

– Local site installation• Free, open-source, supported

– A framework on which to hang user data• DAS and data upload• Local data integration via data adaptors

– Web-based tools• Display tools, primer selection, Anopheles

gene name and transposon submission, etc

7 of 33

Architecture• Encapsulates

– Input– Output– Ensembl API– Rendering

• Improves– Maintainability– Flexibility– Code re-use

MySQL RDBMS

liteestsnp

core

View script

Client browsers

Data

Output

Renderer

Input

Ensembl APIBioperlA

pach

e / m

od_p

erl–

web

ser

vers

8 of 33

Access… mid scale groups

• Wanting to work with 50 to 1,000 genes, regions, expression data

• Little in house programming– Some web views designed for this

group– EnsMart focused on this group

• Mix and match queries• “Instant” refresh of selected set• Output to Excel, FASTA, HTML table

9 of 33

Mart databaseMart database

• De-normalised• Tables with ‘redundant’ information• Query-optimised• Fast and flexible

• Ideal for data mining

10 of 33

There are other waysThere are other ways……MartShellCommandline interface to Mart written in Java.

It works with a Mart Query Language

11 of 33

MartExplorerMartExplorer

12 of 33

BLAST/SSAHABLAST/SSAHA

13 of 33

BLAST/SSAHABLAST/SSAHA• Different web interfaces exist for sequence

comparison over genome scales• Ensembl’s BlastView is a generic/modular

interface that integrates several databases and methods

• BlastView has been extended to integrate tightly with the Ensembl web site

• Server-side state maintenance mechanisms provide a high-performance/flexible framework for the UI

14 of 33

Access… large scale groups

• Full use of the genome, by experienced bioinformaticians

• Complete openness of the group– Open data– Open software– Open MySQL server on the internet– Expect everything to be portable– Participate in standards and adopt

other standards (DAS, UCSC upload)

15 of 33

Ensembl Ensembl –– Open sourceOpen source

Freely-availableCommunity development.

–51 Ensembl installs worldwide.–Both public and commercial,e.g. Gramene (CSHL)

Fugu-sg (ICMB)Ciona-sg (Temasek)

16 of 33

Uploading data to EnsemblUploading data to Ensembl

17 of 33

Display of uploaded data

18 of 33

Comparing genomes

19 of 33

Many Genomes

VertebrateCompara

Human

Mouse Takifugu

C briggsaeC elegans

InterPro

Drosophila

WormCompara

Diptera Compara

Anopheles

Rat

Zebrafish

Honey bee

TetraodonChimp

Chicken

20 of 33

Many more genomes

• Ciona (C. savigny and C. intestinalis)• Rhesus• Sea Urchin, Platynereis…• Aedes, Ixodes… (vectors)

21 of 33

• High level overview of Ensembl– Making genomes useful

• Outline workshop– New web code,– DAS, display your own data,– Modify EnsMart,– BLAST/SSAHA,– Comparing genomes– Customising Ensembl.

• Outlook– Manual annotation– Other features

22 of 33

Future plans• New data

– More species– Variation data– Comparative data

• More integrated views– GeneSNPView– Comparative ContigView

• More focused tool displays– primer & haplotype selection

• Greater integration of user data– Gene & Protein DAS

23 of 33

Challenges

• What is the right way to calculate evolutionary relationships between these genomes?– How different is the gene build for each

new genome?• Is there novel information to be deduced

from the set of related genomes?• How do we integrate “close” genomes and

genome variation?

24 of 33

Manual Curation

• People are the best at– Resolving conflicting

hetreogeneous information– Recognising “out of the ordinary”

biology• For high investment genomes an

automated pipeline with human intervention is the endgame– Human and Mouse

25 of 33

Vega

• Vega is the collection of manually annotated human and other vertebrate genome data– Reuses Ensembl database and

Website technology– Reuses Ensembl pipelines for

Sanger annotation

26 of 33

Two types of variation dataNatural• Limitless• Dense markers

required• Need for optimal

experimental design (HapMap)

• Human and Anopheles

Managed• Limited strain

number• Light density

adequate for some uses

• (dense for complete dataset)

• Mouse, Rat

27 of 33

Variation data (now)

• dbSNP centric– Key data SNP position and allele– Calculate derived properties

(coding SNP, amino acid change)• Provide views on contigview and

transview• Provide selection via EnsMart

28 of 33

Variation data (expected)

• Recombination variability and population history of a species provides for optimal experimental design– “HapMap”

• Have to add individual, cohort, population and genotype concepts

29 of 33

Variation data (future)

• Allow for inexpensive hyper-dense genotype determination of large cohorts

• Integrate population substructure, close species and individual variation – Understanding positive and

negative selection

30 of 33

Other genomic features

31 of 33

There are more than genes!

• RNA genes– “well known” structural RNA genes– Newer miRNA genes– Pseudogenes/duplications a

massive headache• Cis-regulatory motifs

– Transcriptional motifs– RNA processing motifs

• Yet unknown other stuff

32 of 33

Comparative genomics…

• Action of negative selection should let us see these features– Honest research problem - how

does one expect promoters to evolve?

– Overlapping signals, eg, splicing enhancers in exons

33 of 33

Thanks

Ensembl Team

Database Schema and Core APIArne StabenauYuan ChenIan LongdenCraig MelsoppGlenn ProctorDaniel RíosGuy Slater

Distributed Annotation SystemAndreas Kähäri

Project LeaderEwan Birney (EBI)Tim Hubbard (Sanger)

Ensembl Web TeamJames StalkerFiona CunninghamJames Smith

Vega Web TeamPatrick MeidlSteve Trevianon

Analysis and Annotation PipelineVal CurwenSteve SearleDan AndrewsMario CaccamoLaura ClarkeMartin HammondJan Hinnerck-Vogel Kevin HoweVivek IyerKerstin JekoschFelix KokocinskiSimon White

User SupportXosé Mª FernándezMichael Schuster

Comparative GenomicsAbel Ureta-VidalJavier Herrero SánchezJessica SeverinCara Woodwark

EnsMart & BioMartArek KasprzykDamian KeefeDarin LondonDamian Smedley

Ensembl Team

December 2004