High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and...

33
1 High throughput methods Biomolecular Databases High throughput sequencing applications Han et al., Cell, 2011

Transcript of High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and...

Page 1: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

1

High throughput methodsBiomolecular Databases

High throughput sequencing applications

Han et al., Cell, 2011

Page 2: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

2

Whole genome sequencing

• Few large contigs are better than many small contigs• N50 = length of smallest contig in the set of largest contigs covering 50% of assembly• Maximal and average length of contigs

• Few scaffolds with high number of contigs are better than many with few• S50 (according to N50)• Maximal and average length of contigs

Green et al., Nature Genet, 2001

• Assembly in O(n2) where n is number of reads• Sequencing errors (e.g. homopolymers)• Repeats in different length• Areas without or limited coverages• Finishing gap closure

Greedy algorithm• Find shortest common substring T for reads {s1,s2,…}• Solution is very time consuming (NP hard) but can approximated bygreedy algorithm

• Successful used for small genomes (e.g. bacteria)• CAP3, SSAKE, VCAKE, SHARCGS

Overlap‐layout‐consensus• Graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap

• Identifying a path through the graph that contains all the nodes ‐ a Hamiltonian path

• Arachne, Celera Assembler, newbler, Minimus, Edena

Genome assembly

Page 3: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

3

De Brujin graphs• Break up each read into a collection of overlapping k‐mers.• Each k‐mer is represented in a graph as an edge connecting two nodes corresponding to its k‐1 bp prefix and suffix respectively.

• A graph that uses all the edges containing the information obtain from all the reads is a solution to the assembly problem (Eulerian path).

• Repeats • Euler, Velvet, Allpath, ABySS

Genome assembly

Compeau et al., Nature Biotech, 2011

Personal genomes

Sequencing of the genomes of fraternal twins diagnosed with a movement disorder 

6000m nucleotides (diploid human genome)1.63m single‐base variants shared by twins that differ from reference

human genome9531 variants that code for proteins4605 variants that change amino‐acid sequence77  rare variants (which are more likely to cause disease)3  candidate genes1 gene linked to disorder

Bainbridge  et al., Sci Transl Med, 2011Maher, Nature, 2011

Page 4: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

4

Exome sequencing

• 85% of disease causing genes are in the exome• exome ~1% of the genome

Bioinformatics analysis

• Sequence quality• Alignment (e.g. BWA,SOAP2)• Filter (mapped read, duplicate, exome, local alignment around DIP)• Variant detection (SNP, DIP, homo/heterozygous splitter)• Annotation and visualization (IGV)

Copy number variation

Page 5: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

5

Bisulfite sequencing (DNA methylation)

• Lower sequence complexity can make problems (e.g. primer design)• Incomplete conversions• Degradation of DNA during bisulfite treatment

HT sequencing (PCR)

• Bisulfite converts cytosine (C) residues to uracil (U) • Leaves 5‐methylcytosine (5mC)  residues unaffected

Yeast two‐hybrid (Y2H)

Protein‐protein interaction (Y2H)

TF‐DNA interaction (Y1H)

Page 6: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

6

Synthetic lethal interactions

eQTL

• An expression profile can be mapped to gene expression Quantitative Trait Loci by linkage or association method.

• QTLs are stretches of DNA containing or linked to the genes that underlie a quantitative trait (phenotype, charcateristics)

hot spots 

Page 7: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

7

LC‐MS/MS

Spectrum

Page 8: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

8

Peptide fragment fingerprinting (PFF)

Quantitative proteomics

ICAT

Isotope‐Coded Affinity TagsStable isotope labeling with amino 

acids in cell culture

SILAC

Page 9: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

9

Tissue microarray

FACS

Page 10: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

10

Examples of public databases

• National Center for Biotechnology Information (NCBI)

GenBank

• European Bioinformatics Institute (EBI) and  Sanger Center

Ensembl

• The Molecular Biology Database Collection 

http://www3.oup.co.uk/nar/database/c/

National Library of Medicine (NLM)

• On the campus of the National Institutes of Health (NIH) in Bethesda, Maryland, USA(NIH Budget 2005: $28,757,000,000, ‐ NASA Budged 2005: $16,194,000,000)

• World's largest medical library (Budged 2005: $317,947,000). 

• National Center for Biotechnology Information (NCBI) of the NLM is one of the world largest provider of public databases for life sciences!

Page 11: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

11

National Institutes of Health

Page 12: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

12

Databases at the NCBI

• Pubmed• Protein• Nucleotides• Structure• Genome• Books• CancerChromosomes• Conserved Domains• 3D Domains• Gene• Genome Project• dbGAP• GEO Profiles• GEO Datasets• GeneSat

• HomoloGene• Journals• MeSH• NLM Catalogs• OMIA• OMIM• PMC• PopSet• Probe• Protein Cluster• SNP• Taxonomy• UniGene• UniSTS

Entrez

• Entrez is a robust and flexible database search and retrieval system used at the NCBI for all major databases and provides accesses to DNA and protein sequence data from more than 130.000 organisms

• Entrez is at once an indexing and retrieval system, a collection of data from many sources, and an organizing principle for biomedical information. 

• Pre‐computed similarity searches for each database record

• Links from a record in one database to associated records in the other Entrez databases,

Page 13: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

13

Entrez

Entrez

Page 14: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

14

GenBank

GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences,with annotations describing the biological information these records contain.

• Full release of GenBank every 2 months.

• Incremental and cumulative releases: daily.

• GenBank is only available from the Internet.

GenBank

Page 15: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

15

DNA sequences

GenBank

Page 16: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

16

GenBank

The NCBI Data Model is defined in ASN.1

• ASN.1 is a data description language similar to a Backus‐Naur Form.

• It is a formal language specifically designed to specify  complex data structures in a machine, DBMS, and programming language independent manner.

• It is an international standard and used by many data exchange protocols.

• The GenBank Flat File (GBFF) is a view (report) you can generate from ASN.1, but has taken a life of its own in the bioinformatics community.

GenBank

Page 17: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

17

GenBank Flat file

RefSeq

• Best, comprehensive, non‐redundant set of sequences

• For genomic DNA, transcript (RNA), and protein

• For major research organisms (2645 organisms)

• Based on GenBank derived sequences

• Ongoing curation by NCBI staff and collaborators, with  review status indicated on each record

• Identifiers: NT_ Genomic contig

NM_ mRNA

NP_ protein

NR_ None‐coding RNA

XM_ mRNA

XP_ protienautomatic annotation

Page 18: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

18

Gene

• A record represents a single gene from an organism• A gene‐specific information such as map, sequence,    expression, structure, function, homology and  publications

• Includes data for all organisms that have RefSeqgenome records

• Official gene symbol and gene name are used

Gene ID 5091

Official Symbol  PC

Official Full Name pyruvate carboxylase

For human provided fromHUGO Gene NomenclatureCommitee (HGNC)

PubMed

• The database was designed to provide access to citations (with abstracts) from biomedical journals. 

• PubMed has more than 15 million MEDLINE journal article references and abstracts (~1960‐2008)

• 700 million searches per year over the web

• Linking feature to provide access to full‐text journal articles at web sites of participating publishers, as well as to other related web resources. 

Page 19: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

19

PubMed Results

Online Mendelian Inheritance in Man (OMIM)

• Is a timely, authoritative compendium of bibliographic material and observations on inherited disorders and human genes. 

• Curation of the database and editorial decisions take place at The Johns Hopkins University School of Medicine. 

• OMIM provides authoritative free text overviews of genetic disorders and gene loci that can be used by clinicians, researchers, students, and educators. 

Page 20: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

20

Gene expression databases

• Public microarray data repositories

ArrayExpress (AE)                   Gene Expression Omnibus

www.ebi.ac.uk/arrayexpress/ www.ncbi.nlm.nih.gov/geo/

Gene expression databases

• Gene expression atlas (GNF Symatlas)

• Affymetrix arrays

• 79 human tissues

• 61 mouse tissues

• Samples in duplicate

median 3*median 10*median

Page 21: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

21

Protein sequences

Three‐dimensional Protein Structures

Page 22: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

22

Protein‐protein interaction

•   centralized platform to visualize domain architecture, post‐translational

modifications, interaction networks and disease association (golden standard for PPI) 

•   36,500 unique PPIs annotated for 25,000 proteins (2007).

•    > 50% of molecules annotated in HPRD have at least one PPI

• 10% have more than 10 PPIs.

•    3 categories of experiments for PPIs: 

in vitro, in vivo and yeast two hybrid (Y2H).

Genome Browsers  

Page 23: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

23

NCBI Map Viewer

UCSC Genome Browser

Page 24: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

24

Ensembl

Ensembl Project

• Funded to provide metazoan genomes to the world

• Aims to provide the world’s best automated genome annotation

• A leading group for human and mouse analysis

• All software, data and results freely available

• Group split between EBI and Sanger funded by Wellcome Trust

• Largest dedicated compute in biology in Europe

• Developer community >300 people including companies

Sanger Center, Hinxton, UK

Page 25: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

25

Ensembl

– Normalized

– Each data point stored only once

– Quick updates

– Minimal storage requirements

– BUT:      Many tables

Many joins for complicated queries

Slow for data mining questions

– De‐normalized– Tables with ‘redundant’ information– Query‐optimized– Fast and flexible

Core database

Mart database (EnsMart)

Comparative Genomics

Page 26: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

26

Genomes change over time

Definitions

Homologs: A – B – C

Orthologs: B1 – C1 

Paralogs: C1 – C2 –C3 

Inparalogs: C2 – C3 

Outparalogs: B2 – C1

Xenologs: A1 – AB1

Protein A

Page 27: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

27

Orthologues prediction

Ortholog databases

• YOGY (eukarYotic OrtholoGY) is a web‐based resource and integrates 5 independent resources (Sanger)

• COG Cluster of ortholog groups of proteins and KOG for 7 eukaryotic genomes (NCBI), 

• Inparanoid (Center Stockholm Bioinformatics) 

• HomoloGene (NCBI) 

• OrthoMCL use Markov Clustering algorithm (University of Pennsylvania)

Page 28: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

28

Synteny regions

Comparative Genomics within Ensembl

Page 29: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

29

Multiple alignment of vetebrate genomes

UCSC Genome Browser

MAF file format

Page 30: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

30

Rhodes et al., Nat Biotechnol, 2005

Probabilistic model for data integration

DIP         Coexpression GO              Interpro

Integration of datasets

Fraser AG, Marcotte EM. Nat Genet (2004) 

Page 31: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

31

Example for data integration

Genetic and environmental perturbations (A)

TFBS prediction (F)PP interactions (B)

Domain‐domain interactions (E)

PD interactions (C)

Hwang et al., Proc Natl Acad Sci U S A, 2005

• Pointilist (Matlab)

Fisher’s χ2 :

Mudholkar‐George’s T:

Liptak‐Stouffer’s Z:

Meta analysis: combination of p‐values

• Get list of significant p‐values from different type of experiments (data)

• There are a number of (weighted) statistical measures  to combine p‐values from 

k datasets 

χ2 = ‐2 ∑ wi*log(pi)  (df=2k)   

T = f(k) ∑ wi*log(pi/(1‐pi)

Z = (1/sqrt(∑wi2))∑ wi*Ф

‐1(1‐pi)i=1

k

• Derive networks based on overall p‐values

Intersection min (pi)

Union max (pi)k

k

i=1

i=1

Page 32: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

32

HPRD(Gold standard) Y N

Example protein‐protein interaction network

Naïve Bayes model

LR (f1..fn)P(f1..fn|pos)

P(f1..fn|neg)= ∏

i=1

n

LR (fi) ∏i=1

n

=

Evidence (data sets) f1..fn

Prior odds Oprior =  P(pos)/P(neg)

Posterior odds Opost =  P(pos|f1..fn)/P(neg|f1..fn))

Likelihood ratio LR (f1..fn)  =  P(f1..fn|pos)/P(f1..fn|neg)

Opost =  Oprior* LR (f1..fn)

Log‐likelihood score                 LLS=log LR (f1..fn)

Page 33: High throughput methods Biomolecular Databases · complex data structures in a machine, DBMS, and programming language independent manner. • It is an international standard and

33

Data integration to understand mechanisms

Lee et al. Science (2004)

Technologies for data integration

data sources (databases)

Presentation layer (user)

mediator

wrapper

direct links

data warehouse

data martsOLAP tools (cubes)data mining

extract, transform, load (ETL)

databases

Linking (relational database, SQL)

Mediator‐based approach(federated databases) 

Data warehouse

Semantic integration Agents, RDF, OWL, SPARQL, Ontologies