Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base...

37
Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing Unit, Centro Nacional de Biotecnología, Madrid, Spain

Transcript of Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base...

Page 1: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Unreveiling new biological knowledge from multiresolution structural

proteomics data:

A Data Base and Pattern Recognition Approach

José María Carazo

BioComputing Unit, Centro Nacional de Biotecnología, Madrid, Spain

Page 2: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

(Who am I?) Research Areas

Image Processing

HelicaseStruc/Func.

Analysis

StructuralDatabases

Page 3: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Hypothesis: Medium resolution EM data represents a

rich biological information resource. Therefore:

• Step 1) Keep them organized (institutionally) in a new structural data base (do not loose them. Keep them organized and accesible)

• Step 2) Extract the now appearing Macro-architecture features (realize the general organizational principles of large assemblies)

• Step 3) Make the “link” to structural proteomics at the aminoacid level (go from “density blobs” to defined protein structures. “Connect” atomic resolution information with “medium resolution)

• Step 4) Integrate this new structural information with other information sources

Page 4: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Step 1: Motivated by impetus in cryo EM “Construct the EM Data

Base (EMDB)”• The work started in 97 with the “BioImage”

project of the EU as pilot study among research groups

• The work continued through 2000-2003 in the IIMS project, creating the EM Data Base as part of the core facilities of the EBI (European BioInformatics Institute)

Page 5: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

EMDB

• IIMS: to integrate the results of three-dimensional electron microscopy (3D-EM) with models from X-ray and NMR methods.

• Part of the MSD (Macromolecular Structure Database)

The project is funded by the European Commission as the IIMS,contract-no. QLRI-CT-2000-31237 under the RTD programme "Quality of Life and Management of Living Resources"

Page 6: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

EMDB

• Relational Data Model– Fully integrated in the MSD, together with

PDB data

• XML-based Data Model• EMDep, the Electron Microscopy Deposition

Tool– Dictionary driven

Page 7: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

We note that the European Bioinformatics Institute (EBI) through the Macromolecular Structure Database (MSD) now provides a permanent resource for the deposition of three-dimensional maps derived by electron microscopy (see www.ebi.ac.uk/msdsrv/emdep). In addition, coordinate data derived from these maps are deposited in the PDB archive for macromolecular structural data. We intend to use these facilities for the routine deposition of maps and coordinate data produced by our work. These databases are open to the international community and will become part of a family of linked databases in biomedical research.We encourage our colleagues to follow our example by submitting maps, at the stage of publication, to these archival databases.

IIMS Workshop November 15-16, 2002

Page 8: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Sending data to EMD

Page 9: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

… more than a hundred EM structures are now being published in the journals in a typical year. Without EMDB, these data would not be archived for future general use. So the size and usefulness of the database are likely to increase dramatically. Nature Structural Biology is strongly supportive of the general principle that scientific data should be professionally maintained and freely accessible, and so its editors will from now on encourage scientists to deposit their work in EMDB when papers describing EM structures are published in the journal.

Page 10: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Step 2: Discover biological Knowledge: “Extract information on

general organizational principles”

• GOAL: Since EM provides information on (potentially) quite large specimens, device ways to extract automatically topological and geometrical information of the assemblies

• Driven principle: In order to close gaps between differentn techniques of structure determination such as X-rays and cryo-EM, develop techniques able to work transparently accross multiple resolution levels

( HERE COME “ALTERNATIVE REPRESENTATIONS”)

Page 11: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

FEMME Database

Purpose:to store, in a universal data model, the topological and geometric features of 3D-reconstructed macromolecules regardless of the resolution achieved.

Final aim:Automatic detection of general organizational principles

Query by content in structural databases.

Methodology:

Vector quantization and alpha-shape

representation theory

J.Struct. Biol, 2004

Page 12: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Methodology Original dataset:Set of multimeric proteins

coming from

IDENTIFICATION, EXTRACTION AND CHARACTERISATION OF CHANNELS/CAVITIES/(PROTUSSIONS)

pseudo-atoms

ALPHA COMPLEX

PDB/PQS databases

(High resolution)

Macromolecular topology given by the atomic coordinates

(Liang et al 1998)

3D-EM (Medium resolution)

Macromolecular topology given by the selection of a set of

pseudoatoms

(De-Alarcón et al 2002)

Page 13: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Around 140 entries corresponding to alpha-shape representations of

macromolecules and macromolecular structural features from data at any

resolution level

FEMME contents

One of the possible applications: detection of shape similarities among complexes

Detailed description about the number and kind of structural

features contained in the macromolecule

Page 14: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Shape, Size, Protrusions, Channels, Cavities ...

Structurally characterised macromolecule

Several descriptors of the macromolecule structure

CCT

ACTIN

TRICORN PROTEASE

RIBOSOME

FEMME DATABASE STORAGE

Query by content

Final aim

Page 15: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Step 3: Discover biological knowledge: “Make the “link” at the

aminoacid level” (Quantitative “visualization” of fine features)• Goal: Bridging from atomic resolution to medium

resolution

• Motivation: At some moment the link from “density blobs” to define aminoacids has to done. This is so in order to “attach” biochemical and functional information to the medium resolution structures.

• Note: There are many substeps here, we will concentrate on “superfamily recognition” (and in cooperation with other groups in the field, like Chiu’s group)

Page 16: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Superfamily recognition

• Is surface information enough to detect a fold ?

• Can we detect the fold present in an 3DEM map just docking other known fold maps in it ?

• Can some form of flexible docking using SSE be of help?

• Identification of the SSE elements of a protein

• Their spacial distribution and conectivity (topology)

• Assignment of a structural family to the protein

• Assignment of a sequence family to the protein

• Assignment of a function

Increasing difficulty

Information that can be used : • Protein sequence/atomic resolution information: A bunch of methods: neural networks, threading, etc• Medium resolution views of the protein = 3DEM maps

A working definition of Superfamily recognition:

Page 17: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

What are we doing ?

• Is surface information enough to help assigning a superfamily ?

– Application of the spin-image-representation method by De Alarcon, P.A. Y Pascual-Montano, A.

• Can we assign a superfamily in an 3DEM map just docking other known fold maps in it ?

– Application of the COAN docking method by Volkmann, N. within a new Bayesian Schema

• Can we assign a superfamily by some form of flexible docking, possibly using SSS elements ?

– Work in progress

Page 18: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Superfamily assignmentusing surface information

• Surface information can give information about similarity between different folds.• Surface comparison can be performed using techniques derived from the field of computer vision.

• Our studies reveal that similar folds according to the classification given by CATH (belonging to the same superfamily) also have similar surfaces at different resolutions ranging from 8 to 12 Å.• Similarities in the surface are related to similarities in the fold sequence of aminoacids.

• The surface info can be used to detect folds or entire proteins in large assemblies.

Page 19: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Spin image representation (s.i.r.) of 3D-EM Maps

Spin-image-representation of a 3D object:

A) s.i.r principle: to project every point x of the surface with respect to the plane defined by a p point and its normal n.

B) a 3D object with a point and a its normal. C) Points of a surface projected into a plane. D) Spin image obtained from the binning of the surface points

projection.

n

A B C

D

Page 20: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Applications: Partial Matching.Applications: Partial Matching.

Local patches of the query object can be highlighted according to local similiarity with objects in the database.

Query Plane 1st match 3rd match2nd match Coloured Patches

Page 21: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Proteins instead of airplanes….(dealing with multiple domains)

• Possibility of docking isolated domains into entire maps

• Take into account the surface info

• Speed

• Modularity

Page 22: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Fold recognitionusing fitting information

• Docking information can be used to detect the CATH superfamily of a single fold present in a electron microscopy map.

• Repeated experiments of cross correlation and a bayesian probability framework have been use.

• The results show that the use of multiple dockings can overcome the uncertainty when the fold present in the 3D-EM is unknown.

Page 23: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Fold recognition using docking info and bayesian probability

the probability of having a fold given a density map

Bayesian probability

background probability of having an individual fold i, computed as the frequency of realizations of that fold in the total data set of structures to dock.

probability of having a density map given a fold i, computed as follows: 1. a set of elements of the CATH superfamily that represents the fold are

docked to the density map.2. The probability that the density map belongs to that fold is computed

as the probability that the sample values of cross-correlation came from the same population than the sample of cross-correlations from the elements of the CATH superfamily.

3. This test of homogeneity is done by a chi-squared test.

The fold with the highest value of is assigned to the map.

()

i

PfoldDensityMap

Page 24: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Fold recognition using docking info. Results:

At 12 Å resoltuion the information content is a very discriminant measure. 8 of 9 experiments detect the correspondig family with the best value. Example:

Map belonging to family1.10.220.10 SF=Superfamily M=Map

SF Docked elements mean CC Std. Dev P(M|SF) P(M|Non-related) IC P(fold|M)3.40.30.10 10 0,782 0,058 0,518 0,521 -0,003 0,00%2.60.120.60 10 0,636 0,053 0,080 0,478 -0,143 0,00%

1.10.238.10 10 0,832 0,110 0,855 0,734 0,131 6,94%1.20.90.10 10 0,766 0,012 0,003 0,116 -0,011 0,00%

1.10.220.10 10 0,952 0,036 0,730 0,086 1,563 82,88%3.40.50.300 10 0,650 0,098 0,649 0,674 -0,025 0,00%

3.10.100.10 10 0,799 0,044 0,519 0,383 0,158 8,40%2.40.10.10 10 0,787 0,029 0,306 0,274 0,034 1,79%

3.20.20.80 10 0,461 0,048 0,000 0,138 0,000 0,00%2.60.120.20 10 0,585 0,056 0,203 0,344 -0,107 0,00%

Resol. Success rate:

Superfamilies correctly discriminated

12 Å 8 / 9

14 Å 6 / 10

16 Å 4 / 10

20 Å 4 / 10

12 Å

Page 25: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Fold recognitionExtension of the work to multidomain maps

Can a single fold be detected in the entire electron microscopy map?

The cross correlation approach fails in many cases Correct position Position found by cross correlation

Page 26: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Fold recognitionFlexible docking

• By flexible docking we mean to deform ceartain points in the fold to better resemble what we have in the medium resolution density.

• The important points chosen to deform are those points located at the ends of the secondary structure elements of the fold.• To allow for deformations we need to consider different alternatives for each point and choose those ones which better respect the fold superfamily arquitecture. But it doesn´t need to be very same.

Page 27: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Step 4: Discover biological knowledge: “Integrate

information”• Goal: Integrate structural information at all levels

of resolution with other sources of information

• Mean: Semantic mediation over heterogeneous data sources

• Obviously, this is a necessary step towards new powerful data mining approaches, and in data mining the “user” should be in the analysis loop via some graphical interface

Page 28: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Motivating example: DNA binding macromolecules

PQS database

multimeric structure

CATH/SCOP databases

DNA clamp fold

FEMME database

Central channel

Multimeric structures containing the DNA clamp fold and with a central

channel

Page 29: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Ultimate mean: Semantic Data Mediation

• Programmable integrator– Interlieves

information access and algorithm execution

• Semantic mediator– Encodes and

executes domain-specific expert-rules for data joining

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

GCM

CM S1

GCM

CM S1

GCM

CM S2

GCM

CM S2

GCM

CM S3

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

MediatorEngine

FL rule proc.

LP rule proc.

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

USER/ClientUSER/Client

S1 S2 S3

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

XML-Wrapper

CM-Wrapper

GCM

CM S1

GCM

CM S2

GCM

CM S3

GCM

CM S1

GCM

CM S1

GCM

CM S2

GCM

CM S2

GCM

CM S3

GCM

CM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

MediatorEngine

FL rule proc.

LP rule proc.

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results(exchanged in XML)

CM Plug-In

Relational Databases

Web-sources(html, XML)

Service applications

Page 30: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Extended Domain Map in a Structural Biology Extended Domain Map in a Structural Biology ContextContext

My_Polypeptide chain

Has +

My_proteinHas *

name

My_function

SwissprotSwissprot

PDBPDB

PQSPQS

CATH CATH SuperfamilSuperfamil

yy

Has +

Enzyme Enzyme DatabaseDatabase

InterProInterProHas +

Has *

Medium-Resolution3D Image

Has +

Alpha-shape

SSE

Has +

Found_in+

Has

Has

Triangulated Surface

Derive

Helix hunterBeta hunter

Fold Fold InstanceInstance

Found_in+

Superfamily detector

Found_in+

Fold hunter

Cavity/Channels

Derive+3D Point Has +

Connectiviy

Has

Protrusion

Derive+

Properties(area, …)

Has+

X,Y,Z Has

Curvature

Normal

Has

Has

Red-framed boxes require visualization tools!!

Page 31: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Current state: PLAN – a Language for a Programmable Integrator

• XML-based language

• XQuery

PLAN Example

Retrieve those folds in CATH corresponding to proteins which contain a given InterPro motif (IPR001198)

InterPro

SwissProt matches

PDB chains

CATH codes

http://www.ebi.ac.uk/interpro

BLASTp search

CATH Domain Description File

Page 32: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

W.S.J. Valdar, J.M. Thornton, Protein–Protein Interfaces: Analysis of Amino Acid Conservation in Homodimers PROTEINS: Structure, Function, and Genetics 42:108–124 (2001)

• the protomer to be studied must form a stable, symmetric complex with one other protomer to which it is identical (or nearly identical) such as the oligomer is homodimeric and the conservation of only one chain need be considered;

• the full wild-type complex must be available in PDB or PQS;• of all the structures available for the complex, the structure chosen must

have the best combination of the following properties:– high resolution, inclusion of any bound cofactors that occur naturally, the

inclusion of a ligand similar in size and shape to that of the natural substrate.• to enable the robust identification of a diverse set of homologues, the

promoter should be represented in the CATH• the promoter sequence must have non-fragment homologues in the

SwissProt that are numerous (>10) and diverse (<70% mean pairwise sequence identity), and by their annotation, share its function and multimeric state

Page 33: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

1. The oligomer is homodimeric

2. Available in CATH

3. Group by protein

3a. Numerous distant homologues

3b. Wild-type protein

4. Share multimeric state

5. Final selection

PQS

CATH

BLAST

SwissProt

PDB, ENZYME

Data sources Operation Criteria

Filtering

Collection

Page 34: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

<QUERY> <result> LET $x := set("","ipr","IPR001198"), $x := set($x,"display","n"), $x := set($x,"dmax","20000"), $y := constructURL("GET","http://www.ebi.ac.uk/interpro/ISpy",$x) RETURN $y </result> </QUERY> <TRAVERSE>POP</TRAVERSE>

<QUERY> <result> <DATA NAME="InterProMatches" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY>

URL constructor

Wrapper call

Internal data buffer (allows XML filtering)

PLAN Example (I)

Page 35: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

<WHILE> <CONDITION> <STACK> <CONDITION>NONEMPTY</CONDITION> </STACK> </CONDITION> <DO> <TRAVERSE>POP</TRAVERSE> <QUERY> <result> <DATA NAME="spToPdb" TYPE="Add"> RETURN stream() </DATA> </result> </QUERY> </DO> </WHILE>

<CONSTRUCT> <DATA NAME="r1" /> </CONSTRUCT> <DELETE FILE="./resultFiles/q1_IPR001198.xml" /> <PRINTOUT FILE="./resultFiles/q1_IPR001198.xml" />

<XMLBUFFER NAME="InterproMatches" />

Working register is… PLAN Example (II)

Save result data in a file

Nesting requests

Page 36: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Final Remark: “Infrastructures”

• All our software is public domain and with a sustained tradition of making it really accesible (XMIPP, BPR…)

Page 37: Unreveiling new biological knowledge from multiresolution structural proteomics data: A Data Base and Pattern Recognition Approach José María Carazo BioComputing.

Acknowledgements• The CNB Biocomputing

Unit:

• L.E.Donate• Mikel Valle• Carmen San Martin • María Gómez• Yolanda Robledo

Rafael Núñez• Yacob

• Monica Chagoyen • Roberto Marabini• Alberto Pascual • Carlos-Oscar Sanchez• Natalia Jiménez-Lozano• Javier A. Velázquez-Muriel• Pedro Carmona• David Elguero• Jesus Cuenca

• Extra mural:

• The EBI Team• Herbert Edelsbrunner• Wah Chiu’s Lab• SDSC (Gupta’s Lab)• Ioannis Kakadiaris’s Lab• Niels Voksmann• Gruss and Cheng Lab• Mark Ellisman Lab

• (and MANY other interactions)