Applications of knowledge discovery to molecular biology: Identifying structural regularities in...

47
Applications of Applications of knowledge discovery to knowledge discovery to molecular biology: molecular biology: Identifying structural regularities Identifying structural regularities in proteins in proteins Shaobing Su Shaobing Su Supervisor: Dr. Lawrence B. Supervisor: Dr. Lawrence B. Holder Holder Committee: Dr. Diane J. Cook Committee: Dr. Diane J. Cook Dr. Edward Dr. Edward Bellion Bellion

Transcript of Applications of knowledge discovery to molecular biology: Identifying structural regularities in...

Applications of knowledge Applications of knowledge discovery to molecular discovery to molecular biology:biology:Identifying structural regularities in Identifying structural regularities in proteinsproteins

Shaobing SuShaobing Su

Supervisor: Dr. Lawrence B. HolderSupervisor: Dr. Lawrence B. Holder

Committee: Dr. Diane J. CookCommittee: Dr. Diane J. Cook

Dr. Edward BellionDr. Edward Bellion

OutlineOutline

Motivation and goal of the researchMotivation and goal of the research

SUBDUE knowledge discovery systemSUBDUE knowledge discovery system

Proteins and PDBProteins and PDB

Methods and resultsMethods and results

Discussion and conclusionDiscussion and conclusion

Future researchFuture research

Motivation and GoalMotivation and Goal

Explosive amount of molecular biology info Explosive amount of molecular biology info need to be analyze to help understanding need to be analyze to help understanding the underlining structure-function the underlining structure-function relationship in protein and other relationship in protein and other macromolecules.macromolecules.

Apply SUBDUE to the Brookhaven Protein Apply SUBDUE to the Brookhaven Protein

Data Bank (PDB) to identify Data Bank (PDB) to identify

biologically meaningful patternsbiologically meaningful patterns

SUBDUE knowledge SUBDUE knowledge discovery systemdiscovery system

SUBDUE discovers patterns SUBDUE discovers patterns

(substructures) in structural data (substructures) in structural data

sets sets

SUBDUE represent data as a labeled graphSUBDUE represent data as a labeled graph

Inputs: vertices and edgesInputs: vertices and edges

Outputs: discovered patterns and Outputs: discovered patterns and

instancesinstances

ExampleExample

objecttriangle

object

squareon

shape

shape

Vertices: objects or attributesEdges: relationships

4 instances of

SUBDUE’s search SUBDUE’s search algorithmalgorithm

Minimum Description Length (MDL) principle: Minimum Description Length (MDL) principle: The best theory to describe a set of data is the The best theory to describe a set of data is the one that minimizes the DL of the entire data setone that minimizes the DL of the entire data set

DL of the graph: the number of bits necessary DL of the graph: the number of bits necessary to completely describe the graph to completely describe the graph

Search for the substructure that results in Search for the substructure that results in the maximum compressionthe maximum compression

Inexact graph match Inexact graph match approachapproach

Find instances with a slight Find instances with a slight distortion: insertion, deletion, distortion: insertion, deletion, and substitution of and substitution of edges/vertices.edges/vertices.

Threshold parameter: specify Threshold parameter: specify amount of distortion allowed.amount of distortion allowed.

Overview of proteinsOverview of proteins

most important biomolecule most important biomolecule

composed from 20 amino acidscomposed from 20 amino acids

structural hierarchystructural hierarchy

very diverse structure and functionvery diverse structure and function

Structural hierarchy in Structural hierarchy in proteinsproteins

Primary structure (sequence of protein)Primary structure (sequence of protein)

Secondary structure (helix, sheet, Secondary structure (helix, sheet, random)random)

Tertiary structure (3-D)Tertiary structure (3-D)

Primary Structure of proteinsPrimary Structure of proteins

Average 100-150 residues (a.a.) linked in head Average 100-150 residues (a.a.) linked in head to tailto tail

N-terminus and C-terminus N-terminus and C-terminus Peptide bond, alpha-carbonPeptide bond, alpha-carbon

H3N - C1 - C - N - C2 - C - O

R1 O H R2 O

N-terminus C-terminus

+ -

peptide bond

first a.a second a.a

Secondary structure Secondary structure elementselements

Ordered backbone arrangement: helix and Ordered backbone arrangement: helix and

sheetsheet

Helix (0 % to 90 %; average 11 a.a; several Helix (0 % to 90 %; average 11 a.a; several

types)types) Sheet (2 to 15 strands per sheet; parallel and Sheet (2 to 15 strands per sheet; parallel and

anti-parallel; average 6 a.a. per anti-parallel; average 6 a.a. per strand)strand)

Right-handeda -helix

Two-stranded parallel b -sheet

Two-strandedanti-parallel b -sheet

Tertiary Structure of Tertiary Structure of proteinprotein

Highly complicated 3-D arrangementHighly complicated 3-D arrangement Folding of its secondary structure elementsFolding of its secondary structure elements

Brookhaven Protein Data Brookhaven Protein Data Bank Bank (PDB)(PDB)

Brookhaven National LaboratoryBrookhaven National Laboratory

Over 6000 Experimentally determined Over 6000 Experimentally determined 3-D structure of 3-D structure of biomolecules biomolecules

Majority: protein structuresMajority: protein structures

Contents of PDBContents of PDB

SEQRES: sequence of a.a. (three letter SEQRES: sequence of a.a. (three letter code) code)

HELIX: starting, ending, and type HELIX: starting, ending, and type

SHEET: starts, ends, senseSHEET: starts, ends, sense

ATOM: (x, y, z) coordinates for each atoms ATOM: (x, y, z) coordinates for each atoms in protein in protein

Applications of SUBDUE to Applications of SUBDUE to PDBPDB- Methods and Results- Methods and Results

July 1997 PDBJuly 1997 PDBTMTM release (6000 PDB) release (6000 PDB)

Global data set (4000 PDB)Global data set (4000 PDB)

Category data sets Category data sets hemoglobin hemoglobin Myoglobin Myoglobin Ribonuclease ARibonuclease A

Flowchart of ResearchFlowchart of Research

Preprocessing Application

BrookhavenPDB

Graphic representation

Inputs to SUBDUE

Patterns in Category

Patterns in Global others

Instancemapping

PreprocessingPreprocessing

compile PDB list for each categorycompile PDB list for each category

model.c: extract first modelmodel.c: extract first model

seq.c: extract sequence info seq.c: extract sequence info

convert to graphic format convert to graphic format

secondary.c: extract secondary structure info secondary.c: extract secondary structure info

and convert to graphic format and convert to graphic format

coor.c: extract 3D coordinates coor.c: extract 3D coordinates

convert to grahic format convert to grahic format

Primary structure and its Primary structure and its representationrepresentation

Sample PDB lines: Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU SEQRES 2 150 LYS SER LEU GLU 1ASH 1401ASH 140

Sequence (N-terminus to C-terminus): Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ALA ASN LYS THR LYS SER LEU GLU

SUBDUE graphic input (ALA ASN): SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA v 1 ALA - - - ALA residue v 2 residue v 2 ASN - - - ASN residue ASN - - - ASN residue e 1 2 bond - - - a peptide bond between e 1 2 bond - - - a peptide bond between ALA and ASN ALA and ASN

Secondary structure and its Secondary structure and its representation -HELIXrepresentation -HELIX

Sample PDB linesSample PDB lines (starting, ending, type):(starting, ending, type): HELIX 1 ASN HELIX 1 ASN 1 HIS 13 1 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 HELIX 2 ASN 20 ASN 36 1

vertex: h_type_lengthvertex: h_type_length Helix Length:Helix Length:

Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)

SUBDUE graphic input:SUBDUE graphic input: v 1 h_1_12 - - - helix 1, type 1, length v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - 12 v 2 h_1_16 - - - helix 2, type 1, length 16helix 2, type 1, length 16

Secondary structure and its Secondary structure and its representation - SHEETrepresentation - SHEET

Sample PDB linesSample PDB lines (sense, length):(sense, length): SHEET 1 TYR 284 ILE 286 0 SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 SHEET 2 HIS 292

THR 294 - 1 THR 294 - 1

vertex: s_sense_lengthvertex: s_sense_length

SUBDUE graphic input:SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, v 2 s_-1_2 - - - strand 2, sense -1, length 2length 2

Overall secondary structure Overall secondary structure representationrepresentation

PDB line: SUBDUE PDB line: SUBDUE graphic input graphic input HELIX 1 THR 3 MET 13 1 HELIX 1 THR 3 MET 13 1 v 1 h_1_10 v 1 h_1_10 HELIX 2 ASN 24 HELIX 2 ASN 24 ASN 34 1 ASN 34 1 v 2 h_1_10 e 1 2 sh v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 HELIX 3 SER 50 GLN 60 1

v 3 s_0_7 e 2 3 sh v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0 SHEET 1 LYS 41 HIS 48 0

v 4 h_1_10 e 3 4 sh v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1 SHEET 2 MET 79 THR 87 -1

v 5 s_-1_8 e 4 5 shv 5 s_-1_8 e 4 5 sh

sequential relationship is represented as edge “sh”sequential relationship is represented as edge “sh”

Visualization: Visualization:

N-terminus C-terminus

Tertiary structure and its Tertiary structure and its representationrepresentation

Sample PDB lines:Sample PDB lines: XX Y Y Z Z ATOMATOM CACA ALAALA 11 10.36910.3690.9970.997 10.519 10.519 ATOMATOM CACA ASNASN 22 6.6916.691 0.2390.239 9.8309.830

vertex: backbone carbon; vertex: backbone carbon; edge: distance (vs, s) edge: distance (vs, s)

Distance (Å): Distance (Å): distance = ((xdistance = ((x22-x-x11))22 + (y + (y22-y-y11))22 + (z + (z22 - z - z11))22))1/21/2

v 1 CA_ALA v 1 CA_ALA v 2 CA_ASN v 2 CA_ASN e 1 2 vs e 1 2 vs- - - very short distance- - - very short distance

Rationale for representation Rationale for representation choicechoice-Criteria-Criteria

Patterns identified by SUBDUE must be Patterns identified by SUBDUE must be representative for each categoryrepresentative for each category

Patterns discovered by SUBDUE should Patterns discovered by SUBDUE should discriminate one category from othersdiscriminate one category from others

Primary sequencePrimary sequence

vertex - a.a. residue namevertex - a.a. residue name edge - peptide bondedge - peptide bond

e 1 2 bond e 2 3 bond

ARG GLU ALAbond bond

v 1 ARG v 2 GLU v 3 ALA

Secondary structure Secondary structure elementselements

Type of the helixType of the helix starting and ending points (a.a name and seq starting and ending points (a.a name and seq

number)number)

Helix 1

1 12

ASN … HIS

type length

starts ends

N-terminus C-terminus

Other ways of representing Other ways of representing helixhelix

Separate type and lengthSeparate type and length combine type and length combine type and length

Helix 1

1 12

Helix_1_12 type length

Tertiary structureTertiary structure

(x, y, z) coordinates vary with different origin choice(x, y, z) coordinates vary with different origin choice

avoid numeric number, use vs (avoid numeric number, use vs (4 Å), s (4 Å < dist 4 Å), s (4 Å < dist 6 6 Å)Å)

10.4 6.7

1.0 C1 C2 0.2

10.5 9.8

x x

y vs y

z z

Results:Results:Primary structure patternsPrimary structure patterns

Ribonuclease_A_sequence:GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SERLYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL

Hemo_seq (63/65)Hemo_sequence:THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYSVAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SERTHR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALASET VAL SER THR VAL LEU THR SER LYS TYR

Myo_seq (67/103)Myoglo_sequence:VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG

Ribo_A (59/68)

Primary structure patternsPrimary structure patterns

Unique to each sample categoryUnique to each sample category

hemoglobin and myoglobin proteins hemoglobin and myoglobin proteins share little sequence similarity share little sequence similarity

Results:Results:Hemo secondary structure Hemo secondary structure patternspatterns

The secondary structure patterns discovered in hemoglobin (Hemo)

Exp.Parameter

Pattern 1(# of instances inHemo/Global_Other)

Pattern 2(# of instances inHemo/ Global_Other)

Pattern 3(# of instances inHemo/ Global_Other)

Threshold0.0

Hemo_s_1_0.01

(50 / 0)Hemo_s _2_0.02

(52 / 0)Hemo_s _3_0.03

(50 / NA)Threshold0.1

Hemo_s _1_0.14

(51 / NA)Hemo_s _2_0.15

(58 / NA)Hemo_s _3_0.16

(52 / NA)Threshold0.2

Hemo_s _1_0.27

(90 / NA)Hemo_s _2_0.28

(98 / NA)Hemo_s _3_0.29

(92 / NA)Threshold0.3

Hemo_s _1_0.310

(95 / NA)Hemo_s _2_0.311

(107 / NA)Hemo_s _3_0.312

(100 / NA)

1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Results:Results:Myo secondary structure Myo secondary structure patternspatterns

The secondary structure patterns discovered in myoglobin (Myo)

Exp.Parameter

Pattern 1(# of instances inMyo/ Global_Other)

Pattern 2(# of instances inMyo/ Global_Other)

Pattern 3(# of instances inMyo/ Global_Other)

Threshold0.0

Myo_s_1_0.01

(81 / 0)Myo_s _2_0.02

(82 / 0)Myo_s _3_0.03

(81 / 0)Threshold0.1

Myo_s _1_0.14

(81 / NA)Myo_s _2_0.15

(84 / NA)Myo_s _3_0.16

(81 / NA)Threshold0.2

Myo_s _1_0.27

(83 / NA)Myo_s _2_0.28

(84 / NA)Myo_s _3_0.29

(83 / NA)Threshold0.3

Myo_s _1_0.310

(83 / NA)Myo_s _2_0.311

(84 / NA)Myo_s _3_0.312

(84 / NA)

1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Results:Results:Ribo_A secondary structure Ribo_A secondary structure patternspatterns

The secondary structural patterns discovered in ribonuclease A (Ribo_A)

Exp.Parameter

Pattern 1(# of instances inRibo_A/ Global_Other)

Pattern 2(# of instances inRibo_A/ Global_Other)

Pattern 3(# of instances inRibo_A/ Global_Other)

Threshold0.0

Ribo_A_s_1_0.01

(25 / 0)Ribo_A _s _2_0.02

(25 / 0)Ribo_A _s _3_0.03

(25 / 0)Threshold0.1

Ribo_A _s _1_0.14

(27 / NA)Ribo_A _s _2_0.15

(27 / NA)Ribo_A _s _3_0.16

(27 / NA)Threshold0.2

Ribo_A _s _1_0.27

(27 / NA)Ribo_A _s _2_0.28

(27 / NA)Ribo_A _s _3_0.29

(27 / NA)Threshold0.3

Ribo_A _s _1_0.310

(36 / NA)Ribo_A _s _2_0.311

(36 / NA)Ribo_A _s _3_0.312

(36 / NA)

1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3

10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

Results:Results:Tertiary structural patternsTertiary structural patterns

SUBDUE finds small patterns (2 or 3 SUBDUE finds small patterns (2 or 3 a.a.)a.a.)

not unique for each category of proteinsnot unique for each category of proteins

not biologically meaningfulnot biologically meaningful

Visualization of secondary Visualization of secondary structure patterns -structure patterns -hemoglobinhemoglobin

complete hemoglobin 2 instances of pattern structure

N-terminus C-terminus

Visualization of secondary Visualization of secondary structure patterns -structure patterns -myoglobinmyoglobin

complete myoglobin 1 instance of pattern structure

N-terminus C-terminus

Visualization of secondary Visualization of secondary structure patterns -structure patterns -ribonuclease_Aribonuclease_A

complete ribonuclease_A 1 instance of pattern structure

N-terminus C-terminus

DiscussionDiscussion-Hemoglobin-Hemoglobin

Hemoglobin: A, B, C, D chainsHemoglobin: A, B, C, D chains

Two types of patterns identified by SUBDUE Two types of patterns identified by SUBDUE

One for A, C chains, the other for B, D chainsOne for A, C chains, the other for B, D chains

Patterns exist in a majority of hemoglobin Patterns exist in a majority of hemoglobin

proteinsproteins

No instances of the best hemoglobin pattern No instances of the best hemoglobin pattern

found in other proteins in the global data set found in other proteins in the global data set

Occurrence of hemo patternsOccurrence of hemo patterns

The occurrences of the best hemoglobin patterns

PDB Name Occurrence Speciespdb2hhb.ent B, D chains1; A, C chains7 humanpdb1sdl.ent NO human pdb1bbb.ent B chai1 humanpdb4hhb.ent B, D chains1; A, C chains7 humanpdb1thb.ent A, C, B, D chains7 humanpdb3hhb.ent B chain1; A chain7 humanpdb1sdk.ent NO humanpdb1cbm.ent A, B, C, D chains1 humanpdb1cls.ent NO humanpdb1hbb.ent B, D chains1; A, C chains7 humanpdb1hba.ent B, D chains1; A, C chains7 humanpdb2hbc.ent B chain1; A chain7 humanpdb1cbl.ent A, B, C, D chains1 humanpdb1hga.ent N/A humanpdb1hgb.ent N/A humanpdb1hgc.ent N/A humanpdb2hbd.ent B chain1; A chain7 humanpdb2hbf.ent B chain1; A chain7 humanpdb1hho.ent B chain1; A chain7 humanpdb1nih.ent B, D chains1; A, C chains7 humanpdb1coh.ent B, D chains1; A, C chains7 humanpdb1fdh.ent G chain1 humanpdb2hco.ent B chain1; A chain7 humanpdb1cmy.ent B, D chains1 humanpdb1hbs.ent B,D,F,H chains1; A,C,E,G chains7 humanpdb1hco.ent B chain1; A chain7 human

N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01

4 instance found with a thredhold of 0.1: Hemo_s_1_0.14

7 instance found with a thredhold of 0.2: Hemo_s_1_0.27

10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010

Occurrence of hemo patterns Occurrence of hemo patterns -continued-continued

The occurrences of the best hemoglobin patterns

PDB Name Occurrence Speciespdb1bab.ent B chain1; A, C chains7 humanpdb1dxu.ent B, D chains1; A, C chains7 humanpdb1dxv.ent B, D chains1; A, C chains7 humanpdb1dxt.ent B, D chains1; A, C chains7 humanpdb1gbu.ent B, D chains10 humanpdb1gbv.ent B, D chains10 humanpdb1hdb.ent NO humanpdb1dsh.ent NO humanpdb2hhe.ent N/A humanpdb1gli.ent B, D chains1; A, C chains7 humanpdb2hbe.ent B chain1 humanpdb1ibe.ent NO horsepdb2mhb.ent NO horsepdb2dhb.ent B chain4; A chain7 horsepdb1hds.ent N/A deerpdb1hda.ent B, D chains1 bovinepdb2pgh.ent B, D chains1 pigpdb1out.ent NO troutpdb1ouu.ent NO troutpdb1pbx.ent B chain1 antarctic fishpdb1hbh.ent NO antarctic fishpdb1ith.ent NO innkeeper worm

N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01

4 instance found with a thredhold of 0.1: Hemo_s_1_0.14

7 instance found with a thredhold of 0.2: Hemo_s_1_0.27

10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010

DiscussionDiscussion-Myoglobin-Myoglobin

Myoglobin: one chainMyoglobin: one chain

One dominant pattern identified by One dominant pattern identified by

SUBDUE SUBDUE

Patterns exist in most of myoglobin Patterns exist in most of myoglobin

proteinsproteins

No instances of the best myoglobin pattern No instances of the best myoglobin pattern

found in other proteins in the global data found in other proteins in the global data

set set

Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin

Similar secondary structure patternsSimilar secondary structure patterns

Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Myoglobin chain (from N- to C-terminus)

h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Hemoglobin A, C chains (from N- to C-terminus)

h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin

Consistent with the genetic studies Consistent with the genetic studies

Hemoglobin and myoglobin share one ancestral geneHemoglobin and myoglobin share one ancestral gene

Divergence occurred in the course of evolution. One Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin.copy of gene for myoglobin, four copies for hemoglobin.

The last helix of the hemoglobin is shorter; One of the The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow helix in hemoglobin A, C chains almost disappear: allow

conformational change conformational change

Discussion:Discussion:-ribonuclease A proteins-ribonuclease A proteins

All patterns have three helices of the All patterns have three helices of the same sizesame size

Several strands appear twice indicating Several strands appear twice indicating participation in two sheet formation. participation in two sheet formation.

Ribonuclease S protein (S-protein Ribonuclease S protein (S-protein fragment) also has the pattern. fragment) also has the pattern.

Conclusion of the resultsConclusion of the results

Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are representative to each categorySUBDUE are representative to each category

Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are distinct for each categorySUBDUE are distinct for each category

SUBDUE has the ability to discover SUBDUE has the ability to discover biologically interesting patterns from PDB biologically interesting patterns from PDB and other similar MB data basesand other similar MB data bases

Comparison with other related Comparison with other related studiesstudies

Different graphic representationDifferent graphic representation

predefined patterns with exact or inexact predefined patterns with exact or inexact graph matchgraph match

Not applied systematically to PDB or other DBNot applied systematically to PDB or other DB

SUBDUE would perform similar task if the SUBDUE would perform similar task if the inexact graph match routine is incorporatedinexact graph match routine is incorporated

Conclusions of the studyConclusions of the study

Abstraction over 3D structure to its secondary Abstraction over 3D structure to its secondary structural elements is suitable for discoverystructural elements is suitable for discovery

SUBDUE discovered secondary structure patterns for SUBDUE discovered secondary structure patterns for each category can be used as a signature for its classeach category can be used as a signature for its class

Inexact graph match is useful for finding similar Inexact graph match is useful for finding similar patterns patterns

SUBDUE is suitable for knowledge discovery in MB SUBDUE is suitable for knowledge discovery in MB structural DBstructural DB

Future ResearchFuture Research

More consistent and detailed description of More consistent and detailed description of secondary structure secondary structure

Add relative positions of the secondary Add relative positions of the secondary structural elements to represent spatial structural elements to represent spatial relationshiprelationship

Investigate alternative representation: more Investigate alternative representation: more suitable 3D coordinates representation; suitable 3D coordinates representation; weighting on different edgesweighting on different edges

Inexact graph match in predefined substructureInexact graph match in predefined substructure More collaboration with domain scientistsMore collaboration with domain scientists