Linguistic Regularities in Sparse and Explicit Word Representations
Applications of knowledge discovery to molecular biology: Identifying structural regularities in...
-
Upload
ralph-erick-dean -
Category
Documents
-
view
217 -
download
0
Transcript of Applications of knowledge discovery to molecular biology: Identifying structural regularities in...
Applications of knowledge Applications of knowledge discovery to molecular discovery to molecular biology:biology:Identifying structural regularities in Identifying structural regularities in proteinsproteins
Shaobing SuShaobing Su
Supervisor: Dr. Lawrence B. HolderSupervisor: Dr. Lawrence B. Holder
Committee: Dr. Diane J. CookCommittee: Dr. Diane J. Cook
Dr. Edward BellionDr. Edward Bellion
OutlineOutline
Motivation and goal of the researchMotivation and goal of the research
SUBDUE knowledge discovery systemSUBDUE knowledge discovery system
Proteins and PDBProteins and PDB
Methods and resultsMethods and results
Discussion and conclusionDiscussion and conclusion
Future researchFuture research
Motivation and GoalMotivation and Goal
Explosive amount of molecular biology info Explosive amount of molecular biology info need to be analyze to help understanding need to be analyze to help understanding the underlining structure-function the underlining structure-function relationship in protein and other relationship in protein and other macromolecules.macromolecules.
Apply SUBDUE to the Brookhaven Protein Apply SUBDUE to the Brookhaven Protein
Data Bank (PDB) to identify Data Bank (PDB) to identify
biologically meaningful patternsbiologically meaningful patterns
SUBDUE knowledge SUBDUE knowledge discovery systemdiscovery system
SUBDUE discovers patterns SUBDUE discovers patterns
(substructures) in structural data (substructures) in structural data
sets sets
SUBDUE represent data as a labeled graphSUBDUE represent data as a labeled graph
Inputs: vertices and edgesInputs: vertices and edges
Outputs: discovered patterns and Outputs: discovered patterns and
instancesinstances
ExampleExample
objecttriangle
object
squareon
shape
shape
Vertices: objects or attributesEdges: relationships
4 instances of
SUBDUE’s search SUBDUE’s search algorithmalgorithm
Minimum Description Length (MDL) principle: Minimum Description Length (MDL) principle: The best theory to describe a set of data is the The best theory to describe a set of data is the one that minimizes the DL of the entire data setone that minimizes the DL of the entire data set
DL of the graph: the number of bits necessary DL of the graph: the number of bits necessary to completely describe the graph to completely describe the graph
Search for the substructure that results in Search for the substructure that results in the maximum compressionthe maximum compression
Inexact graph match Inexact graph match approachapproach
Find instances with a slight Find instances with a slight distortion: insertion, deletion, distortion: insertion, deletion, and substitution of and substitution of edges/vertices.edges/vertices.
Threshold parameter: specify Threshold parameter: specify amount of distortion allowed.amount of distortion allowed.
Overview of proteinsOverview of proteins
most important biomolecule most important biomolecule
composed from 20 amino acidscomposed from 20 amino acids
structural hierarchystructural hierarchy
very diverse structure and functionvery diverse structure and function
Structural hierarchy in Structural hierarchy in proteinsproteins
Primary structure (sequence of protein)Primary structure (sequence of protein)
Secondary structure (helix, sheet, Secondary structure (helix, sheet, random)random)
Tertiary structure (3-D)Tertiary structure (3-D)
Primary Structure of proteinsPrimary Structure of proteins
Average 100-150 residues (a.a.) linked in head Average 100-150 residues (a.a.) linked in head to tailto tail
N-terminus and C-terminus N-terminus and C-terminus Peptide bond, alpha-carbonPeptide bond, alpha-carbon
H3N - C1 - C - N - C2 - C - O
R1 O H R2 O
N-terminus C-terminus
+ -
peptide bond
first a.a second a.a
Secondary structure Secondary structure elementselements
Ordered backbone arrangement: helix and Ordered backbone arrangement: helix and
sheetsheet
Helix (0 % to 90 %; average 11 a.a; several Helix (0 % to 90 %; average 11 a.a; several
types)types) Sheet (2 to 15 strands per sheet; parallel and Sheet (2 to 15 strands per sheet; parallel and
anti-parallel; average 6 a.a. per anti-parallel; average 6 a.a. per strand)strand)
Right-handeda -helix
Two-stranded parallel b -sheet
Two-strandedanti-parallel b -sheet
Tertiary Structure of Tertiary Structure of proteinprotein
Highly complicated 3-D arrangementHighly complicated 3-D arrangement Folding of its secondary structure elementsFolding of its secondary structure elements
Brookhaven Protein Data Brookhaven Protein Data Bank Bank (PDB)(PDB)
Brookhaven National LaboratoryBrookhaven National Laboratory
Over 6000 Experimentally determined Over 6000 Experimentally determined 3-D structure of 3-D structure of biomolecules biomolecules
Majority: protein structuresMajority: protein structures
Contents of PDBContents of PDB
SEQRES: sequence of a.a. (three letter SEQRES: sequence of a.a. (three letter code) code)
HELIX: starting, ending, and type HELIX: starting, ending, and type
SHEET: starts, ends, senseSHEET: starts, ends, sense
ATOM: (x, y, z) coordinates for each atoms ATOM: (x, y, z) coordinates for each atoms in protein in protein
Applications of SUBDUE to Applications of SUBDUE to PDBPDB- Methods and Results- Methods and Results
July 1997 PDBJuly 1997 PDBTMTM release (6000 PDB) release (6000 PDB)
Global data set (4000 PDB)Global data set (4000 PDB)
Category data sets Category data sets hemoglobin hemoglobin Myoglobin Myoglobin Ribonuclease ARibonuclease A
Flowchart of ResearchFlowchart of Research
Preprocessing Application
BrookhavenPDB
Graphic representation
Inputs to SUBDUE
Patterns in Category
Patterns in Global others
Instancemapping
PreprocessingPreprocessing
compile PDB list for each categorycompile PDB list for each category
model.c: extract first modelmodel.c: extract first model
seq.c: extract sequence info seq.c: extract sequence info
convert to graphic format convert to graphic format
secondary.c: extract secondary structure info secondary.c: extract secondary structure info
and convert to graphic format and convert to graphic format
coor.c: extract 3D coordinates coor.c: extract 3D coordinates
convert to grahic format convert to grahic format
Primary structure and its Primary structure and its representationrepresentation
Sample PDB lines: Sample PDB lines: SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU SEQRES 2 150 LYS SER LEU GLU 1ASH 1401ASH 140
Sequence (N-terminus to C-terminus): Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ALA ASN LYS THR LYS SER LEU GLU
SUBDUE graphic input (ALA ASN): SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA v 1 ALA - - - ALA residue v 2 residue v 2 ASN - - - ASN residue ASN - - - ASN residue e 1 2 bond - - - a peptide bond between e 1 2 bond - - - a peptide bond between ALA and ASN ALA and ASN
Secondary structure and its Secondary structure and its representation -HELIXrepresentation -HELIX
Sample PDB linesSample PDB lines (starting, ending, type):(starting, ending, type): HELIX 1 ASN HELIX 1 ASN 1 HIS 13 1 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 HELIX 2 ASN 20 ASN 36 1
vertex: h_type_lengthvertex: h_type_length Helix Length:Helix Length:
Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)
SUBDUE graphic input:SUBDUE graphic input: v 1 h_1_12 - - - helix 1, type 1, length v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - 12 v 2 h_1_16 - - - helix 2, type 1, length 16helix 2, type 1, length 16
Secondary structure and its Secondary structure and its representation - SHEETrepresentation - SHEET
Sample PDB linesSample PDB lines (sense, length):(sense, length): SHEET 1 TYR 284 ILE 286 0 SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 SHEET 2 HIS 292
THR 294 - 1 THR 294 - 1
vertex: s_sense_lengthvertex: s_sense_length
SUBDUE graphic input:SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, v 2 s_-1_2 - - - strand 2, sense -1, length 2length 2
Overall secondary structure Overall secondary structure representationrepresentation
PDB line: SUBDUE PDB line: SUBDUE graphic input graphic input HELIX 1 THR 3 MET 13 1 HELIX 1 THR 3 MET 13 1 v 1 h_1_10 v 1 h_1_10 HELIX 2 ASN 24 HELIX 2 ASN 24 ASN 34 1 ASN 34 1 v 2 h_1_10 e 1 2 sh v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 HELIX 3 SER 50 GLN 60 1
v 3 s_0_7 e 2 3 sh v 3 s_0_7 e 2 3 sh SHEET 1 LYS 41 HIS 48 0 SHEET 1 LYS 41 HIS 48 0
v 4 h_1_10 e 3 4 sh v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 THR 87 -1 SHEET 2 MET 79 THR 87 -1
v 5 s_-1_8 e 4 5 shv 5 s_-1_8 e 4 5 sh
sequential relationship is represented as edge “sh”sequential relationship is represented as edge “sh”
Visualization: Visualization:
N-terminus C-terminus
Tertiary structure and its Tertiary structure and its representationrepresentation
Sample PDB lines:Sample PDB lines: XX Y Y Z Z ATOMATOM CACA ALAALA 11 10.36910.3690.9970.997 10.519 10.519 ATOMATOM CACA ASNASN 22 6.6916.691 0.2390.239 9.8309.830
vertex: backbone carbon; vertex: backbone carbon; edge: distance (vs, s) edge: distance (vs, s)
Distance (Å): Distance (Å): distance = ((xdistance = ((x22-x-x11))22 + (y + (y22-y-y11))22 + (z + (z22 - z - z11))22))1/21/2
v 1 CA_ALA v 1 CA_ALA v 2 CA_ASN v 2 CA_ASN e 1 2 vs e 1 2 vs- - - very short distance- - - very short distance
Rationale for representation Rationale for representation choicechoice-Criteria-Criteria
Patterns identified by SUBDUE must be Patterns identified by SUBDUE must be representative for each categoryrepresentative for each category
Patterns discovered by SUBDUE should Patterns discovered by SUBDUE should discriminate one category from othersdiscriminate one category from others
Primary sequencePrimary sequence
vertex - a.a. residue namevertex - a.a. residue name edge - peptide bondedge - peptide bond
e 1 2 bond e 2 3 bond
ARG GLU ALAbond bond
v 1 ARG v 2 GLU v 3 ALA
Secondary structure Secondary structure elementselements
Type of the helixType of the helix starting and ending points (a.a name and seq starting and ending points (a.a name and seq
number)number)
Helix 1
1 12
ASN … HIS
type length
starts ends
N-terminus C-terminus
Other ways of representing Other ways of representing helixhelix
Separate type and lengthSeparate type and length combine type and length combine type and length
Helix 1
1 12
Helix_1_12 type length
Tertiary structureTertiary structure
(x, y, z) coordinates vary with different origin choice(x, y, z) coordinates vary with different origin choice
avoid numeric number, use vs (avoid numeric number, use vs (4 Å), s (4 Å < dist 4 Å), s (4 Å < dist 6 6 Å)Å)
10.4 6.7
1.0 C1 C2 0.2
10.5 9.8
x x
y vs y
z z
Results:Results:Primary structure patternsPrimary structure patterns
Ribonuclease_A_sequence:GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SERLYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL
Hemo_seq (63/65)Hemo_sequence:THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYSVAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SERTHR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALASET VAL SER THR VAL LEU THR SER LYS TYR
Myo_seq (67/103)Myoglo_sequence:VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG
Ribo_A (59/68)
Primary structure patternsPrimary structure patterns
Unique to each sample categoryUnique to each sample category
hemoglobin and myoglobin proteins hemoglobin and myoglobin proteins share little sequence similarity share little sequence similarity
Results:Results:Hemo secondary structure Hemo secondary structure patternspatterns
The secondary structure patterns discovered in hemoglobin (Hemo)
Exp.Parameter
Pattern 1(# of instances inHemo/Global_Other)
Pattern 2(# of instances inHemo/ Global_Other)
Pattern 3(# of instances inHemo/ Global_Other)
Threshold0.0
Hemo_s_1_0.01
(50 / 0)Hemo_s _2_0.02
(52 / 0)Hemo_s _3_0.03
(50 / NA)Threshold0.1
Hemo_s _1_0.14
(51 / NA)Hemo_s _2_0.15
(58 / NA)Hemo_s _3_0.16
(52 / NA)Threshold0.2
Hemo_s _1_0.27
(90 / NA)Hemo_s _2_0.28
(98 / NA)Hemo_s _3_0.29
(92 / NA)Threshold0.3
Hemo_s _1_0.310
(95 / NA)Hemo_s _2_0.311
(107 / NA)Hemo_s _3_0.312
(100 / NA)
1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Results:Results:Myo secondary structure Myo secondary structure patternspatterns
The secondary structure patterns discovered in myoglobin (Myo)
Exp.Parameter
Pattern 1(# of instances inMyo/ Global_Other)
Pattern 2(# of instances inMyo/ Global_Other)
Pattern 3(# of instances inMyo/ Global_Other)
Threshold0.0
Myo_s_1_0.01
(81 / 0)Myo_s _2_0.02
(82 / 0)Myo_s _3_0.03
(81 / 0)Threshold0.1
Myo_s _1_0.14
(81 / NA)Myo_s _2_0.15
(84 / NA)Myo_s _3_0.16
(81 / NA)Threshold0.2
Myo_s _1_0.27
(83 / NA)Myo_s _2_0.28
(84 / NA)Myo_s _3_0.29
(83 / NA)Threshold0.3
Myo_s _1_0.310
(83 / NA)Myo_s _2_0.311
(84 / NA)Myo_s _3_0.312
(84 / NA)
1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25
Results:Results:Ribo_A secondary structure Ribo_A secondary structure patternspatterns
The secondary structural patterns discovered in ribonuclease A (Ribo_A)
Exp.Parameter
Pattern 1(# of instances inRibo_A/ Global_Other)
Pattern 2(# of instances inRibo_A/ Global_Other)
Pattern 3(# of instances inRibo_A/ Global_Other)
Threshold0.0
Ribo_A_s_1_0.01
(25 / 0)Ribo_A _s _2_0.02
(25 / 0)Ribo_A _s _3_0.03
(25 / 0)Threshold0.1
Ribo_A _s _1_0.14
(27 / NA)Ribo_A _s _2_0.15
(27 / NA)Ribo_A _s _3_0.16
(27 / NA)Threshold0.2
Ribo_A _s _1_0.27
(27 / NA)Ribo_A _s _2_0.28
(27 / NA)Ribo_A _s _3_0.29
(27 / NA)Threshold0.3
Ribo_A _s _1_0.310
(36 / NA)Ribo_A _s _2_0.311
(36 / NA)Ribo_A _s _3_0.312
(36 / NA)
1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3
10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6
Results:Results:Tertiary structural patternsTertiary structural patterns
SUBDUE finds small patterns (2 or 3 SUBDUE finds small patterns (2 or 3 a.a.)a.a.)
not unique for each category of proteinsnot unique for each category of proteins
not biologically meaningfulnot biologically meaningful
Visualization of secondary Visualization of secondary structure patterns -structure patterns -hemoglobinhemoglobin
complete hemoglobin 2 instances of pattern structure
N-terminus C-terminus
Visualization of secondary Visualization of secondary structure patterns -structure patterns -myoglobinmyoglobin
complete myoglobin 1 instance of pattern structure
N-terminus C-terminus
Visualization of secondary Visualization of secondary structure patterns -structure patterns -ribonuclease_Aribonuclease_A
complete ribonuclease_A 1 instance of pattern structure
N-terminus C-terminus
DiscussionDiscussion-Hemoglobin-Hemoglobin
Hemoglobin: A, B, C, D chainsHemoglobin: A, B, C, D chains
Two types of patterns identified by SUBDUE Two types of patterns identified by SUBDUE
One for A, C chains, the other for B, D chainsOne for A, C chains, the other for B, D chains
Patterns exist in a majority of hemoglobin Patterns exist in a majority of hemoglobin
proteinsproteins
No instances of the best hemoglobin pattern No instances of the best hemoglobin pattern
found in other proteins in the global data set found in other proteins in the global data set
Occurrence of hemo patternsOccurrence of hemo patterns
The occurrences of the best hemoglobin patterns
PDB Name Occurrence Speciespdb2hhb.ent B, D chains1; A, C chains7 humanpdb1sdl.ent NO human pdb1bbb.ent B chai1 humanpdb4hhb.ent B, D chains1; A, C chains7 humanpdb1thb.ent A, C, B, D chains7 humanpdb3hhb.ent B chain1; A chain7 humanpdb1sdk.ent NO humanpdb1cbm.ent A, B, C, D chains1 humanpdb1cls.ent NO humanpdb1hbb.ent B, D chains1; A, C chains7 humanpdb1hba.ent B, D chains1; A, C chains7 humanpdb2hbc.ent B chain1; A chain7 humanpdb1cbl.ent A, B, C, D chains1 humanpdb1hga.ent N/A humanpdb1hgb.ent N/A humanpdb1hgc.ent N/A humanpdb2hbd.ent B chain1; A chain7 humanpdb2hbf.ent B chain1; A chain7 humanpdb1hho.ent B chain1; A chain7 humanpdb1nih.ent B, D chains1; A, C chains7 humanpdb1coh.ent B, D chains1; A, C chains7 humanpdb1fdh.ent G chain1 humanpdb2hco.ent B chain1; A chain7 humanpdb1cmy.ent B, D chains1 humanpdb1hbs.ent B,D,F,H chains1; A,C,E,G chains7 humanpdb1hco.ent B chain1; A chain7 human
N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01
4 instance found with a thredhold of 0.1: Hemo_s_1_0.14
7 instance found with a thredhold of 0.2: Hemo_s_1_0.27
10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010
Occurrence of hemo patterns Occurrence of hemo patterns -continued-continued
The occurrences of the best hemoglobin patterns
PDB Name Occurrence Speciespdb1bab.ent B chain1; A, C chains7 humanpdb1dxu.ent B, D chains1; A, C chains7 humanpdb1dxv.ent B, D chains1; A, C chains7 humanpdb1dxt.ent B, D chains1; A, C chains7 humanpdb1gbu.ent B, D chains10 humanpdb1gbv.ent B, D chains10 humanpdb1hdb.ent NO humanpdb1dsh.ent NO humanpdb2hhe.ent N/A humanpdb1gli.ent B, D chains1; A, C chains7 humanpdb2hbe.ent B chain1 humanpdb1ibe.ent NO horsepdb2mhb.ent NO horsepdb2dhb.ent B chain4; A chain7 horsepdb1hds.ent N/A deerpdb1hda.ent B, D chains1 bovinepdb2pgh.ent B, D chains1 pigpdb1out.ent NO troutpdb1ouu.ent NO troutpdb1pbx.ent B chain1 antarctic fishpdb1hbh.ent NO antarctic fishpdb1ith.ent NO innkeeper worm
N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01
4 instance found with a thredhold of 0.1: Hemo_s_1_0.14
7 instance found with a thredhold of 0.2: Hemo_s_1_0.27
10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010
DiscussionDiscussion-Myoglobin-Myoglobin
Myoglobin: one chainMyoglobin: one chain
One dominant pattern identified by One dominant pattern identified by
SUBDUE SUBDUE
Patterns exist in most of myoglobin Patterns exist in most of myoglobin
proteinsproteins
No instances of the best myoglobin pattern No instances of the best myoglobin pattern
found in other proteins in the global data found in other proteins in the global data
set set
Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin
Similar secondary structure patternsSimilar secondary structure patterns
Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Myoglobin chain (from N- to C-terminus)
h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25
Hemoglobin A, C chains (from N- to C-terminus)
h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20
Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin
Consistent with the genetic studies Consistent with the genetic studies
Hemoglobin and myoglobin share one ancestral geneHemoglobin and myoglobin share one ancestral gene
Divergence occurred in the course of evolution. One Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin.copy of gene for myoglobin, four copies for hemoglobin.
The last helix of the hemoglobin is shorter; One of the The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow helix in hemoglobin A, C chains almost disappear: allow
conformational change conformational change
Discussion:Discussion:-ribonuclease A proteins-ribonuclease A proteins
All patterns have three helices of the All patterns have three helices of the same sizesame size
Several strands appear twice indicating Several strands appear twice indicating participation in two sheet formation. participation in two sheet formation.
Ribonuclease S protein (S-protein Ribonuclease S protein (S-protein fragment) also has the pattern. fragment) also has the pattern.
Conclusion of the resultsConclusion of the results
Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are representative to each categorySUBDUE are representative to each category
Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are distinct for each categorySUBDUE are distinct for each category
SUBDUE has the ability to discover SUBDUE has the ability to discover biologically interesting patterns from PDB biologically interesting patterns from PDB and other similar MB data basesand other similar MB data bases
Comparison with other related Comparison with other related studiesstudies
Different graphic representationDifferent graphic representation
predefined patterns with exact or inexact predefined patterns with exact or inexact graph matchgraph match
Not applied systematically to PDB or other DBNot applied systematically to PDB or other DB
SUBDUE would perform similar task if the SUBDUE would perform similar task if the inexact graph match routine is incorporatedinexact graph match routine is incorporated
Conclusions of the studyConclusions of the study
Abstraction over 3D structure to its secondary Abstraction over 3D structure to its secondary structural elements is suitable for discoverystructural elements is suitable for discovery
SUBDUE discovered secondary structure patterns for SUBDUE discovered secondary structure patterns for each category can be used as a signature for its classeach category can be used as a signature for its class
Inexact graph match is useful for finding similar Inexact graph match is useful for finding similar patterns patterns
SUBDUE is suitable for knowledge discovery in MB SUBDUE is suitable for knowledge discovery in MB structural DBstructural DB
Future ResearchFuture Research
More consistent and detailed description of More consistent and detailed description of secondary structure secondary structure
Add relative positions of the secondary Add relative positions of the secondary structural elements to represent spatial structural elements to represent spatial relationshiprelationship
Investigate alternative representation: more Investigate alternative representation: more suitable 3D coordinates representation; suitable 3D coordinates representation; weighting on different edgesweighting on different edges
Inexact graph match in predefined substructureInexact graph match in predefined substructure More collaboration with domain scientistsMore collaboration with domain scientists