Mining Structure Patterns with the Motif as a … · GESTS Int’l Trans. Computer Science and...

14
GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 111 GESTS-Oct.2005 Mining Structure Patterns with the Motif as a Specified Starting Point Pai-Tsen Cheng 1 , Ming-Chuan Hung 1 , Don-Lin Yang 1 , and Jungpin Wu 2 1 Department of Information Engineering and Computer Science 2 Department of Statistics Feng Chia University, 407 Taichung, Taiwan { 1 mchong, 1 dlyang, 2 cwu}@fcu.edu.tw, 1 [email protected] Abstract. As the human genome project has been completed recently, biologists are confronted by countless experiments and analyses of huge datasets representing DNA, RNA, and proteins. Although biologists can use data to make predictions about the functions of DNA, RNA, and proteins, traditional experimental processes cannot match the rapid changes of current research trends. Bioinformatics has become an effective means for biologists to manage large datasets used in statistics, information engineering, and data mining. In this paper, we designed an effective method to mine the widely known protein in Proteomics and performed the experiments with some useful results. We also provided evidence regarding the outcomes with relevant references to show that the patterns we discovered are functional and enable the motif–Arg-Gly-Asp (RGD) with cell adhesion to be exposed outside the protein, allowing the cell to bind with others and possessing significant functions to resist cancer and other diseases. 1 Introduction As the human genome project was completed recently, Bioinformatics [1] combined the knowledge and techniques of biology, chemical engineering, statistics, and computer science to become a new field of study and research. Since the inception of Bioinformatics, many technologies have developed to accelerate the process of using data to understand biological processes, especially in the field of computer science. As computer hardware and software has been improved, computers have become essential tools in research. Combining the techniques of computer science with the domain knowledge of biology, we believe that the resulting synergy is very powerful, especially in data mining, a key area in the emerging field of Bioinformatics. Although a vast amount of new information has been acquired in biology in recent years, much of the analyses of that information for the next important new discoveries have not yet been done. Much of what is still unknown is waiting to be discovered. Inasmuch as the human genome project has been completed, we consider that Proteomics is an important direction for research. To this end, we started this project with a clear goal to define a technique to mine large datasets for patterns from a

Transcript of Mining Structure Patterns with the Motif as a … · GESTS Int’l Trans. Computer Science and...

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 111

GESTS-Oct.2005

Mining Structure Patterns with the Motif as a Specified Starting Point

Pai-Tsen Cheng 1, Ming-Chuan Hung1, Don-Lin Yang1, and Jungpin Wu2

1Department of Information Engineering and Computer Science 2Department of Statistics

Feng Chia University, 407 Taichung, Taiwan {1mchong, 1dlyang, 2cwu}@fcu.edu.tw,

1 [email protected]

Abstract. As the human genome project has been completed recently, biologists are confronted by countless experiments and analyses of huge datasets representing DNA, RNA, and proteins. Although biologists can use data to make predictions about the functions of DNA, RNA, and proteins, traditional experimental processes cannot match the rapid changes of current research trends. Bioinformatics has become an effective means for biologists to manage large datasets used in statistics, information engineering, and data mining.

In this paper, we designed an effective method to mine the widely known protein in Proteomics and performed the experiments with some useful results. We also provided evidence regarding the outcomes with relevant references to show that the patterns we discovered are functional and enable the motif–Arg-Gly-Asp (RGD) with cell adhesion to be exposed outside the protein, allowing the cell to bind with others and possessing significant functions to resist cancer and other diseases.

1 Introduction

As the human genome project was completed recently, Bioinformatics [1] combined the knowledge and techniques of biology, chemical engineering, statistics, and computer science to become a new field of study and research.

Since the inception of Bioinformatics, many technologies have developed to accelerate the process of using data to understand biological processes, especially in the field of computer science. As computer hardware and software has been improved, computers have become essential tools in research. Combining the techniques of computer science with the domain knowledge of biology, we believe that the resulting synergy is very powerful, especially in data mining, a key area in the emerging field of Bioinformatics.

Although a vast amount of new information has been acquired in biology in recent years, much of the analyses of that information for the next important new discoveries have not yet been done. Much of what is still unknown is waiting to be discovered. Inasmuch as the human genome project has been completed, we consider that Proteomics is an important direction for research. To this end, we started this project with a clear goal to define a technique to mine large datasets for patterns from a

112 Mining Structure Patterns with the Motif

GESTS-Oct.2005

specified point. To show how it differs from other proposed methods, we first explain our motivation. The patterns in the proteins with specific functions or structures are always useful but are usually not discussed, starting from a specified point. In this paper, we try to discover more useful and detailed attributes that can be used in concrete applications. We will present our data mining process by showing experimental examples using a useful motif, RGD.

Motif is a recurring pattern of protein folding, e.g. a homeobox, a zinc finger [2]. A higher order of protein structure is a module which consists of several motifs and units of secondary structure. It can be used to predict functional or structural properties of the protein, or it describes (non-trivial) features common to biologically (structurally or functionally) related proteins. The term motif can include sites (a small part of the structure having a specific functional or structural role e.g. the active site in enzymes or metal binding sites), cores (an often bigger part of the structure in the interior of the protein e.g. the hydrophobic core), secondary structures and supersecondary motif (constituted of secondary structures).

RGD (Arg-Gly-Asp) is one kind of useful motifs that is provided with binding with other proteins. The Arg-Gly-Asp sequence resides in the cell attachment region of fibronectin. Arg-Gly-Asp-containing peptides support fibroblast attachment, inhibit fibroblast adhesion to fibronectin, and inhibit fibronectin binding to thrombin-stimulated platelets. Interactions between integrins and extracellular proteins are often facilitated by the RGD sequence. These interactions are important for adhesion of cells of many types. Integrins-mediated cell attachment influences and regulates migration, growth, differentiation, and apoptosis of cells [3]. Regulation of assembly of extracellular matrix [4], inhibition of angiogenesis [5] and inhibition of fertilization [6] by the RGD-containing peptides suggests involvement of integrins also in these processes. The RGD motif primarily occurs in sequences of proteins of extracellular matrix and blood. These proteins include fibronectin, vitronectin, osteopontin, collagens, thrombospondin, fibrinogen, von Willebrand factor [7] as well as specific inhibitors of cell attachment, disintegrins (snake venom proteins). The ‘RGD’ motif could also be found in sequences of many other proteins. However, only presence of the RGD in a sequence of a protein may be not sufficient for the biological activity. A conformation or a structural environment of an RGD-site in a protein may render the sequence inactive. Therefore, biochemical or structural studies or both are necessary to establish activity of an RGD-site. Structural studies of many of proteins were done [8]-[18] and the structures are freely available from PDB. However, in structures of proteins with established RGD-activity, RGD does not show some distinct ‘active’ conformation. Although some or other qualitative judgments could be based on some particular structures, there are large conformational differences between the biologically active RGD sequences.

Effective techniques in data mining [1] are evolving to provide powerful tools in meaningful data and knowledge discovery. In this paper, we discuss many advantages of data mining and examine some known issues. We present findings that demonstrate how our algorithm can mine for structure patterns from a specified starting point in the examined sequences. For different approaches, there are many different kinds of mining algorithms like mining association rules, time series, sequential patterns, and clustering, etc. A particular and appropriate algorithm can be used, depending on what

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 113

ⓒGESTS-Oct.2005

type of problem is to be solved. These basic concepts will be discussed later in this paper.

Generally, the data mining process does not require that the mining process start from a particular position in the sequences. Since we do wish to attempt to get more precise results, we make a decision about where to start the mining process. In this paper, our goal is to provide an efficient process to get correct results and reduce unnecessary cost. As a result, traditional biological experiments can be reduced or replaced by examining the mined results.

We assume that we can get the same results with the natural process after confirming the specific properties. For example, the hydrophilicity of each amino acid besides the RGD is the basic condition we are concerned about. We can use our algorithm to discover rules with sensible patterns but without using the hydrophilicity. In so doing, we can propose a process to help biologists discover new tectonics more quickly and cheaply.

The rest of the paper is structured as follows. Section 2 briefly describes related work of data mining. In Section 3 and Section 4, we present the proposed algorithm in detail and the experimental results. Finally in Section 5, conclusions and future work are presented.

2 Related Work

In this section, we present two related tasks. There are two programs presented and we do this project with reference to these. Then, we describe the process we have done in detail.

First, GYM [19] is an algorithm used to detect known motifs in protein sequences. The algorithm is referred as the “Pattern Dictionary” method. Fig. 1 shows the system process of GYM. The algorithm requires that an approximate length of the motif be known beforehand. Then, a reasonably large number of motifs are known and have been detected and verified by experiments in the standard way. The training set can be chosen from these known motifs. Mainly, the algorithm consists of two parts. The first part is a preprocessing step that needs to be performed only once. The second part is where the actual motif detection takes place.

There is a preprocessing called “Pattern Mining” (see Fig. 2) in which the input to this phase is a master set of aligned motifs without space, generating a dictionary of frequent patterns for the aligned motifs inputted. In this process, the dictionary is applied as a motif detection process. The problem of this process is that each of the motif sequences inputted occurs at different locations in different proteins.

The motif detection algorithm takes as input a motif length m, the dictionary of significant patterns L output by the Pattern-Mining algorithm, an integer krepresenting the number of best matches required as output, and the given protein sequence P to be examined for the motif. There is a sliding window of length macross the input sequence P. The subsequence of P that lies in the window is then matched with every significant pattern in L. This is performed in a subroutine called “Match”. Match returns a Match-Score that quantifies how well the window matched with the patterns in the dictionary L. While it is convenient to think of Match-Score as

114 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

a number, it is in reality a collection of measures that describe the quality of the match.

Algorithm: Pattern-Mining Input: Motif length m, support threshold T, and list of

aligned motifs. Output: Dictionary L of frequent patterns. 1. Generate all frequent patterns of length 1 and insert into list L1

2. for i := 2 to m3. for every pair of patterns p,q 1iL such that

1iqp4. {5. if (support(p)>T) then6. Insert p into Li

7. }8. if ( 1iL ) then9. return ii LL

Fig. 1. Pattern Mining Process Fig. 2. Pattern Mining Algorithm

Second, SPratt2 is an algorithm used in the automatic discovery of recurring patterns in protein structures [20]. The patterns consist of individual residues having a defined order along the protein’s backbone that comes close together in the structure and whose spatial conformations are similar. The residues in a pattern could be separated in the protein’s sequence.

Given a set of N structures, packing patterns with occurrences in at least k of the structures are wanted, i.e., patterns with support at least k. Rather than devising a method for generating all possible packing patterns, the patterns will be generated as generalizations of neighbor strings from the structures. For example, the neighbor string ACEWGGTGEA can be generalized to a large number of (matching) packing patterns, like G, GG, WG, and CWGT. If packing patterns are allowed to have amino acid match sets, the amino acids in the neighbor string can be generalized to match sets. Without a specified starting point of the probe, there may exist ambiguously mined patterns.

Based on the concept of Spratt2, we use the idea about generating packing patterns; if we can set a specified starting point as the probe, we can discuss the patterns mined more easily.

3 Proposed Method – PMSS

This section describes our proposed method of Pattern Mining with the Specified Starting (PMSS) point and basic concept of our research. We use some concepts of the related work to design our algorithm.

3.1 Overview

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 115

ⓒGESTS-Oct.2005

In this research, we are interested in looking for patterns that sit beside a specific motif with functions. As many motifs are functional, they naturally need to exist with secondary structures, like -helix, -sheet or loops (turns). Based on this premise, we develop an algorithm for mining these special patterns.

There are steps we should follow: first, we select the topic of motif to do, called data collection. Second, to reduce the time cost in the next step, we apply a data preprocessing to filter out the item occurring less than the minimum times we want. Finally, the mining algorithm is used to mine the meaningful patterns and provide the patterns to the biologists. In Section 4, we will show the experimental results using the above steps. Fig. 3 depicts the flow of our mining process.

Data of PDBand references

Data Collection

Data Preprocessing (Pruning)

Data Collected

Occurrenceof each item

Removed

Retained

Meet min. frequency

Pattern Mining

Data Pruned

PMSS(Pattern Mining Starting from a Specified point)

Data collection withspecified properties

Patterns Mined

Doesn't meetmin. frequency

Fig. 3. The flow of mining process Fig. 4. Distance between each amino acid and ‘R’GD

3.2 Data Collection

In data mining and knowledge discovery, if raw data are not collected in large enough amounts, the mined results will be affected. In this section, we will show how we do this type of work.

In the related work of GYM, the algorithm takes sequences from proteins and decides the location in sequence with many experiments. We believe that this algorithm requires so many iterations to make these decisions that it is meaningful and useful to establish a starting point as an unequivocal location. In addition, we refer to the Pattern-Mining Algorithm of GYM to design the following algorithm.

For RGD, we first select the proteins with the function of cell adhesion, of course on RGD. In the process of data collection, we also find that there are some data that are peptide but not natural. These data won’t be selected for analysis because we want to find the hidden characteristics of natural proteins. After confirming the collected data, the next processes will also be stable and meaningful.

116 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

In the data collection, we also pay attention to data with a different range of Ånstrong. For example, the Euclidean distance of each amino acid with Arg (we count the distance of atom C ) will be computed. We can see the functional RGD is always far away from the other amino acid with a range. Fig. 4 shows that the distance of amino acid from ‘R’GD is over 15 Ånstrong, so that the motif RGD won’t be disturbed by other amino acids.

After the data collection process, we assume that the raw data could be inputted into the mining process. For better results, we can perform the data preprocessingstep.

3.3 Data Preprocessing

In this step, we hope to reduce the processing time of mining task. We perform the data preprocessing to remove some items that never satisfy the count of a single position. The algorithm and an example are shown in Fig. 5 and Table 1 below.

In Table 1(a), we collect the data that satisfy the conditions we set, like proteins having the function of cell adhesion. We select the amino acids sitting beside the RGD, from both right and left sides, and keep abreast of the RGD. To avoid spending unnecessary time on aligning items that are never used, we filter the items that appear less than the number of times we set, as in Table 1(b).

Preprocess: Data Pruning Input: Data, D, collected beside a specified point

with a specified length, M.Minimum frequency, F, of amino acid in each position occurred.

Output: Pruned list, L, with satisfying minimum frequency.

1. for i := 1 to M2. for j := 1 to the number of D {3. if (D[i]j occurs < F) then4. D[i]j := null 5. }6. L[i] := D[i]7.8. return L

Fig. 5. The data preprocessing: Data Pruning

Table 1 (a). Data before pruning Table 1(b). Data filtered/sorted (minimum frequency = 2)

Seq.1 A R C D R G D E A G D

Seq.2 F R C A R G D E P R V

Seq.3 E A C D R G D R P G C

Seq.4 A R D D R G D R A G E

Seq.5 C A A R R G D C V A D

Seq.1 A R C D R G D E A G D

Seq.2 A R -- D R G D R A G --

Seq.3 -- A C D R G D R P G --

Seq.4 -- R C -- R G D E P -- --

Seq.5 -- A -- -- R G D -- -- -- D

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 117

ⓒGESTS-Oct.2005

We not only prune the infrequent items but also sort the list from the leftmost of the starting point, i.e., RGD. After the data preprocessing, the raw data could be pruned such that we can decrease some processing time in the next mining process.

3.4 Mining Algorithm

After the data collection and pruning, we assume that the data could be mined to discover hidden knowledge. Although the measurement methods are not used here, like the confidence in Apriori algorithm, we are also confident in counting frequent items for discovering patterns.

Fig. 6 is our proposed PMSS algorithm where the pruned data are inputted. We simply set a data set for generating candidate lists so that every element will be appended at the end of each candidate. For example, while ‘A’ is in 1-item set, the candidate list is {A_, AA, AC, …, AY} and each will be aligned with pruned list for generating 2-item sets.

Following is the description of the algorithm and we explain this with Table 1(b). 1. Get the pruned list L with satisfying minimum frequency and the length of L[i]

is M.2. Set the threshold, T, the minimum times of occurrence of these patterns. 3. Start scanning L for 1-item sets from the candidate list C to see if the

occurrences of the 1-item in L are more than or equal to T.4. Append C at the end of candidates of 1-item. 5. Scan L for 2-item sets with candidates generated by 1-item sets. 6. Keep doing Step 2 to Step 5 until reach M and return the final patterns.

PMSS Algorithm Input: Pruned list, L, with satisfying minimum frequency,

the length of L[i] is M. Threshold, T, the times of occurrence those patterns should satisfy. Basic item for candidate list, C, of single amino acid and a gap: {“ “,”A”,”C”,”D”…”Y”}, length = 21.

Output: Patterns, P, detected. 1. producing initial candidates that are 1-item sets 2. while (M>0) do{ 3. for i := 1 to 21 { 4. for j := 1 to the number of L {5. if (L[j] := C[i]) then 6. count(C[j])++ 7. if (count(C[j]) T) then 8. push into Queue(Q)9. } 10. } 11. M--12. C := pop(Q) + C //Generating larger itemsets13. } 14. P := C15. return P

Fig. 6. The proposed PMSS mining algorithm

118 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

In the example of Fig. 7, we get four patterns after executing the PMSS algorithm. For computer science, they are simply resultant patterns as shown above. To find or interpret what they mean, we go discussing with the biologists. In addition, we have more experimental results explained in the next section and we believe that our algorithm does work.

Fig. 7. The example of mining process and outcomes

4. Experimental Results and Discussion

We perform our experiments on a personal computer of Intel Pentium 4 processor with a clock rate of 2.4AGHz and 512MB DDR266MHz memory. The algorithm PMSS has been implemented in a Java program and the developing tool is Borland JBuilder 9.0.

We used the data downloaded from PDB [21] and related references, and selected with three conditions: natural protein, cell adhesion, and functional motif, RGD, to mine the patterns. We also set ranges of Ånstrong to classifying the raw data to assist in analyzing the patterns mined. That is because in different ranges of Ånstrong, there should be different attributes of amino acid to support the solid structure of RGD.

There is an example that Fig. 8 and Fig. 9 are figures of solution structure of Rhodostomin, PDB id: 1JYP. With these figures, we are sure that since the distance between ‘R’GD and other amino acids are farther away, the motif, RGD, works. Fig. 10 and Fig. 11 also show that, but through the distance between ‘R’GD with other amino acid we can see something different. In 1JYP, the distance is larger than 10 ,and in Fig. 11, 2MFN, the distance is smaller than 10 . We are interested in this discrepancy and there are some discussions.

In Section 2, we introduce related work, GYM and SPratt2, and there are some advantages and disadvantages in each of them. We take the view of detection of GYM, with counting the frequency of the patterns, but we take a known topic of proteins. On the other hand, the mining process of the SPratt2 gives us a sense that we can detect

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 119

ⓒGESTS-Oct.2005

the amino acids beside a specified starting point. With the specified starting point, we can analyze the patterns mined much more distinctly.

Fig. 8. The backbone of 1JYP, PDB Fig. 9. The distance between ‘R’GD and

amino acid (1JYP)

Fig. 10. The backbone of 2MFN, PDB Fig. 11. The distance between ‘R’GD and

amino acid (2MFN)

Our proposed PMSS takes the raw data representing the primary structure of protein to mine useful patterns. As the results in Table 2, we collected the data of Disintegrin, with function of platelet aggregation. We discuss the mined patterns as follows.

Table 2. The collected data

Function Protein PDB ID

Platelet aggregation Salmosin 1L3X

Trimestatin 1J2L

Decorsin 1DEC

Kistrin 1N4Y

Flavoridin 1FVL

Echistatin 1RO3

Saxatilin none

120 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

4.1 Disintegrin

Disintegrin is a family of naturally occurring proteins derived from viper venom which contain 49 to 84 amino acids including an RGD or KGD sequence and 8 to 14 cysteines linked by S – S bonds [22]-[28]. Disintegrins bind with high affinity and inhibit the function of several integrins, including IIb 3, v 3 and 5 1. They show significant variations in their activity and selectivity. NMR-derived structures for the disintegrins like echistatin, kistrin and flavoridin reveal that they posses a RGD sequence at the apex of a mobile loop between two strands of the protein, protruding 14 – 17 Å from the protein core. We calculated the number of data used, and it surely matches the range of 14 – 17 Å so that we believe it is feasible.

In Table 3, we can see that the patterns we mined both include disulphide bondthat could be the brace for supporting the motif, RGD. There are still some amino acids beside RGD that we can not explain exactly now, and beside Pro and Gly, these two residues could destroy the stable secondary structure, -helix and -sheet. In Pattern 1, the critical aspartate is immediately followed by a residue with a hydrophobic side chain, and a praline residue is present on the right side of the motif, RGD loop sequence. The motif RGD should exist in a loop that can work to interact with other receptors. With the unknown residues beside RGD, there is more work to be done in the future.

Table 3. Mined patterns with threshold = 3

Pattern 1 : _ _ I C R I RGD _ P D D R C T

Pattern 2 : G T I C _ _ A RGD D _ D D Y C N

5. Conclusion

We design a PMSS algorithm to mine the supporting structure of RGD by starting from the specified point. As the experimental result shows, we believe this algorithm does work and still can be strengthened in the future study.

To follow the trend of the scientific world, we develop data mining and knowledge discovery technologies for use in Bioinformatics. This research requires help from biologists. Thus, we consulted with some biologists in the microbiology to perform our research in a proper direction and examine the resultant findings.

The PMSS can be used as a tool for biologists to mine special patterns starting from a specified point. Of course, the amount of data we used in the experiments is not large enough in regular data mining. However, we believe the approach is correct. The experimental results show that we do find the patterns that were published in well-known journals. We note that our research results are based on a prototype; further research needs to be extended to a full scale trial in the future.

In this paper, we use data taken from technical literature and PDB, in which verification is easier. A larger scale of input data is definitely recommended in the future work. The functional difference among different angles of atoms in RGD is

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 121

ⓒGESTS-Oct.2005

also an interesting problem. It is worthy of further investigation as well. The PMSS algorithm only takes the primary structure of the protein into the mining process, and we think it should be improved so that it can mine the secondary structure of meaningful patterns beside the specified starting point.

Acknowledgements

Authors thank Prof. David C. Chen for his valuable suggestions and technical support, and Prof. Inge Jonassen, Department of Informatics, University of Bergen, Norway, for providing us with the program of SPratt2. This research was partially supported by the National Science Council, Taiwan, under grant number NSC92-2213-E-035-039.

References

1. N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms, The MIT Press, 2004

2. M. J. Sutcliffe, M. Jaseja, and E. I. Hyde, “Three-Dimensional Structure of the RGD-containing Neurotoxin Homologue Dendroaspin,” Nat Struc Biol, Vol. 1, No. 11, pp. 802-807, 1994.

3. E. Ruoslahti, “RGD and Other Recognition Sequences for Integrins,” Annu Rev Cell, Dev Biol, Vol. 12, pp. 697-715, 1996.

4. C. Wu, A. E. Chung, and J. A. McDonald, “A Novel Role for Alpha 3 Beta 1 Integrins in Extracellular Matrix Assembly,” J Cell Sci, Vol.108, pp. 2511-2523, 1995.

5. R. F. Nicosia and E. Bonanno, “Inhibition of Angiogenesis in Vitro by Arg-Gly-Asp-containing Synthetic Peptide,” Am J Pathol, Vol. 138, No. 4, pp. 829-833, 1991.

6. R. A. Bronson and F. Fusi, “Evidence that an Arg-Gly-Asp Adhesion Sequence Plays a Role in Mammalian Fertilization,” Biol Reprod, Vol. 43 No. 6 pp. 1019-1025, 1990.

7. E. Ruoslahti and M. D. Pierschbacher, “New Perspectives in Cell Adhesion: RGD and Integrins,” Science, Vol. 238, No. 4826, pp. 491-497, 1987.

8. W. Bode, I. Mary, and Baumann U, “The Refined 1.9 Å Crystal Structure of Human Alpha-Thrombin: Interaction with D-Phe-Pro-Arg Chloromethylketone and Significance of the Tyr-Pro-Pro-Trp Segment,” EMBO J, Vol. 8, No. 11, pp. 3467-3475, 1989.

9. P. D. Martin, M. G. Malkowski, and O. J. DiMai, “Bovine Thrombin Complexed with an Uncleavable Analog of Residues 7-19 of Fibrinogen A Alpha: Genetry of the Catalytic Triad and Interaction with P2’ and P3’ Substrate Residues,” Biochemistry, Vol. 35, No. 40, pp. 13030-13039, 1996.

10. D. J. Leahy, W. A. Hendrickson, I. Aukhil, and H. P. Erickson, “Structure of a Fibronectin Type III Domain from Tenacin Phased by MAD Analysis of the Selenomethionyl Protein,” Science, Vol. 258, No. 5084, pp. 987-991, 1992.

11. T. J. Rydel, M. Yin, and K. P. Padmanabhan, “Crystallographic Structure of Human Gamma-Thrombin,” J Biol Chem, Vol. 269, No. 35, pp. 22000-22006, 1994.

122 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

12. C. D. Dickinson, B. Veerapandian, and X. P. Dai, “Crystal Structure of the Tenth Type III Cell Adhesion Module of Human Fibronectin,” J Mol Biol, Vol. 266, No. 4, pp. 1079-1092, 1994.

13. A. L. Main, T. S. Harvey, and M. Baron, “The Three-Dimensioal Structure of the Tenth Type III Module of Fibronectin: An in Sight into RGD-Mediated Interactions,” Cell, Vol. 71, No. 4, pp. 671-678, 1992.

14. M. J. Sutcliffe, M. Jaseja, and E. I. Hyde, “Three-Dimensional Structure of the RGD-Containing Neurotoxin Homologue Dendroaspin,” Nat Struc Biol, Vol. 1, No. 11, pp. 802-807, 1994.

15. H, Senn and W, Klaus, “The Nuclear Magnetic Resonance Solution Structure of Flavoridin, an Antagonist of the Platelet GP IIb-IIIa Receptor,” J Mol Biol, Vol.232, No. 3, pp. 907-925, 1993.

16. V. Saudek, R. A. Atkinson, P. Lepage, and J. T. Pelton, “The Secondary Structure of Echistatin from 1H-NMR, Circular-Dichroism and Raman Spectroscope,” Eur J Biochem, Vol. 202, No. 2, pp. 329-338, 1991.

17. A. M. Krezel, G. Wagner, J. Seymour-Ulmer, and R. A. Lazarus, “Structure of the RGD Protein Decorsin: Conserved Motif and Distinct Function in Leech Proteins that Affect Blood Clotting,” Science, Vol. 264, No. 5167, pp. 1944-1947, 1994.

18. J. Shapiro and D. Brutlag, “FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm,” Protein Science, Vol. 13, pp. 278-294, 2004

19. Y. Gao, G. Narasimhan, K. Mathee, and X. Wang, “Motif Detection in Protein Sequences,” String Processing and Information Retrieval Symposium, 1999 and International Workshop on Groupwave, pp. 63-72, 1999.

20. I. Jonassen, I. Eidhammer, D. Conklin, and W.R. Taylor, “Structure Motif Discovery and Mining the PDB,” Bioinformatics, Vol. 18, pp. 362-367, 2002.

21. H. M. Berman, J. Westbrook, and Z. Feng, “The Protein Data Bank,” Nucleic Acids Res, Vol. 28, No.1, pp. 235-242, 2000.

22. M. A. McLane, V. K. Senadhi, C. Marcinkiewicz, J. J. Calvete, and S. Niewiarowski, “Importance of the Structure of the RGD-Containing Loop in the Disintegrins Echistatin and Eristostatin for Recognition of IIb 3 and v 3Integrins,” FEBS Letters, Vol. 391, pp. 139-143, 1996.

23. Y. Fujii, D. Okuda, Z. Fujimoto, K. Horii, T. Morita, and H. Mizuno, “Cristal Structure of Trimestatin, a Disintegrin Containing a Cell Adhesión Recognition Motif RGD,” J. Mol Bio, Vol. 332, pp. 1115-1122, 2003.

24. R. J. Gould, M. A. Polokoff, P. A. Friedman, T. F. Huang, H. C. Holt, J. J. Cook, and S. Niewiarowski, “Disintegrins: A Family of Integrin Inhibitory Proteins from Viper Venoms,” Proc. Soc. Exptl Biol. Med., Vol. 195, pp. 168-171, 1990.

25. M. A. McLane, C. Marcinkiewicz, S. Vijay-Kumar, I. Wierzbicka-Patynowski, and S. Niewiarowski, “Viper Venom Disintegrins and Related Molecules,” Proc. Soc. Exptl Biol. Med., Vol. 219, pp. 109-119, 1998.

26. H. Senn and W. Klaus, “The Nuclear Magnetic Resonance Solution Structure of Flavoridin, an Antagonist of the Platelet GP IIb-IIIa Receptor,” J. Mol. Biol., Vol. 232, pp. 907-925, 1993.

27. M. Adler, R. A. Lazarus, M. S. Dennis, and G. Wagner, “Solution structure of kristin, a potent platelet aggregation inhibitor and GP IIb-IIIa antagonist,” Science, Vol. 253, pp. 445-448, 1991.

28. S. Y. Hong, Y. S. koh, K. H. Chung, and D. S. Kim, “Snake Venom Disintegrins,

GESTS Int’l Trans. Computer Science and Engr., Vol.18, No.1 123

ⓒGESTS-Oct.2005

Saxatilin, Inhibits Platelet Aggregation, Human Umbilical Vein Endothelial Cell Proliferation, and Smooth Muscle Cell Migration,” Thrombosis Research, Vol. 105, pp. 79-86, 2002.

Biography

Name: Pai-Tsen Cheng Address: Department of Information Engineering and Computer Science, Feng

Chia University, 407 Taichung, Taiwan Education & Work experience: He received the B.E. degree in Computer Science

from Feng Chia University in 2002 and the M.S. degree in Computer Science from Feng Chia University, Taichung Taiwan, in 2004. His research interest is data mining.

Tel: +886-4-24517250 Ext.3623 E-mail: mhamlet @soft.iecs.fcu.edu.tw

Name: Ming-Chuan Hung Address: Department of Information Engineering and

Computer Science, Feng Chia University, 407 Taichung, Taiwan Education & Work experience: He received the B.E. degree in

Industrial Engineering and the MS degree in Automatic Control Engineering from Feng Chia University, Taichung, Taiwan, in 1979 and 1985, respectively. He is now a Ph.D. candidate in Information Engineering at Feng Chia University. From 1985 to 1987, he was an instructor in Mechanics Engineering Department

at National Chin-Yi Institute of Technology, Taichung, Taiwan. Since 1987, he has been an instructor in the Department of Industrial Engineering and Systems Management and served as a secretary in the College of Engineering at Feng Chia University from 1991 to 1996. His research interests include data mining, CIM, and e-commerce applications. He is a member of the CIIE.

Tel: +886-4-24517250 Ext.3623 E-mail: [email protected]

Name: Don-Lin Yang Address: Department of Information Engineering and

Computer Science, Feng Chia University, 407 Taichung, Taiwan Education & Work experience: He received the B.E. degree in

Computer Science from Feng Chia University in 1973, the M.S. degree in Applied Science from the College of William and Mary in 1979, and the Ph.D. degree in Computer Science from the University of Virginia in 1985. He was a staff programmer at IBM Santa Teresa Laboratory from 1985 to 1987 and a member of the

124 Mining Structure Patterns with the Motif

ⓒGESTS-Oct.2005

technical staff at AT&T Bell Laboratories from 1987 to 1991. Since then, he joined the faculty of Feng Chia University, where he was in charge of the University Computer Center from 1993 to 1997 and served as the Chairperson of the Department of Information Engineering and Computer Science from 2001 to 2003. Dr. Yang is currently a professor and the head of the Information Technology Office at Feng Chia University. His research interests include distributed and parallel computing, image processing, and data mining. He is a member of the IEEE computer society and the ACM.

Tel: +886-4-24517250 Ext.3743 E-mail: [email protected]

Name:Jungpin Wu Address: Department of Statistics, Feng Chia University, 407

Taichung, Taiwan Education & Work experience: He received the B.S. degree

in Applied Mathematics from Tatung University, Taipei, Taiwan in 1988, the M.S. degree in Statistics from the Graduate Institute of Statistics of National Central University, Taoyuan, Taiwan in 1993, and the Ph.D. degree in Statistics from the North Carolina

State University in 1998. He was a postdoctoral staff at Academia Sinica from 1998 to 1999. Since then, he joined the faculty of Feng Chia University, where he was an Assistant Professor in the Department of Statistics from 1999 to 2004. Dr. Wu is currently an Associate Professor. His research interests include spatial statistics, generalized estimating equations, empirical process approach, and data mining.

Tel: +886-4-24517250 Ext.4422 E-mail:[email protected]