[IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC...

6
Proceedings of IC-NIDC2010 THE IMPROVEMENT OF THE RS3 SYSTEM FOR DRUG SUBSTRUCTURE SEARCHING Hwankoo Lee, Hyunsang Kang, Yoonsoo Lee, Byungchol Chang, Jaehyuk Cha Department of Electronic Computer Engineering Hanyang University, Korea {prodog, coreafirst, eclipse, bcchang, chajh}@hanyang.ac.kr Abstract For the development of a new drug, medicinal chemists perform the structure-based drug design, i.e., search the existing drugs with the chemical structure similar to the chemical structure of the target drug. In brief, researcher and developer want to know drugs to have a specific chemical structure. Therefore, from large drug databases, it is necessary to give the information of selected drugs fast, which should have the members of a set of chemical moieties matching a user-defined query moiety. Substructure searching is the process of identifying the members of a set of chemical moieties that match a specific query moiety. Testing for substructure searching was developed in the late 1950s. In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query graph. Testing for subgraph isomorphism has been proved, in the general case, to be an NP- complete problem. For the purpose of overcoming this difficulty, there were computational approaches. On the 1990s, a US patent has been granted on an atom-centered indexing scheme, used by the RS3 system; this has the virtue that the indexes generated can be searched by direct text comparison. This system is commercially used(http://www.acelrys.com/rs3 ). We define the RS3 system's drawback and present a new indexing scheme. This system treats substructure searching with substring matching by means of expressing chemical structure as predefined strings. However, it has insufficient 'recall' and 'precision' because of a fatal shortage of the indexing scheme. Expressing 2D chemical structure into 1D a string has limit. Therefore, we break 2D chemical structure into 1D structure fragments. We present in this paper a new index technique to improve recall and precision surprisingly. Keywords: Substructure Searching, Subgraph Isomorphism, RS3 system 1 Introduction Among the methods of drug design, medicinal chemists research the relationship between drug's chemical structure and its function. In other words, structure-based drug design is tried. At the base of these tries, medicinal chemists search the existing drugs of similar chemical structure to target drug for the development of a new drug. In brief, researcher and developer want to search drugs to have a specific chemical structure fast. Therefore, it is such necessity that an automatic system selects drug files fast to have the members of a set of chemical moieties that match a user-defined query moiety in very many drug files. Substructure searching is the process of identifying the members of a set of chemical moieties that match a specific query moiety. Testing for substructure searching was developed in the late 1950s. In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query graph. Testing for subgraph isomorphism has been proved, in the general case, to be an NP-complete problem. For the purpose of overcoming this difficulty, there were computational approaches. For instance, there are strategies that use faster computers or take advantage of parallelism in the hardware. Also, they use algorithms and heuristics to reduce and order the number of possible mappings of nodes and edges for each query/target comparison. [8] Because these atom-to-atom searching is still a very time-consuming process, the screening system which preprocesses time-consuming operations that are query-independent was developed apart from the improvements to those algorithms. These systems stored the keys that index the entries in a chemical database based on structural features present. Because the structural features present in the query are identified at search time, these systems improved performance. The preprocessing and massive atom-centered indexing concepts have been a more recent development, catalyzed by the availability of low- cost hard-disk storage and CD-ROMs in the late 1980s. On the 1990s, a US patent has been 342 ___________________________________ 978-1-4244-6853-9/10/$26.00 ©2010 IEEE

Transcript of [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC...

Page 1: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

Proceedings of IC-NIDC2010

THE IMPROVEMENT OF THE RS3 SYSTEM FOR DRUG SUBSTRUCTURE SEARCHING

Hwankoo Lee, Hyunsang Kang, Yoonsoo Lee, Byungchol Chang, Jaehyuk Cha

Department of Electronic Computer Engineering Hanyang University, Korea {prodog, coreafirst, eclipse, bcchang, chajh}@hanyang.ac.kr

AbstractFor the development of a new drug, medicinal chemists perform the structure-based drug design, i.e., search the existing drugs with the chemical structure similar to the chemical structure of the target drug. In brief, researcher and developer want to know drugs to have a specific chemical structure. Therefore, from large drug databases, it is necessary to give the information of selected drugs fast, which should have the members of a set of chemical moieties matching a user-defined query moiety. Substructure searching is the process of identifying the members of a set of chemical moieties that match a specific query moiety. Testing for substructure searching was developed in the late 1950s. In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query graph. Testing for subgraph isomorphism has been proved, in the general case, to be an NP-complete problem. For the purpose of overcoming this difficulty, there were computational approaches. On the 1990s, a US patent has been granted on an atom-centered indexing scheme, used by the RS3 system; this has the virtue that the indexes generated can be searched by direct text comparison. This system is commercially used(http://www.acelrys.com/rs3).

We define the RS3 system's drawback and present a new indexing scheme. This system treats substructure searching with substring matching by means of expressing chemical structure as predefined strings. However, it has insufficient 'recall' and 'precision' because of a fatal shortage of the indexing scheme. Expressing 2D chemical structure into 1D a string has limit. Therefore, we break 2D chemical structure into 1D structure fragments. We present in this paper a new index technique to improve recall and precision surprisingly.

Keywords: Substructure Searching, Subgraph Isomorphism, RS3 system

1 Introduction Among the methods of drug design, medicinal chemists research the relationship between drug's chemical structure and its function. In other words, structure-based drug design is tried. At the base of these tries, medicinal chemists search the existing drugs of similar chemical structure to target drug for the development of a new drug. In brief, researcher and developer want to search drugs to have a specific chemical structure fast. Therefore, it is such necessity that an automatic system selects drug files fast to have the members of a set of chemical moieties that match a user-defined query moiety in very many drug files.

Substructure searching is the process of identifying the members of a set of chemical moieties that match a specific query moiety. Testing for substructure searching was developed in the late 1950s. In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query graph. Testing for subgraph isomorphism has been proved, in the general case, to be an NP-complete problem. For the purpose of overcoming this difficulty, there were computational approaches. For instance, there are strategies that use faster computers or take advantage of parallelism in the hardware. Also, they use algorithms and heuristics to reduce and order the number of possible mappings of nodes and edges for each query/target comparison. [8]

Because these atom-to-atom searching is still a very time-consuming process, the screening system which preprocesses time-consuming operations that are query-independent was developed apart from the improvements to those algorithms. These systems stored the keys that index the entries in a chemical database based on structural features present. Because the structural features present in the query are identified at search time, these systems improved performance. The preprocessing and massive atom-centered indexing concepts have been a more recent development, catalyzed by the availability of low-cost hard-disk storage and CD-ROMs in the late 1980s. On the 1990s, a US patent has been

342

___________________________________ 978-1-4244-6853-9/10/$26.00 ©2010 IEEE

Page 2: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

granted on an atom-centered indexing scheme, used by the RS3 system(http://www.acelrys.com/rs3); this has the virtue that the indexes generated can be searched by direct text comparison. This system is commercially used. But the RS3 system has insufficient 'recall' and 'precision' because of a fatal shortage of the indexing scheme.

The strength of the RS3 system is that a NP time complexity of this problem is reduced to O(n) by transforming substructure searching into substring searching. For the purpose of that, the RS3 system previously makes the atom-centered index which transforms structural relations of the other atoms into string and uses these at search time. To explain in detail, the strings of representing structural relationship with atoms are previously stored in RDBMS. At search time the strings which resulted from a similar analysis, but utilizing wild cards, match the stored strings in RDBMS by direct text comparison. We promote the new indexing scheme that promotes the RS3 system's virtue and overcomes that's drawback.

2 Related work

2.1 The ullmann algorithm for subgraph Isomorphism Problem

In graph theoretical terms, this problem corresponds to determining which graphs in a set are subgraph isomorphic to a specified query graph. A subgraph isomorphism exists if all the nodes (atoms) of the query graph (Gq) can be mapped to a subset of the nodes of the target graph (Gt), such that all the edges (bonds) of Gq map to a subset of the edges in Gt. Testing for subgraph isomorphism has been proved, in the general case, to be an NP-complete problem[2].

For this NP-complete problem, there are many computational approaches. They are illustrated below. 1) use faster computers or take advantage of parallelism in the hardware; 2) produce efficient implementations of

algorithms in terms of space utilization, disk access, and execution time;

3) use algorithms and heuristics to remove candidates from the set of target structures to match;

4) use algorithms and heuristics to reduce and order the number of possible mappings of nodes and edges for each query/target comparison;

5) preprocess time-consuming operations that are query-independent and store answers or partial answers to common classes of queries that can then be used at search time.

The step 1 is implemented in CAS(Chemical Abstracts Service)[3], Daylight Chemical Information System[4] and Synopsys[5]. The backtracking algorithms correspond to the step 3, 4 of the section 2.2. If at any point no node in Q can map to a node in T, then the algorithm 'backtracks', to the last mapped node in Q and selects an alternative node in T to map to it. The use of a backtracking algorithm to map chemical structures was first reported in 1957 by Ray and Kirsch[6].

A number of refinement algorithms and heuristics have been developed with the goal of reducing the number of backtrack steps required on average for each query-target comparison. They can be grouped as follows: 1) reduce the number of candidates atoms in T

that can potentially map on atoms in Q (either before or during the backtracking phase);

2) prioritize the order in which atoms in Q and T are mapped;

3) identify end-conditions that allow the backtrack algorithm to terminate early with either success or failure.

The typical solution of subgraph isomorphism is the Ullmann algorithm[7]. The Ullmann heuristic can be described as follows: given a query node Qi that has a neighbor Qj, and a target atom Ti(Tj) that can be mapped to Qj. This heuristic is applied iteratively to each unmapped query atom, removing candidate atoms in T until no further refinement can be made.

2.2 The RS3 system

Because these atom-to-atom searching is still a very time-consuming process, the screening system which has the strategy of restricting the number of candidate structures that need to be mapped was implemented apart from the improvements to those algorithms. These systems index the entries in a chemical database based on structural features (keys) present. These keys are stored in the database. At search time, the structural features present in the query are identified. If a key is present in the query but is not present in a target, then this entry can be eliminated from further consideration. So we represent the RS3 system that is the root of our research. The RS3 system's strength is what transforms substructure searching into substring searching. For the purpose of doing that, this system represents chemical structures as strings when the atom-centered indexes are generated. At search time a similar analysis is performed on the query, but utilizing wild cards. As a result, the time complexity of substructure searching which is a NP-complete problem is reduced to O(n).

First of all common atom-bond pairs are assigned an alphanumeric value (Table 2 ) for transforming

343

Page 3: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

this problem to substring searching. To give an example single to bromine is 'b', double bond to carbon is 'd'.

Table 1. An example RS3 bonded atom codes

- Br b - C c = C d = O e wildcard %

At each level, the codes are separated from the next level by a period after that (Figure 1). On this occasion the codes are stored in alphabetical order. In figure 1 the structural relationship between bromine and others is represented by string as follows:

1) We generate the string (key), "Br c", because there is a single bond to carbon on bromine and then add an end-of-atom marker(a period) for separating from the next level. Consequently the key became "Br c . ".

2) We extend the key to "Br c . c . " because this carbon has a single bond to carbon with the exception of the described bromine.

3) There are two unused neighbors on the last described atom: a double bond to oxygen denoted by 'e', and a single bond to carbon denoted by 'c'. These bonds are ordered, with 'c' taking precedence over 'e'. These codes are then added to the key in order. Next, an end-of-atom marker is added to the key. The key now reads "Br c . c . c e . ".

4) The current key is "Br c . c . c e . ". The carbon is selected to examine next because of an alphabetical order, between the carbon and the oxygen corresponding to 'c' and 'e' each. This carbon has an unused single bond to carbon; therefore, the key now reads "Br c . c . c e . c . ".

5) The next undescribed atom (oxygen) has no unused neighbor atom; therefore, an end-of-atom marker is simply added to the key. The key now reads "Br c . c . c e . c . . ".

6) The last examined atom (carbon) has no unused neighbor atom; therefore, an end-of-atom marker is appended to the key. The key now reads "Br c . c . c e . c . . . ".

7) Now all atoms are described. The final search key would be: "Br c . c . c e . c . . . ".

Figure 1. DB-stored structure and keys in the RS3 system

Querying whether the substructure is present is illustrated in Figure 2. A '*' is marked on carbon where other atoms can be bonded. The sites marked with a '*' are the positions of searching substructure, and these are the targets of searching substrings from a text comparison. Therefore, wild cards(%) are appropriately inserted in a query string. The analysis of query structure is performed with a similar manner. The string generation of the bromine in Figure 2 is illustrated as follow:

1) First of all the key is assembled into "Br c" because the bromine only has a single bond to carbon. Then an end-of-atom marker is added. The key now reads "Br c . ".

2) We extend the key to "Br c . c . " because this carbon has a single bond to carbon with the exception of the described bromine.

3) The last described atom(carbon) has a double bond to oxygen and has been marked with a '*'. It means that other atoms besides oxygen may be bonded here. Therefore, a wild card(%) is inserted in front and rear of code 'e' that assigns a double bond to oxygen and the key can be used for substring searching. The key now reads "Br c . c . % e % . ".

4) If the end of the key is "% . ", then only a wild card is added to the key and the key generation is terminated. The last key reads "Br c . c . % e % . %".

Figure 2. Query substructure and keys in the RS3 system

The execution of text comparison between query key and DB-stored key results in the candidates of the target chemical structure files. Then an ABAM(Atom-by-Atom Matching) is performed for retrieving exact answers. The RS3 system is commercially used in the Accelrys Inc[1].

O

C*Br

C

1. Br c . c . % e % . % 2. C b c . . % e % . % 3. C % c % e % . % 4. O d . % c % . %

O

CBr

C C

C

1. Br c . c . c e . c . . . 2. C b c . . c e . c . . . 3. C c c e . b . c . . . . 4. O d . c c . b . c . . . 5. C c c . . c e . b . . . 6. C c . c . c e . b . . .

344

Page 4: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

2.3 Weakness of the RS3 System

As the number of atoms marked by '*' increases, the precision of the RS3 system largely decrease. This is the most weakness of the RS3 system. If '*' is once encountered, the description of the chemical structure is anymore impossible because of addition of only one wild card followed by a termination of generation of the key. The query structure is specified by the relationship between bromine and carbon but this query string accommodates any relationship. Therefore, the precision is dropped.

A second problem is the no rule of assembling sequence among same bonds to same atom. The main idea of the RS3 system is that this system describes a chemical structure according to alphabetical order. For example, the second key of Figure 3, "C b c . . c e . c . . . ", demonstrates that carbon has a single bond to bromine and a single bond to carbon, and then the key is assembled into "b c" according to alphabetical order. In other words, this structure has corresponded to the only this string. However, the third key, "C c c e . b . c . . . . ", is replaced by "C c c e . c . b . . . . " because of same alphabetical order of 'c' and 'c'. The no elaborateness of description sequence is the reason for low recall of the RS3 system.

3 Improvements of the RS3 systemIn this chapter, we represent an improved scheme of the RS3 system.

3.1 Solutions to problems of the RS3 system's weakness

An indexing scheme of this system needs to be modified for the raise of RS3 system's recall and precision. In other words, a key is generated according to a different rule. The rule is that the shortest paths between a centered atom and other atoms are encoded into strings similar to the RS3 system, and separated by its length, and stored in alphabetical order. The result to which the improved scheme transforms a structure in Figure 3 is illustrated in Table 3. In a similar manner, the query structure in Figure 4 is described according to the improved scheme in Table 4.

Figure 3. DB-stored structure

Table 3. DB-stored keys according to an improved scheme

atom level 1 level 2 level 3 level 4

Br , c , c . c , c . c . c , c . c . e

, c . c. c . c

C , b , c , c . c , c . e , c . c . c

C , c , c , e

, c . b , c . c

O , d , d . c , d . c

, d . c . b , d . c . c

C , c , c , c . c , c . e , c . c . b

C , c , c . c , c . c . c , c . c . e

, c . c . c . b

Figure 4. Query structure

Table 4. Query keys according to an improved scheme

Atom level 1 level 2 level 3 level 4

Br % , c % % , c . c % % , c . c . e %

C % , b % , c % % , c . e %

C % , c % , e % % , c . b %

O % , d % % , d . c % % , d . c . b %

The level 3 of the record relative to the bromine in Table 5 results from appending a wildcard(%) to front and rear of the path "c . c . e". We do substring search for this string and can select the level 3 of the record relative to the bromine in Table 4. Provided this improved scheme is used, '*' of the RS3 system needn't be designated. Therefore, the precision is increased because all of the structure can be described. Besides, the rule of sequence among same bonds to same atom is clear because all of the path between the centered atom and other atom are described and arranged in alphabetical order. Therefore, the recall is increased.

3.2 Flowchart of the improved scheme

The flowchart of the improved scheme is illustrated in Figure 5.

O

C* Br

C

O

CBr

C C

C

345

Page 5: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

Figure 5. Flowchart of the improved scheme

4 Performance evaluation In this chapter, we represent comparison of performance among the Ullmann algorithm, the RS3 system, and the improved scheme.

4.1 Experiment

We search 672 drug chemical structure files for the 15 distinct substructures (moieties) in Figure 4.1. About the RS3 system's experiment, all atoms is designated by '*'.

Figure 6. Query Substructure (Moiety)

4.2 Experiment result

About 672 drug chemical structures, the Ullmann algorithm, the RS3 system, and the improved scheme are experimented in execution time, precision, and recall. As for the experiment of the RS3 system and the improved scheme, DBMS's search results which results from the main engine is filtered by the comparison of the atom numbers per its symbol between query structure and those

results. From a different standpoint, it was performed in a polynomial time complexity. Screening systems which are the RS3 system and the improved scheme output exact results by performing an ABAM(Atom-by-Atom Matching) on the intermediate results. ABAM is performed by the Ullmann algorithm. Performance of execution time is illustrated in Table 5. The columns including "+ Ullmann" show the results of performing an ABAM at last. Performance of precision and recall in the RS3 system and the improved scheme is illustrated in Table 6. For your information, precision and recall in the Ullmann algorithm are naturally all 100%.

Table 5. Execution Time [sec]

Table 6. Precision and Recall

Precision Recall moiety RS3

system Improved RS3system Improved

benzene 72.5 98.5 100 100 carbazole 0.3 100 100 100

furan 1.4 46.7 100 100 imidazole 21.8 93.2 63.4 100

indole 2.5 84.6 100 100 naphthalene 0.6 25.0 100 100 piperazine 10.6 64.3 100 100 piperidine 11.9 48.8 100 100

purine 1.4 100 100 100 pyrazole 0.8 80.0 25.0 100 pyridine 8.2 82.2 100 100

pyrimidine 11.4 100 100 100 pyrrole 2.9 27.1 100 100

quinoline 0.2 100 100 100 toluene 61.2 98.9 100 100 Average 13.8 76.6 92.6 100

RS3 system Improvedscheme

moiety Ullmann

RS3

RS3+

Ullmann

Improved

Improved +

Ullmann

benzene 10 1 10 1 7 carbazole 11 1 12 3 3

furan 2 1 2 1 1 imidazole 1 1 1 1 1

indole 20 1 20 2 2 naphthalene 15 1 15 2 2 piperazine 11 1 12 1 11 piperidine 13 1 14 1 13

purine 6 2 7 1 1 pyrazole 1 1 1 1 1 pyridine 13 2 15 1 1

pyrimidine 6 2 7 1 1 pyrrole 7 1 8 1 1

quinoline 10 1 11 3 3 toluene 5 1 6 2 5 Average 8.7 1.2 9.4 1.5 3.5

346

Page 6: [IEEE 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content (IC-NIDC 2010) - Beijing, China (2010.09.24-2010.09.26)] 2010 2nd IEEE InternationalConference

4.3 Performance evaluation

The execution time when the improved scheme is followed by the Ullmann algorithm is 2.5 times less than the execution time of doing only the Ullmann algorithm. According to expectation, the precision in the RS3 system is very low. It is raised by the improved scheme. The recall in the RS3 system is 63.4% at imidazole and 25.0% at pyrazole, but that in the improved scheme is always 100%.

4.4 Examination of the Recall in the improved scheme

From the viewpoint of graph theory, the improved scheme doesn't accomplish the 100% of recall when we search a part of a cycle. As for chemical structure, however, a cycle has meaning only in itself. That is to say, the try of searching a part of a cycle is meaningless, because chemical characteristic of a cycle is very different from its part. Therefore, the improved scheme accomplishes the 100% of recall on chemical substructure searching.

4.5 Comparison of storage size

We mark the number of atoms as 'n' and compare the improved scheme with the RS3 system for storage size. The RS3 system requires 2n2 storage size, because all atom codes and corresponding '.'s would be written and result in n*(n+n).

The improved scheme requires a maximum storage size when structure has a chain shape, because the long path would be described. Per the centered atom, codes of 1+2+3+...+n numbers and corresponding '.'s are required. Therefore, the total storage size is n3+n2 because of n*{(1+2+3+...+n)*2}. These are illustrated in Table 7.

According to experiments with 672 drug structure files, the improved scheme's storage size is 2.7 times as large as the RS3 system. The more precise description requires the more storage size.

Table 7 Comparison of Storage sizes (n: total atom number)

RS3 The improved scheme 2n2 max n3+n2

5 Results We define the weakness of the RS3 system and present the improved scheme.

The RS3 system treats substructure searching with substring matching by means of expressing chemical structure as predefined strings.

However, the RS3 system has insufficient 'recall' and 'precision' because of a fatal shortage of the indexing scheme. The improved scheme has a weakness of the increase of storage size, but raises recall and precision sharply.

References [1] M. F. Lynch, in 'Chemical Information

Systems', eds. J. E. Ash and E. Hyde, Ellis Horwood, Chichester, 1985, pp. 160-167

[2] R. C. Read and D. G. Corneil, J. Graph Theory, 1, 339-363, 1977

[3] N. Farmer, J. Amoss, W. Farel, J. Fehribach, and C. R. Zeidner, in 'Chemical Structures: The International Language of Chemistry', ed. W. A. Warr, Springer-Verlag, Heidelberg, pp. 283-295, 1988

[4] Daylight Chemical Information Systems, Inc., 27401 Los Altos, Suite 370, Mission Viejo, CA 92691, USA

[5] G. A. Hopkinson, J. Chem. Inf. Comput. Sci., 37, 143-145, 1997

[6] L. C. Ray and R. A. Kirsch, Science, 1957, 126, 814-819

[7] J. R. Ullmann, J. Assoc. Comput. Mach., 1976, 23, 31-42

[8] RS3, http://www.acelrys.com/rs3

347