A parallel approach to rule based systems

North-Holland Microprocessing and Microprogramming 21 (1987) 507-514 507

A PAHALLRL APPROACH TO RULE BASED S Y S ~ I S

Jr. Paul DELBAR

Laboratory for Automatic Control, State University of Ghent, Grotesteenweg Noord 2, B-9710 GENT - ZWIJNAARDE, BELGIUM

This paper discusses the performance of rule based systems using the OPS5 paradigm, and shows how it can be increased by algorithm modification. Also, a parallel implementation is discussed, and the expected benefits are presented.

I. I~ODUCTION

Recent developments in both hardware and software technology illustrate the fact that the attention of software developers is not only focussed on application design, but increasing- ly on the design and selection of the appropri- ate tools. New processor architectures have been developed to better fit the requirements of high-level software development. As the relative cost of hardware further decreases, it becomes more economical to design a computer system on specifications defined by the applications that are to run on it.

Artificial intelligence has always set its own standards. Its need for symbolic processing, its awkward data structures and highly interac- tive user environment have resulted in the development of dedicated machines, such as LISP- machines. However, as AI applications become better known and are more widely used, the need exists for efficient implementations on conven- tional machines. New system architectures, such as multiprocessor systems, are being explored for this purpose.

This paper discusses how OPS5, an AI tool for the development of rule based systems, can be implemented on a parallel architecture. The goal is to significantly increase the performance of applications written in OPS5. Optimi- zation is performed, both in the algorithm it- self, and in the implementation on a parallel machine, The Computing Tower.

order for the rule to be satisfied) and a con- sequent part (actions to be taken in case a rule is applicable). Rules are stored in p r o d u c t i o n m e m o r y , and are activated by a forward inference on data elements residing in working memory. These elements may match a number of conditions, constituting the left hand side of a rule. A group of elements that cause a rule to be applicable (ie. an ordered set of elements that each match one of the conditions and have consistent bindings within the rule's left hand side) are said to instantiate the rule, which then becomes ready to fire. The instantiation is stored in the c o n f l i c t s e t , on which a resolution algorithm acts to select the rule that will be executed.

Firing a rule is done by performing the actions specified in the rule's right hand side. Actions include the deletion and creation of working memory elements as well as input and output actions. Also, calling user defined functions is possible. The global strategy 0PS5 uses consists of three steps :

I. MATCH : use the Rote network to discover rule instantiations that can fire (ie., delimit the conflict set)

2. RECOGNIZE : use a strategy to select one rule instantiation (ie., conflict resolution)

3. ACT : fire the rule (ie. perform the actions specified in the rule's right hand side) and

proceed to step I

2. OPS5

2.1 '[rne OPS5 Language

OPS5 [I] is a language for writing production systems. The basic control component in production languages is the rule, consisting of an antecedent part (conditions to be fulfilled in

Figure I shows a sample rule that could be used in an 0PS5 application in order to establish links between items of a certain nature. The rule's left hand side contains four condition elements, that express a pattern the elements should match. There are three types of elements used in this rule : goal, item and link. Sym- bols preceded by a caret are attribute field identifiers, while symbols like <item1> stand for variables.

5 0 8 P. De/bar/A Parallel Approach to Rule Based Systems

(p find-link (goal ^task link-items

^status active îtem-1 <item/> îtem-2 <item2>)

(item îd <item/> ^connection-type <ci>)

(item îd <item2> ^connection-type <c2>)

(link ^status not-used îd <lid> ^connection-type-1 <ci> ^connection-type-2 <c2>)

-->

(modify 2 ^link-id <lid>) (modify 3 ~link-id <lid>) (modify 4 îtem-1 <item/>

îtem-2 <item2>) (remove I) (call activatelink <lid>)

Figure I -- A Sample OPS5 Rule

OPS5 was originally implemented in LISP, and its syntax still shows traces of this. Besides providing the symbolic processing OPS5 needs, LISP also brought 0PS users the interpreting environment, allowing for runtime addition and exclusion of rules and ease of debugging. Other implementations have followed, trading this advantage for that of speed (eg. on VAX, the implementation is done in BLISS and assembler, while the rule right hand sides are compiled).

2 . 2 How I t Works

Measurements on a number of existing 0PS5-applications show [2], that the average OPS5 system comprises 909.83 rules with an average of 4.11 condition elements, each having 2.93 ato- mic tests. The average number of working memory elements is 295. A naive matching algorithm would take an immense number of evaluations. Performing this task every time a rule has fi- red would be a sheer waste of processor time. Therefore, a number of optimizing matching algorithms have been designed. The one used in OPS5 is the Rete algorithm [3], developed by Charles Forgy.

The Rete matcher works on the observation that working memory contents are temporally redun- dant, meaning that a firing rule only makes mi- nor changes to working memory (on average 2.21 changes per firing, [4]). Therefore, storing the results of the matching process provides an effective way to reduce the computational load. For this purpose, the Rete matcher converts the left hand sides of the rules in production memory into a network we will now discuss.

2.3 The Rete Network

The Rete matcher distinguishes a number of test types. First, there are i n t r a - c o n d i t i o n tests, that can be performed using only the working memory element being matched. Examples are

(temperature ^name feed-temp-t322 ^value { < 200 })

Second, there are inter-comdition tests, uni- ting multiple working memory elements using va- riable bindings. These bindings must first of all be consistent, and also, further testing may be required, as is the case in

(temperature <t>) (alarm-temp-max <mt> { > <t> })

Also, negated condition elements are allowed. A negated pattern is satisfied if no working memory elements can be found to match it. An example of using negated condition elements is

-(task ^status not-finished)

In the Rete network, t-const nodes perform intra-condition tests. Working memory elements are identified by a time tag value, indicating the order of their creation. According to the working memory element class, all t-const nodes that test elements of that class will be activated. If an element matches all conditions, its time tag is stored in an alpha memory node. The inter-condition tests are performed in two- input nodes, where memory nodes are combined to produce new lists of tuples, indicating working memory element combinations that consistently fulfill all previous conditions. Tuples like these are stored in b e t a memory nodes. There are two major types of two-input nodes : AND nodes for normal binding verification, and NOT nodes, used for negated condition elements.

As seen from figure 2, the first two-input node combines two alpha nodes, while all others com- bine an alpha node (denoted as right memory) and a beta node (left memory). At the output of each two-input node is a beta node, except for the last node in a rule : at its output is a special production node, that will insert the rule's id, together with the matching tuple and relevant bindings for rule variables, into the conflict set. The information passed in this data flow structure consists of tokens like

< + , (12 14 32) > < - , ( 1 2 1 ) >

for memory addition or removal respectively.

Frequently, similar condition elements will be found in multiple rules. The Rete network compiler optimizes t-const node activations by allowing this condition element to be shared. Al- so, combinations of condition elements, hence

P. Delbar / A Parallel Approach to Rule Based Systems 509

task job

"status active "status active I

? L consistent bindings for <jobid> 1

(p ruleO01 (task ^status active

"job-id <jobid>) (job ^status active

- - >

)

machine

"status down

"id <jobid> ^machine <mach>)

(machine ^id <mach> ^status down)

(modify I ^status down)

consistent bindings for <mach>

Figure 2 -- A Sample Rete Network

two-input and beta memory nodes, can be shared under certain circumstances. According to Gupta and Forgy [2], the average static sharing coef- ficients for nodes range from 1.13 (NOT nodes) to 6.76 (t-const nodes). The Rete algorithm thus provides a considerable optimization.

This short introduction should suffice for the purpose of this paper. A more thorough discussion of the Rete network can be found in [3].

3- BASIC PERFOI~tANCE I I ~ l ~ O ~ T

3 . 1 . Support For Frequent ly Used Techniques

A primary source of optimization lies in the efficient implementation of often used features of 0PS5. Of course, the solution provided must not violate the requirement of compatibility with the OPS5 syntax definition. We will pre- sent only one possible optimization here, but we will return to this subject after discussing the parallel implementation.

A possibility for performance improvement comes from the use of negated condition elements. Ap- proximately one out of four condition elements is negated. This clearly shows that negated

conditions are often used, and they are indeed very useful. Consider the following rule, meant to select the working memory element for which the value attribute is maximal :

(p find-maximal (number "value <max>)

-(number "value { > <max> I) ---->

(modify I ^maximal yes) )

Supposing that N elements of the class number exist, and since there are no intra-condition tests to reduce the contents of the alpha memories for b~th conditions, the amount of matching is N . It would be relatively easy to provide a number of functions such as mininum and maximum, average, ... for working memory element attribute values. The above rule would then look like

(p find-maximal (number "value (max))

---->

(modify I "maximal yes) )

and the related amount of matching would be re- duced to N. Taking into account that, on the

510 P. Delbar / A Parallel Approach to Rule Based Systems

average, 25.8 percent of all two-input nodes are NOT nodes, and the average count of tokens in left/right memories is 13.79 (giving the value for N in the above calculation), and assu- ming that processing time for NOT and AND nodes is comparable, we obtain performance improvements up to 20 percent for two input node matching, depending on the amount of negated condition elements effectively used in constructs such as the one used in the above rule.

A number of similar optimizations could be performed, each with a comparable performance benefit. We will not discuss them in detail, but turn to optimizations due to parallel implementation of OPS5 algorithms.

4. A pA]TXT.T.K'T, APPHOACH

4.1. I n t r o d u c t i o n

Significant work has already been done [4,5] on the execution of data-directed algorithms on parallel machines. At the roots of this research lies the observation that generally, data-directed applications can be modelled as data flow (or comparable) systems. Graphs such as the Rete model from figure 2 illustrate this. The approach differs in the grain size of parallelism used. Very fine grain parallel machines, such as the NON-VON machine described in [6] or connectionist models, use a "large amount of very simple processing elements. Coarse grain parallel machines distribute the load among a more limited number of processors.

A general discussion of issues influencing the selection of either one, such as load balancing, programmability, etc. fall outside the scope of this paper. We will concentrate on the mapping of an algorithm on a parallel architecture, and try to evaluate the possible benefits.

4.2. Potential for Parallelisation in OPS5

It will be clear that the matching process shows a great deal of potential for parallel execution. As shown in [4], the recognize and act phases of the OPS5 cycle show little or ne- glectable potential. We will therefore focus on the Rete matoher.

Basically, the idea behind parallel execution of the Rete algorithm is that all intra-condition tests for a certain class can be performed at the same time. All related alpha nodes can be accessed in parallel, since an alpha node receives data from only one t-const node. Two- input nodes can then receive new tokens, and propagate eventual new data items towards the

production nodes. Once the conflict set is sta- ble, the matching process ends.

Although it seems evident that a parallel implementation will draw benefits from this, in practice, there are a number of issues that should be considered carefully. For example, care should be taken when dividing the processing load into separate tasks, each allotted to a (different) processor. The communications overload may be drastic. The application will typically dictate the amount of data sharing, and it is up to the implementation designer to choose between shared memory access or high speed communication between processes.

In OPS5, this is certainly the case. Working memory elements have attribute values that are needed in multiple nodes of the Rete network. It may well he so, that new elements trigger beta node binding tests with relatively old elements, residing in memory pages already swapped out of physical address space. This may happen quite repetitively, causing immense overhead. It is therefore not a good approach to centralize working memory storage. One could consider passing working memory elements as messages, but data on the amount of matching done in OPS5 indicate this would require a vast amount of communication.

The Rete network also allows for pipelining. Indeed, two-input nodes generate beta node ad- ditions or removals that ripple through the rule's chain towards the production nodes. In sequential machines, this fact is hardly ex- ploitable. In parallel machines, however, using pipelining is quite common and easy to implement.

4.3- N e u r a l N e t N o d e l s

The Rete network comprises a number of nodes, each with a defined function. The network may he very large (typically 7.985 nodes per rule, yielding some 7250 nodes per application). It would therefore be beneficial to implement OPS5 in such a way, that this multitude of separate processes is supported.

N e u r a l n e t models are a suitable paradigm for this. Research on neural functions resulted in experimental computer systems using similar structures. Functional centers are connected through stimulation paths, constituting a structure that closely resembles that of a neural network of living creatures. Stimuli are passed from sensor nodes and processed along the way, reaching storage centers and triggering actions in, for example, muscles. The CHEOPS implementation discussed below uses such a model, implementing a fine grain process structure on a coarse grain multiprocessor system.

P. De/bar/A Parallel Approach to Rule Based Systems 511

5- CHEOPS : A CONCgRR~WT 0PS5 ~ A T I O N

5.1. The S t r u c t u r e of C}~PS

CHEOPS (Concurrent Highly Efficient OPS) is an OPS5 implementation that attempts to exploit the parallel potential available in the language. This is done by the following design deci- sions :

- a CHEOPS machine is considered to have one or more processors with a similar structure, in- cluding sufficient RAM

- a copy of working memory is maintained in each processing node, in order to minimize communication while matching the CHEOPS network compiler assesses the potential for parallel and pipelined execution (for example, t-const nodes for a certain working memory element class are evenly dis- tributed among all processors)

- the CHEOPS kernel consists of a set of template processes for each node function, and the code generated is only for testing, execution of right hand side actions and node communications configuration.

These are the basic components of the CHEOPS strategy. A CHEOPS implementation is currently being developed at the Automatic Control Labo- ratory, on a parallel machine called The Compu- ting Tower. The implementation language is OCCAM, designed by INMOS Ltd. as a parallel programming language for its Transputer micro- processor series. As we will show, OCCAM provides the necessary programming paradigms for data-directed, network-structured applications.

5 . 2 . Us ing OCCAR to Model Node F u n c t i o n s

OCCAN [7] is a language for describing parallel processes. The basic means for inter-process communication is the OCCAM ChAnnel, along which data can be sent or received. As an example, consider the following OCCAM program, in which two parallel processes pass data :

I~ demoprocess ( ) CI{AN channel :

INT message : PAR

channel ! message channel ? message

Channels can be software controlled (the Trans- puter's multitasking microcode also handles the semaphores necessary for communication) or mapped to one of the hardware links, operating at 20 Mbit per second. Processes can be developed and tested on a single processor, and then be configured to run on a multiprocessor system. An OCCAM process can therefore be seen as a

neural activity center, passing information along channels and eventually maintaining a

working state in local variables.

5.3. Implementat ion Layout

The features of OCCAM discussed above allow for a very simple mapping of OPS5 node functions onto template processes, parallel instantiations of which are created by a top level procedure. Nodes are connected through a communications network. The compiler provides the definitions for the necessary channels, and se- lects those to be mapped onto a hard link. All processes receive the needed set of channels as parameters. Code for the top level program, initialising all processes, is also written by the network compiler.

Some general services are also provided as standard processes. These are

- the OPS5 top level interpreter the tokenizer, which maintains the global symbol table for the application (symbolic atoms are converted into an integer value by hash coding) the working memory manager, copied to each node in the processor tree

- the tuple memory manager, also copied to each processor, providing memory for alpha and beta memory nodes

- the processor router, allowing for transpa- rent hard/soft link placement and working memory copying to each processor

5 . 4 - Sample Node P r o c e s s e s

Let us look at a number of node processes. We will start with the t-const node. The network compiler will provide a test function for each

of the t-const tests to be performed. These test processes accept a time tag as input, and return TRUE or FALSE. Figure 3 shows the resul- ting OCCAM code for the t-const process. This shows that the alpha nodes are initialized by the corresponding t-const node.

In a way similar to t-const nodes, two-input nodes receive their input from two channels, and start a beta node for tuple storage. An ex- tra parameter tells the node how to behave : as an AND node, a NOT node or an ANY node (used for combination of condition elements that require no inter-condition testing).

Alpha and beta memory nodes have two tasks to perform : they must store tuples of time tags, and they must also be able to provide the following two-input node with the entire list of tuples. Let us look at the case where a new memory element is stored in the left memory of a two-input node. In order to perform the tests,


PROC t.const.node ( CHAN OF token in, CBAN OF INT test, CHAN OF BOO/, result, CNAN OF token alpha )

INT tag : BOO/, res : PAR

alpha.node ( alpha, two.in.node ) WHILE TRUE

CASE in ? (plus, tag)

S~ test ! tag result ? res IY

(res) alpha ! (plus,tag)

TRUE SKIP

(minus, tag) alpha ! (minus,tag)

Figure 3 -- 0CCAM Source for the t-const Process

it must now find out which elements are located in its right memory, and check all consistent bindings. This resource sharing limits the amount of parallel processing possible. In order to allow for a limited sharing, the tuple memory node process acts a local memory manager, either adding or removing elements from it, or allowing the two-input node to browse through the tuple list.

Data on the average population of alpha and beta nodes, and on the frequency of their visi- ting by a two-input node, allow us to devise a memory management strategy at run time. Indeed, we can compute a limit above which relinqui- shing the entire tuple memory causes an addition/removal request to wait too long. Memory sizes above this limit will cause the tuple memory node to switch strategy, and relinquish only part of tuple memory for browsing, while the other part remains available for insertion and deletion. Hit rates and speedup factors are the subject of current simulation and statistical research.

5.5- Process Partitioning

The actual mapping processes onto processors is an important issue. It is vital to performance that the communication and memory accessing are balanced, so that they can effectively happen in parallel. The Transputer provides a benefit for communication, since all four hardware links can run independently of each other and of the processor. Also, a number of limitations are suggested by the implementation, such as locating tuple memory nodes on the same processor as test nodes that fill them.

f Vl

T3

I

I

Figure 4 -- A CHEOPS Network

Consider the following example, in which the network communication is shown for a simple rule (figure 4). Squares (representing t-const or two-input nodes) only work with time tag values (or tuples). Vertical boxes (alpha or beta memory nodes) access the tuple memory manager, while horizontal boxes (test nodes for t-const or two-input nodes) access working memory elements in order to retrieve attribute values.

Let us look at the sequence of events when a new memory element is inserted into the network of figure 4, activating three conflict set ad- ditions. Analysis of the communication in a sequential implementation results in the time diagram of figure 5.a. A parallel scheme in which infinite processing power is available at any time, permitting all processes that are computable to run simultaneously, results in figure 5.b.

Analysis of other resources, such as tuple memory, given infinite processing speed, results in diagrams 5.c and 5.d, for the sequential and parallel implementations respectively. Notice that addition to any left or right memory causes the opposite memory node to be read.

P. Delbar / A Paralle/ Approach to Rule Based Systems 513

T1

T2

(~2 T3

~ 3 B2 92 B3 P

[] B

[] B

B

D

[] B

D

D (a) sequential

D

Q []

[]

[] B ~/////JV/////A

S B

[] 8 E (b) parallel

~3 92

(c) sequential (d) parallel

Figure 5 -- A the Sample CHEOPS

These extreme cases delimit the performance improvement of realistic implementations. Based upon these considerations, a number of process placement strategies can be deduced from analysis of time and resource concurrenc[. Given a statistical model (using data from [2]), it becomes possible to estimate the performance benefits that result from this.

Resource Study on Network from Figure 4

5.6. The Computing Tower

Researchers at the Automatic Control Laboratory have developed a parallel computer system using a number of Transputer based boards (TOWER stands for Transputer and Occam Workbench for Experimental Research). Allthough the topology of the system is easily modified, the network configuration chosen for the CHEOPS implementation is a modified hypercube or tower topology (see Figure 6). Each of the nodes consists of a 10 MIPS, 32 bit Transputer running at 20 MHz, with I Mbyte of on-board dynamic RAM. Nodes may have an application specific interface, such as serial communication ports, disk controller, windowing graphics controller, etc. .

The partitioned CHEOPS network is mapped onto this architecture. The root node (see figure 6) contains the top level interpreter and the tokenizer, while all other nodes have copies of working memory and act as a part of the network.

Figure 6 -- The Computing Tower Topology for CHEOPS


The number of processors has only a small ef- fect on the overall performance boost, because of the granularity of the processes. The gra- dual pipelining of the processors in the tower topology permits an efficient balancing of message passing and local computation.

6. PERFOHItUCE BEREFITS

According to Hillyer [5], researchers at CMU estimate the benefit from parallelization of OPS5 to be 5 to 10. The influence of high speed architectures is estimated to be 8 to 16. Cur- rent implementations on VAX run at a maximum of 20 rules per second. Given the relative processing advantage of the Transputer, and the benefit from parallel execution, we estimate the execution speed of the CHEOPS implementation at some 250 to 400 rules per second per Transpu- ter. Experiments will have to confirm these computations. So far, speed improvements on small test cases indicate a speed improvement of 20 to 50 over a VAX implementation, with CHEOPS running on a single Transputer. We be- lieve overhead costs (memory management, message routine, etc.) will cut this factor down to about 10 to 20.

~m~CES

[I] Forgy, Charles L,, OPS5 User's Manual, Technical Report CMU-CS-81-135 (Carnegie- Mellon University, 1981)

[2] Gupta, Anoop and Forgy, Charles L., Measu- rements on Production Systems, Technical Report (Department of Computer Science, Carnegie-Mellon University, 1983)

[3] Forgy, Charles L., Rete : a fast algorithm for the many pattern/many object pattern match problem, Artificial Intelligence, vol. 19, no. I, September 1982, pp.17-137

[4] Hillyer, Bruce K., and Shaw, David E., Exe- cution of OPS5 Production Systems on a Mas- sively Parallel Machine (Department of Com- puter Science, Columbia University, Septem- ber 1984)

[5] Hillyer, Bruce K., On Applying Heteroge- neous Parallelism to Elements of Knowledge- Based Data Management, PhD Thesis, Columbia University, 1986

[6] Shaw, David E., The NON-VON Supercomputer, Technical Report (Computer science Depart- ment, Columbia University, 1982)

[7] May, David, OCCAM 2 Product Definition, IN- M0S Ltd., June 1986

A parallel approach to rule based systems

Documents

Transcript of A parallel approach to rule based systems