Protein Sequencing Experiment Planning Using Analogy · Protein Sequencing Experiment Planning...

Protein Sequencing Experiment Planning Using Analogy

Brian Kettler and Lindley DardenDept. of Computer Science and Dept. of Philosophy

University of MarylandCollege Park, Maryland 20742

Emaih kettler~cs.umd.edu and daxden~umiacs.umd.edu

Abstract

Experiment design and execution is a central ac-tivity in the natural sciences. The SeqER sys-tem provides a general architecture for the inte-gration of automated planning techniques with avariety of domain knowledge in order to plan sci-entific experiments. These planning techniques in-clude rule-based methods and, especially, the useof derivational analogy. Derivational analogy al-lows planning experience, captured as cases, to bereused. Analogy also allows the system to func-tion in the absence of strong domain knowledge.Cases are efficiently and flexibly retrieved from alarge casebase using massively parallel methods.The SeqER system is initially configured to planprotein sequencing experiments. Planning is inter-leaved with experiment execution, simulated usingthe SequenceIt program. SeqER interacts with ahuman user who analyzes the data generated bylaboratory procedures and supplies SeqER withnew hypotheses. SeqEtt is a vehicle in which totest theories about how scientists reason about ex-perimental design.

1. IntroductionExperiments serve a variety of important purposes inscience, including empirical discovery, theory evalua-tion, and anomaly resolution. In many fields scientistsspend the bulk of their time planning and executing ex-periments and analyzing the data produced. Planningtechniques from Artificial Intelligence can be used toprovide automation, ranging from partial automation- advising or tutoring a scientist - to full automation ofthe task (for more routine experiments). The abilitiesof computers to systematically handle large amountsof information with high speed and accurate recall cancomplement those of human scientists.

SeqER1 is a prototype computer program that plans

1 "SeqER’, pronounced as "seeker", is used in this paper

to refer both to SeqER system as configured for proteinsequencing and SeqER as an architecture which could beconfigured for various domains.

scientific experiments. SeqER provides a general archi-tecture for the integration of techniques from A.I. plan-ning - including reasoning using rules and reasoningby derivational and transformational analogy - witha variety of kinds of conceptual and episodic domainknowledge.

The use of derivational analogy allows SeqER toreuse previous planning experiences from similar sit-uations and to function in the absence of strong do-main knowledge. As it solves new problems, SeqERaccumulates additional planning experience which itencodes as new cases. These cases can be efficientlyretrieved from the casebase using the massively par-allel mechanisms of the CAPER case-based planningsystem (Kettler et al. 93).

The SeqER system is composed of a planner com-ponent and a knowledge base (KB) component. ThePlanner component consists of general-purpose mech-anisms for creating experiment plans, which are se-quences of laboratory and data manipulation/analysisprocedures. The KB component consists of knowl-edge about concepts, laboratory methodology and pro-cedures, previous experiments performed, etc. Thisknowledge is encoded in a variety of forms includ-ing frames, rules, and cases. Much of this knowl-edge is specific to a particular experiment planning do-main. The modular nature of SeqER’s architecture al-lows different domain-dependent KBs to be used withthe domain-independent Planner component so thatSeqER could be used to plan experiments in other do-mains.

The initial domain SeqER is being tested on is pro-tein sequencing experiment planning where the objec-tive is to determine the amino acid sequence (and otherproperties) of a protein. SeqER produces a plan toachieve this objective using a KB created from an anal-ysis of the domain. While protein sequencing can bedone using sequencing machines2, we have chosen thedomain of protein sequencing using manual methodsas the initial testbed for SeqER because it is well un-derstood and a simulator for the domain is available.

2The accuracy of these machines varies from 90-95 per-cent, depending on make, model, etc.

216 ISMB-93

From: ISMB-93 Proceedings. Copyright © 1993, AAAI (www.aaai.org). All rights reserved.

The task of manually sequencing a protein is a complexone amenable to automated planning.

In addition to automation, the SeqER architectureprovides a computational framework for the empiri-cal testing of different reasoning strategies for experi-ment design suggested by historical, philosophical, andpsychological studies of scientists. This work supple-ments work on theory formation and revision (Darden91). Rule-based and case-based reasoning has been ob-served in think-aloud protocols from a human expertplanning protein sequencing experiments.

The following section describes the task of experi-ment planning. Section 2 describes the SeqER archi-tecture and the KB component being used for proteinsequencing experiment planning. Section 3 describesthe analogical reasoning mechanisms of the Plannercomponent. Section 4 presents an example of SeqER,’soperation for a protein sequencing experiment. Section5 discusses related work, and Section 6 outlines somedirections for future research.

1.1 ]~xperiment PlanningExperiment planning is a complex task. The exper-iment should efficiently and inexpensively achieve itsgoals (e.g., test a hypothesis, determine a property,etc.). The use of resources (e.g., money, time, people,equipment, supplies) should be minimized. Planningmay be complicated due to the unavailability of strongdomain knowledge, established methodologies, or ex-perienced personnel. In addition, uncertainty aboutexperimental conditions, instrument readings, and theobjects being manipulated must be dealt with. The re-sults of executing a laboratory procedure may not becompletely predictable in cases where domain knowl-edge is lacking or uncertainty exists. Thus, planningmay need to be interleaved with execution such thatthe results of previous procedures help to determinethe next procedure to execute.

This interleaving is necessary in experiments to se-quence an unknown protein (polypeptide). The se-quence of a protein (often hundreds of amino acidslong), along with its shape and other properties, de-termines the structure (and hence the function) the protein. There are 20~°° possible sequences fora polypeptide of average length. SeqER, plans an ex-periment to determine a polypeptide’s amino acid se-quence, N-terminal and C-terminai end groups (i.e.,Formyl, Acetyl, Pyroglutamyl or Amide blocked, orunblocked), circularity, and disulfide bond locations (ifany).

Various laboratory procedures can be performed toachieve these goals including acid and base hydrolysis,chemical and enzymatic cleavage, oxidation, Edmandegradation, end group analysis, exopeptidase degra-dation, etc. The execution of many of these proce-dures is realistically simulated by the Sequencelt pro-gram(Place and Schmidt 92)3, a tutorial program for

SThis program was developed as part of the BioQUEST

SeqER Architecture

Experiment Con~oller ~ ~Jt Goe~

,~ ! c_-o,~_ _k or r~r.~A~~ i~Applier = =~

kom Task Set [

I --’~ ....... J [Use;Interprets I--I

E p!sode I ’

Figure 1: The SeqER Architecture: Main Modules andProcess Flow

protein sequencing techniques which we are using totest SeqER’s plans in lieu of an actual laboratory. Se-quenceIt provides 35 laboratory procedures to choosefrom, each of which has several parameters that can bevaried.

2. An Overview of SeqER

SeqER’s top-level processes and modules are shown inFigure 1. The domain-independent Planner compo-nent consists of the Experiment Controller, Rule Ap-plier, Case Applier, and Decision Episode Storer mod-ules (shown as dark-outlined boxes). These last twomodules, in conjunction with modules of the CAPERcase-based planner, contain the mechanisms for rea-soning by analogy (described in Section 3). The RuleApplier module’matches and fires rules. The Experi-ment Controller module runs the top-level process (de-scribed in Section 2.2). (Solid arrows indicate flow control. Dashed arrows indicate flow of information).The domain-specific KB component is described in thefollowing section.

biology software project.

Kettler 217

2.1 SeqER’s Knowledge Base

SeqER uses a variety of kinds of knowledge to deter-mine the subgoals of the experiment, the tasks thatachieve them, and the task to be executed next. Thisknowledge is contained in the KB component, whichcontains both working knowledge4 (knowledge aboutthe problem at hand) and persistent knowledge (whichpersists between problems). (The latter is indicated Figure 1 by circles with underlined labels; the formerby circles with plain labels). Most of the KB is storedin a single large semantic network, implemented usingthe PARKA massively parallel, frame-based knowledgerepresentation system (Spector et al. 90; Evett et al.93). The following paragraphs describe the modules(’~nowledge sources") within the KB component andgive examples from the KB used for protein sequenc-ing.

The Concepts Set consists of those frames in theKB representing domain concepts including domainobjects (e.g., "Polypeptide’, "Amino Acid", "DisulfideBond"), properties (e.g., "Amino Acid Composition","Sequence Length"), and procedures (e.g., "Acid Hy-drolysis", "Clostripain Cleavage"). Procedures areconsidered (atomic) actions from SeqER’s viewpoint.Actions can have roles (e.g., "Agent", "Object"), pre-conditions, filter conditions, parameters, and effects.A plan is a sequence of primitive actions, those thatcan be executed directly. Concepts are organized inan is-a (a-kind-of) abstraction hierarchy that supportsinheritance. There a~e currently 450 concept frames inthe protein sequencing KB.

Concepts may have instances in the world. For ex-ample, "Sample 101"5, an instance of "PolypeptideSample", could be a particular sample with particularproperties from a particular experiment. Like "thing"concepts, action concepts may also have instances -termed "episodic actions (e-actions)". An e-action rep-resents a particular potential occurrence of a primitiveaction involving particular objects, parameters, etc.For example, "Acid Hydrolysis 202", an instance of"Acid Hydrolysis", could be acid hydrolysis on Sample101 with parameters duration = 24 hours and sampleamount -- 1 nMole. An e-action is a kind of episode.Other kinds of episodes include experiment episodesand decision episodes, both of which are part of theCasebase (see Section 3).

The Goal Set contains the current outstandinggoals (subgoals) for the experiment such as "Aminoacid composition for Sample 101 determined", "Disul-fide bonds of Sample 202 broken", "Have overlappingsequenced fragments for Sample 303", etc. A goal mayhave an associated parent goM, sibling goals, task forwhich the goal is a precondition, and task(s) whichachieves the goal. Goals are currently added to the

4A.k.a. "working memory~SA name consisting of a concept name and a numerical

suffix designates an instance frame.

Goal Set by the firing of production (if-then) rules thatmatch structures in the KB or by their instantiation aspreconditions of an action to be executed. The use ofanalogy to make use of previous subgoaling informa-tion is also being investigated.

The Task Set contains tasks currently being con-sidered for execution. Tasks are e-actions and are ei-ther laboratory procedures (such as Aminopeptidaseor Edman Degradation) or data procedures (such Estimate Amino Acid Amounts or Assemble SequenceFragments). A task may have an associated purpose(i.e., the goal it achieves) and preconditions. Tasksmay be partially ordered. Tasks are currently addedto the Task Set by the firing of production rules thatmatch structures in the KB, especially goals in theGoal Set. The use of analogy to add tasks to the TaskSet by reusing previous task-goal associations is beinginvestigated.

The SRmples Set consists of all instance framesrepresenting samples in the current experiment. Apolypeptide sample has an associated quantity, ori-gin (the sample and procedure it was derived from),and properties (composition, sequence, length, etc.).SeqER starts with an initial sample of a polypeptide.Additional samples can be derived from this sample us-ing cleavage procedures, performic acid oxidation, etc.

The Lab State, a module of the KB that is sepa-rate from the main semantic network, is a set of factsthat represent the current state of the lab (the stateof objects, equipment, etc.). This includes facts stat-ing which data has been collected, which samples havebeen modified, etc. The Lab State and Samples Setare updated to reflect the effects of an experiment pro-cedure’s execution.

The Hypothesis Set contains frames representinghypotheses about the various properties (amino acidcomposition and sequence, disulfide bond locations,end groups, and circularity) of current polypeptidesamples. An amino acid composition hypothesis is ofthe form "There are N A amino acids in Sample S"where N is a number, A is a kind of amino acid (e.g.,Alanine, Arginine, etc.), and S is a polypeptide sam-ple. A sequence hypothesis is of the form "There is aA amino acid in Position M in S", where M is a posi-tion in S’s sequence. A disulfide bond hypothesis is ofthe form "There is a Disulfide Bond between PositionsM1 and M2 in S" where M1 and M2 are the hypothe-sized locations of two Cysteines. Hypotheses may alsobe negative: i.e., "Sample S does not contain Serine".A hypothesis may have an associated confidence value(e.g., "high", "medium high", etc.), supporting evi-dence (procedures executed), and related hypotheses.Hypotheses are entered by the user based on his/herinterpretation of the results of a lab or data procedurethat was executed via SequenceIt.

The Rule Base contains production rules of threetypes: goal determination rules, task determinationrules, and task selection rules. Rule antecedents are

218 ISMB--93

matched against the KB (semantic net and Lab State)to determine if a rule can fire. The consequents offired rules can modify the working knowledge in theKB. Goal determination rules and task determinationrules add goals to the Goal Set and tasks to the TaskSet, respectively. Task selection rules are used to selectamong alternative, executable tasks (see Sections 3.3and 3.4).

Some rules are domain-specific and were derivedfrom analysis of human protocols and from domaintextbooks, papers, etc. An example is the goal de-termination rule GDSI: "If the goal is ’DeterminePolypeptide Properties for Sample S’, then instanti-ate subgoals ’Determine Sequence for S’, ’DetermineCircularity for S’,... "e Others are more generic rulessuch as the task determination rule TDI: "If there is aGoal G for Sample S and there is a procedure A whichachieves G, then instantiate the task A for S."

2.2 SeqER’s Top-Level Process

Given the overall goal(s) of the experiment and infor-mation about the initial situation, SeqER produces aplan. Planning and execution are interleaved in an ex-periment: after each procedure is selected by SeqER,it is "executed" by the user who then interprets theresults of the procedure and supplies SeqER with newhypotheses to investigate.

For protein sequencing experiments, the input goal isto determine the amino acid sequence and other prop-erties of a polypeptide. As initial inpfit, SeqER is toldof the quantity (and any other known properties) of polypeptide sample whose properties are to be deter-mined (via SequenceIt). It is also told of any unusualsituations such as any lab equipment that is not func-tioning or other constraints (time limits, etc.) on theplan to be produced.

SeqER repeats the following process until the exper-iment successfully concludes (the polypeptide’s prop-erties have been determined) or fails (SeqER is out sample or other resources, or it cannot find executabletasks for its current goals).

1. The Experiment Controller (EC) adds applicablegoals (subgoals) to the Goal Set through the firingof goal determination rules.~

2. The EC checks each goal in the Goal Set to see if itis already established (by examining the KB). If theGoal Set is empty or all goals have been established,SeqER ends the experiment with success.

3. The EC adds applicable tasks (i.e., those achievingunestablished goals in the Goal Set) to the Task Setthrough the firing of task determination rules.

SWhere S is a variable whose value is an instance ofPolypeptide Sample.

7Currently all firable rules axe fired: no conflict set res-olution is performed.

4. If new goals or tasks were added in the above steps,Steps 1-3 are repeated (because the new goals andtasks may suggest additional goals and tasks).

5. If no executable task can be found in the Task Set(and the Goal Set is not empty), SeqER ends the ex-periment with failure. Otherwise, the EC chooses atask (laboratory or data manipulation/analysis pro-cedure) from the Task Set to execute. Task selectionis done using previous decision episodes (see Section3).

6. The task is presented to the user who executes it(via the Sequencelt simulator).

7. The user interprets the results of the procedure’sexecution and enters changes to the Lab State andHypothesis Set. (Other changes to the Lab Stateare made automatically using knowledge of the pro-cedure’s default effects).

8. SeqER stores a new decision episode in the casebasewhich describes the task selection just made.

9. If the user judges the experiment complete, SeqERstores a new experiment episode and exits, s Other-wise, the process is repeated from Step 1.

3. SeqER’s Use of Analogy

SeqER uses derivational analogy (e.g., (Carbonell 86))to reuse previous planning experience. Derivationalanalogy differs from transformational analogy in thatthe latter makes use of the solution of the previousproblem (e.g., a previously generated plan) whereas theformer makes use of the reasoning used to produce thatsolution (e.g., the decisions made during the planning).Both kinds of analogy have been used by A.I. planningsystems. Case-based planners such as CAPER typicallyuse transformational analogy. Derivational analogy isused for planning by the PRODIGY/Analogy systemwhere it provides a considerable reduction in searcheffort (Veloso 92).

SeqER captures planning experience in the form ofdecision episodes and experiment episodes. Both ofthese kinds of episodes are stored as cases in the case-base. An experiment episode describes an experimentthat was performed and consists of a sequence of de-cision episodes. A decision episode (DE) describes thechoice of a task (laboratory or data procedure) to ex-ecute. A DE consists of the set of tasks chosen from,the task chosen, and the reason(s) for the choice made.DEs capture control knowledge which can be reused inanalogous situations. For example, suppose a decisionwas made in a particular situation to choose task Afrom the set of tasks {A, B}, where both tasks achievethe goal G. When faced with the same (or a similar)

SCurrently the user decides when the sequence has beenaccurately determined. Determining when hypotheses havebeen proven is a nontrlvial task. Seq]~R might be modifiedto assist in this, perhaps by checking to see if the hypothesisset has qniesced.

Kettler 219

choice in a similar future situation, a task analogous toA in that situation can be chosen, assuming that thechoice of A was successful before.

In the event the original experiment failed some timeafter choice of A was made, however, it may be diffi-cult to determine if the choice of A over B was success-ful or not. Determining which choice(s) led to failureis an instance of the difficult credit-blame assignmentproblem. When available, information about successor failure of a choice is also stored in the DE. SeqERcurrently assumes that a previous choice was successfulunless it is obviously at fault. For example, the choiceof A can be deemed a failure ifA clearly did not achievegoal G when A was predicted to achieve G. Thus, inthe future when the objective is to achieve G in simi-lar circumstances, one has reason to avoid choosing A.Thus one can avoid possibly repeating one’s previousmistake by pruning A from the Task Set.

The savings from reusing control information isgreatest when there is a large number of alternatives toconsider and through knowledge of previous choices amore informed choice can be made from among them,rather than pursue the path (the sequence of tasks)from each. Pursuing a path can be expensive in SeqERbecause planning and execution are interleaved andthere is no ability to backtrack by undoing choicesmade. Knowledge of which alternative previously ledto success can be used to select that alternative (oran analogous one) in the current situation. Knowledgeof which alternatives previously led to failure can beused to prune those alternatives (or analogous ones)from the Task Set. The danger in pruning alternativesis the possibility of not fnding a more optimal path toa solution.

When SeqER must decide which task in the TaskSet to execute next, it searches for episodes in whichsimilar decisions were made. If it finds such a DE, itcan opt to choose the task in the current ("target")situation that is analogous to the task chosen in theprevious ("source") episode. An alternative approachto using DEs to guide task selection, or making anarbitrary choice, is the use of heuristic task selectionrules (see section 3.4).

3.1 Acquiring Decision Episodes

Initial DEs were obtained by analyzing experiment logskept by a human expert user of the SequenceIt pro-gram. Thus some of the user’s expertise is capturedin these DEs and can be use of by SeqER. These ini-tial experiments are "correct" - the choices made ledeventually to a plan that determined a polypeptide’sproperties. These experiments, however, are not as-sumed to be optimal - it may be possible to find a suc-cessful experiment that has fewer procedures or thatconsumes fewer resources.

Each time SeqER chooses among tasks, a new DEis created to describe this choice and is stored in thecasebase. In this manner, SeqER can improve over

22O ISMB--93

time as it accumulates additional planning experience.It is expected that SeqER’s knowledge base will growquite large as these DEs accumulate.

3.2 Decision Episode RetrievalWhen faced with choosing a task from the Task Set,SeqER attempts to retrieve similar decision episodesfrom its casebase. SeqER’s Case Applier componentretrieves DEs using the case retrieval mechanism ofthe CAPER case-based planning system. Through theuse of massive parallelism, CAPER can retrieve casesefficiently from a very large, unindexed memory (Ket-tler et al. 93). Because this case memory is unindexed,there is high flexibility as to which features of the cur-rent situation (i.e., Lab State, Goal Set, HypothesisSet, current samples, etc.) can be used as part of theretrieval probe.

For protein sequencing, SeqER currently uses a sim-ple probe consisting of the type and object of each exe-cutable task in the Task Set. As a simple example, con-sider the case where the current task set contains twotarget tasks: (TT1) Acid Hydrolysis on Sample S1 and(TT2) Base Hydrolysis on S1. A retrieval probe will beissued to find all DEs where the alternatives includedAcid Hydrolysis on Sample S and Base Hydrolysis onSample S (where S is any previous sample). If a sin-gle source DE is returned, it is selected for application(as described in the next section). If no source DE returned, SeqER must use other means (perhaps evenguessing) to choose among the target tasks. CaPER’sretrieval flexibillty makes it easy to add other features,such as the hypotheses about a sample, to the retrievalprobe.

In addition, SeqER can make use of CaPER’s proberelaxation mechanism in the event that no cases matcha retrieval probe (or some feature in the retrievalprobe). Features in the retrieval probe can be re-laxed using information from the concept is~a hier-archy. For example, SeqER may have to choose be-tween the tasks Clostripain Cleavage and Acid Hy-drolysis. It issues a probe containing these tasks andgets no matching cases. Using its taxonomic knowl-edge that Clostripain Cleavage is-a Enzymatic Cleav-age Procedure and that Acid ttydrolysis is-a Hydroly-sis Procedure, SeqER could issue a new, more generalprobe, replacing "Acid Hydrolysis" with "Hydrolysis"and/or replacing "Clostripain Cleavage" with "Enzy-matic Cleavage".

SeqER uses a simple similarity metric to rank theDEs retrieved using the probe. Each of these DEs con-tains one or more source tasks that match some targettask in the retrieval probe. SeqER counts the numberof tasks that each source DE has in common with thetarget DE. This score ranges from 1 to K, the numberof executable tasks in the Task Set. In the above exam-ple, this score could be 1 or 2. DEs with scores below2 are rejected. 9 Multiple DEs with the same similar-

9This heuristic eliminates DEs having only a single task

ity score are ordered such that those DEs that havethe fewest extraneous tasks (tasks with no analoguesamong the target tasks) are preferred. The adequacyof this simple similarity metric needs to be determinedthrough further empirical study.

SeqER’s Case Applier next attempts to find a com-plete mapping between each source DE retrieved andthe target situation. A one-to-one mapping must befound between target entities - target tasks (and theirsamples) matched during case retrieval - with sourceentities - tasks (and samples) in the source DE. It possible that several such mappings may exist betweena source DE and the target situation. SeqER currentlyconsiders only the first such mapping it finds. If nosuch mapping can be found for a source DE, it is re-jected.

SeqER next checks each source DE for which it hasestablished a mapping(s) to determine if a target taskwas found that is analogous to the task chosen inthe source DE. If no analogous target task exists, thesource DE is rejected. For example, consider a sourceDE that has two tasks, As and Bs, which match twotarget tasks AT and BT, respectively. Suppose thesource DE has an additional task Cs which was thetask chosen but has no analogue among the targettasks. This source DE will be thus be rejected sincethere is no choice among the targets to make that isanalogous to Cs.1°

3.3 Decision Episode Application

SeqER considers each source DE returned by the re-trieval/mapping procedure described above. SourceDE’s are considered in decreasing order of their simi-larity scores. For each source DE, SeqER checks thereason(s) for the choice that was made. Reasons fora choice are: the choice is supported by a rule(s), thechoice is supported by a (analogous) DE(s), or both. is also possible a choice was made arbitrarily. SeqERlooks for reasons analogous to those from the sourceDE that can justify choosing the analogous target taskin the current situation, li In addition, SeqER alsoconsiders any information it has as to the s~dccess orfailure of choice made in the source DE. This infor-mation can also be used to justify a particular choiceamong the target tasks.

To continue the previous section’s simple example,suppose SeqER is considering a source DE - DE3 -

analogous to a target task. This is not considered a suffi-cient basis for an analogy between the source DE and thetarget situation.

l°This information could be of use in pointing out tasksthat perhaps should be under consideration but are not. Ifthe source DE describes a situation sumdently similar tothe current situation, the system might decide to look forand instantiate (via a rule, etc.) a task analogous to and add it to the Task Set.

liThe importing of reasons from the source DE to thetarget situation is a kind of transformational analogy.

where a choice was made between source tasks (ST1)Acid Hydrolysis on sample $2 (previously chosen) and(ST2) Base Hydrolysis on $2. The mapping {(TTi STi), (TT2 :: ST2), (S1::$2)} is found.12 Since STiwas previously chosen, SeqER considers choosing theanalogous target task, TTi (Acid Hydrolysis on S1).

Suppose the reason for DE3 was the applicable taskselection rule "Rule 11: If ~askA has fewer harmfulside effects than tasks, choose taskA over tasks" withvariable bindings $askA = STI, tasks = ST2. TheRule Applier checks to see if the rule applies in thecurrent situation when instantiated with the propersubstitutions from the mapping (TT1 for $aSkA, TT2for tasks). If so, Rule 11 justifies choosing TTi overTT~. Although it is possible that Rule 11 could havebeen checked for application without first retrievingany DE, there are advantages in the approach used(see Section 3.4).

If, however, the reason for choosing ST1 in DE3 hadbeen another decision episode (say DE2), then SeqERassumes that DE2, by virtue of its justifying the choiceof ST, in DE3, also justifies choosing TTi in the cur-rent situation. In other words, because the current sit-uation is analogous to DE3 which is analogous to DE2,the current situation is analogous to DE2 by transitiv-ity. Thus, DE3 would justify choosing TTi.

Finally, suppose STi was chosen arbitrarily in DE3.In this event, if it is known that choosing ST1 (ratherthan ST2 ) proved to be a "good" decision, SeqER isjustified in choosing TT1 given the assumption thatDE3 is similar to the current situation. If the qualityof choice of ST1 is not known, however, then SeqERwill, in the absence of any other justification, considerDE3 sufficient justification for choosing TT1 .is This isan example of using analogy to make a choice when nostronger domain knowledge such as a rule is available,rather than just make an arbitrary choice.

For sources DEs with the same similarity score,SeqER prefers the source DE that suggests (by anal-ogy) the choice of a particular target task for whichthe best justification exists in the current situation.Justification by a rule is preferable to justification bya case. Both are preferable to choosing a target taskanalogous to one chosen arbitrarily in some source DE.The target task, along with the source DE that sug-gested it, is returned to the Experiment Controller forexecution by the user.

The justification mechanism in SeqER is being fins-tuned through empirical evaluation. There are severalpossible extensions including the use of multiple sourceDE’s to justify the choice of a particular target task.Multiple source DEs might suggest the same target

12Note that TT~ and STj are constants standing for par-ticular tasks. S1 am.d $2 are actual sample names.

13This approach is more warranted when there is reasonto believe the source DE describes a good (or at least ~non-bad" choice), such as when the source DE was froman experiment performed by a human expert.

Kettler 221

task and hence provide better justification for choosingit than a single source DE could.

3.4 Cases and RulesThe interaction of rules and cases is being exploredin SeqER. Analysis of human expert protocols inthis domain indicates that both are used to determinegoals/tasks and for task selection. These types of rulesare thus used by SeqER. The integration of rule-basedreasoning and case-based reasoning has proven use-ful for systems in other domains (e.g., (l~issland andSkal k 91)).

Our initial approach in SeqER is to minimize theuse of task selection rules. Often such rules are notavailable, particularly in less understood domains, orare highly specific as to their applicability. Our aim isto see how far SeqER can get by its use of analogy.

In SeqER, cases can be used to organize the task-selection rules. Consider the example from the previ-ous section in which the justification for DE3 is Rule11. When DE3 is retrieved, Rule 11 is "retrieved" andchecked to see if it applies in the current situation.If so, Rule 11 justifies choosing TT1 . An alternateapproach would be to consider all such rules at taskselection time and see if any can apply, before we re-trieve any DEs. Rule 11 would thus be found and ap-plied directly, without retrieving DE3. Although thisapproach is practical for small rulebases, it becomesless practical to always have to check each rule in alarge rulebase for applicability at every task selectionstep because rule matching can be expensive. WithSeqER’s approach, a task-selection rule is not consid-ered until it is suggested by a case (such as Rule 11 wassuggested by DE3). Some recent psychological resultshave shown how some experts use cases to organizerules (Boshuizen and Schmidt 92).

Cases can also provide a context in which a task se-lection rule applies. This is especially useful for heuris-tic rules that only work in certain situations. For ex-ample, suppose that we have task selection Rule 22that only works in very limited situations and thatwas applied with successful results in DE5. If we haveretrieved DE5 and are considering applying Rule 22to the target situation, we can be more confident thatRule 22 will work since our current situation is similarto DE5.

One extension to SeqER would be to add a mecha-nism to induce task selection rules from similar cases.Another extension would be to induce the context ofapplicability for task selection rules from similar caseswhere the rule was applied with success.

4. An Example of SeqER’s Operation

The following text describes part of an actual SeqERrun. SeqER was initially given 200 nMoles of Beta-Endorphin protein to determine its sequence and otherproperties. The casebase used here contained 12 exper-iments of fairly short length (15-20 DEs each - a total

of 207 decision episodes). All of these experiments wereencoded from the logs of a human domain expert. Thetotal KB size (conceptual and episodic memory) wasover 1400 frames. To test the ability of reasoning byanalogy, task selection rules were disabled and casesalone used to choose among tasks.

SeqER first instantiated the initial goal (Goal-1325: PP-Props-Determined for Sample-1324) for thenew experiment, Experiment-1323. It then fired asmany goal and task determination rules as possi-ble, until no new rules could be fired. SeqER nowhad to choose among the following executable targettasks (procedures/e-actions) from the Task Set: Base-Hydrolysis~1424 of Sample-1324, Acid-Hydrolysis-1426of Sample-1324, and Performic-Acid-Oxidation- 1430 ofSample-1324.

Three candidate source decision episodes were re-trieved (DE-1890, DF_,-1844, and DE-1793), each witha score of 2 (they each have two source tasks match-ing the above target tasks). Each decision episode wasconsidered and all were found to have the same rea-sons: NIL (i.e., no justification was given in the sourceDEs). In the absence of source DEs with justifica-tions, SeqER chose one of these (DE-1890, from anexperiment to sequence Bombesin) where the choiceof source task Acid-Hydrolysis-1888 on Sample-1886(over Base-Hydrolysis-1889 on Sample-1886) was suc-cessful. Using the mapping between these source tasksand the target tasks, the analogous target task to bechosen (Acid-Hydrolysis-1426 of Sample-1324) was de-termined.

The user executed Acid Hydrolysis (24 hours dura-tion, 1 nMole of Sample-1324) using SequenceIt whichsuccessfully achieved the goal of obtaining data on theamino acid composition of Sample-1324. This goal wasmarked achieved (and removed from the Goal Set),and task Estimate-Composition-1370, for which thisgoal was a precondition, was added to the Task Set.(Base-ttydrolysis-1424 was no longer needed to achievethis goal and was thus removed from the Task Set.)A new DE was created to record the choice of Acid-Hydrolysis-1426 and DE-1890 and was stored as thereason justifying this choice.

Other rules were fired, and eventually SeqERnext chose between Estimate-Composition-1370 andPefformic-Acid-Oxidation-1430, both for Sample-1324.Three candidate source decision episodes were re-trieved but each only had a single source task analo-gous to a target task (a score of 1). This was not suffi-cient for an analogy, and the source DEs were rejected.SeqER arbitrarily chose Estimate-Composition-1370.

The user used the acid hydrolysis data in SequenceIt(obtained in the previous DE) to successfully estimatethe amino acid amounts in Sample-1324. The user re-turned these estimates as 20 medium-high confidencehypotheses: 2 Alanines, 2 Aspartic Acids, 5 Lysines,etc. These were added to the Hypothesis Set, and anew DE was created to describe the choice and results

222 ISMB-93

of this action (reason was "Arbitrary Choice"). Thetarget goal of obtaining amino acid composition hy-potheses for Sample-1324 was marked achieved.

The addition of new hypotheses triggered (viathe firing of rules matching the Hypothesis Setand Concepts Set) the addition to the Task Setof laboratory procedures that cleaved on the hy-pothesized amluo acids in Sample-1324. SeqERnext chose Performic-Acid-Oxidation-1430 over thefollowing tasks (all for Sample-1324): Estimate-Length-1356, Thermolysin-Cleavage-1476, Lysobacter-Protease-Cleavage-1478, and Trypsin-Cleavage-1480.Performic Acid Oxidation was executed to derive a newsample (instantiated as Sample-1489) from 100 nMolesof Sample-1324. SeqER then continued to perform ad-ditional procedures on this new sample, which lackedthe disulfide bonds that can complicate sequence de-termination.

This example is illustrative of several aspects ofSeqER’s operation in this domain and in general. Thefeatures used in the retrieval probe and in the simi-laxity metric are a kind of implicit domain knowledge.In the protein sequencing domain the use of only thetarget tasks (and samples) in case retrieval/selectionoften results in a source DE being discarded due to itonly matching a single target task. This mainly occurswhen the source DE was from a human-planned exper-iment, where the alternatives being considered at eachdecision point are few. We conjecture that the humanexpert is using additional domain knowledge to limitthe alternatives under consideration. SeqER, in con-trast, adds all the tasks it can (via the firing of rules) the Task Set and then chooses among them. SeqER’sapproach results in a systematic enumeration of the al-ternative tasks but these include ones that might notbe particularly useful to execute. To mimic the humanexpert’s strategy, additional domain knowledge couldbe incorporated into the goal and task determinationrules.

In addition, it has been observed that SeqER oftenlacks the ability to make fine discriminations amongsource DEs that have highly similar source tasks.When selecting which of these DEs is most similar tothe target situation, it would be useful to compare the(hypothesized) known properties of samples in theirsource tasks with those samples in the target tasks.This is currently being investigated. The modular na-ture of the KB makes it easy to add, delete, or modifyits contents based on this kind of empirical evaluationof the system’s operation.

The size of the KB can be quite large. Experimentsto sequence small polypeptides (20 or so amino acids)often have more than 20 DEs. Some large polypep-tides (hundreds of amino acids) sequenced by a hu-man expert have over a hundred DEs. SeqER mustbe able to support these large casebases of thousandsof frames. The parallel case retrieval mechanisms ofCaPER/Parka that are used have been proven effi-

cient on large casebases (tens of thousands of frames)in other domains and should scale well for SeqER case-bases too.

5. Related WorkThe MOLGEN (Stefik 81) system used generativeplanning techniques along with constraint posting andmeta-planning to design experiments for gene cloning.Unlike MOLGEN, SeqER must interleave plan genera-tion and execution. SeqER also uses different planningtechniques, such as case-based planning, than MOL-GEN did. More recent work using MOLGEN for auto-mated discovery is described in (Friedland and Kedes85).

The use of derivational analogy to control agenerative planner has been investigated in thePRODIGY/Analogy system (Veloso 92). SeqER alsouses analogy for control of a (simpler) planner but em-ploys different case representations and retrieval tech-niques and interleaves planning with execution. Othersystems have used derivational analogy for planning indomains other than experiment planning including theAPU system (Bhansali and Harandi 93) for the syn-thesis of UNIX programs and various systems that usederivational analogy to design objects (see (Mostow90) for an overview of these).

Several systems have used analogy for scientific rea-soning including the PHINEAS system (Falkenhainer90), which explained new scientific phenomena by anal-ogy to previously-explained phenomena, and the PIsystem (Thagard 88). Also related to the general do-main of SeqER is the HYPGENE/GENSIM system(Karp 89), which used iterative design of hypothesesto improve the predictive power of theories. There hasalso been some research in psychology on the task ofexperiment design (e.g., (Klahr et al. 90)).

6. Directions for Future Research andConclusions

There are many possible interesting extensions to theprotein sequencing configuration of SeqER and thearchitecture in general to be investigated after fur-ther empirical study of the prototype. The hypothesisset could be monitored to collect data about how itchanges, based on the results of executing a lab pro-cedure, and checked for weak or inconsistent hypothe-ses, which could be used to suggest appropriate labprocedures to remedy these deficiencies. SeqER couldfocus more on plan optimality (i.e., minimize resourceconsumption, etc.). The capturing of more detailedinformation on quality of choices allows better evalu-ation of whether an analogous choice should be madein a future situation. The reuse of larger units (i.e.,several consecutive actions) from previous plans wherepossible is also being investigated.

SeqER’s architecture has allowed us to explore theapplication of derivational analogy and other planning

Ketfler 223

techniques to scientific experiment planning. Relevantplanning experience can be efficiently retrieved and ap-plied to select tasks for execution in protein sequencingexperiments. New experience is encoded as new casesin the casebase. Knowledge about laboratory and datamanipulation/analysis procedures, laboratory objects,goals, methodology, hypotheses, and previous plans isrepresented as frames, rules, and cases in a modularknowledge base component. This KB component cap-tures domain-specific knowledge while the reasoningmechanisms of the Planner component are domain-independent. Thus it is anticipated that SeqER couldeasily be configured for other experiment planning do-mains.

We are currently investigating potential domainswhere there is a real need for automation to comple-ment the skills of human scientists. SeqER, acting asan assistant, could accurately recall and present rel-evant planning experiences from the past, advise onwhich procedure to select next, and maintain up-to-date information about current hypotheses.

Acknowledgments: We wish to thank Dr. JamesHendler for his direction and support of the CAPERproject; Dr. Joshua Lederberg and his associatesfor feedback on SeqER and experiment planning;Dr. Allen Place for his making the SequenceIt pro-gram available to us; Ting-Hsien Wang and especiallyNathan Felix for their knowledge acquisition work forSeqER; and Bill Andersen for his development of newParka code and tools. The CAPER part of this re-search was supported in part by grants from NSF(IRI-8907890), ONR (N00014-J-91-1451), and AFOSR(F49620-93-1-0065).

References

Bhansali, S., and Harandi, M.T. 1993. Synthesis ofUNIX Programs Using Derivational Analogy. Ma-chine Learning, Vol. 10, pp. 7-55.

Boshuizen, H.P.A., and Schmidt, H.G. 1992. On theRole of Biomedical Knowledge in Clinical Reason-ing by Experts, Intermediates, and Novices. CognitiveScience, 16, pp. 153-184.

Carbonell, J.G. 1986. Derivational Analogy: A The-ory of Reconstructive Problem Solving and ExpertiseAcquisition. In Machine Learning, Vol. 2, Eds. R.S.Michalski, ~I.G. Cazbonell, T.M. Mitchell. San Marco,California: Morgan Kaufmann Publishers.

Darden, L. 1991. Theory Change in Science: Strate-gies from Mendelian Genetics. New York: OxfordUniversity Press.

Evett, M.P.; Hendler, J.A.; and Spector, L. 1993. Par-allel Knowledge Representation on the ConnectionMachine. Journal of Parallel and Distributed Com-puting, forthcoming.

Falkenhalner, B. 1990. A Unified Approach to Ex-planation and Theory Formation. In Computational

224 ISMB-93

Models of Scientific Discovery and Theory Formation.Eds. J. Shrager and P. Langley. San Mateo, Califor-nia: Morgan Kaufmann Publishers, pp. 157-196.Friedland, P., and Kedes, L.H. 1985. Discovering theSecrets of DNA. Communications of the ACM, Vol.28. No. 11., November 1985, pp. 1164-1186.

Karp, P.D. 1989. Hypothesis Formation and Qualita-tive Reasoning in Molecular Biology. Doctoral Disser-tation Stan-CS-89-1263, Stanford University.

Kettler, B.P.; Hendler, J.A.; Andersen, W.A.; andEvett, M.P. 1993. Massively Parallel Support forCase-Based Planning. In Proceedings of the NinthConference on Artificial Intelligence Applications(IEEE). Held in Orlando, Florida, March 1-5, 1993.Washington, D.C.: IEEE Computer Society Press,pp. 3-9.

Klahr, D.; Dunbar, K.; and Fay, A.L. 1990. DesigningGood Experiments to Test Bad Hypotheses. In Com-putational Models of Scientific Discovery and TheoryFormation. Eds. J. Shrager and P. Langley. San Ma-teo, California: Morgan Kaufmann Publishers, pp.355-402.Mostow, J. 1990. Design by Derivational Analogy: Is-sues in the Automated Replay of Design Plans. InCarbonell, J.G., Machine Learning: Paradigms andMethods. Cambridge, Massachusetts: The MIT Press,pp. 119-184.

Place, A.R., and Schmidt, T.G. 1992. User’s Guide forSequenceIt!: An Apprenticeship in the Art and Logicof Protein Sequencing. Center of Marine Biotechnol-ogy, Maryland Biotechnology Institute and ProjectBioQUEST.

Rissland, E.L., and Skalak, D.B. 1991. CABARET:Rule Interpretation in a Hybrid Architecture. Inter-national Journal of Man-Machine Studies, Vol. 34,pp. 839-887.Spector, L.; Hendler, J.A.; and Evett, M.P. 1990.Knowledge Representation in PARKA. Technical Re-port 2410, Department of Computer Science, Univer-sity of Maryland at College Park.

Stefik, M. 1981. Planning with Constraints (MOL-GEN: Part 1). Artificial Intelligence, Vol. 16, pp. 111-140.Thagard, P.R. 1988. Computational Philosophy ofScience. Cambridge, Massachusetts: The MIT Press.

Veloso, M.M. 1992. Learning By Analogical Reason-ing in General Problem Solving. Doctoral Disserta-tion CMU-CS-92-174, School of Computer Science,Carnegie Mellon University.

Protein Sequencing Experiment Planning Using Analogy · Protein Sequencing Experiment Planning...

Documents

Transcript of Protein Sequencing Experiment Planning Using Analogy · Protein Sequencing Experiment Planning...