Grounded Situation Models for Situated Conversational...

Grounded Situation Modelsfor

Situated Conversational Assistants

Nikolaos Mavridis,

PhD candidate,

The Media Laboratory,

Massachusetts Institute of Technology

http://www.pdfonline.com/easypdf/?gad=CLjUiqcCEgjbNejkqKEugRjG27j-AyCw_-AP

Prelude

Acknowledgments: Roz, Whitman, Yiannis, Deb, Kai-yuh, Cogmac, Affective…


Central Theme

How can we build cooperative Situated Conversational Assistants

around

Grounded Situation Models,

which are amodal representations of current, past or imagined situations,

that can be fed

both by the senses and language?


Vision andContributions

Vision: Vision: Cooperative Situated Conversational Assistants (SCAs)

Contributions: Contributions: The Grounded Situation Model (GSM) proposal:

Motivation and Requirements

Explicit Computational Model

A modular Implementation Architecture

An Implemented System

Quantitative and Qualitative Evaluations

A discussion of a Future Extensions plan

A Discussion w proposals: GSMs in relation to theories in AI, cogsci, semantics

Other smaller contributions:A rough Method for Designing GSM-based systems


News

HOW WE WENT FROM…

…A SOMEWHAT DUMMY CONVERSATIONAL ROBOT

A robot that could only touch currently visible objects


NewsTO A ROBOT THAT CAN……IMAGINE, REMEMBER, AND MORE!

Imagining :parts of situations through language, and later updating/verifying through senses

Remembering / answering questions :about the current or past situationsVerbalising gradations of certainty: expressing, and maintaining alternative possibilities

Motor commands :Pich up, touch, hand me…Latest extensions: Face & hand detection, bilinguality,


News

1) Three Novelties above the state of the art in conversational robots

2) Behavioral repertoire comparable / surpassing the Token Test for children

3) Quantitative Performance: median 1 correction (repeat) per 10-step task, 100% task completion


Salient Idea:Grounded Situation Models


Salient Idea:Grounded Situation Models

A LAYMAN’S DEFINITION:Customizable amodal representations w processes, acting as a "theater in the head“, containing descriptions of present, past or imagined situations.

REQUIREMENTS FOR GSMs:

1. Break down situations into types of NL "folk ontology":Histories = moments (situations) + events, Situations = agents + objects + relations, Agents = physical + mental part, Physical part = subparts w properties etc. Contain representations for "me, you, others“ in situations& "embedded GSMs" within mental partHandle uncertainty / represent multiple alternatives

2. Multimodal connectivity / Bi-directionality senses-language:Fed by multiple modalities (vision, speech,..), feeding params to multiple modalities (speech, motor) Bi-directionality between sensing and languageStandard processes for basic operations: updates, matching,..


GSM BIDIRECTIONALITY

SENSES vs. LANGUAGE: DIFFERENCE in Informational Granularity / Specificity


In more detail…

I. Vision, Background, Motivation

II. Computational Model

III. Modular Implementation Architecture

IV. Implemented System: Ripley the Robot

V. Evaluation

VI. Future Extensions plan

VII. Discussion / relation to AI, semantics...


I. Vision,Background,Motivation


Vision:Cooperative SCAsA Situated Conversational Assistant (SCA)

is a system with sensing, acting and speech synthesis/recognition abilities, which engages in physically situated natural language conversation with human partners and assists them in carrying out tasks.

A continuum: from DUMMY TOOL to TRULY COOPERATIVE ASSISTANT

Tool: [Calligraphy Pen]No autonomy – Needs continuous controlHuman adapts to tool Tool does not know about human, objects & task

Cooperative Assistant: [Friend helping with cooking]Coordinated autonomy – Takes initiativeAssistant adapts to humanAssistant knows about human, objects & task

three axis: autonomy, adaptation, knowledge

SOME PREREQUISITES FOR COOPERATIVITY:natural language interaction: Commands, Questions, Inform statements, Declarations, Promises…

intention recognition: He’s looking at the tomatoes…what does he want?

action coordination: Once he cuts the tomatoes, I’ll throw them in the pan


Background:Relevant NotionsI. Symbol Grounding [Harnad]

Any system of symbols, if isolated from a meaning space externalto it, cannot help but becoming circular. (Ex: Semantic Nets.)

Meaning Spaces, Indirect Grounding, Approximations, …

II. Situation Models [Zwaan/Randvansky] ([Laird], [Minsky])."A situation model is a mental representation of a described or experienced situation in a real or imaginary world"

Situation model components & levels, main focus on story understanding / integration. No explicit processes for sensory conn!

III. Situated LanguageLanguage about the concrete here-and-now

Development and levels of situatedness, also in Ripley’s progress

Also: Chapter 2 – purpose, internal rep, prediction etc.


Motivation:Four pathways

Main motivation:

Building Cooperative SCAsOther motivations:

Cognitive Architectures:How to interconnect vision, motor control, and speech? (NLP cannot PNP)

How can we go from percept to words & vice-versa?

GSMs as “strcuctured blacboards” in AI?

Internal Models [Zwaan, P.J. Laird, Minsky]: What if there was a “theatre in the head”

that is filled with present, past or imagined “Situation Models?”

Grounding the Meaning of Words [Roy, Siskind, Regier]: How can we empirically learn models of the meaning of words?

What is wrong with today’s NLP?

MIND(concepts)

WORLD(objects)

SIGN(words)

? “A blue ball”


II. Computational Model

RepresentationWhy three layers?The three layers in operationExample ContentsOperational CycleHistories (Moments & Events))


RepresentationalHierarchy

Two Desiderata for proposed design:

I) Folk Ontology

II) Multimodal connectivity /

bidirectionality between senses-language


Why three layers?

Different Layers for Different Purposes!Consider an object (ball) which is part of a situation, and consider one of its properties (position).Imagine that you:

C1) See a ball and notice its position under uncertaintyC2) Need to grasp the ballC3) Hear that “the ball is at the center”

What is a matching representation of the position for each of C1-C3?

L1) Stochastic (n-D continuous distribution) (INFO FROM SENSES)

L2) Continuous (n-D continuous value) (ACTION PARAMETERS)

L3) Categorical (1-D discrete distribution) (INFO FROM/TO WORD)

(0.2, 0.34)


The three layers in relation totraditional logical sentences

SIGNALS

STOCHASTIC

SYMBOLS

DETERMINISTIC


The three layers in operation


MatchingAlgorithm

Match:

Sensory Expectation

To

Incoming Sensory Data

3 Outcomes:

Create

Update

Destroy


Example Contents of GSMAt time = 137 (present) Agent #1 (“I”)

Physical:Compound Object with 5 objects (body parts)Object #1: (“head”)

Property #1: (“position”)Layer #1 (Stochastic): Circular Gauss. Mix G1{m=(1.5,1.89,2), s=2}, G2{…}Layer #2 (Continuous): (1.6, 1.85, 2.05)Layer #3 (Categorical): #7 (“At the center”)

Property #2: (“pose”)…

Object #2: (“neck”)…

Interface:Viewpoint: R eye center (1.4, 1.89, 2.1), viewdir (-14deg, 5deg, 0deg) Mover: Head Motor: attachment point (1.5, 1.89, 1.8), rot axis: (…)…

Mental:Goals:Estimated GSM: (what Robot believes Robot’s GSM contents are)…

Agent #1 (“You”) Physical:… Interface:… Mental:

…Estimated GSM: (what Robot believes Human’s GSM contents are)


Operational Cycle

Basic Cycle

InitL1L2L3DoForever {If (incoming sensory data)

Senses-to-L1L1-to-L2L1-to-L3

Else If (incoming speech data)Words-to-L3L3-to-L1L1-to-L2

Else (no incoming data – diffuse)Diffuse-L1L1-to-L2L1-to-L3

EndifIf (incoming request for words)

L3-to-wordsElse If (incoming request for action parameter)

L2-to-actionparamsEndif}

Senses-to-L1 contains EXPECTATION MATCHING algorithm

Distribution across modules

VISOR and PROPRIOCEPTOR: contribute sensory data

IMAGINER :contributes speech data

INQUIRER and REMEMBERER:create words and action parameters

GSM :diffuses data that have not been updated


From Situation to History

Moments:Just copy and index current contents (situations)![No filtering/consolidation yet.]

Events:Event classifiers take input from GSM contentsCurrently hand-coded, future HMM / CFG’s etc.Continuously updated “live” events<type> <participants> <time0> <time1> <parameters>

Two types of classifiers:1) categorical classifiers (“red”, “small”, “heavy”)2) event classifiers (“appeared”, “x started touching y”)


Semantics of verbs & adjectives

Adjective meaning: categorical classifier

Verb meaning (second/third person): event classifierVerb meaning (first person): action controller


Video Example Revisited

Ripley answers, imagines and remembers…Notice the contents of the layers!


III. Modular ImplementationArchitecture


ModularImplementationArchitecture


IV. Implemented System:Ripley the Robot


Embodiment

Hardware:

7 dof with force feedback actuators

Gripper with force-sensitive touch sensors

Joint angle encoders

Dual cameras mounted around head

Microphone / Speakers

8 PC’s with gigabit net / Linux / PVM

Environment: table/objects/human

Goal: Conversational Helping Hand


Behavioral specs that were set

1) Answering questions about the present physical context:“What color is the one on the left?”

2) Quantifying and expressing confidence in beliefs“Where is the red one?” (red object out of view for a while)“Probably at the left, but maybe not”,

3) Respond to spoken motor requests: “Look to the left”“Hand me the small red one”

4) Imagining situations described through language:“Imagine an object at the left”, “There is a small object at the right”. (& sensory updates!)

5) Remembering and resolving temporal referents: “What color was the object that was at the center of the table when the red one appeared?”.

6) Resolving indexical human hand pointing: “What color is this one" + pointing

7) Recognizing human hand-related event detection:"Where was the blue one when my hand started touching the red one?“

8) Extending speech interface to multiple languages:Greek / English. Rudimentary GSM-mediated translation


GSM Customization

AGENTs: for Robot, Human, Objects

PROPERTY structures:P1 - Position (3D): Object center positionP2 - Pose (3D): Object pose in scaled quaternions, conv. to matrices / angles P3 - ShapeParams(3D): Object shape params (radius for spheres) etc.P4 -Weight (1D): As acquired from the force sensitive sensorsP5 - Color (3D): Simple RGB-space colorP6 - 3D voxelised occupancy grid: for 3D model acquisition

EVENT types:appeared / disappearedcame in view / came out of viewstarted moving / stopped movingstarted touching / stopped touching

SENSORY connections:Vision, Language, Proprioception


Vision System

Capture

Object Segmenter

Skin Segmenter

Face Detector

2D Objecter

StereObjecter- Objects

StereObjecter- Skin

2D Objecter

2D Objecter

StereObjecter- Faces

regions

blobs (ellipses)

regions

regions

blobs (ellipses)

blobs (rectangles)

Visor

Stereo & Mono blobsMono regions



SituationModel

pictures

…Right Channel Pipeline…(same as left channel)

Regions & blobs - R

LEFT

RIGHT

Segmenters: trainable color-based probabilistic w ellipse fit

Face Detector: Viola-Jones based (OpenCV)

StereObjecter: centroid/endpoint correspondence

Create / Delete / Update proposals


Spoken Language Processing

Transcript parsing:Keyword-based semantic framesSpeech-act type classificationTense determinationDescription extraction

Object referent resolution:Body-part / attributive descriptions, indexicals w pointingSingular / PluralDisambiguating questions

Temporal referent resolution:Recency criterion, time of descriptionObject should exist for action

Speech generation:Decision trees / templates


Verbalizing Uncertainty

Gradations

Input:

P(first): Pr of 1st most probable category

P(second): Pr of 2nd most probable category

Entropy Ratio: Entropy / Max Entropy


Input:

Nmatch: Num of objects matching description given by human,

Number: Was human’s description singular or plural

From referent resolution to action execution


Watching human hands


Voxelized shape models

->


Current Behavioral Repertoire

The system responds to the following utterance classes:

1. Questions:(present/past)<question> is/are the <obj des><question> was/were the <obj des> when <event des>

2. Action requests: (referent description in present/past)<action> the <obj des><action> the <obj des> when <event des>

3. Imaginer request: Imagine/There is a <obj des>

4. Location-centered “look at”: Look at <location>

5. Viewpoint: <action> the one on my/your left/right

6. Basic mode switching:Wake up, sleep, relax, look at me/the table

Types:

<question> belongs to {where, how big, what color}

<action> belongs to {touch, pick up, hand me, look at}

<obj des> contains <size> <color> <object> <locus>

<event des> contains <actor> <event type>

<size> belongs to {small, medium, large}

<color> belongs to {red, green, blue}

<locus> belongs to {center, left, right, top, bottom}

<actor> is either an <obj des> or <agent part>

<agent part> belongs to {my, your} × {head, arm}

<event type> belongs to {appeared, disappeared, started moving, stopped moving, came in view, came out of view}

7. Recent extensions:Indexicals: <obj des> extended to {this} + fingerpointingTouch events:<event des> extended to <actor> <event type> <patient><event type> extended: {started touching, stopped touching}Multilingual support: English/Greek(auto language switching)


V. Evaluation

Novelties / Qualitative

Generics: levels of Eval for SCAs, Psych tests

Quantitative

Comparison w Token Test


Novelties above state-of-the-art

Our system exhibits three novelties:

N1. Imagining situations described through language “Imagine a large object at the left!”

N2. Remembering and resolving temporal referents“Where was the green one when the red one appeared?”

N3. Quantifying and expressing confidence in beliefs “It is probably at the left”

Some comparable systems: [Crangle and Suppes 94], [McGuire et al 2002], [Sofge et al 2003], [Breazeal et al 2004]


Levels of Evaluation for SCAsIt is proposed that an SCA may be evaluated at various levels (& analogy w humans!):

E1) Module-level:One could attempt quantifying the performance of each of the system’s components (speech recognition, vision etc.). However, this would not be highly nformative about the effectiveness of the system as a whole.

E2) System-level:One could evaluate through measurables that involve more than one component. For example, for the composite VisionSystem-Visor-SituationModel chain, how much statistical error or delay exists between the "outside world" (external reality) and the "inside world" (situation model)? What is the transfer function between the two?

E3) Task-level: For a repertoire of specific tasks, and some criteria of success, how often is the robot successful in carrying out these tasks? For example, in responding to a human saying: "hand me the red one" by indeed handing him the red one?

E4) User satisfaction:One could measure human satisfaction while interacting with the system through questionnaires or otherwise.

E5) Human replacement test: How close to a human are the SCA’s responses, given same situations and utterances heard

Also: Guidelines for Using Human Psychological Tests as Design Specs / Evauation for SCAs


Quantitative Evaluation

Task-Level (T1)10 turns

Rich mix of commands

Repeat cmd if unsuccessful

Error types:

SRE: Speech Recognition

CME: Category Misclassification

VSE: Vision System

MME: Mechanical Manipulation



Module-LevelT2: “touch” 100%, “hand me” 82% (grip slip)

T3: “what color?” red 100%, green 64%


The Token Test for ChildrenPsychological Tests usable as:

Evaluations and Behavioral Specifications

A test for 3-year children & aphasics!


Ripley vs. Token Test

ALSO: Ripley surpassing Token Test in other directions! (inform, remember…)


VI. Future Extensions plan

A 5-10 year research plan…Current system provides SIGNIFICANT LEVERAGE towards these extensions!!!

Sensory domain: X1-3: object vision, tactile, human activity

Cognitive / Affective domain:X3-6: embedded situation models, language, plans / goals / affect, LTM

Real-world operation and design / theory:X7-10: online learning, evaluation, design, theoretical reflections


Future Extensions plan:Sensory domain

X1) Better Visual Abilities:X1.1 Richer shapes & shape recognitionX1.2 Active sensingX1.3 Richer spatial relations (containment, support, etc.)

X2) Better Tactile Abilities:X2.1 Soft/Hard, texture, and temperature sensingX2.2 Shape by touch

X3) Better support for Mind Reading / Embedded situation models:X3.1 Eye gaze directionX3.2 Absence/presence of humans, face recognition, multiple speakersX3.3 Extend recognition of human-related eventsX3.4 Keep history of interactions w each human


Future Extensions plan:Cognitive / Affective domainX4) Better Language subsystem:

X4.1 Online naming of objects and agentsX4.2 Conditional execution of actionsX4.3 More systematic treatment of vocabulary, speech acts, descriptions etc.X4.4 Extend informspeech act handling for story understanding / future anticipation

X5) Mixed motor/speech planning, explicit goals, inference, affective stateX5.1 Translate categorical layer to FOPCX5.2 Performsimple inference on the basis of translated categorical layerX5.3 Performsituational commonsense knowledge acquisitionX5.4 Devise goal representations and performsimple planningX5.5 Performonline collaborative planning / action coordinationX5.6 Introduce affective state for the robot fed through goal satisfaction, novelty etc.X5.7 Introduce affective state for the human fed through expression recognition etc.X5.8 Affective learning for the robot by observation of human affective responses

X6) Core GSM enhancementsX6.1 Extend basic GSM operations: attentional focusing etc.X6.2 Better memory subsystem: Consolidation, long-termexperience acquisitionX6.3 Incorporate more results from cognitive science research


Future Extensions plan:Real-world operation / designX7) Integrated Online learning

X7.1 Integrate online learning of adjectives, verbs & params with normal operationX7.2 Online learning of tasks and situationally-appropriate actions

X8) Passing Landmark tests / EvaluationsX8.1 Pass all of the "Token test" with professional testgiverX8.2 Perform simple story understanding testsX8.3 Perform cooperative assembly task testsX8.4 Perform extensive tests through the levels described in section 7.2

X9) Standardisation / Open Sourcing / Design toolsX9.1 Open Source a generic library of basic GSM representations and processesX9.2 Build design tools for semi-automated CAD of GSM-based SCAsX9.3 Build an affordable educational GSM-based robot (One Robot Per Child)

X10) TheoreticalX10.1 Alignment of Situation ModelsX10.2 Empirically Sound Semantics

Scalability! Hierarchical Learability / Open Sourcing – huge teacher/developer baseAchieving Intention Rec and Activity Coord for ideal SCA prerequs


VII. Discussion / relation to AI,semantics, philosophy…

Block 1: Language grounding, Situated languageBlock 2: Abstract entities, Imaginary situationsBlock 3: Semantics, SituationsBlock 4: Logic, OntologiesBlock 5: Stance X and its relation to GSMs


Discussion9.1 Language grounding, Situated languageX 9.1.1 Meaning spaces

9.1.2 Unifying speech with sensing and acting X 9.1.3 Grounding the whole of language

9.1.4 Levels of Detachment from the "here-and-now"

9.2 Abstract entities, Imaginary situations9.2.1 Do abstract entities exist in a primary manner?9.2.2 Can abstract entities exist in a primary manner in GSMs?9.2.3 Representing abstract entities within situation models9.2.4 Inform speech acts, Imagine statements & the "Imaginer" module9.2.5 On the possibility of "Pure Imagination"

9.3 Semantics, SituationsX 9.3.1 The semantics of adjectives and verbs within the GSM proposal X 9.3.2 Successive approximations of meaning models

9.3.3 Comparative adjectives and better spatial semantics9.3.4 Similarity of situations9.3.5 The Enumerability of Situations

9.4 Logic, OntologiesX 9.4.1 Bridging the categorical layer with FOPC

9.4.2 Grounding ontologies

9.5 Stance X and its relation to GSMsX 9.5.1 Embodiment and GSMsX 9.5.2 SHRDLU and GSMs

9.5.3 Procedural semantics and GSMsX 9.5.4 A proposal for Empirically Sound Semantics

9.5.5 The intentional stance and GSMs9.5.6 The homunculus attack and GSMs9.5.7 Some of Minsky’s views and GSMs


Meaning Spaces &Grounding the whole of language


Semantics of verbs & adjectives

Adjective meaning: categorical classifier

Verb meaning (second/third person): event classifierVerb meaning (first person): action controller


Successive Approximationsof Meaning Models

F1 (absolute) 1st order approx:category = f(property value)

F2 (noun-dependent) 2nd order approx:category = f(property value, noun)

F3 (situationalctxt-dependent) 3rd order approx:category = f(property value, noun, values of

same property for other relevant objects)

"the small cup" when surrounded by other large cups

vs. "the small cup"

when surrounded by miniature cups

"small elephant" vs.

"small ant"

Example: Small = radius less than 3cm


GSMs and FOPCExample of translation:

From Event(now, Object1, ismoving)

to

ADVANTAGE: Become able to use all FOPC-based methods


Embodiment and GSMs


SHRDLU vs. GSMs

Differences:SHRDLU exists in a virtual world, while the GSM-based Ripley exists in the real world.

1) No need for sensing - world continuously & totally accessible at object & property level;

2) No ExternalReality-to-SituationModel process – no segmentation, property extraction etc.

3)The Situation Model IS the only Reality in SHRDLU’s case– and this reality is totally accessible and not partially hidden or noisy or only partially visible at the time.

4) No need for stochastic layer, no need for object permanence- the omniscient "eye" of SHRDLU knows all objects with their unique IDs, and they will never disappear from sight.

5) No concept of "sensory expectation" and verification of sensory expectation - it can never be the case that reality does not match its estimate (the situation model), because reality IS the situation model.

6) No learning whatsoever of semantics of words - "red" is not attached to any sensory space semantics through examples, "pick up" is never taught by demonstration.

7) No "imagination" in the sense used in this document too- no arbitrary recombination of sensory-derived elements, no sensory expectation formed that can be verified or disproved.

8) Actions in SHRDLU always have their predicted effects,9) No user model or self model - no embedded models.

Similarities:1) Some "understanding" of natural language –

- and dialogue with the user about virtual / real world

2) 3D Visualizations of the beliefs of the system about the state of the world3) Keeping of sequences of events etc.

Differences much deeper than the apparent similarities!SHRDLU was an amazing system in its times; but in order to get to the next level, and build SCAs in real environments, GSMs crucial…


Requirements forEmpirically Sound Semantics

Motivational Question:What is a meaningful definition of the "meaning" of an action, which definition fulfills the following three requirements?

R1) It is empirically sound.

Meanings defined in this way should be estimatable in a computational manner, on the basis of external observations of a set of agents acting and possibly interacting (not by human expert introspection)

R2) It is not defined on the basis of a set of high-level natural-language "concepts“.

I.e. defined in terms of an agreed upon description of externally observable states of the world, which world is modelled in some non-language like model.

R3) It contains both a internal / mental and an external / physical part.

(Within clearly explicated “mind” and “world” models)


A proposal forEmpirically Sound Semantics

MeaningMeaning(given(given sign, at a given instant)sign, at a given instant)==

Production Conditions, Reception Effects Production Conditions, Reception Effects across all agentsacross all agents

Production Conditions(sign, agent)=

Which of all possible past experiences will make this agent produce the sign, which past experiences will always ultimately

ground out to states of external reality=

A classifier on all pasts, which dichotomizes them to those that will result to the production of the sign, and those that won’t

Reception Effects(sign, agent)=

How will the receiving agent’s behavior be modified for all possible futures, which will in turn further constrain all sub-sequent futures

(states of external reality) after the agent’s actions=

A further constraint on all possible futures after the reception of the sign


Conclusion


Central Theme

How can we build cooperative Situated Conversational Assistants

around

Grounded Situation Models,

which are amodal representations of current, past or imagined situations,

that can be fed

both by the senses and language?


Vision andContributions

Vision: Vision: Cooperative Situated Conversational Assistants (SCAs)

Contributions: Contributions: The Grounded Situation Model (GSM) proposal:

Motivation and Requirements

Explicit Computational Model

A modular Implementation Architecture

An Implemented System

Quantitative and Qualitative Evaluations

A discussion of a Future Extensions plan

A Discussion w proposals: GSMs in relation to theories in AI, cogsci, semantics

Other smaller contributions:A rough Method for Designing GSM-based systems


News

1) Three Novelties above the state of the art in conversational robots

2) Behavioral repertoire comparable / surpassing the Token Test for children

3) Quantitative Performance: median 1 correction (repeat) per 10-step task, 100% task completion


Epimetron

In the introductory section, we stated upfront and discussedthe vision of this thesis -SCAs.

But taking a higher viewpoint, one might ask: But why do we need SCAs for humanity?

Or taking an even higher viewpoint, one should ask: Where should SCAs and intelligent robots fit within the Order of the world?

This is a question which would deserve deep consideration, and significant future-reaching examination. Thus, I will end this thesis not with an answer, but with a wish:

May the day come when humans and their artificial partners can co-exist in harmony and peace, enabling each other to utilize their full potential, and most importantly, helping eachother towards fulfilling their true purpose of existence.


OPTIONALMATERIAL

Behavioral Rep

Design Method

All the stuff from previous pres

All the new stuff


Some questions to explore…

Meaning of Words: How can we empirically learn models of the meaning of words?

What is wrong with today’s NLP?

Cooperative Machines:What does it mean for a machine to be “cooperative”?

Cognitive Architectures:How can we interconnect vision, motor control, and speech?

How can we go from percept to words & vice-versa?

Internal Models: What if there was a “theatre in the head” that is filled with present, past or imagined “Situation Models?”

Areas implicated: AI, Vision, Robotics, Linguistics, Philosophy


Word meaning?

Important Theoretical Questions:Q1a) What is the meaning of an utterance?Q1b) Where is it? Q1c) How should it be represented?

“A BLUE BALL” = ???

“JOHN KICKED THE BALL” = ???

“I LIKED REMEMBERING YESTERDAY’S LUNCH = ???”

Referential vs. Functional Meaning Semantics vs. Pragmatics

Origin of signs / Empirical Semantics

MIND(concepts)

WORLD(objects)

SIGN(words)

The “Semiotic Triangle”


The Central Problem

The Symbol Grounding Problem

1) How is meaning represented in traditional NLP systems?

Semantic Networks

2) Very much like:

Dictionary DefinitionsApple = is a red fruit that has a kernel

Μήλο = είναι ένα κόκκινο φρούτο που έχει ένα κουκούτσι

3) Thus, one inescapable problem:

CircularityBut how do children solve it? How do we solve it? See next slide…


Escaping the Problem

Our Proposed Solution:Equip machines with:- a starting vocabulary whose meaning is acquired by “being in

the world” (meaning(“apple”), meaning(“red”) etc.) - as well as with a way to compose meanings.

(meaning(“red apple”) = f (meaning(“red”),meaning(“apple”))- Then, by analogy, extension to abstract concepts could

potentially be addressed.(meaning(“justice”)=? meaning(“opposite”)=?)

Meaning Spaces:1. Sensory (“red”, “loud”)2. SensoryMotor (“jump”, “search”)3. Goal-oriented (“it is hot in here”)Bigger picture: semiotic triangle


Practical Pathways for modelsof grounded meaning

Three Practical Pathways:

Robots – Games – AV record


Interconnecting vision, motorcontrol, and language

Q2)How can we interconnect existing subsystems?

And then, what is in the star? (Reps / Processes)

How do we map psychological terms:Beliefs? Goals? Intentions? Emotions?

? “A blue ball”


Are there mental models inthe head?Q3) If (and how) is a mental model constructed

with structure analogous to the situationin which we are embedded, or are remembering / imagining?

What are its contents?

What are its main operations?


Theoretical Steps

Theoretical agent model

proposal, which is GSM-based.

Design methods

incremental, producing a design given a behavioral spec.

Psychological tests

using them for specification / evaluation


The representation:What the GSM is

and what it is not… The GSM is amodal – it is NOT:

A viewer-dependent 2D image An absolute 3D physical model

(it even includes invisibles such as the beliefs of others) An unstructured list of language-like propositions

The GSM is a representation: Accumulating information from multiple modalities

(vision, touch, proprioception, language) Which has a

structure analogous to the situationthe robot is embedded in or is imagining

Which might extend to the past / future (not only present) Which contains Mental and Physical parts in agent descriptions –

supports “embedded GSMs” (estimated GSMs of other agents) Similar to “Situation/Mental/Internal Models” in CogSci, J Laird, Minsky

The GSM is NOT only a representation: But also a set of associated standard processes!


Theory:GSM-Based Agent Models

Representations:R1) Present Situation model: known world

(not only physical necessarily)R2) Internal state (goals, affect)R3) Past states/situations (includes events)R4) Future predictionsR5) Action list

Processes:P1) Sensor-to-SituationModel processP2) SituationModel-to-InternalState proc.P3) Storage/Maintenance of Past process

(includes event recognition)P4) Future Prediction Generation processP5) Action Selection Process (SituationModel-to-action table or planner)

Also: further breakdown! And more arrows…

PresentSituation

(R1)

Future(R4)

Past(R3)

Goals etc.(R2)

Sensors

Effectors

P3

Action List(R2)

P4

P1

P5

AGENT

WORLD


A closer view:Reality-to-SituationModel

ExternalReality

M1: SensorProjection

P1a:Individuation

P1b: PropertyExtraction

P1c: CategoryProjection

InternalSituation Model

Spatiotemporally local external reality is projected into a “sensory stream”

Sensory streams are segmented to “objects” – i.e. some information is marked as pertaining to an object, which objct is also given a unique ID

All of the segments of the sensory streams that resolve to a common objare collected, and properties are calculated on their basis: color, size…

Continuous-valued or fine-granularity property values are quantized to meaningful categories -Color is quantized to “red”, ”green”, ”yellow” etc.

SITUATION-VIEWPOINT of an agent: {M1, P1a, P1b, P1c}(t)

All the info required to go from External Reality to Situation Model

Two questions of fundamental Importance:Q1) How do I estimate another agent’s situation model?Q2) How do I align situation models across agents?


Motivation

How do we interconnect language with perception/action subsystems in an SCA? (why NLP cannot PNP)

How can we find a systematic way to build an SCA starting from a behavioral specification?

GSMs as "structured blackboards" GSMs as cognitive science-like "situation

models" GSMas a "theatrical stage"


Grounded Situation Models for Situated Conversational...

Documents

Transcript of Grounded Situation Models for Situated Conversational...