Hammer & Hitzler ● KI2005 Tutorial ● Koblenz, Germany ● September 2005 AIFB Slide 1...

Hammer & Hitzler ● KI2005 Tutorial ● Koblenz, Germany ● September 2005

Slide 1

AIFB

Connectionist Knowledge Representation and Reasoning

(Part I)

Neural Networks and Structured Knowledge

SCREECH

Fodor, Pylyshin: What’s deeply wrong with Connectionist architecture is this: Because it acknowledges neither syntactic nor semantic structure in mental representations, it perforce treats them not as a generated set but as a list.[Connectionism and Cognitive Architecture 88]

Our claim: state-of-the-art connectionist architectures do adequately deal with structures!


Slide 2

AIFB

Tutorial Outline (Part I)Neural networks and structured knowledge

• Feedforward networks– The good old days: KBANN and co.

– Useful: neurofuzzy systems, data mining pipeline

– State of the art: structure kernels

• Recurrent networks– The basics: Partially recurrent networks

– Lots of theory: Principled capacity and limitations

– To do: Challenges

• Recursive data structures– The general idea: recursive distributed representations

– One breakthrough: recursive networks

– Going on: towards more complex structures


Slide 3

AIFB












Slide 4

AIFB

The good old days – KBANN and co.

fw :ℝn ℝo

x y

x1

x2

xn

w1

w2

wn

…θ σ(wtx - θ)

σ(t) = sgd(t) = (1+e-t)-1

feedforward neural network

neuron

1. black box2. distributed

representation3. connection to

rules for symbolic I/O ?


Slide 5

AIFB


Knowledge Based Artificial Neural Networks [Towell/Shavlik,

AI 94]:• start with a network which represents known rules• train using additional data• extract a set of symbolic rules after training

(partial)rules

data

train

(complete)rules


Slide 6

AIFB


(partial)rules

propositional acyclic Horn clauses such as A :- B,C,DE :- A

B

C

D

A

1

1

1

2.5

FNN with one neuron per boolean variable

E1

0.5


Slide 7

AIFB


data

train

use some form of backpropagation,add a penalty to the error e.g. for changing the weights

1. The initial network biases the training result, but2. There is no guarantee that the initial rules are preserved3. There is no guarantee that the hidden neurons maintain their

semantic


Slide 8

AIFB


1. There is no exact direct correspondence of a neuron and a single rule, although each neuron (and the overall mapping) can be approximated by a set of rules arbitrarily well

2. It is NP complete to find a minimum logical description for a trained network [Golea, AISB'96]

3. Therefore, a couple of different rule extraction algorithms have been proposed, and this is still a topic of ongoing research

(complete)rules


Slide 9

AIFB


(complete)rules

rules

decompositional approach:

rulesx y

pedagogical approach:


Slide 10

AIFB


Decompositional approaches:• subset algorithm, MofN algorithm describe single neurons by

sets of active predecessors [Craven/Shavlik, 94] • local activation functions (RBF like) allow an approximate direct

description of single neurons [Andrews/Geva, 96]• MLP2LN biases the weights towards 0/-1/1 during training and

can then extract exact rules [Duch et al., 01]• prototype based networks can be decomposed along relevant

input dimensions by decision tree nodes [Hammer et al., 02]

Observation:• usually some variation of if-then rules is achieved• small rule sets are only achieved if further constraints

guarantee that single weights/neurons have a meaning• tradeoff between accuracy and size of the description


Slide 11

AIFB


Pedagogical approaches:• extraction of conjunctive rules by extensive search [Saito/Nakano

88]

• interval propagation [Gallant 93, Thrun 95]

• extraction by minimum separation [Tickle/Diderich, 94]

• extraction of decision trees [Craven/Shavlik, 94]

• evolutionary approaches [Markovska, 05]

Observation:• usually some variation of if-then rules is achieved• symbolic rule induction required with a little (or a bit more) help

of a neural network


Slide 12

AIFB

The good old days – KBANN and co.Where is this good for?• Nobody uses FNNs these days • Insertion of prior knowledge might be valuable. But efficient training

algorithms allow to substitute this by additional training data (generated via rules)

• Validation of the network output might be valuable, but there exist alternative (good) guarantees from statistical learning theory

• If-then rules are not very interesting since there exist good symbolic learners for learning propositional rules for classification

• Propositional rule insertion/extraction is often an essential part of more complex rule insertion/extraction mechanisms

• Demonstrates a key problem, different modes of representation, in a very nice way

• Some people e.g. in the medical domain also want an explanation for a classification

• There are at least two application domains where if-then rules are very interesting and not so easy to learn: fuzzy-control and unsupervised data mining


Slide 13

AIFB












Slide 14

AIFB

Useful: neurofuzzy systems

processinput observation

control

Fuzzy control:

if (observation є FMI) then (control є FMO)


Slide 15

AIFB


Fuzzy control:

if (observation є FMI) then (control є FMO)

Neurofuzzy control:

observation control

implementation of the fuzzy-if-then rules

Benefit: the form of the fuzzy rules (i.e. neural architecture) and the shape of the fuzzy sets (i.e. neural weights) can be learned from data!


Slide 16

AIFB


• NEFCON implements Mamdani control [Nauck/Klawonn/Kruse, 94]

• ANFIS implements Takagi-Sugeno control [Jang, 93]

• and many other …• Learning

– of rules: evolutionary or clustering– of fuzzy set parameters: reinforcement learning or

some form of Hebbian learning


Slide 17

AIFB

Useful: data mining pipeline

Task: describe given inputs (no class information) by if-then rules

• Data mining with emergent SOM, clustering, and rule extraction [Ultsch, 91]

rules


Slide 18

AIFB












Slide 19

AIFB

State of the art: structure kernels

SV

Mkernelk(x,x’)

data

sets, sequences, tree structures, graph structures

just compute pairwise distances for this complex data using structure information


Slide 20

AIFB


• Closure properties of kernels [Haussler, Watkins]

• Principled problems for complex structures: computing informative graph kernels is at least as hard as graph isomorphism [Gärtner]

• Several promising proposals - taxonomy [Gärtner]:

count common substructures

derived from a probabilistic model

derived fromlocal transformations

syntax

semantic


Slide 21

AIFB


Count common substructures:

GAGAGA

GAT

GA AG AT

3 2 0

1 0 13

Efficient computation:dynamic programmingsuffix trees

locality improved kernel [Sonnenburg et al.], bag of words [Joachims]string kernel [Lodhi et al.], spectrum kernel [Leslie et al.]word-sequence kernel [Cancedda et al.]

convolution kernels for language [Collins/Duffy, Kashima/Koyanagi, Suzuki et al.]kernels for relational learning [Zelenko et al.,Cumby/Roth, Gärtner et al.]

graph kernels based on paths or subtrees [Gärtner et al.,Kashima et al.]kernels for prolog trees based on similar symbols [Passerini/Frasconi/deRaedt]


Slide 22

AIFB


Derived from a probabilistic model:

describe byprobabilistic model P(x)

compare characteristics of P(x)

Fisher kernel [Jaakkola et al., Karchin et al., Pavlidis et al., Smith/Gales, Sonnenburg et al., Siolas et al.]tangent vector of log odds [Tsuda et al.]marginalized kernels [Tsuda et al., Kashima et al.]

kernel of Gaussian models [Moreno et al., Kondor/Jebara]


Slide 23

AIFB


Derived from local transformations:

is similar

to

local neighborhood, generator H

expand to aglobal kernel

diffusion kernel [Kondor/Lafferty, Lafferty/Lebanon, Vert/Kanehisa]


Slide 24

AIFB


• Intelligent preprocessing (kernel extraction) allows an adequate integration of semantic/syntactic structure information

• This can be combined with state of the art neural methods such as SVM

• Very promising results for– Classification of documents, text [Duffy, Leslie, Lodhi, …]

– Detecting remote homologies for genomic sequences and further problems in genome analysis [Haussler, Sonnenburg, Vert, …]

– Quantitative structure activity relationship in chemistry [Baldi et al.]


Slide 25

AIFB

Conclusions: feedforward networks

• propositional rule insertion and extraction (to some extend) are possible

• useful for neurofuzzy systems, data mining • structure-based kernel extraction followed by learning

with SVM yields state of the art results but: sequential instead of fully integrated neuro-

symbolic approach

• FNNs itself are restricted to flat data which can be processed in one shot. No recurrence


Slide 26

AIFB





• Recurrent networks– The basics: partially recurrent networks

– Lots of theory: principled capacity and limitations

– To do: challenges





Slide 27

AIFB

The basics : partially recurrent networks

xt+1 = f(xt,It)

[Elman, Finding structure in time, CogSci 90]very natural architecture for processing speech/temporal signals/control/robotics

1. can process time series of arbitrary length

2. interesting for speech processing [see e.g. Kremer, 02]

3. training using a variation of backpropagation [see e.g. Pearlmutter, 95]


Slide 28

AIFB












Slide 29

AIFB

Lots of theory: principled capacity and limitations

RNNs and finite automata [Omlin/Giles, 96]

input

state

output

dynamics of the transition function of a DFA


Slide 30

AIFB


DFA RNN

unary input

unary state representation

implement (approximate) the boolean formula corresponding to the state transition within a two-layer network

RNNs can exactly simulate finite automata


Slide 31

AIFB


RNN DFA

unary input

in general: distributed state representation

cluster into disjoint subsets corresponding to states and observe their behavior approximate description

approximate extraction of automata rules is possible


Slide 32

AIFB

The principled capacity of RNNs can be characterized exactly:


RNNs with small weights or Gaussian noise= finite memory models [Hammer/Tino, Maass/Sontag]

RNNs with limited noise = finite state automata[Omlin/Giles, Maass/Orponen]

RNNs with rational weights = Turing machines[Siegelmann/Sontag]

RNNs with arbitrary weights = non uniform Boolean circuits(super Turing capability) [Siegelmann/Sontag]


Slide 33

AIFB

However, learning might be difficult:• gradient based learning schemes face the problem of long-term-

dependencies [Bengio/Frasconi]

• RNNs are not PAC-learnable (infinite VC-dim), only distribution dependent bounds can be derived [Hammer]

• there exist only few general guarantees for the long term behavior of RNNs, e.g. stability [Suykens, Steil, …]

•


error

tatata tatatatatatatatatatatatatatatatatata

tatatatatatatatatatatatatatatatatatatatatatatatatata


Slide 34

AIFB

RNNs • naturally process time series • incorporate plausible regularization such as a bias towards

finite memory models • have sufficient power for interesting dynamics (context free,

context sensitive; arbitrary attractors and chaotic behavior) but: • training is difficult • only limited guarantees for the long term behavior and

generalization ability

symbolic description/knowledge can provide solutions



Slide 35

AIFB

RNN recurrent symbolic system

real numbers;iterated function systems give rise to fractals/attractors/chaos, implicit memory

discrete states;crisp boolean function on the states + explicit memory

correspondence?e.g. attractor/repellor for counting anbncn



Slide 36

AIFB












Slide 37

AIFB

To do: challenges

Training RNNs: • search for appropriate regularizations inspired by a focus on

specific functionalities – architecture (e.g. local), weights (e.g. bounded), activation function (e.g. linear), cost term (e.g. additional penalties) [Hochreiter,Boden,Steil,Kremer…]

• insertion of prior knowledge – finite automata and beyond (e.g. context free/sensitive, specific dynamical patterns/attractors) [Omlin,Croog,…]

Long term behavior:• enforce appropriate constraints while training• investigate the dynamics of RNNs – rule extraction,

investigation of attractors, relating dynamics and symbolic processing [Omlin,Pasemann,Haschke,Rodriguez,Tino,…]


Slide 38

AIFB

To do: challenges

Some further issues:• processing spatial data

• unsupervised processing

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

[bicausal networks, Pollastri et al.,contextual RCC, Micheli et al.]

x1,x2,x3,x4,…[TKM, Chappell/Taylor, RecSOM, Voegtlin, SOMSD, Sperduti et al., MSOM, Hammer et al., general formulation, Hammer et al.]


Slide 39

AIFB

Conclusions: recurrent networks

• the capacity of RNNs is well understood and promising e.g. for natural language processing, control, …

• recurrence of symbolic systems has a natural counterpart in the recurrence of RNNs

• training and generalization faces problems which could be solved by hybrid systems

• discrete dynamics with explicit memory versus real-valued iterated function systems

• sequences are nice, but not enough


Slide 40

AIFB












Slide 41

AIFB

The general idea: recursive distributed representations

How to turn tree structures/acyclic graphs into a connectionist representation?

vector


Slide 42

AIFB


vector

recursion!

inp.cont.

output

cont.

f:ℝixℝcxℝcℝc

yields

fenc wherefenc(ξ)=0fenc(a(l,r))=f(a,fenc(l),fenc(r))


Slide 43

AIFB

f:ℝn+2cℝc g:ℝcℝo

decoding:hdec: ℝo (ℝn)2*hdec(0) = ξhdec(x) = h0(x) (hdec(h1(x)), hdec(h2(x)))

inp.cont.

cont.

encoding

image

h:ℝoℝn+2o

label

leftright

encoding:fenc:(ℝn)2*ℝc

fenc(ξ) = 0fenc(a(l,r)) = f(a,fenc(l),fenc(r))



Slide 44

AIFB


• recursive distributed description [Hinton,90]

– general idea without concrete implementation • tensor construction [Smolensky, 90]

– encoding/decoding given by (a,b,c) a⊗b⊗c– increasing dimensionality

• Holographic reduced representation [Plate, 95]

– circular correlation/convolution– fixed encoding/decoding with fixed dimensionality (but potential loss of

information) – necessity of chunking or clean-up for decoding

• Binary spatter codes [Kanerva, 96]

– binary operations, fixed dimensionality, potential loss– necessity of chunking or clean-up for decoding

• RAAM [Pollack,90], LRAAM [Sperduti, 94]

– trainable networks, trained for the identity, fixed dimensionality– encoding optimized for the given training set

inp.

con

t.co

nt.

code

lab

el

left

right


Slide 45

AIFB

Nevertheless: results not promising

Theorem [Hammer]:• There exists a fixed size neural network which can

uniquely encode tree structures of arbitrary depth with discrete labels

• For every code, decoding of all trees up to height T requires Ω(2T) neurons for sigmoidal networks

encoding seems possible, but no fixed size architecture exists for decoding



Slide 46

AIFB












Slide 47

AIFB

One breakthrough: recursive networks

Recursive networks [Goller/Küchler, 96]:

• do not use decoding• combine encoding and mapping • train this combination directly for the given task with

backpropagation through structure

efficient data and problem adapted encoding is learned

inp.cont.

output

cont.

encodingtransformation

y


Slide 48

AIFB


Applications:• term classification [Goller, Küchler, 1996]• automated theorem proving [Goller, 1997]• learning tree automata [Küchler, 1998]• QSAR/QSPR problems [Schmitt, Goller, 1998; Bianucci, Micheli,

Sperduti, Starita, 2000; Vullo, Frasconi, 2003] • logo recognition, image processing [Costa, Frasconi, Soda, 1999,

Bianchini et al. 2005]• natural language parsing [Costa, Frasconi, Sturt, Lombardo, Soda,

2000,2005]• document classification [Diligenti, Frasconi, Gori, 2001]• fingerprint classification [Yao, Marcialis, Roli, Frasconi, Pontil, 2001]• prediction of contact maps [Baldi, Frasconi, Pollastri, Vullo, 2002]• protein secondary structure prediction [Frasconi et al., 2005]• …


Slide 49

AIFB


Desired: approximation completeness - for every (reasonable) function f and ε>0 exists a RecNN which approximates f up to ε (with appropriate distance measure)

Approximation properties can be measured in several ways:

given f, ε, probability P, data points xi, find fw such that

• P(x | |f(x)-fw(x)| > ε ) small (L1 norm) or

• |f(x)-fw(x)| < ε for all x (max norm) or

• f(xi) = fw(xi) for all xi (interpolation of points)


Slide 50

AIFB


Approximation properties for RecNNs and tree-structured data:

… capable of approximating every continuous function in max-norm with restricted height, every measurable function in L1-norm (σ:squashing) [Hammer]

… capable of interpolating every set {f(x1),…,f(xm)} with O(m2) neurons (σ:squashing, C2 in environment of t s.t. σ‘‘(t)≠0) [Hammer]

… can approximate every tree automaton for arbitrary large inputs [Küchler]

... cannot approximate every f:{1}2*{0,1} (for realistic σ) [Hammer]

… fairly good results - 3:1


Slide 51

AIFB












Slide 52

AIFB

Going on: towards more complex structures

More general trees:

arbitrary number of not positioned children

fenc=f(1/|ch| · ∑ch w · fenc(ch) · labeledge,label)

approximation completefor appropriate edge labels [Bianchini et al. 2005]


Slide 53

AIFB


Planar graphs:

…

…

[Baldi,Frasconi,…,2002]


Slide 54

AIFB


Acyclic graphs:

q-1

q-1

q-1

q+1

Contextual cascade correlation[Micheli,Sperduti,03]Approximation complete (under a mild structural restriction)even for structural transduction [Hammer,Micheli,Sperduti,05]


Slide 55

AIFB


Cyclic graphs:

neighbor

[Micheli,05]


Slide 56

AIFB

Conclusions: recursive networks

• Very promising neural architectures for direct processing of tree structures

• Successful applications and mathematical background

• Connections to symbolic mechanisms (tree automata)

• Extensions to more complex structures (graphs) are under development

• Only few approaches which achieve structured outputs


Slide 57

AIFB












Slide 58

AIFB

Conclusions (Part I)

Overview literature:• FNN and rules: Duch,Setiono,Zurada,Computational intelligence methods

for understanding of data, Proc. of the IEEE 92(5):771- 805, 2004 • Structure kernels: Gärtner,Lloyd,Flach, Kernels and distances for structured

data, Machine Learning, 57, 2004. (new overview is forthcoming)• RNNs: Hammer,Steil, Perspectives on learning with recurrent networks, in:

Verleysen, ESANN'2002, D-side publications, 357-368, 2002• RNNs and rules: Jacobsson, Rule extraction from recurrent neural

networks: a taxonomy and review, Neural Computation, 17:1223-1263, 2005• Recursive representations: Hammer, Perspectives on Learning Symbolic

Data with Connectionistic Systems, in: Kühn, Menzel, Menzel, Ratsch, Richter, Stamatescu, Adaptivity and Learning, 141-160, Springer, 2003.

• Recursive networks: Frasconi,Gori,Sperduti, A General Framework for Adaptive Processing of Data Structures, IEEE Transactions on Neural Networks, 9(5):768-786,1998

• Neural networks and structures: Hammer,Jain, Neural methods for non-standard data, in: Verleysen, ESANN'2004, D-side publications, 281-292, 2004


Slide 59

AIFB

Conclusions (Part I)

• There exist networks which can directly deal with structures (sequences, trees, graphs) with good success: kernel machines, recurrent and recursive networks

• Efficient training algorithms and theoretical foundations exist

• (Loose) connections to symbolic processing have been established and indicate benefits

Now: towards strong connections PART II: Logic and neural networks


Slide 60

AIFB

Hammer & Hitzler ● KI2005 Tutorial ● Koblenz, Germany ● September 2005 AIFB Slide 1...

Documents

Transcript of Hammer & Hitzler ● KI2005 Tutorial ● Koblenz, Germany ● September 2005 AIFB Slide 1...