Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((•...

21
Rela%onal Representa%ons Daniel Lowd University of Oregon April 20, 2015

Transcript of Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((•...

Page 1: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Rela%onal  Representa%ons  

Daniel  Lowd  University  of  Oregon  

April  20,  2015  

Page 2: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Caveats  

•  The  purpose  of  this  talk  is  to  inspire  meaningful  discussion.  

•  I  may  be  completely  wrong.  

My  background:  Markov  logic  networks,  probabilis%c  graphical  models  

Page 3: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Q:  Why  rela%onal  representa%ons?    

A:  To  model  rela%onal  data.  

Page 4: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Rela%onal  Data  •  A  rela%on  is  a  set  of  n-­‐tuples:  

Friends:  {(Anna,Bob),  (Bob,Anna),  (Bob,Chris)}  Smokes:  {(Bob),(Chris)}  Grade:  {(Anna,  CS612,  Fall2012,  “A+”),…}  

•  Rela%ons  can  be  visualized  as  tables:  

•  Typically  make  closed  world  assump%on:  all  tuples  not  listed  are  false.  

Anna   Bob  

Bob   Anna  

Bob   Chris  Friend

s   Bob  

Chris  Smokes  

Page 5: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Rela%onal  Knowledge  •  First-­‐order  logic  •  Descrip%on  logic  •  Logic  programs    General  form:  A  set  of  rules  of  the  form  “For  every  tuple  of  objects  (x1,x2,…,xk),  certain  rela%onships  hold.”  

e.g.,  For  every  pair  of  objects  (x,y),  if  Friends(x,y)  is  true  then  Friends(y,x)  is  true.  

Page 6: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Sta%s%cal  Rela%onal  Knowledge  •  First-­‐order  logic  •  Descrip%on  logic  •  Logic  programs    General  form:  A  set  of  rules  of  the  form  “For  every  tuple  of  objects  (x1,x2,…,xk),  certain  rela%onships  probably  hold.”            (Parametrized  factors  or  “parfactors”.)  

•  Bayesian  networks  •  Markov  networks  •  Dependency  networks    

e.g.,  For  every  pair  of  objects  (x,y),  if  Friends(x,y)  is  true  then  Friends(y,x)  is  more  likely.  

Page 7: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Applica%ons  and  Datasets  

What  are  the  “killer  apps”  of  rela%onal  learning?    

They  must  be  rela%onal.  

Page 8: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Graph  or  Network  Data  •  Many  kinds  of  networks:  – Social  networks  –  Interac%on  networks  – Cita%on  networks  – Road  networks  – Cellular  pathways  – Computer  networks  – Webgraph  

Page 9: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Graph  Mining  

Page 10: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Graph  Mining  •  Well-­‐established  field  within  data  mining  •  Representa%on:  nodes  are  objects,  edges  are  rela%ons  •  Many  problems  and  methods  –  Frequent  subgraph  mining  –  Genera%ve  models  to  explain  degree  distribu%on  and  graph  evolu%on  over  %me  

–  Community  discovery  –  Collec%ve  classifica%on  –  Link  predic%on  –  Clustering  

•  What’s  the  difference  between  graph  mining  and  rela%onal  learning?  

Page 11: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Social  Network  Analysis  

Page 12: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Specialized  vs.  General  Representa%ons  

In  many  domains,  the  best  results  come  from  more  restricted,  “specialized”  representa%ons  and  algorithms.    •  Specialized  representa%ons  and  algorithms  – May  represent  key  domain  proper%es  beeer  –  Typically  much  more  efficient  –  E.g.,  stochas%c  block  model,  label  propaga%on,  HITS  

•  General  representa%ons  –  Can  be  applied  to  new  and  unusual  domains  –  Easier  to  define  complex  models  –  Easier  to  modify  and  extend  –  E.g.,  MLNs,  PRMs,  HL-­‐MRFs,  ProbLog,  RBNs,  PRISM,  etc.  

Page 13: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Specializing  and  Unifying  Representa%ons  

There  have  been  many  representa%ons  proposed  over  the  years,  each  with  their  own  advantages  and  disadvantages.  •  How  many  do  we  need?  •  Which  comes  first,  representa%onal  power  or  algorithmic  convenience?  

•  What  are  the  right  unifying  frameworks?  •  When  should  we  resort  to  domain-­‐specific  representa%ons?  

•  Which  domain-­‐specific  ideas  actually  generalize  to  other  domains?  

Page 14: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Applica%ons  and  Datasets  

What  are  the  “killer  apps”  of  general  rela%onal  learning?  

 They  must  be  rela%onal.  

They  should  probably  be  complex.  

Page 15: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

BioNLP  Shared  Task  Workshop  

In  2009,  Riedel  et  al.  win  with  a  Markov  logic  network!        

•  They  claim  Markov  logic  contributed  to  their  success:  “Furthermore,  the  declara%ve  nature  of  Markov  Logic  helped  us  to  achieve  these  results  with  a  moderate  amount  of  engineering.    In  par%cular,  we  were  able  to  tackle  task  2  by  copying  the  local  formulae  for  event  predic%on  and  adding  three  global  formulae.”    

•  However,  conver%ng  this  problem  to  an  MLN  was  non-­‐trivial:  "In  future  work  we  will  therefore  inves%gate  means  to  extend  Markov  Logic  (interpreter)  in  order  to  directly  model  event  structure.”  

Task:  Extract  biomedical  informa%on  from  text.  

event (i) ) 9t.eventType (i, t)

eventType (i, t) ) event (i)

eventType (i, t) ^ t 6= o ) ¬eventType (i, o)

¬site (i) _ ¬event (i)

role (i, j, r) ) event (i) j r i i

role (i, j, r1) ^ r1 6= r2 ) ¬role (i, j, r2)

eventType (e, t) ^ role (e, a, r) ^ event (a) ) regType (t)

role (i, j, r) ^ taskOne (r) ) event (j) _ protein (j)

role (i, j, r) ^ taskTwo (r) ) site (j)

site (j) ) 9i, r.role (i, j, r) ^ taskTwo (r)

event (i) ) 9j.role (i, j, )

eventType (i, t) ^ ¬allowed (t, r) ) ¬role (i, j, r)

role (i, j, r1) ^ k 6= i ) ¬role (k, j, r2)

j < k ^ i < j ^ role (i, j, r1) ) ¬role (i, k, r2)

47

Page 16: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

BioNLP  Shared  Task  Workshop  

For  2011,  Riedel  and  McCallum  produce  a  more  accurate  model  as  a  factor  graph:              Is  this  a  victory  or  a  loss  for  rela%onal  learning?  

... phosphorylation of TRAF2 inhibits binding to the CD40 ...

Phosphorylation

Regulation

BindingTheme

Cause Theme

ThemeTheme

Regulation BindingPhosphorylationTheme

Cause

ThemeTheme Theme

Same Binding

2 3 4 5 6 7 8 9

b4,9

e2,Phos.a6,9,Theme

(a)

(b)

Figure 1: (a) sentence with target event structure; (b) pro-jection to labelled graph.

sentence, as seen figure 1b).

We will first present some basic notation to sim-plify our exposition. For each sentence x we havea set candidate trigger words Trig (x), and a set ofcandidate proteins Prot (x). We will generally usethe indices i and l to denote members of Trig (x), theindices p, q for members of Prot (x) and the index jfor members of Cand (x) def= Trig (x) � Prot (x).

We label each candidate trigger i with an eventType t � T (with None � T ), and use the binaryvariable ei,t to indicate this labeling. We use binaryvariables ai,l,r to indicate that between i and l thereis an edge labelled r � R (with None � R).

The representation so far has been used in previ-ous work (Riedel et al., 2009; Björne et al., 2009).Its shortcoming is that it does not capture whethertwo proteins are arguments of the same bindingevent, or arguments of two binding events with thesame trigger. To overcome this problem, we intro-duce binary “same Binding” variables bp,q that areactive whenever there is a binding event that hasboth p and q as arguments. Our inference algorithmwill also need, for each trigger i and protein pair p, q,a binary variable ti,p,q that indicates that at i there isa binding event with arguments p and q. All ti,p,q aresummarized in t.

Constructing events from solutions (e,a,b) canbe done almost exactly as described by Björne et al.(2009). However, while Björne et al. (2009) grouparguments according to ad-hoc rules based on de-pendency paths from trigger to argument, we simplyquery the variables bp,q.

3 Model

We use the following objective to score the struc-tures we like to extract:

s (e,a,b) def=�

ei,t=1

sT (i, t) +�

ai,j,r=1

sR (i, j, r) +

bp,q=1

sB (p, q)

with local scoring functions sT (i, t) def=�wT, fT (i, t)�, sR (i, j, r) def= �wR, fR (i, j, r)�and sB (p, q) def= �wB, fB (p, q)�.

Our model scores all parts of the structure in isola-tion. It is a joint model due to the three types of con-straints we enforce. The first type acts on trigger la-bels and their outgoing edges. It includes constraintssuch as “an active label at trigger i requires at leastone active outgoing Theme argument”. The secondtype enforces consistency between trigger labels andtheir incoming edges. That is, if an incoming edgehas a label that is not None, the trigger must not belabelled None either. The third type of constraintsensures that when two proteins p and q are part ofthe same binding (as indicated by bp,q = 1), thereneeds to be a binding event at some trigger i thathas p and q as arguments. We will denote the set ofstructures (e,a,b) that satisfy all above constraintsas Y .

To learn w we choose the passive-aggressiveonline learning algorithm (Crammer and Singer,2003). As loss function we apply a weighted sum offalse positives and false negative labels and edges.The weighting scheme penalizes false negatives 3.8times more than false positives.

3.1 FeaturesFor feature vector fT (i, t) we use a collection ofrepresentations for the token i: word-form, lemma,POS tag, syntactic heads, syntactic children; mem-bership in two dictionaries used by Riedel et al.(2009).For fR (a; i, j, r) we use representations ofthe token pair (i, j) inspired by Miwa et al. (2010) .They contain: labelled and unlabeled n-gram depen-dency paths; edge and vertex walk features (Miwa etal., 2010), argument and trigger modifiers and heads,words in between (for close distance i and j). ForfB (b; p, q) we use a small subset of the token pairrepresentations in fR.

47

Task:  Extract  biomedical  informa%on  from  text.  

Page 17: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Other  NLP  Tasks?  Hoifung  Poon  and  Pedro  Domingos  obtained  great  NLP  results  with  MLNs:  •  “Joint  Unsupervised  Coreference  Resolu%on  with  Markov  Logic,”  ACL  2008.  •  “Unsupervised  Seman%c  Parsing,”  EMNLP  2009.  

Best  Paper  Award.  •  “Unsupervised  Ontology  Induc%on  from  Text,”  ACL  2010.  

…but  Hoifung  hasn’t  used  Markov  logic  in  any  of  his  follow-­‐up  work:  •  “Probabilis%c  Frame  Induc%on,”  NAACL  2013.  

 (with  Jackie  Cheung  and  Lucy  Vanderwende)  •  “Grounded  Unsupervised  Seman%c  Parsing,”  ACL  2013.  •  “Grounded  Seman%c  Parsing  for  Complex  Knowledge  Extrac%on,”  NAACL  

2015.    (with  Ankur  P.  Parikh  and  Kris%na  Toutanova)  

 

Page 18: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

MLNs  were  successfully  used  to  obtain  state-­‐of-­‐the-­‐art  results  on  several  NLP  tasks.    Why  were  they  abandoned?    Because  it  was  easier  to  hand-­‐code  a  custom  solu%on  as  a  log-­‐linear  model.  

Page 19: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Soqware  •  There  are  many  good  machine  learning  toolkits  –  Classifica%on:  scikit-­‐learn,  Weka  –  SVMs:  SVM-­‐Light,  LibSVM,  LIBLINEAR  – Graphical  models:  BNT,  FACTORIE  – Deep  learning:  Torch,  Pylearn2,  Theano  

•  What’s  the  state  of  soqware  for  rela%onal  learning  and  inference?  –  Frustra(ng.  – Are  the  implementa%ons  too  primi%ve?  – Are  the  algorithms  immature?  – Are  the  problems  just  inherently  harder?  

Page 20: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Hopeful  Analogy:  Neural  Networks  

•  In  computer  vision,  specialized  feature  models  (e.g.,  SIFT)  outperformed  general  feature  models  (neural  networks)  for  a  long  %me.  

•  Recently,  convolu%onal  nets  are  best  and  are  used  everywhere  for  image  recogni%on.  

•  What  changed?    More  processing  power  and  more  data.  

Specialized  rela%onal  models  are  widely  used.  Is  there  a  revolu%on  in  general  rela%onal  learning  wai%ng  to  happen?  

Page 21: Relaonal(Representaons(...BioNLP(Shared(Task(Workshop(In(2009,(Riedel(etal.(win(with(aMarkov(logic(network!((((• They(claim(Markov(logic(contributed(to(their(success:(“Furthermore,(the

Conclusion  

•  Many  kinds  of  rela%onal  data  and  models  –  Specialized  rela%onal  models  are  clearly  effec%ve.  – General  rela%onal  models  have  poten%al,  but  they  haven’t  taken  off.  

•  Ques%ons:  – When  can  effec%ve  specialized  representa%ons  become  more  general?  

– What  advances  do  we  need  for  general-­‐purpose  methods  to  succeed?  

– What  “killer  apps”  should  we  be  working  on?