Iteravely*Learning*Condi&onal*Statements*in* Transforming...

25
Itera&vely Learning Condi&onal Statements in Transforming Data by Example Bo Wu and Craig A. Knoblock University of Southern California

Transcript of Iteravely*Learning*Condi&onal*Statements*in* Transforming...

Page 1: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Itera&vely  Learning  Condi&onal  Statements  in  Transforming  Data  by  Example  

Bo  Wu  and  Craig  A.  Knoblock  University  of  Southern  California  

Page 2: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

   

Introduc&on  

Page 3: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Mo&va&on  Data  Source  A   Data  Source  B   Data  Source  C  

Data  Transforma&on    

Data  in  the  ready  format  

Page 4: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

A  Data  Table  

Accession  

Credit   Dimensions   Medium   Name  

01.2   GiK  of  the  ar&st   5.25  in  HIGH  x  9.375  in  WIDE    

Oil  on  canvas   John  Mix  Stanley    

05.411   GiK  of  James  L.  Edison    20  in  HIGH  x  24  in  WIDE    

Oil  on  canvas   Mor&mer  L.  Smith    

06.1   GiK  of  the  ar&st   Image:  20.5  in.  HIGH  x  17.5  in.  WIDE    

Oil  on  canvas   Theodore  ScoV  Dabo    

06.2   GiK  of  the  ar&st   9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE    

Oil  on  canvas   Leon  Dabo    

…  

09.8   GiK  of  the  ar&st   12  in|14  in  HIGH  x  16  in|18  in  WIDE Oil  on  canvas   Gari  Melchers    

4  

Page 5: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Programming  by  Example  Raw  Value   Target  Value  

R1   5.25  in  HIGH  x  9.375  in  WIDE    

R2    20  in  HIGH  x  24  in  WIDE    

R3   Image:  20.5  in.  HIGH  x  17.5  in.  WIDE    

R4   9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE    

R5   12  in|14  in  HIGH  x  16  in|18  in  WIDE

.  .  .  

5  

9.375  

24  

17.5  

null  

null  

24  

17.5  

19.5  

null  

24  

17.5  

19.5  

18  

Page 6: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

   

Problem:  Learn  accurate  condi&onal  statements  efficiently  for  data  with  heterogeneous  

formats  using  few  examples  

Page 7: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

GUI  

Examples  

Par&&ons  

Condi&onal   Branch    Transforma&on  program  

Data  Preprocessing  

Transforma&on  Program  

Cluster  Learn    classifier   Derive  branch  program  

Combine  

Examine  Results  and    provide  examples  

Get  examples  and    provide  results  

Previous  Approach  

Compa&bility  score  (O(n3))      

Few  Training  Data  

Page 8: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Transforma&on  Program  

BNK:  blankspace  NUM[0-­‐9]+:  98  UWRD[A-­‐Z]:  I  LWRD[a-­‐z]:  mage  WORD[a-­‐zA-­‐Z]  START:  END:  

8  Example:  9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE  è  19.5  

Page 9: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

     

Our  Approach        

Page 10: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Main  Idea  

Learning  the  condi?onal  statement  itera?vely  

Input:  5.25  in  HIGH  x  9.375  in  WIDE    Output:  9.375  

Input:  9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE  Output:  13.35  

10  

Page 11: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Our  Approach  

Utilize previous constraints

U&lize  Unlabeled  Data

11  

Examples  

Par&&ons  

Condi&onal   Branch    Transforma&on  program  

Data  Preprocessing  

Transforma&on  Program  

Cluster  

Learn    classifier   Derive  branch  program  

Combine  

Convert data into feature vectors

Page 12: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Data  Preprocessing  

Page 13: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Our  Approach  

Utilize previous constraints

U&lize  Unlabeled  Data

13  

Examples  

Par&&ons  

Condi&onal   Branch    Transforma&on  program  

Data  Preprocessing  

Transforma&on  Program  

Cluster  

Learn    classifier   Derive  branch  program  

Combine  

Convert data into feature vectors

Page 14: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Constraints  

•  Two  Types  of  Constraints:  •  Cannot-­‐merge  Constraints:    •  Ex:  

•  Must-­‐merge  Constraints:    •  Ex:      

5.25  in  HIGH  x  9.375  in  WIDE   9.375  

9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE   13.75  

20  in  HIGH  x  24  in  WIDE   24  

5.25  in  HIGH  x  9.375  in  WIDE   9.375  

20  in  HIGH  x  24  in  WIDE   24  

9.75  in|16  in  HIGH  x  13.75  in|19.5  in  WIDE   13.75  

Image:  20.5  in.  HIGH  x  17.5  in.  WIDE   17.5  

P1  

P2  

P3  

14  

Page 15: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Constrained Agglomerative Clustering

r1   r3   r4  r2  

r1   r2  

r4  

r3  

r1   r3   r2  

Update  constraints  Learn  distance  metric  

r2  r1   r3   r4  

r4  

15  

Page 16: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Distance  Metric  Learning  

•  Distance  Metric  (Weighted  Euclidean)  Learning  

•  Objec&ve  Func&on  

Too  far  away  

Close  to  each  other  

16  

Page 17: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Our  Approach  

Utilize previous constraints

U&lize  Unlabeled  Data

17  

Examples  

Par&&ons  

Condi&onal   Branch    Transforma&on  program  

Data  Preprocessing  

Transforma&on  Program  

Cluster  

Learn    classifier   Derive  branch  program  

Combine  

Convert data into feature vectors

Page 18: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

U&lize  Unlabeled  data  in  Learning  Classifier  

Filter  unlabeled  data  1.  Filter  unlabeled  data  on  

the  boundary  

2.  Only  choose  top  K  unlabeled  data  

 Learn  a  SVM  classifier  

18  

Page 19: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

   

Results  

Page 20: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Evalua&on  

•  Dataset:  30  edi&ng  scenarios  –  Museum    –  Google  Refine  and  Excel  user  forums  

•  Comparing  Methods:  –  SP  

•  The  state-­‐of-­‐the-­‐art  approach  that  uses  compa&bility  score  to  select  par&&ons  to  merge  

–  SPIC  •  U&lize  previous  constraints  besides  using  compa&bility  score  

–  DP  •  Learn  distance  metric    

–  DPIC  •  U&lize  previous  constraints  besides  learning  distance  metric  

–  DPICED  •  Our  approach  in  this  paper  

20  

Page 21: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Results  Success  Rates:  

Time  and  Examples:  

21  

Page 22: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Related  Work  •  Wrapper  induc&on  approaches  – WIEN  [Kushmerick,  1997],  SoKMealy  [Hsu  et  al.,  1998],  STALKER  [Muslea  et  al.,  1999]  

•  Programming-­‐by-­‐example  approaches  –  FlashFill[Gulwani,  2011][Perelman  et  al.,  2014],  Data  Wrangler  [Kandel  et  al.,  2011],  SmartEditor  [Lau  et  al.  2003]  

•  Clustering  with  constraints  –  Clustering  with  constraints  [Xing  et  al.,  2002][Bilenko  et  al.,  2004][Bade  et  al.,  2006][Zhao  et  al.,  2010][Zheng  et  al.,  2011]  

Page 23: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Discussion  

•  Itera&vely  learn  condi&onal  statements  in  PBE  selng  –  Improve  the  efficiency  – Learn  more  accurate  condi&onal  statements  – generate  a  small  number  of  branches.  

•  Incorporate  ML  tools  as  external  func&ons  in  induc&ve  programming  

Page 24: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

Future  Work  

•  Integrate  the  par&&oning  and  classifica&on  steps  – Reduce  accumulated  errors  

•  Improve  GUI  to  help  user  verifying  the  data  –  Iden&fy  unseen  formats  –  Iden&fy  incorrectly  classified  records  

Page 25: Iteravely*Learning*Condi&onal*Statements*in* Transforming ...usc-isi-i2.github.io/slides/wu14-icdm-slides.pdf · A*DataTable* Accessi on Credit( Dimensions( Medium( Name(01.2 GiK*of*the*ar&st

•  Thanks