Ekaw2014 ziqi zhang

22
Learning with Partial Data for Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield

Transcript of Ekaw2014 ziqi zhang

Page 1: Ekaw2014 ziqi zhang

Learning with Partial Data for Semantic Table Interpretation

Ziqi Zhang Department of Computer Science, University of Sheffield

Page 2: Ekaw2014 ziqi zhang

Semantic Table Interpretation

• Input

• Ontology

• Relational table

• Goals/Tasks

• Column – classes/concepts

• Cell – named entities

• Column, Column – relation

Thing Company

Work

Time Period

… …

Ent:2kGames

Ent:THQ

… VidoeGame

Company

Video Game

Year

Name Publisher Year

1 Gears of War Microsoft 2006

2 Civilization IV 2k Games 2006

3 Titan Quest THQ 2006

99 Civilization V 2k Games 2010

Table of video games (PC)

< … … >

… …

Rel:publishedBy

Rel:publishedBy

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 3: Ekaw2014 ziqi zhang

Motivation

• SoA semantic table interpretation methods, e.g. [Limaye2010, Venetis2011, Mulwad2013]

Limitation Algorithm is ‘exhaustive’, but unnecessary

Goal: Assign a concept to this column

Hint: Content in the column gives useful clues

How much do we need for inference (99 rows in this example)?

- Human: SOME (learn by example)

- SoA: ALL

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Name Publisher Year

1 Gears of War Microsoft 2006

2 Civilization IV 2k Games 2006

3 Titan Quest THQ 2006

99 Civilization V 2k Games 2010

< … … >

Page 4: Ekaw2014 ziqi zhang

Research Questions

• Can machines ‘learn by example’

• inference using only partial data (sample)

• achieving good accuracy

• How to choose a sample

• does it matter (e.g., in terms of accuracy)

• how to optimize

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Zhang, Z. (2014). Towards efficient and effective semantic table interpretation. In Proceedings of the 13th International Semantic Web Conference, 487-502

TableMiner

(contribution of this work) Sample Selection

Page 5: Ekaw2014 ziqi zhang

Method

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 6: Ekaw2014 ziqi zhang

TableMiner (modified)

• Incremental inference (I-Inf) to address two tasks

• Column classification

• Using some data in the column

• Cell disambiguation

• Using column label to constrain disambiguation

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 7: Ekaw2014 ziqi zhang

• Incremental inference (I-Inf) Tj – a column; Cj – candidate concepts for the column; Ei,j candidate entities for a cell

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

TableMiner (modified)

Page 8: Ekaw2014 ziqi zhang

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

TableMiner (modified)

1 2

3

… … Until Cj changes little

(convergence)

Page 9: Ekaw2014 ziqi zhang

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

TableMiner (modified)

Cj= {<c1,s1’>, <c2,s2‘>, <c3,s3‘>, …. <c11,s11‘>} Column label (class) used as

constraint in selecting

candidate entities for

disambiguation

Page 10: Ekaw2014 ziqi zhang

Sample Selection – the Principle

• ‘Order matters’

• TableMiner processes data in order until convergence

• Changing the order means

• (Possibly) Different convergence speed

• Different data are processed

• Change the order of cells in a column (and corresponding row) such that

• cells that are ‘easier’ to disambiguate come to the top

• because the class for a column depends on cells in the column

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 11: Ekaw2014 ziqi zhang

Sample Selection- ‘name length’ hypothesis

• Longer names are easier to disambiguate than shorter names

• e.g., “Manchester” v.s. “Manchester United F.C.”

• Method name length (nl):

•nl(Ti,j) = # of tokens in cell Ti,j

•Re-order table rows by sorting on column Tj using nl(Ti,j)

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 12: Ekaw2014 ziqi zhang

• Names that have a richer feature representation are easier to disambiguate

• B.O.W. representation using row context

• ‘one-sense-per-discourse’ (in non-subject columns)

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Sample Selection- ‘feature density’ hypothesis

Page 13: Ekaw2014 ziqi zhang

• Method ‘duplicate content cell’ (dup)

• re-arrange the target column and table following ospd

• dup(Ti,j) = # of times text of Ti,j is duplicated in column Tj

• Re-order table rows by sorting on column Tj using dup(Ti,j)

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Sample Selection- ‘feature density’ hypothesis

Page 14: Ekaw2014 ziqi zhang

• Method ‘feature representation size’ (rep)

• re-arrange the target column and table following ospd

• rep(Ti,j) = # of tokens in the B.O.W. representation of Ti,j • Re-order table rows by sorting on column Tj using rep(Ti,j)

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Sample Selection- ‘feature density’ hypothesis

Page 15: Ekaw2014 ziqi zhang

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 16: Ekaw2014 ziqi zhang

Evaluation

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 17: Ekaw2014 ziqi zhang

Data

• Data

• Freebase as reference ontology/background knowledge

• Limaye200 – 200 Web tables from Limaye2010 originally annotated with Wikipedia

• Column classes are manually annotated

• LimayeAll – 6310 Web tables from Limaye2010

• Names in content cells are automatically mapped to Freebase

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 18: Ekaw2014 ziqi zhang

Settings

• Baseline

• 𝑇𝑀𝑏𝑠 – modified TableMiner to use all cells in a column for column classification (everything else unchanged)

• Comparison*

• 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 - TableMiner using name length sample selection

method

• 𝑇𝑀𝑚𝑜𝑑𝑑𝑢𝑝

- TableMiner using duplicate content cell sample

selection method

• 𝑇𝑀𝑚𝑜𝑑𝑟𝑒𝑝

- TableMiner using feature representation size

sample selection method

* The original TableMiner is modified. For details and other settings see paper.

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 19: Ekaw2014 ziqi zhang

Results

• Results in F1

• Convergence speed in column classification

• Reduced candidate named entities for disambiguation

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Classification (Limaye200) 72.1 72.3 72.0 72.1

Disambiguation (LimayeAll) 80.9 81.3 81.22 81.24

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Limaye200 100% 36.3% 36.1% 35.3%

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Limaye200 0 32.4% 48.1% 46.8%

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 20: Ekaw2014 ziqi zhang

Results

• Results in F1

• Convergence speed in column classification

• Reduced candidate named entities for disambiguation

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Classification (Limaye200) 72.1 72.3 72.0 72.1

Disambiguation (LimayeAll) 80.9 81.3 81.22 81.24

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Limaye200 100% 36.3% 36.1% 35.3%

𝑇𝑀𝑏𝑠 𝑇𝑀𝑚𝑜𝑑𝑛𝑙 𝑇𝑀𝑚𝑜𝑑

𝑑𝑢𝑝 𝑇𝑀𝑚𝑜𝑑

𝑟𝑒𝑝

Limaye200 0 32.4% 48.1% 46.8%

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Comparable or better accuracy

But uses only partial data for column classification

… and process much less data for disambiguation

Page 21: Ekaw2014 ziqi zhang

Conclusion

• Learning with partial data for semantic table interpretation can be both effective and efficient

• The choice of sample selection methods makes limited difference in terms of accuracy and efficiency

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

Page 22: Ekaw2014 ziqi zhang

Thank you

Z. Zhang / Learning with Partial Data for Semantic Table Interpretation

@ziqizhang_zz http://staffwww.dcs.shef.ac.uk/people/Z.Zhang