A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and...

18
A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Transcript of A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and...

Page 1: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

A confidence-based framework for

disambiguating geographic terms

Erik Rauch, Michael Bukatin, and Kenneth Baker

MetaCarta, Inc.

Page 2: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.
Page 3: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

‘wine’ in Europe

Page 4: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Al Hamra

(= ‘red’ in Arabic)

Page 5: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.
Page 6: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Local and non-local information

Madison

Wisconsin

Milwaukee

‘s downtown

More non-local information -> too many states to get probabilities

Page 7: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Candidate places

• 38 01'10.5"N 121 44'48.8"W

• four miles south of Lusaka–(22.10 S 15.51 E)

• Deir az Zor – (32.10 N 41.11 E), 0.325

– (25.03 N 31.44 E), 0.151

– (….)

confidence

Page 8: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Local context

resident of Madison

Minister Ishihara

Ishihara, Japan (32.36 N 147.21 E)Madison, WI; Madison, ID; Madison, CT; Madison, KY…

Page 9: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Context affects confidence

• Increase or decrease c(p,n) based on strength of context words– “by Madison” vs. “President Madison”– can be added manually or automatically

• and/or use HMM

Page 10: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Local context problems

Madison family attractions

Madison, WI; Madison, ID; Madison, CT; Madison, KY…

Milwaukee

Page 11: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Using spatial patterns of geographic references

Page 12: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Madison

MilwaukeeWisconsin

Increase c(p,n) based on number of other references:

Enclosing regions or nearby points

Page 13: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Pitfalls

Ishihara, Japan (32.36 N 147.21 E)

Ishihara, Japan’s leading epidemiologist,

Page 14: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Training

• “Philadelphia” is usually geographic; “Bend” usually isn’t

• If name n often refers to point p in documents, give (n,p) high confidence to start with

• Use average confidence in a large corpus

Page 15: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Training cont’d

• Extract local linguistic contexts that often occur with geographic names in tagged corpora

• Or train HMM

Page 16: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Relevance

• Several dimensions to relevance: – Traditional textual relevance of query terms– Georelevance

Query: “cheese” in France

Page 17: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Georelevance

• Aim: combination reflects user’s preferred balance between recall and correctness of the geographic reference

• e.g. Georelevance = query term relevance * geoconfidence

• Depends on:– Attributes of the geotext, e.g. document frequency, font

size, position– Geoconfidence

Page 18: A confidence-based framework for disambiguating geographic terms Erik Rauch, Michael Bukatin, and Kenneth Baker MetaCarta, Inc.

Conclusion

• Ambiguity problem much worse with large gazetteers

• Can use probabilistic methods where feasible (local information), combine with confidence-based heuristics