Information Extraction Shih-Hung Wu Assistant Professor CSIE,
Chaoyang University of Technology
Slide 2
Outline Information Extraction Introduction Applications Table
Reading Citation Extraction Chinese Named Entity Recognition
Slide 3
Introduction
Slide 4
Information Extraction pieces of information extracts pieces of
information that are salient to the user's needs
Slide 5
Message Understanding Conferences (MUC) Evaluations provide
prepared data and task definitions in addition to providing fully
automated scoring software to measure machine and human
performance. The databases now include named entities, multilingual
named entities, attributes of those entities, facts about
relationships between entities, and events in which the entities
participated. The multilingual portion was known as "Multilingual
Entitity Task (MET)"
Slide 6
Examples The following fictional news story portrays the levels
of detail that systems can extract: Fletcher Maddox, former Dean of
the UCSD Business School, announced the formation of La Jolla
Genomatics together with his two sons. La Jolla Genomatics will
release its product Geninfo in June 1999. Geninfo is a turnkey
system to assist biotechnology researchers in keeping up with the
voluminous literature in all aspects of their field. Dr. Maddox
will be the firm's CEO. His son, Oliver, is the Chief Scientist and
holds patents on many of the algorithms used in Geninfo. Oliver's
brother, Ambrose, follows more in his father's footsteps and will
be the CFO of L.J.G. headquartered in the Maddox family's hometown
of La Jolla, CA.
Slide 7
Entities: Persons:Organizations:Locations:Artifacts:Dates:
Fletcher MaddoxUCSD Business SchoolLa JollaGeninfoJune 1999 Dr.
MaddoxLa Jolla GenomaticsCAGeninfo OliverLa Jolla Genomatics
OliverL.J.G. Ambrose Maddox
Slide 8
Attributes: NAME:Fletcher Maddox DESCRIPTOR: former Dean of the
UCSD Business School his father the firm's CEO CATEGORY:PERSON
NAME:La Jolla Genomatics DESCRIPTOR: CATEGORY:ORGANIZATION
NAME:Geninfo DESCRIPTOR:its product CATEGORY:ARTIFACT NAME:La Jolla
DESCRIPTOR:the Maddox family's hometown CATEGORY:LOCATION
Slide 9
Facts: PERSONEmployee_ofORGANIZATION Fletcher Maddox Fletcher
Maddox Oliver Ambrose Employee_of Employee_of UCSD Business School
La Jolla Genomatics La Jolla Genomatics La Jolla Genomatics
ARTIFACTProduct_ofORGANIZATION GeninfoProduct_ofLa Jolla Genomatics
LOCATIONLocation_ofORGANIZATION La JollaLocation_ofLa Jolla
Genomatics CALocation_ofLa Jolla Genomatics
Slide 10
Events: COMPANY-FORMATION_EVENT: COMPANY:La Jolla Genomatics
PRINCIPALS: Fletcher Maddox Oliver Ambrose DATE: CAPITAL:
RELEASE-EVENT: COMPANY: La Jolla Genomatics PRODUCT:Geninfo
DATE:June 1999 COST:
Slide 11
Information Extraction current indicators of the state of the
art: Items of Information Percentile Reliability Entities90
Attributes80 Facts70 Events60
Slide 12
Technical definition of IE The process of creating database
entries by skimming a text and looking for occurrences of a
particular class of object or event and for relationships among
those objects and events [Russell, Norvig 2003]
Slide 13
Basic IE tasks Extract addresses from Web pages target: street,
city, state, and zip code Extract storms from weather report
target: temperature, wind speed, and precipitation
Slide 14
IE Applications Competitive intelligence find instances of
corporate mergers and joint ventures. Intelligence gathering
terrorist activities. any damage to buildings or the
infrastructure, as well as the time and location of the event.
Health care delivery summarize medical patient records by
extracting diagnoses, symptoms, physical findings, test results,
and therapeutic treatments..
Slide 15
Technology Method in literature Regular expressions Cascaded
finite-state transducers Our approaches Ontological domain
knowledge Machine Learning Hybrid method
Slide 16
Regular expression approach example From the text 17in SXGA
Monitor for only $249.99 Extract m m ComputerMonitors
Size(m,Inches(17)) Price(m, $(249.99)) Resolution(m, 12801024)
Slide 17
Regular Expressions [0-9] [0-9]+.[0-9] [0-9] (.[0-9] [0-9])?
$[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A
period followed by two digits A period followed by two digits, or
nothing $249.99, $1.23, $100000, matches
Slide 18
Weakness Whats the price ? List price $99.00, special sale
price $78.00, shipping $3.00.
Slide 19
Cascaded finite-state transducers approach example From
Bridgestone Sports Co. said Friday it has set up a joint venture in
Taiwan with a local concern and a Japanese trading house to produce
golf clubs to be shipped to Japan. Extract e JointVentures
Product(e, golf clubs) Date(e,Friday) Entity(e,Bridgetstone Sports
Co) Entity(e, a local concern) Entity(e, a Japanese trading
house)
Slide 20
Cascaded finite-state transducers A typical relational
extraction systems consists of the following five stages:
Tokenization Complex word handling Basic group handling Complex
phrase handling Structure merging
Slide 21
Tokenization Word segmentation -> | | , | Complex word
handling Bridgestone Sports Co.
CapitalizedWord+(Company|Co|Inc|Ltd) Intel Chairman Andy Grove
CapitalizedWord+(Grove|Forest|Village|)
Slide 22
Basic group handling Noun group, verb group, Preposition,
Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4
NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan
9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese
trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be
shipped 16 PR: to 17 NG: Japan
Slide 23
Complex phrase handling Company+SetUp JointVenture (with
Company+)? Structure merging If the next sentence says something
about the same event.
Slide 24
A brief remark IE works well for a restricted domain
Predetermine the Subjects and how they are mentioned
Slide 25
Applications
Slide 26
Table Reading Citation Extraction Chinese NER
Slide 27
Semantic Search on Internet Tabular Information Extraction for
Answering Queries CIKM 2000
Slide 28
Gives a algorithm to interpret tables of the type shown below
where some cells span over multiple rows or columns. An example of
interpretation is: (Attribute)=>(Value) (Adult-Price-Single
Room-Economic class)=>35,450 Table Reading
Tagging C: Departure Information C: Departure City C: Arrival
City I: Departure City I: Arrival City Concept v.s. Descent Concept
C: Departure City I: Departure City Concept v.s. Instance of the
Concept Instance v.s. Instance of the same Concept
Slide 34
Four Relations of Table Cells Relations of Concept - Instances
Concept - Instance of the Concept Concept - Descent Concept Concept
- Instance of Descent Concept Instance - Instance of the same
Concept
Slide 35
Layout Recognition C-I Table Layout Descriptions Template
Matching Matched Layout Description Defined by Layout Syntax
Grammar
Experiments 23 tables from 23 web pages 13 2-dimension tables,
10 complex tables Success is no miss, Any miss results fail
Slide 38
Conclusion & Future Works Layout Transformation from
complex tables to simple tables (1D, 2D). A general approach 1.
Tagging 2. Semantic Layout Recognition 3. Layout Transformation
Ambiguous reduced by checking cell relations
Slide 39
Reference Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng- Lung
Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular
Information Extraction for Answering Queries, Ninth International
Conference on Information and Knowledge Management (CIKM-2000),
McLean, VA, November 6-11, 2000. pp. 243-249. (EI) H.-H. Chen,
S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML
Texts, In Proc. 18th International Conference on Computational
Linguistics, Saabrucken, Germany, July 2000.
Slide 40
A Knowledge-based Approach to Citation Extraction IRI-2005
Slide 41
Introduction Integration of the bibliographical information of
scholarly publications available on the Internet Accurate reference
metadata extraction from heterogeneous reference sources. We
propose a knowledge-based approach to reference metadata extraction
INFOMAP: ontological knowledge representation framework
Automatically extract the reference metadata.
Slide 42
Proposed Approach
Slide 43
Reference Data Collection Journal Spider (journal agent)
collect journal data from the Journal Citation Reports (JCR)
indexed by the ISI and digital libraries on the Web. Citation data
source ISI web of science DBLP Citeseer PubMed Phase 1
Slide 44
Domain Knowledge Phase 2
Slide 45
INFOMAP INFOMAP as ontological knowledge representation
framework extracts important citation concepts from a natural
language text. Feature of INFOMAP represent and match complicated
template structures hierarchical matching regular expressions
semantic template matching frame (non-linear relations) matching
Using INFOMAP, we can extract author, title, journal, volume,
number (issue), year, and page information from different kinds of
reference formats or styles.
Slide 46
Reference Metadata Extraction Journal Reference styles
Reference style example Bioinformatics style (BIOI) Davenport, T.,
DeLong, D., & Beers, M. (1998) Successful knowledge management
projects. Sloan Management Review, 39(2), 43-57. ACM style (ACM)
1.Davenport, T., DeLong, D. and Beers, M. 1998. Successful
knowledge management projects. Sloan Management Review, 39 (2).
43-57. IEEE style (IEEE) [1]T. Davenport, D. DeLong, and M. Beers,
"Successful knowledge management projects," Sloan Management
Review, vol. 39, no. 2, pp. 43-57, 1998. APA style (APA) Davenport,
T., DeLong, D., & Beers, M. (1998). Successful knowledge
management projects. Sloan Management Review, 39(2), 43-57. JCB
style (JCB) Davenport, T., DeLong, D., & Beers, M. 1998.
Successful knowledge management projects. Sloan Management Review
39(2), 43-57. MISQ style (MISQ) Davenport, T., DeLong, D., and
Beers, M. "Successful knowledge management projects," Sloan
Management Review (39:2) 1998, pp 43-57. Table 1. Examples of
different journal reference styles Phase 3
Slide 47
Knowledge-based Reference Metadata Extraction - Online Service
Phase 4
Slide 48
Citation Extraction From Text to BixTex W. L. Hsu, "The
coloring and maximum independent set problems on planar perfect
graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On
the general feasibility test of scheduling lot sizes for several
products on one machine," Management Science 29, (1983), 93- 105.
W. L. Hsu, "The distance-domination numbers of trees," Operations
Research Letters 1, (3), (1982), 96-100. @article{ Author = {W. L.
Hsu}, Title = {The coloring and maximum independent set problems on
planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.},
Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }}
@article{ Author = {W. L. Hsu}, Title = {On the general feasibility
test of scheduling lot sizes for several products on one
machine,"}, Journal = {Management Science}, Volume = {29}, Number =
{}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L.
Hsu}, Title = {The distance-domination numbers of trees,"}, Journal
= {Operations Research Letters}, Volume = {1}, Number = {3}, Pages
= {96-100}, Year = {1982 }} Figure 5. The system output of BibTex
Format Figure 3. The system input of knowledge-based RME
Slide 49
Figure 6. The online service of knowledge-based RME
(http://bioinformatics.iis.sinica.edu.tw/CitationAgent/) System
Output System Input (Plain text) Output BibTex
Slide 50
Experimental Results and Discussion Experimental data We used
EndNote to collect Bioinformatics citation data for 2004 from
PubMed. A total of 907 bibliography records were collected from
PubMed digital libraries on the Web. Reference testing data was
generated for each of the six reference styles (BIOI, ACM, IEEE,
APA, MISQ, and JCB). Randomly selected 500 records for testing from
each of the six reference styles.
Slide 51
Experimental results of citation extraction from six reference
styles
Slide 52
Example Results
Slide 53
FieldField Relation StructurePercentage% Author 54.29% 42.86%
N/A2.85% Year 48.57% 20.00% 14.29% 5.71% 2.86% 2.86% N/A5.71% Title
48.57% 42.86% N/A8.57% Journal 71.43% 20.00% 5.71% N/A2.86% Volume
40.00% 31.43% 14.29% 5.71% 2.86% 2.86% N/A2.85% Issue 34.29% 14.29%
N/A51.42% Pages 42.86% 34.29% 17.14% 2.86% N/A2.85% The various
structures of different styles (Analysis of structures of 30
reference styles )
Slide 54
Comparison with related works Knowledge-based approach Our
proposed knowledge-based method for scholarly publications can
extract reference information from 907 records in various reference
styles with a high degree of precision the overall average field
accuracy is 97.87% for six major styles listed in Table 1 98.20%
for the MISQ style 87% for other 30 randomly selected styles
Slide 55
Conclusions Citation extraction is a challenging problem The
diverse nature of reference styles We have proposed a
knowledge-based citation extraction method for scholarly
publications. The experimental results indicate that, by using
INFOMAP, we can extract author, title, journal, volume, number
(issue), year, and page information from different reference styles
with a high degree of precision. The overall average field accuracy
of citation extraction is 97.87% for six major reference
styles.
Slide 56
Future Research Integrate the ontological and the machine
learning approaches to boost the performance of citation
information extraction Maximum-Entropy Method (MEM) Hidden Markov
Model (HMM) Conditional Random Fields (CRF) Support Vector Machines
(SVM)
Slide 57
Reference Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung,
Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, and Wen-Lian Hsu, A
Knowledge-based Approach to Citation Extraction, to appear in
Proceedings of IEEE International Conference on Information Reuse
and Integration (IEEE IRI-2005), pp.50-55. (EI)
Slide 58
Chinese Named Entity Recognition Using a Hybrid Approach of
Machine Learning and Domain Knowledge ROCLING 2003, CLCLP 2004
Machine Learning named entity corpus corpus target named
entity, corpus . NER NER
Slide 62
Hybrid NER method Domain knowledge , , , Machine Learning SVM,
Bigram/Trigram Model Hybrid Maximum-Entropy Framework Domain
knowledge serves as features
Slide 63
Statical knowledge is insufficient New names SARS Ambiguity
Context dependence
Slide 64
Pure machine learning might suffer Lack context information
Window Size token tag NE NER , NER
Slide 65
Basic Concepts of Our ME-based Hybrid Approach NE Context
Information Internal/External Features Training Data Feature ,
confidence
Slide 66
Internal/External Features Internal Found within the name
string itself e.g., External Context e.g.,
Slide 67
Tag Set (outcome) Character Token, Named Entity , , Tag Set
/B-P /I-P /I-P /B-L /I-L /I-L /B-O /I-O /I-O
Slide 68
ME-based NER Framework -Feature Representation For example:
token , Feature f is active!!
Slide 69
ME-based NER Framework -Training Given a set of features and a
training corpus The ME estimation process produces a model in which
every feature f i has a weight i. Then we are allowed to
compute:
Slide 70
ME-based NER Framework -Decoding Tokenize the text and
preprocess the testing sentence For each token, check which
features are active and combine the i of active features according
to Equation 1 A Viterbi search is run to find the highest
probability path
Slide 71
Hybrid NER Example The NER problem has been formulated as
maximize p(o|h) and find its corresponding outcome o W 0 : the
current token Os Ls Ps Context (History) Feature 1:
Slide 72
Advantages of Hybrid NER , . , Performance
Slide 73
DomainNumber of Named EntitiesSize (in characters) PERLOCORG
Local News841399711835 Social Affairs31028735437719
Investment20633314397 Politics41920923317168 Headline News
2677024319938 Business14218618725815 Total12429541147126872 United
Daily News (December, 2002) Experiment-Data Set
Slide 74
Experiment Result NEP(%)R(%)F(%) PER72.9897.9383.63
LOC67.9674.6771.16 ORG95.7764.0776.78 Total75.6282.1378.74
NEP(%)R(%)F(%) PER97.9487.39 92.36 LOC78.6069.35 73.69
ORG94.3962.57 75.25 Total90.5673.70 81.26 Use domain knowledge only
ME-based Hybrid
Conclusion and Future Work Conclusion Hybrid Approach Hybrid
Approach Precision Improvement, Hybrid Improvement , Future Work
Named Entity Features Named Entity Multi Iteration NER Hierachical
Named Entity
Slide 77
References [Tsai 2003] Tzong-Han Tsai, Shih-Hung Wu and
Wen-Lian Hsu. (2003), Mencius: A Chinese Named Entity Recognizer
Using Hybrid Model, in Proceedings of the Fifteenth Research on
Computational Linguistics International Conference (ROCLING XV),
pp.193-209, 2003. [Tsai 2004] Tzong-Han Tsai, Shih-Hung Wu, and
Wen-Lian Hsu, "Mencius: A Chinese Named Entity Recognizer Based on
a Maximum Entropy Framework," Computational Linguistics and Chinese
Language Processing, Vol.9, No.1, pp.65-82, 2004. [Shih 2004]
Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and
Wen-Lian Hsu, (2004) The Construction of a Chinese Named Entity
Tagged Corpus: CNEC1.0, in Proceedings of the Fifteenth Conference
on Computational Linguistics and Speech Processing (ROCLING XVI),
pp. 305-313.