Transcript of Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.
- Slide 1
- Information Extraction Shih-Hung Wu Assistant Professor CSIE,
Chaoyang University of Technology
- Slide 2
- Outline Information Extraction Introduction Applications Table
Reading Citation Extraction Chinese Named Entity Recognition
- Slide 3
- Introduction
- Slide 4
- Information Extraction pieces of information extracts pieces of
information that are salient to the user's needs
- Slide 5
- Message Understanding Conferences (MUC) Evaluations provide
prepared data and task definitions in addition to providing fully
automated scoring software to measure machine and human
performance. The databases now include named entities, multilingual
named entities, attributes of those entities, facts about
relationships between entities, and events in which the entities
participated. The multilingual portion was known as "Multilingual
Entitity Task (MET)"
- Slide 6
- Examples The following fictional news story portrays the levels
of detail that systems can extract: Fletcher Maddox, former Dean of
the UCSD Business School, announced the formation of La Jolla
Genomatics together with his two sons. La Jolla Genomatics will
release its product Geninfo in June 1999. Geninfo is a turnkey
system to assist biotechnology researchers in keeping up with the
voluminous literature in all aspects of their field. Dr. Maddox
will be the firm's CEO. His son, Oliver, is the Chief Scientist and
holds patents on many of the algorithms used in Geninfo. Oliver's
brother, Ambrose, follows more in his father's footsteps and will
be the CFO of L.J.G. headquartered in the Maddox family's hometown
of La Jolla, CA.
- Slide 7
- Entities: Persons:Organizations:Locations:Artifacts:Dates:
Fletcher MaddoxUCSD Business SchoolLa JollaGeninfoJune 1999 Dr.
MaddoxLa Jolla GenomaticsCAGeninfo OliverLa Jolla Genomatics
OliverL.J.G. Ambrose Maddox
- Slide 8
- Attributes: NAME:Fletcher Maddox DESCRIPTOR: former Dean of the
UCSD Business School his father the firm's CEO CATEGORY:PERSON
NAME:La Jolla Genomatics DESCRIPTOR: CATEGORY:ORGANIZATION
NAME:Geninfo DESCRIPTOR:its product CATEGORY:ARTIFACT NAME:La Jolla
DESCRIPTOR:the Maddox family's hometown CATEGORY:LOCATION
- Slide 9
- Facts: PERSONEmployee_ofORGANIZATION Fletcher Maddox Fletcher
Maddox Oliver Ambrose Employee_of Employee_of UCSD Business School
La Jolla Genomatics La Jolla Genomatics La Jolla Genomatics
ARTIFACTProduct_ofORGANIZATION GeninfoProduct_ofLa Jolla Genomatics
LOCATIONLocation_ofORGANIZATION La JollaLocation_ofLa Jolla
Genomatics CALocation_ofLa Jolla Genomatics
- Slide 10
- Events: COMPANY-FORMATION_EVENT: COMPANY:La Jolla Genomatics
PRINCIPALS: Fletcher Maddox Oliver Ambrose DATE: CAPITAL:
RELEASE-EVENT: COMPANY: La Jolla Genomatics PRODUCT:Geninfo
DATE:June 1999 COST:
- Slide 11
- Information Extraction current indicators of the state of the
art: Items of Information Percentile Reliability Entities90
Attributes80 Facts70 Events60
- Slide 12
- Technical definition of IE The process of creating database
entries by skimming a text and looking for occurrences of a
particular class of object or event and for relationships among
those objects and events [Russell, Norvig 2003]
- Slide 13
- Basic IE tasks Extract addresses from Web pages target: street,
city, state, and zip code Extract storms from weather report
target: temperature, wind speed, and precipitation
- Slide 14
- IE Applications Competitive intelligence find instances of
corporate mergers and joint ventures. Intelligence gathering
terrorist activities. any damage to buildings or the
infrastructure, as well as the time and location of the event.
Health care delivery summarize medical patient records by
extracting diagnoses, symptoms, physical findings, test results,
and therapeutic treatments..
- Slide 15
- Technology Method in literature Regular expressions Cascaded
finite-state transducers Our approaches Ontological domain
knowledge Machine Learning Hybrid method
- Slide 16
- Regular expression approach example From the text 17in SXGA
Monitor for only $249.99 Extract m m ComputerMonitors
Size(m,Inches(17)) Price(m, $(249.99)) Resolution(m, 12801024)
- Slide 17
- Regular Expressions [0-9] [0-9]+.[0-9] [0-9] (.[0-9] [0-9])?
$[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A
period followed by two digits A period followed by two digits, or
nothing $249.99, $1.23, $100000, matches
- Slide 18
- Weakness Whats the price ? List price $99.00, special sale
price $78.00, shipping $3.00.
- Slide 19
- Cascaded finite-state transducers approach example From
Bridgestone Sports Co. said Friday it has set up a joint venture in
Taiwan with a local concern and a Japanese trading house to produce
golf clubs to be shipped to Japan. Extract e JointVentures
Product(e, golf clubs) Date(e,Friday) Entity(e,Bridgetstone Sports
Co) Entity(e, a local concern) Entity(e, a Japanese trading
house)
- Slide 20
- Cascaded finite-state transducers A typical relational
extraction systems consists of the following five stages:
Tokenization Complex word handling Basic group handling Complex
phrase handling Structure merging
- Slide 21
- Tokenization Word segmentation -> | | , | Complex word
handling Bridgestone Sports Co.
CapitalizedWord+(Company|Co|Inc|Ltd) Intel Chairman Andy Grove
CapitalizedWord+(Grove|Forest|Village|)
- Slide 22
- Basic group handling Noun group, verb group, Preposition,
Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4
NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan
9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese
trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be
shipped 16 PR: to 17 NG: Japan
- Slide 23
- Complex phrase handling Company+SetUp JointVenture (with
Company+)? Structure merging If the next sentence says something
about the same event.
- Slide 24
- A brief remark IE works well for a restricted domain
Predetermine the Subjects and how they are mentioned
- Slide 25
- Applications
- Slide 26
- Table Reading Citation Extraction Chinese NER
- Slide 27
- Semantic Search on Internet Tabular Information Extraction for
Answering Queries CIKM 2000
- Slide 28
- Gives a algorithm to interpret tables of the type shown below
where some cells span over multiple rows or columns. An example of
interpretation is: (Attribute)=>(Value) (Adult-Price-Single
Room-Economic class)=>35,450 Table Reading
- Slide 29
- Slide 30
- Method Tagging Layout Recognition Layout Transformation HTML
Table C-I Table Layout Description Layout Transition Rule Database
Table Ambiguous Relations of Cells
- Slide 31
- Method Tagging Layout Identifying Layout Trans.
- Slide 32
- Airline Schedule Ontology
- Slide 33
- Tagging C: Departure Information C: Departure City C: Arrival
City I: Departure City I: Arrival City Concept v.s. Descent Concept
C: Departure City I: Departure City Concept v.s. Instance of the
Concept Instance v.s. Instance of the same Concept
- Slide 34
- Four Relations of Table Cells Relations of Concept - Instances
Concept - Instance of the Concept Concept - Descent Concept Concept
- Instance of Descent Concept Instance - Instance of the same
Concept
- Slide 35
- Layout Recognition C-I Table Layout Descriptions Template
Matching Matched Layout Description Defined by Layout Syntax
Grammar
- Slide 36
- Layout Transformation Origin Layout Description Destination
Layout Description
- Slide 37
- Experiments 23 tables from 23 web pages 13 2-dimension tables,
10 complex tables Success is no miss, Any miss results fail
- Slide 38
- Conclusion & Future Works Layout Transformation from
complex tables to simple tables (1D, 2D). A general approach 1.
Tagging 2. Semantic Layout Recognition 3. Layout Transformation
Ambiguous reduced by checking cell relations
- Slide 39
- Reference Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng- Lung
Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular
Information Extraction for Answering Queries, Ninth International
Conference on Information and Knowledge Management (CIKM-2000),
McLean, VA, November 6-11, 2000. pp. 243-249. (EI) H.-H. Chen,
S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML
Texts, In Proc. 18th International Conference on Computational
Linguistics, Saabrucken, Germany, July 2000.
- Slide 40
- A Knowledge-based Approach to Citation Extraction IRI-2005
- Slide 41
- Introduction Integration of the bibliographical information of
scholarly publications available on the Internet Accurate reference
metadata extraction from heterogeneous reference sources. We
propose a knowledge-based approach to reference metadata extraction
INFOMAP: ontological knowledge representation framework
Automatically extract the reference metadata.
- Slide 42
- Proposed Approach
- Slide 43
- Reference Data Collection Journal Spider (journal agent)
collect journal data from the Journal Citation Reports (JCR)
indexed by the ISI and digital libraries on the Web. Citation data
source ISI web of science DBLP Citeseer PubMed Phase 1
- Slide 44
- Domain Knowledge Phase 2
- Slide 45
- INFOMAP INFOMAP as ontological knowledge representation
framework extracts important citation concepts from a natural
language text. Feature of INFOMAP represent and match complicated
template structures hierarchical matching regular expressions
semantic template matching frame (non-linear relations) matching
Using INFOMAP, we can extract author, title, journal, volume,
number (issue), year, and page information from different kinds of
reference formats or styles.
- Slide 46
- Reference Metadata Extraction Journal Reference styles
Reference style example Bioinformatics style (BIOI) Davenport, T.,
DeLong, D., & Beers, M. (1998) Successful knowledge management
projects. Sloan Management Review, 39(2), 43-57. ACM style (ACM)
1.Davenport, T., DeLong, D. and Beers, M. 1998. Successful
knowledge management projects. Sloan Management Review, 39 (2).
43-57. IEEE style (IEEE) [1]T. Davenport, D. DeLong, and M. Beers,
"Successful knowledge management projects," Sloan Management
Review, vol. 39, no. 2, pp. 43-57, 1998. APA style (APA) Davenport,
T., DeLong, D., & Beers, M. (1998). Successful knowledge
management projects. Sloan Management Review, 39(2), 43-57. JCB
style (JCB) Davenport, T., DeLong, D., & Beers, M. 1998.
Successful knowledge management projects. Sloan Management Review
39(2), 43-57. MISQ style (MISQ) Davenport, T., DeLong, D., and
Beers, M. "Successful knowledge management projects," Sloan
Management Review (39:2) 1998, pp 43-57. Table 1. Examples of
different journal reference styles Phase 3
- Slide 47
- Knowledge-based Reference Metadata Extraction - Online Service
Phase 4
- Slide 48
- Citation Extraction From Text to BixTex W. L. Hsu, "The
coloring and maximum independent set problems on planar perfect
graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On
the general feasibility test of scheduling lot sizes for several
products on one machine," Management Science 29, (1983), 93- 105.
W. L. Hsu, "The distance-domination numbers of trees," Operations
Research Letters 1, (3), (1982), 96-100. @article{ Author = {W. L.
Hsu}, Title = {The coloring and maximum independent set problems on
planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.},
Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }}
@article{ Author = {W. L. Hsu}, Title = {On the general feasibility
test of scheduling lot sizes for several products on one
machine,"}, Journal = {Management Science}, Volume = {29}, Number =
{}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L.
Hsu}, Title = {The distance-domination numbers of trees,"}, Journal
= {Operations Research Letters}, Volume = {1}, Number = {3}, Pages
= {96-100}, Year = {1982 }} Figure 5. The system output of BibTex
Format Figure 3. The system input of knowledge-based RME
- Slide 49
- Figure 6. The online service of knowledge-based RME
(http://bioinformatics.iis.sinica.edu.tw/CitationAgent/) System
Output System Input (Plain text) Output BibTex
- Slide 50
- Experimental Results and Discussion Experimental data We used
EndNote to collect Bioinformatics citation data for 2004 from
PubMed. A total of 907 bibliography records were collected from
PubMed digital libraries on the Web. Reference testing data was
generated for each of the six reference styles (BIOI, ACM, IEEE,
APA, MISQ, and JCB). Randomly selected 500 records for testing from
each of the six reference styles.
- Slide 51
- Experimental results of citation extraction from six reference
styles
- Slide 52
- Example Results
- Slide 53
- FieldField Relation StructurePercentage% Author 54.29% 42.86%
N/A2.85% Year 48.57% 20.00% 14.29% 5.71% 2.86% 2.86% N/A5.71% Title
48.57% 42.86% N/A8.57% Journal 71.43% 20.00% 5.71% N/A2.86% Volume
40.00% 31.43% 14.29% 5.71% 2.86% 2.86% N/A2.85% Issue 34.29% 14.29%
N/A51.42% Pages 42.86% 34.29% 17.14% 2.86% N/A2.85% The various
structures of different styles (Analysis of structures of 30
reference styles )
- Slide 54
- Comparison with related works Knowledge-based approach Our
proposed knowledge-based method for scholarly publications can
extract reference information from 907 records in various reference
styles with a high degree of precision the overall average field
accuracy is 97.87% for six major styles listed in Table 1 98.20%
for the MISQ style 87% for other 30 randomly selected styles
- Slide 55
- Conclusions Citation extraction is a challenging problem The
diverse nature of reference styles We have proposed a
knowledge-based citation extraction method for scholarly
publications. The experimental results indicate that, by using
INFOMAP, we can extract author, title, journal, volume, number
(issue), year, and page information from different reference styles
with a high degree of precision. The overall average field accuracy
of citation extraction is 97.87% for six major reference
styles.
- Slide 56
- Future Research Integrate the ontological and the machine
learning approaches to boost the performance of citation
information extraction Maximum-Entropy Method (MEM) Hidden Markov
Model (HMM) Conditional Random Fields (CRF) Support Vector Machines
(SVM)
- Slide 57
- Reference Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung,
Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, and Wen-Lian Hsu, A
Knowledge-based Approach to Citation Extraction, to appear in
Proceedings of IEEE International Conference on Information Reuse
and Integration (IEEE IRI-2005), pp.50-55. (EI)
- Slide 58
- Chinese Named Entity Recognition Using a Hybrid Approach of
Machine Learning and Domain Knowledge ROCLING 2003, CLCLP 2004
- Slide 59
- Named Entity Recognition Named Entity
- Slide 60
- Sequential Labeling PerLocOrg B-PI-P B-LI-LB-OI-O Token-based
Charactor-based
- Slide 61
- Machine Learning named entity corpus corpus target named
entity, corpus . NER NER
- Slide 62
- Hybrid NER method Domain knowledge , , , Machine Learning SVM,
Bigram/Trigram Model Hybrid Maximum-Entropy Framework Domain
knowledge serves as features
- Slide 63
- Statical knowledge is insufficient New names SARS Ambiguity
Context dependence
- Slide 64
- Pure machine learning might suffer Lack context information
Window Size token tag NE NER , NER
- Slide 65
- Basic Concepts of Our ME-based Hybrid Approach NE Context
Information Internal/External Features Training Data Feature ,
confidence
- Slide 66
- Internal/External Features Internal Found within the name
string itself e.g., External Context e.g.,
- Slide 67
- Tag Set (outcome) Character Token, Named Entity , , Tag Set
/B-P /I-P /I-P /B-L /I-L /I-L /B-O /I-O /I-O
- Slide 68
- ME-based NER Framework -Feature Representation For example:
token , Feature f is active!!
- Slide 69
- ME-based NER Framework -Training Given a set of features and a
training corpus The ME estimation process produces a model in which
every feature f i has a weight i. Then we are allowed to
compute:
- Slide 70
- ME-based NER Framework -Decoding Tokenize the text and
preprocess the testing sentence For each token, check which
features are active and combine the i of active features according
to Equation 1 A Viterbi search is run to find the highest
probability path
- Slide 71
- Hybrid NER Example The NER problem has been formulated as
maximize p(o|h) and find its corresponding outcome o W 0 : the
current token Os Ls Ps Context (History) Feature 1:
- Slide 72
- Advantages of Hybrid NER , . , Performance
- Slide 73
- DomainNumber of Named EntitiesSize (in characters) PERLOCORG
Local News841399711835 Social Affairs31028735437719
Investment20633314397 Politics41920923317168 Headline News
2677024319938 Business14218618725815 Total12429541147126872 United
Daily News (December, 2002) Experiment-Data Set
- Slide 74
- Experiment Result NEP(%)R(%)F(%) PER72.9897.9383.63
LOC67.9674.6771.16 ORG95.7764.0776.78 Total75.6282.1378.74
NEP(%)R(%)F(%) PER97.9487.39 92.36 LOC78.6069.35 73.69
ORG94.3962.57 75.25 Total90.5673.70 81.26 Use domain knowledge only
ME-based Hybrid
- Slide 75
- Performance Comparison Sys. PersonLocationOrganizationOverall
PRFPRFPRFPRF NTU (98) 749181.6697873.2857881.3778379.9 KRD L (98)
66.4 92 77.1 8990.990 89.587.888.685.2 90.2 87.6 IASL (03) 92.1
83.3 87.5 88.181.884.9 93.388.790.990.4 8587.7 Corpus: MET2 Dataset
Number of Entities: 3646
- Slide 76
- Conclusion and Future Work Conclusion Hybrid Approach Hybrid
Approach Precision Improvement, Hybrid Improvement , Future Work
Named Entity Features Named Entity Multi Iteration NER Hierachical
Named Entity
- Slide 77
- References [Tsai 2003] Tzong-Han Tsai, Shih-Hung Wu and
Wen-Lian Hsu. (2003), Mencius: A Chinese Named Entity Recognizer
Using Hybrid Model, in Proceedings of the Fifteenth Research on
Computational Linguistics International Conference (ROCLING XV),
pp.193-209, 2003. [Tsai 2004] Tzong-Han Tsai, Shih-Hung Wu, and
Wen-Lian Hsu, "Mencius: A Chinese Named Entity Recognizer Based on
a Maximum Entropy Framework," Computational Linguistics and Chinese
Language Processing, Vol.9, No.1, pp.65-82, 2004. [Shih 2004]
Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and
Wen-Lian Hsu, (2004) The Construction of a Chinese Named Entity
Tagged Corpus: CNEC1.0, in Proceedings of the Fifteenth Conference
on Computational Linguistics and Speech Processing (ROCLING XVI),
pp. 305-313.