Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.

Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology

Outline Information Extraction Introduction Applications Table Reading Citation Extraction Chinese Named Entity Recognition

Introduction

Information Extraction pieces of information extracts pieces of information that are salient to the user's needs

Message Understanding Conferences (MUC) Evaluations provide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance. The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated. The multilingual portion was known as "Multilingual Entitity Task (MET)"

Examples The following fictional news story portrays the levels of detail that systems can extract: Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.

Entities: Persons:Organizations:Locations:Artifacts:Dates: Fletcher MaddoxUCSD Business SchoolLa JollaGeninfoJune 1999 Dr. MaddoxLa Jolla GenomaticsCAGeninfo OliverLa Jolla Genomatics OliverL.J.G. Ambrose Maddox

Attributes: NAME:Fletcher Maddox DESCRIPTOR: former Dean of the UCSD Business School his father the firm's CEO CATEGORY:PERSON NAME:La Jolla Genomatics DESCRIPTOR: CATEGORY:ORGANIZATION NAME:Geninfo DESCRIPTOR:its product CATEGORY:ARTIFACT NAME:La Jolla DESCRIPTOR:the Maddox family's hometown CATEGORY:LOCATION

Facts: PERSONEmployee_ofORGANIZATION Fletcher Maddox Fletcher Maddox Oliver Ambrose Employee_of Employee_of UCSD Business School La Jolla Genomatics La Jolla Genomatics La Jolla Genomatics ARTIFACTProduct_ofORGANIZATION GeninfoProduct_ofLa Jolla Genomatics LOCATIONLocation_ofORGANIZATION La JollaLocation_ofLa Jolla Genomatics CALocation_ofLa Jolla Genomatics

Events: COMPANY-FORMATION_EVENT: COMPANY:La Jolla Genomatics PRINCIPALS: Fletcher Maddox Oliver Ambrose DATE: CAPITAL: RELEASE-EVENT: COMPANY: La Jolla Genomatics PRODUCT:Geninfo DATE:June 1999 COST:

Information Extraction current indicators of the state of the art: Items of Information Percentile Reliability Entities90 Attributes80 Facts70 Events60

Technical definition of IE The process of creating database entries by skimming a text and looking for occurrences of a particular class of object or event and for relationships among those objects and events [Russell, Norvig 2003]

Basic IE tasks Extract addresses from Web pages target: street, city, state, and zip code Extract storms from weather report target: temperature, wind speed, and precipitation

IE Applications Competitive intelligence find instances of corporate mergers and joint ventures. Intelligence gathering terrorist activities. any damage to buildings or the infrastructure, as well as the time and location of the event. Health care delivery summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments..

Technology Method in literature Regular expressions Cascaded finite-state transducers Our approaches Ontological domain knowledge Machine Learning Hybrid method

Regular expression approach example From the text 17in SXGA Monitor for only $249.99 Extract m m ComputerMonitors Size(m,Inches(17)) Price(m, $(249.99)) Resolution(m, 12801024)

Regular Expressions [0-9] [0-9]+.[0-9] [0-9] (.[0-9] [0-9])? $[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A period followed by two digits A period followed by two digits, or nothing $249.99, $1.23, $100000, matches

Weakness Whats the price ? List price $99.00, special sale price $78.00, shipping $3.00.

Cascaded finite-state transducers approach example From Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. Extract e JointVentures Product(e, golf clubs) Date(e,Friday) Entity(e,Bridgetstone Sports Co) Entity(e, a local concern) Entity(e, a Japanese trading house)

Cascaded finite-state transducers A typical relational extraction systems consists of the following five stages: Tokenization Complex word handling Basic group handling Complex phrase handling Structure merging

Tokenization Word segmentation -> | | , | Complex word handling Bridgestone Sports Co. CapitalizedWord+(Company|Co|Inc|Ltd) Intel Chairman Andy Grove CapitalizedWord+(Grove|Forest|Village|)

Basic group handling Noun group, verb group, Preposition, Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4 NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan 9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be shipped 16 PR: to 17 NG: Japan

Complex phrase handling Company+SetUp JointVenture (with Company+)? Structure merging If the next sentence says something about the same event.

A brief remark IE works well for a restricted domain Predetermine the Subjects and how they are mentioned

Applications

Table Reading Citation Extraction Chinese NER

Semantic Search on Internet Tabular Information Extraction for Answering Queries CIKM 2000

Gives a algorithm to interpret tables of the type shown below where some cells span over multiple rows or columns. An example of interpretation is: (Attribute)=>(Value) (Adult-Price-Single Room-Economic class)=>35,450 Table Reading

Method Tagging Layout Recognition Layout Transformation HTML Table C-I Table Layout Description Layout Transition Rule Database Table Ambiguous Relations of Cells

Method Tagging Layout Identifying Layout Trans.

Airline Schedule Ontology

Tagging C: Departure Information C: Departure City C: Arrival City I: Departure City I: Arrival City Concept v.s. Descent Concept C: Departure City I: Departure City Concept v.s. Instance of the Concept Instance v.s. Instance of the same Concept

Four Relations of Table Cells Relations of Concept - Instances Concept - Instance of the Concept Concept - Descent Concept Concept - Instance of Descent Concept Instance - Instance of the same Concept

Layout Recognition C-I Table Layout Descriptions Template Matching Matched Layout Description Defined by Layout Syntax Grammar

Layout Transformation Origin Layout Description Destination Layout Description

Experiments 23 tables from 23 web pages 13 2-dimension tables, 10 complex tables Success is no miss, Any miss results fail

Conclusion & Future Works Layout Transformation from complex tables to simple tables (1D, 2D). A general approach 1. Tagging 2. Semantic Layout Recognition 3. Layout Transformation Ambiguous reduced by checking cell relations

Reference Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng- Lung Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular Information Extraction for Answering Queries, Ninth International Conference on Information and Knowledge Management (CIKM-2000), McLean, VA, November 6-11, 2000. pp. 243-249. (EI) H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML Texts, In Proc. 18th International Conference on Computational Linguistics, Saabrucken, Germany, July 2000.

A Knowledge-based Approach to Citation Extraction IRI-2005

Introduction Integration of the bibliographical information of scholarly publications available on the Internet Accurate reference metadata extraction from heterogeneous reference sources. We propose a knowledge-based approach to reference metadata extraction INFOMAP: ontological knowledge representation framework Automatically extract the reference metadata.

Proposed Approach

Reference Data Collection Journal Spider (journal agent) collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. Citation data source ISI web of science DBLP Citeseer PubMed Phase 1

Domain Knowledge Phase 2

INFOMAP INFOMAP as ontological knowledge representation framework extracts important citation concepts from a natural language text. Feature of INFOMAP represent and match complicated template structures hierarchical matching regular expressions semantic template matching frame (non-linear relations) matching Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.

Reference Metadata Extraction Journal Reference styles Reference style example Bioinformatics style (BIOI) Davenport, T., DeLong, D., & Beers, M. (1998) Successful knowledge management projects. Sloan Management Review, 39(2), 43-57. ACM style (ACM) 1.Davenport, T., DeLong, D. and Beers, M. 1998. Successful knowledge management projects. Sloan Management Review, 39 (2). 43-57. IEEE style (IEEE) [1]T. Davenport, D. DeLong, and M. Beers, "Successful knowledge management projects," Sloan Management Review, vol. 39, no. 2, pp. 43-57, 1998. APA style (APA) Davenport, T., DeLong, D., & Beers, M. (1998). Successful knowledge management projects. Sloan Management Review, 39(2), 43-57. JCB style (JCB) Davenport, T., DeLong, D., & Beers, M. 1998. Successful knowledge management projects. Sloan Management Review 39(2), 43-57. MISQ style (MISQ) Davenport, T., DeLong, D., and Beers, M. "Successful knowledge management projects," Sloan Management Review (39:2) 1998, pp 43-57. Table 1. Examples of different journal reference styles Phase 3

Knowledge-based Reference Metadata Extraction - Online Service Phase 4

Citation Extraction From Text to BixTex W. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), 93- 105. W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), 96-100. @article{ Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }} @article{ Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }} Figure 5. The system output of BibTex Format Figure 3. The system input of knowledge-based RME

Figure 6. The online service of knowledge-based RME (http://bioinformatics.iis.sinica.edu.tw/CitationAgent/) System Output System Input (Plain text) Output BibTex

Experimental Results and Discussion Experimental data We used EndNote to collect Bioinformatics citation data for 2004 from PubMed. A total of 907 bibliography records were collected from PubMed digital libraries on the Web. Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). Randomly selected 500 records for testing from each of the six reference styles.

Experimental results of citation extraction from six reference styles

Example Results

FieldField Relation StructurePercentage% Author 54.29% 42.86% N/A2.85% Year 48.57% 20.00% 14.29% 5.71% 2.86% 2.86% N/A5.71% Title 48.57% 42.86% N/A8.57% Journal 71.43% 20.00% 5.71% N/A2.86% Volume 40.00% 31.43% 14.29% 5.71% 2.86% 2.86% N/A2.85% Issue 34.29% 14.29% N/A51.42% Pages 42.86% 34.29% 17.14% 2.86% N/A2.85% The various structures of different styles (Analysis of structures of 30 reference styles )

Comparison with related works Knowledge-based approach Our proposed knowledge-based method for scholarly publications can extract reference information from 907 records in various reference styles with a high degree of precision the overall average field accuracy is 97.87% for six major styles listed in Table 1 98.20% for the MISQ style 87% for other 30 randomly selected styles

Conclusions Citation extraction is a challenging problem The diverse nature of reference styles We have proposed a knowledge-based citation extraction method for scholarly publications. The experimental results indicate that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of precision. The overall average field accuracy of citation extraction is 97.87% for six major reference styles.

Future Research Integrate the ontological and the machine learning approaches to boost the performance of citation information extraction Maximum-Entropy Method (MEM) Hidden Markov Model (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM)

Reference Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung, Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, and Wen-Lian Hsu, A Knowledge-based Approach to Citation Extraction, to appear in Proceedings of IEEE International Conference on Information Reuse and Integration (IEEE IRI-2005), pp.50-55. (EI)

Chinese Named Entity Recognition Using a Hybrid Approach of Machine Learning and Domain Knowledge ROCLING 2003, CLCLP 2004

Named Entity Recognition Named Entity

Sequential Labeling PerLocOrg B-PI-P B-LI-LB-OI-O Token-based Charactor-based

Machine Learning named entity corpus corpus target named entity, corpus . NER NER

Hybrid NER method Domain knowledge , , , Machine Learning SVM, Bigram/Trigram Model Hybrid Maximum-Entropy Framework Domain knowledge serves as features

Statical knowledge is insufficient New names SARS Ambiguity Context dependence

Pure machine learning might suffer Lack context information Window Size token tag NE NER , NER

Basic Concepts of Our ME-based Hybrid Approach NE Context Information Internal/External Features Training Data Feature , confidence

Internal/External Features Internal Found within the name string itself e.g., External Context e.g.,

Tag Set (outcome) Character Token, Named Entity , , Tag Set /B-P /I-P /I-P /B-L /I-L /I-L /B-O /I-O /I-O

ME-based NER Framework -Feature Representation For example: token , Feature f is active!!

ME-based NER Framework -Training Given a set of features and a training corpus The ME estimation process produces a model in which every feature f i has a weight i. Then we are allowed to compute:

ME-based NER Framework -Decoding Tokenize the text and preprocess the testing sentence For each token, check which features are active and combine the i of active features according to Equation 1 A Viterbi search is run to find the highest probability path

Hybrid NER Example The NER problem has been formulated as maximize p(o|h) and find its corresponding outcome o W 0 : the current token Os Ls Ps Context (History) Feature 1:

Advantages of Hybrid NER , . , Performance

DomainNumber of Named EntitiesSize (in characters) PERLOCORG Local News841399711835 Social Affairs31028735437719 Investment20633314397 Politics41920923317168 Headline News 2677024319938 Business14218618725815 Total12429541147126872 United Daily News (December, 2002) Experiment-Data Set

Experiment Result NEP(%)R(%)F(%) PER72.9897.9383.63 LOC67.9674.6771.16 ORG95.7764.0776.78 Total75.6282.1378.74 NEP(%)R(%)F(%) PER97.9487.39 92.36 LOC78.6069.35 73.69 ORG94.3962.57 75.25 Total90.5673.70 81.26 Use domain knowledge only ME-based Hybrid

Performance Comparison Sys. PersonLocationOrganizationOverall PRFPRFPRFPRF NTU (98) 749181.6697873.2857881.3778379.9 KRD L (98) 66.4 92 77.1 8990.990 89.587.888.685.2 90.2 87.6 IASL (03) 92.1 83.3 87.5 88.181.884.9 93.388.790.990.4 8587.7 Corpus: MET2 Dataset Number of Entities: 3646

Conclusion and Future Work Conclusion Hybrid Approach Hybrid Approach Precision Improvement, Hybrid Improvement , Future Work Named Entity Features Named Entity Multi Iteration NER Hierachical Named Entity

References [Tsai 2003] Tzong-Han Tsai, Shih-Hung Wu and Wen-Lian Hsu. (2003), Mencius: A Chinese Named Entity Recognizer Using Hybrid Model, in Proceedings of the Fifteenth Research on Computational Linguistics International Conference (ROCLING XV), pp.193-209, 2003. [Tsai 2004] Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu, "Mencius: A Chinese Named Entity Recognizer Based on a Maximum Entropy Framework," Computational Linguistics and Chinese Language Processing, Vol.9, No.1, pp.65-82, 2004. [Shih 2004] Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu, (2004) The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0, in Proceedings of the Fifteenth Conference on Computational Linguistics and Speech Processing (ROCLING XVI), pp. 305-313.

Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.

Documents

Transcript of Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.