Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.

Click here to load reader

download Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.

of 77

  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    4

Transcript of Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology.

  • Slide 1
  • Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology
  • Slide 2
  • Outline Information Extraction Introduction Applications Table Reading Citation Extraction Chinese Named Entity Recognition
  • Slide 3
  • Introduction
  • Slide 4
  • Information Extraction pieces of information extracts pieces of information that are salient to the user's needs
  • Slide 5
  • Message Understanding Conferences (MUC) Evaluations provide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance. The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated. The multilingual portion was known as "Multilingual Entitity Task (MET)"
  • Slide 6
  • Examples The following fictional news story portrays the levels of detail that systems can extract: Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.
  • Slide 7
  • Entities: Persons:Organizations:Locations:Artifacts:Dates: Fletcher MaddoxUCSD Business SchoolLa JollaGeninfoJune 1999 Dr. MaddoxLa Jolla GenomaticsCAGeninfo OliverLa Jolla Genomatics OliverL.J.G. Ambrose Maddox
  • Slide 8
  • Attributes: NAME:Fletcher Maddox DESCRIPTOR: former Dean of the UCSD Business School his father the firm's CEO CATEGORY:PERSON NAME:La Jolla Genomatics DESCRIPTOR: CATEGORY:ORGANIZATION NAME:Geninfo DESCRIPTOR:its product CATEGORY:ARTIFACT NAME:La Jolla DESCRIPTOR:the Maddox family's hometown CATEGORY:LOCATION
  • Slide 9
  • Facts: PERSONEmployee_ofORGANIZATION Fletcher Maddox Fletcher Maddox Oliver Ambrose Employee_of Employee_of UCSD Business School La Jolla Genomatics La Jolla Genomatics La Jolla Genomatics ARTIFACTProduct_ofORGANIZATION GeninfoProduct_ofLa Jolla Genomatics LOCATIONLocation_ofORGANIZATION La JollaLocation_ofLa Jolla Genomatics CALocation_ofLa Jolla Genomatics
  • Slide 10
  • Events: COMPANY-FORMATION_EVENT: COMPANY:La Jolla Genomatics PRINCIPALS: Fletcher Maddox Oliver Ambrose DATE: CAPITAL: RELEASE-EVENT: COMPANY: La Jolla Genomatics PRODUCT:Geninfo DATE:June 1999 COST:
  • Slide 11
  • Information Extraction current indicators of the state of the art: Items of Information Percentile Reliability Entities90 Attributes80 Facts70 Events60
  • Slide 12
  • Technical definition of IE The process of creating database entries by skimming a text and looking for occurrences of a particular class of object or event and for relationships among those objects and events [Russell, Norvig 2003]
  • Slide 13
  • Basic IE tasks Extract addresses from Web pages target: street, city, state, and zip code Extract storms from weather report target: temperature, wind speed, and precipitation
  • Slide 14
  • IE Applications Competitive intelligence find instances of corporate mergers and joint ventures. Intelligence gathering terrorist activities. any damage to buildings or the infrastructure, as well as the time and location of the event. Health care delivery summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments..
  • Slide 15
  • Technology Method in literature Regular expressions Cascaded finite-state transducers Our approaches Ontological domain knowledge Machine Learning Hybrid method
  • Slide 16
  • Regular expression approach example From the text 17in SXGA Monitor for only $249.99 Extract m m ComputerMonitors Size(m,Inches(17)) Price(m, $(249.99)) Resolution(m, 12801024)
  • Slide 17
  • Regular Expressions [0-9] [0-9]+.[0-9] [0-9] (.[0-9] [0-9])? $[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A period followed by two digits A period followed by two digits, or nothing $249.99, $1.23, $100000, matches
  • Slide 18
  • Weakness Whats the price ? List price $99.00, special sale price $78.00, shipping $3.00.
  • Slide 19
  • Cascaded finite-state transducers approach example From Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. Extract e JointVentures Product(e, golf clubs) Date(e,Friday) Entity(e,Bridgetstone Sports Co) Entity(e, a local concern) Entity(e, a Japanese trading house)
  • Slide 20
  • Cascaded finite-state transducers A typical relational extraction systems consists of the following five stages: Tokenization Complex word handling Basic group handling Complex phrase handling Structure merging
  • Slide 21
  • Tokenization Word segmentation -> | | , | Complex word handling Bridgestone Sports Co. CapitalizedWord+(Company|Co|Inc|Ltd) Intel Chairman Andy Grove CapitalizedWord+(Grove|Forest|Village|)
  • Slide 22
  • Basic group handling Noun group, verb group, Preposition, Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4 NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan 9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be shipped 16 PR: to 17 NG: Japan
  • Slide 23
  • Complex phrase handling Company+SetUp JointVenture (with Company+)? Structure merging If the next sentence says something about the same event.
  • Slide 24
  • A brief remark IE works well for a restricted domain Predetermine the Subjects and how they are mentioned
  • Slide 25
  • Applications
  • Slide 26
  • Table Reading Citation Extraction Chinese NER
  • Slide 27
  • Semantic Search on Internet Tabular Information Extraction for Answering Queries CIKM 2000
  • Slide 28
  • Gives a algorithm to interpret tables of the type shown below where some cells span over multiple rows or columns. An example of interpretation is: (Attribute)=>(Value) (Adult-Price-Single Room-Economic class)=>35,450 Table Reading
  • Slide 29
  • Slide 30
  • Method Tagging Layout Recognition Layout Transformation HTML Table C-I Table Layout Description Layout Transition Rule Database Table Ambiguous Relations of Cells
  • Slide 31
  • Method Tagging Layout Identifying Layout Trans.
  • Slide 32
  • Airline Schedule Ontology
  • Slide 33
  • Tagging C: Departure Information C: Departure City C: Arrival City I: Departure City I: Arrival City Concept v.s. Descent Concept C: Departure City I: Departure City Concept v.s. Instance of the Concept Instance v.s. Instance of the same Concept
  • Slide 34
  • Four Relations of Table Cells Relations of Concept - Instances Concept - Instance of the Concept Concept - Descent Concept Concept - Instance of Descent Concept Instance - Instance of the same Concept
  • Slide 35
  • Layout Recognition C-I Table Layout Descriptions Template Matching Matched Layout Description Defined by Layout Syntax Grammar
  • Slide 36
  • Layout Transformation Origin Layout Description Destination Layout Description
  • Slide 37
  • Experiments 23 tables from 23 web pages 13 2-dimension tables, 10 complex tables Success is no miss, Any miss results fail
  • Slide 38
  • Conclusion & Future Works Layout Transformation from complex tables to simple tables (1D, 2D). A general approach 1. Tagging 2. Semantic Layout Recognition 3. Layout Transformation Ambiguous reduced by checking cell relations
  • Slide 39
  • Reference Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng- Lung Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular Information Extraction for Answering Queries, Ninth International Conference on Information and Knowledge Management (CIKM-2000), McLean, VA, November 6-11, 2000. pp. 243-249. (EI) H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML Texts, In Proc. 18th International Conference on Computational Linguistics, Saabrucken, Germany, July 2000.
  • Slide 40
  • A Knowledge-based Approach to Citation Extraction IRI-2005
  • Slide 41
  • Introduction Integration of the bibliographical information of scholarly publications available on the Internet Accurate reference metadata extraction from heterogeneous reference sources. We propose a knowledge-based approach to reference metadata extraction INFOMAP: ontological knowledge representation framework Automatically extract the reference metadata.
  • Slide 42
  • Proposed Approach
  • Slide 43
  • Reference Data Collection Journal Spider (journal agent) collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. Citation data source ISI web of science DBLP Citeseer PubMed Phase 1
  • Slide 44
  • Domain Knowledge Phase 2
  • Slide 45
  • INFOMAP INFOMAP as ontological knowledge representation framework extracts important citation concepts from a natural language text. Feature of INFOMAP represent and match complicated template structures hierarchical matching regular expressions semantic template matching frame (non-linear relations) matching Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.
  • Slide 46
  • Reference Metadata Extraction Journal Reference styles Reference style example Bioinformatics style (BIOI) Davenport, T., DeLong, D., & Beers, M. (1998) Successful knowledge management projects. Sloan Management Review, 39(2), 43-57. ACM style (ACM) 1.Davenport, T., DeLong, D. and Beers, M. 1998. Successful knowledge management projects. Sloan Management Review, 39 (2). 43-57. IEEE style (IEEE) [1]T. Davenport, D. DeLong, and M. Beers, "Successful knowledge management projects," Sloan Management Review, vol. 39, no. 2, pp. 43-57, 1998. APA style (APA) Davenport, T., DeLong, D., & Beers, M. (1998). Successful knowledge management projects. Sloan Management Review, 39(2), 43-57. JCB style (JCB) Davenport, T., DeLong, D., & Beers, M. 1998. Successful knowledge management projects. Sloan Management Review 39(2), 43-57. MISQ style (MISQ) Davenport, T., DeLong, D., and Beers, M. "Successful knowledge management projects," Sloan Management Review (39:2) 1998, pp 43-57. Table 1. Examples of different journal reference styles Phase 3
  • Slide 47
  • Knowledge-based Reference Metadata Extraction - Online Service Phase 4
  • Slide 48
  • Citation Extraction From Text to BixTex W. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), 93- 105. W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), 96-100. @article{ Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }} @article{ Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }} Figure 5. The system output of BibTex Format Figure 3. The system input of knowledge-based RME
  • Slide 49
  • Figure 6. The online service of knowledge-based RME (http://bioinformatics.iis.sinica.edu.tw/CitationAgent/) System Output System Input (Plain text) Output BibTex
  • Slide 50
  • Experimental Results and Discussion Experimental data We used EndNote to collect Bioinformatics citation data for 2004 from PubMed. A total of 907 bibliography records were collected from PubMed digital libraries on the Web. Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). Randomly selected 500 records for testing from each of the six reference styles.
  • Slide 51
  • Experimental results of citation extraction from six reference styles
  • Slide 52
  • Example Results
  • Slide 53
  • FieldField Relation StructurePercentage% Author 54.29% 42.86% N/A2.85% Year 48.57% 20.00% 14.29% 5.71% 2.86% 2.86% N/A5.71% Title 48.57% 42.86% N/A8.57% Journal 71.43% 20.00% 5.71% N/A2.86% Volume 40.00% 31.43% 14.29% 5.71% 2.86% 2.86% N/A2.85% Issue 34.29% 14.29% N/A51.42% Pages 42.86% 34.29% 17.14% 2.86% N/A2.85% The various structures of different styles (Analysis of structures of 30 reference styles )
  • Slide 54
  • Comparison with related works Knowledge-based approach Our proposed knowledge-based method for scholarly publications can extract reference information from 907 records in various reference styles with a high degree of precision the overall average field accuracy is 97.87% for six major styles listed in Table 1 98.20% for the MISQ style 87% for other 30 randomly selected styles
  • Slide 55
  • Conclusions Citation extraction is a challenging problem The diverse nature of reference styles We have proposed a knowledge-based citation extraction method for scholarly publications. The experimental results indicate that, by using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different reference styles with a high degree of precision. The overall average field accuracy of citation extraction is 97.87% for six major reference styles.
  • Slide 56
  • Future Research Integrate the ontological and the machine learning approaches to boost the performance of citation information extraction Maximum-Entropy Method (MEM) Hidden Markov Model (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM)
  • Slide 57
  • Reference Min-Yuh Day, Tzong-Han Tsai, Cheng-Lung Sung, Cheng-Wei Lee, Shih-Hung Wu, Chorng-Shyong Ong, and Wen-Lian Hsu, A Knowledge-based Approach to Citation Extraction, to appear in Proceedings of IEEE International Conference on Information Reuse and Integration (IEEE IRI-2005), pp.50-55. (EI)
  • Slide 58
  • Chinese Named Entity Recognition Using a Hybrid Approach of Machine Learning and Domain Knowledge ROCLING 2003, CLCLP 2004
  • Slide 59
  • Named Entity Recognition Named Entity
  • Slide 60
  • Sequential Labeling PerLocOrg B-PI-P B-LI-LB-OI-O Token-based Charactor-based
  • Slide 61
  • Machine Learning named entity corpus corpus target named entity, corpus . NER NER
  • Slide 62
  • Hybrid NER method Domain knowledge , , , Machine Learning SVM, Bigram/Trigram Model Hybrid Maximum-Entropy Framework Domain knowledge serves as features
  • Slide 63
  • Statical knowledge is insufficient New names SARS Ambiguity Context dependence
  • Slide 64
  • Pure machine learning might suffer Lack context information Window Size token tag NE NER , NER
  • Slide 65
  • Basic Concepts of Our ME-based Hybrid Approach NE Context Information Internal/External Features Training Data Feature , confidence
  • Slide 66
  • Internal/External Features Internal Found within the name string itself e.g., External Context e.g.,
  • Slide 67
  • Tag Set (outcome) Character Token, Named Entity , , Tag Set /B-P /I-P /I-P /B-L /I-L /I-L /B-O /I-O /I-O
  • Slide 68
  • ME-based NER Framework -Feature Representation For example: token , Feature f is active!!
  • Slide 69
  • ME-based NER Framework -Training Given a set of features and a training corpus The ME estimation process produces a model in which every feature f i has a weight i. Then we are allowed to compute:
  • Slide 70
  • ME-based NER Framework -Decoding Tokenize the text and preprocess the testing sentence For each token, check which features are active and combine the i of active features according to Equation 1 A Viterbi search is run to find the highest probability path
  • Slide 71
  • Hybrid NER Example The NER problem has been formulated as maximize p(o|h) and find its corresponding outcome o W 0 : the current token Os Ls Ps Context (History) Feature 1:
  • Slide 72
  • Advantages of Hybrid NER , . , Performance
  • Slide 73
  • DomainNumber of Named EntitiesSize (in characters) PERLOCORG Local News841399711835 Social Affairs31028735437719 Investment20633314397 Politics41920923317168 Headline News 2677024319938 Business14218618725815 Total12429541147126872 United Daily News (December, 2002) Experiment-Data Set
  • Slide 74
  • Experiment Result NEP(%)R(%)F(%) PER72.9897.9383.63 LOC67.9674.6771.16 ORG95.7764.0776.78 Total75.6282.1378.74 NEP(%)R(%)F(%) PER97.9487.39 92.36 LOC78.6069.35 73.69 ORG94.3962.57 75.25 Total90.5673.70 81.26 Use domain knowledge only ME-based Hybrid
  • Slide 75
  • Performance Comparison Sys. PersonLocationOrganizationOverall PRFPRFPRFPRF NTU (98) 749181.6697873.2857881.3778379.9 KRD L (98) 66.4 92 77.1 8990.990 89.587.888.685.2 90.2 87.6 IASL (03) 92.1 83.3 87.5 88.181.884.9 93.388.790.990.4 8587.7 Corpus: MET2 Dataset Number of Entities: 3646
  • Slide 76
  • Conclusion and Future Work Conclusion Hybrid Approach Hybrid Approach Precision Improvement, Hybrid Improvement , Future Work Named Entity Features Named Entity Multi Iteration NER Hierachical Named Entity
  • Slide 77
  • References [Tsai 2003] Tzong-Han Tsai, Shih-Hung Wu and Wen-Lian Hsu. (2003), Mencius: A Chinese Named Entity Recognizer Using Hybrid Model, in Proceedings of the Fifteenth Research on Computational Linguistics International Conference (ROCLING XV), pp.193-209, 2003. [Tsai 2004] Tzong-Han Tsai, Shih-Hung Wu, and Wen-Lian Hsu, "Mencius: A Chinese Named Entity Recognizer Based on a Maximum Entropy Framework," Computational Linguistics and Chinese Language Processing, Vol.9, No.1, pp.65-82, 2004. [Shih 2004] Cheng-Wei Shih, Tzong-Han Tsai, Shih-Hung Wu, Chiu-Chen Hsieh, and Wen-Lian Hsu, (2004) The Construction of a Chinese Named Entity Tagged Corpus: CNEC1.0, in Proceedings of the Fifteenth Conference on Computational Linguistics and Speech Processing (ROCLING XVI), pp. 305-313.