On-Demand Information Extractionmavir2006.mavir.net/docs/Sekine-OnDemandIE.pdfCoreStates Financial...

70
On-Demand Information Extraction New York University Satoshi Sekine Summer/Fall 07

Transcript of On-Demand Information Extractionmavir2006.mavir.net/docs/Sekine-OnDemandIE.pdfCoreStates Financial...

  • On-Demand Information Extraction

    New York UniversitySatoshi Sekine

    Summer/Fall 07

  • Introduction(http://nlp.cs.nyu.edu/sekine)

  • Research topics

    Attributefor ENE

    EnglishAnalyzer(OAK)

    Web PeopleSearch

    JapaneseSpelling

    Multi/Singdoc. sum.

    IE andSum.

    Cross Ling.Pre-ODIE

    ParaphraseDiscoveryIE pattern

    Discovery

    EncyclopediaQA

    IR, clustering& sum.

    Cross Ling.QA

    2nd grade language test

    Query log& class

    Informationneeds

    On-demand IE

    QA

    IRSummarization

    RelationDiscovery

    ExtendedNE

  • What is Information Extraction

    • To automatically extract information on specific scenario from unstructured text and put it into table format

    • Ex. Management succession Newspaper

    PresidentBank of Manhattan

    Bill Brown7/2/2003

    COESmith Trade corporation

    John Smith7/1/2003PositionCompanypersonDate

  • Problem in conventional IE

    • Preparation of knowledge for given scenario– Pattern, lexicon and semantic knowledge

    • Company announced Person’s promotion to Position

    – Writing rules by hand or creating training data• Restriction

    – Preparation time (one month in MUC)– Specific/Limited scenario/domain

  • Our goal• Make one month to one minute (=automatic)• IE on the fly• How?

    – Use unsupervised learning methods• Pattern discovery, paraphrase discovery

    – Prepare as much knowledge/tools as we can for as many scenarios as possible

    • Extended NE, semantic knowledge, relation

    • Demo http://blueberry.cs.nyu.edu:8080/odie

    http://blueberry.cs.nyu.edu:8080/odie

  • ODIE – Flow ChartDescription of task

    Table

    IR systemRelevantDocument

    Pattern discoveryLanguage Analyzer

    ENE taggerPattern

    Pattern sets

    Table construction

    Paraphrase discoveryCo-reference

  • Automatic Discovery of IE patterns (Sudo et.al HLT01) (Sudo et.al ACL03)

    • Bottleneck of IE is knowledge creation• IE pattern

    Company announced Person’s promotion to Position

    • Key Idea– Retrieve documents on the given topic– Collect relatively frequent patterns in the retrieved documents

    • Research focus and Result– Pattern format (predicate-argument, chain, sub-tree)– Computation time (Tree mining algorithms)– Near human performance was achieved

  • IE Pattern Discovery - ResultSuccessionTrain: 1 yr. newspaper

    Test: 148 (87 relevant, 61 irrelevant)

    Slots:

    Person:135

    Org: 172

    Pot: 215

  • Automatic acquisition of paraphrase(Sekine Gengo01) ( Shinyama et al HLT02) (Shinyama et al IWP03)

    • Purpose: Relationship between IE patterns• Key idea

    – Events are usually reported in different newspapers on the same day

    – However, NE and numeric, date expressions are identical

    – Use them as anchor to acquire paraphrase• We observed encouraging results, but we found that

    it is not so easy a task

    Paraphrase Acquisition

  • Procedure

    NewspaperA

    NewspaperB

    ArticleA

    ArticleB

    FindComparable

    Articles

    Anchorsidentified

    Anchorsidentified

    NE taggingCoref.

    FindComparableSentences

    phraseA

    ExtractParaphrase

    phraseB

    Paraphrase Acquisition

  • EvaluationParaphrase Acquisition

    withw/ow/owithw/o

    Coref

    withwithw/o

    ASD

    56%69%(52)75

    56%(18)3262%(23)37

    24%(25)106Phrases(>100)

    44%75%(41)55Sentence(93)

    recallprecisionobtained

    •Two techniques (co-reference, Argument Structure Database)•One year of two Japanese newspaper•195 article pairs on arrest of murder suspects

  • Relation discovery(Hasegawa et al ACL04)

    • Motivation– Discovering particular relations between NE’s– Country & President, Merged Companies

    • Unsupervised method– Not known the relationships in advance– As many relationships as possible

    • IdeaContext based clustering– Pairs of entities with similar context would be in the

    same relation– Similar contexts could also characterize the relations

    Relation Discovery

  • Procedure• Tag ENE on text, select two ENE types• Collecting expressions within 5 words, and make the

    context between them as context vector• Cluster ENE pairs based on context vectors• Label each cluster based on frequent context words

    is offering to buy’s interest inis negotiating to acquire’s planned purchase of

    Company A Company B…

    Accumulated contexts

    5 words or lessNamed entity Named entity

    Relation Discovery

  • Result

    parent (1.0), unit (0.476), own (0.143), …7/7Parentacquire (1.0), acquisition (0.583), buy (0.583), …9/9M&Abuy (1.0), bid (0.382), …10/11M&Acoach (1.0), …5/5CoachRep. (1.0), Republican (0.667), …5/6RepublicanSecretary (1.0), secretary (0.143), …6/7SecretaryGov. (1.0), governor (0.458), Governor (0.3), …15/16GovernorMinister (1.0), minister (0.875), Prime (0.875), …

    15/16Prime MinisterSen. (1.0), Republican (0.214), …19/21SenatorPresident (1.0), president (0.415), …17/23PresidentCommon words (Relative frequency)RatioMajor relations

    Relation Discovery

  • Evaluation and Error analysis• Evaluation result

    Labeling accuracy: 10/10 = 100% (Cluster level) 108/121 = 89% (NE pair level)

    (cf. recall: 140/(177+65) = 58% (All clusters))

    • False Alarm (Mis-clustered NE pairs): – Common words in another cluster which existed in 5

    words contexts by accident (noise words)… Chechnya war may exhaust President Boris Yeltsin …

    • Miss (Undetected NE pairs): – Absence of common words in 5 words contexts– Outer context might be helpful… Boris Yeltsin on end the fighting in Chechnya …

    Relation Discovery

  • Another Paraphrase Discovery via Relation Discovery (Hasegawa et.al Gnengo-05)

    • Expressions found in the same relation should contain paraphrase

    • Two filters– Used in more than one pair of instance– Expressions contain frequent termEx) A bought B / A has agreed to buy B / A,

    which is buying B / A’s proposed acquisition of B / A’s acquisition of B / A’s agreement to buy B / A’s purchase of B / A bid for B / A’s takeover of B / A merger with B / A succeeded in buying B / B, which was acquired by A / B would become a subsidiary of A / B agreed to be bought by A

  • Yet Another Paraphrase Discovery using context and keywords

    (Sekine IPW-05)• To eliminate the frequency threshold for the

    vector model in Hasegawa’s method• Algorithm

    – Find keywords in context of each pair of NEs using TFIDF

    – Cluster phrases based on NE instances (buy – acquire / buy – agree / buy – purchase)

  • Extended Named Entity( Sekine et.al LREC02) (Sekine et.al LREC04)

    • Named Entity with 200 categories• Definition (150 page) was released• Automatic tagger

    – 130K entries– 2000 rules– 70% accuracy– Improving…

  • Attributes for ENE (ver. 7.0.1)(example for PERSON)

     5(11)Car accident, GuillotineCause of death

    Person5(11)John B. Kelly, Sr.Father

    Location5(11)New York, BirminghamPlace of birth

    Game6(13)World Series, 1955 piano competition in ParisCompetition

    Title 6(13)Knight, an honorary degree at YaleTitle

    Person8(17)Santa, father ChristmasAnother name

    Person8(17)Saint NicholasReal Name

    Award8(17)Academy Award, MVP, Nobel PrizeAward

    Era8(17)Edo period, the 11th centuryEra

    date10(22)04,23,1704, 04/23/1704, unknownDeath date

    Person10(22)Andrea del Verrocchio, Michelangelo di Lodovic Buonarroti SimoniMentor

    Location12(26)England, New YorkPrevious stay

    Province18(39)State of Illinois, SichuanNative Province

    City19(41)Paris, Manchester, ShanghaiHometown

    School20(44)M.A. in German at Cambridge, MK High SchoolGraduate

    Product, Facility25(54)Guernica, Mona LisaMasterpiece

    Vocation 26(57)A professor at Yale University, The Princess of WalesCareer

    Country29(63)American, Chinese, JapaneseNationality

    Vocation46(100)professional baseball player, economist, poetVocation

    ENEFreq. (%)Example of valueAttribute(20)

    ../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html../../../??????/version7_0_0draft003_2006-12-14.html

  • ODIE – Flow ChartDescription of task

    Table

    IR systemRelevantDocument

    Pattern discoveryLanguage Analyzer

    ENE taggerPattern

    Pattern sets

    Table construction

    Paraphrase discoveryCo-reference

  • On-demand Information Extraction- ODIE -

    ( Middle-size NSF project at NYU 03-08)• Demo http://blueberry.cs.nyu.edu:8080/odie• Description of the task (topic)

    – acquire acquisition merge merger buy purchase• Build a table in 1 minute

    Last year$400 millionChemicals Inc., nyt951113.0483

    $1.27 billionTenneco Inc., Mobil Corp.nyt951002.0099

    $3.1 billionCoreStates Financial Corp., Merdian Bancorp

    nyt951010.0389

    Last week$900 millionComdata Holdings Corp., Ceridian Corp.

    nyt950420.0056

    DateMoneyCompanyDocid

    http://blueberry.cs.nyu.edu:8080/odie

  • • nyt950831.0485 The move comes amid a flurry of acquisitions in the computer financial processing industry. Last week, check processor Ceridian Corp. agreed to buy Comdata Holdings Corp. for $900 million.

    • nyt951113.0483 Last year, Terra Industries Inc., another large fertilizer maker, bought Agricultural Minerals & Chemicals Inc. for $400 million. ``Industry growth is primarily overseas,'' said Harvey Stober, an analyst with Donaldson, Lufkin & Jenrette in New York.

    • nyt951002.0099 Tenneco Inc. said it agreed to buy Mobil Corp.'s plastics division for $1.27 billion as part of its strategy to expand into less cyclical, higher-growth businesses.

    • nyt951010.0389CoreStates Financial Corp. agreed to buy Meridian Bancorp for $3.1 billion in stock, creating a bank with a commanding role in Philadelphia and the industrial cities of eastern Pennsylvania.

  • ODIE - Evaluation Result

    • Out of 20 topics, tables created for 2 topics are judged useful, 12 are useful, 6 are not useful.

    • Correctness of the table fillers

    Number of rowsEvaluation

    12Incorrect4Partially correct84Correct

  • Research topics

    Attributefor ENE

    EnglishAnalyzer(OAK)

    Web PeopleSearch

    JapaneseSpelling

    Multi/Singdoc. sum.

    IE andSum.

    Cross Ling.Pre-ODIE

    ParaphraseDiscoveryIE pattern

    Discovery

    EncyclopediaQA

    IR, clustering& sum.

    Cross Ling.QA

    2nd grade language test

    Query log& class

    Informationneeds

    On-demand IE

    QA

    IRSummarization

    RelationDiscovery

    ExtendedNE

  • Preemptive Information Extraction(Shinyama, Sekine NAACL06)

    • We can even delete queries…

    1. Find a pair (or triple) of entities that have a similar relation in articles.

    ex. Relation between “Katrina”-“New Orleans” in Article A is similar to relation between “Longwang”-“Taiwan” in Article B.

    2. Cluster relations based on their similarity.3. A cluster of similar relations = table.

  • Knowledge Discovery from Query Log(work at Microsoft Research, Summer 06)

    • Query log contains a lot of knowledge• Discover knowledge about ENE…

    bathroom+showers+picturesmiss+mundoowen+mulligan+tyronenews+pressfree+wedding+plannermarc+arther+glenpreadolescente+follandogokmenler+agricultural+machinery+co.+-erotika+comacademy+award+winner

    100+greatest+britons AWARD100+worst+britons AWARDaaass/orbis+books+prize+for+polish+studies AWARDabel+prize AWARDacademy+award AWARDacademy+awards AWARDacm+turing+award AWARDagatha+award AWARDagatha+awards AWARDair+medal AWARD

    Query log (6month, 1.5B) ENE dictionary (245K)

  • Problems to be solved

    Problem 2This is noise in

    ENEdictionary. I

    want toget rid of them!!!

    Problem 1This is a very

    general context. I want to see

    important context for this

    category

    Problem 4Individual

    entities are verbose. I want to cluster them.

    Problem 3The list may not

    have enough coverage. I want to

    populate ENE dictionary.

    Important ContextNoise

    ClusteringPopulate the list

  • Key Idea• Find Important context for each category

    – New Formula (not TFIDF, MI, log likelihood ratio …) Score(context) = ftype{context} * log( g(context) / C );

    g(context) = ftype{context}/Finst{context};C = ftype{contexttop1000} / Finst{contexttop1000};

    – Result• ACADEMIC #+syllabus, degrees+in+#, phd+in+#, masters+in+#, diploma+in+#, #+lecture+notes, masters+degree+in+#, courses+in+,• AWARD

    #+winners, #+nominees, #+nominations, #+winner, #+award, who+won+#, winners+of+#, list+of+#+winners, winners+of+the+#

  • Results• Delete Noise (Problem 2)

    – Entities which does not co-occur with important context are noise

    – BOOK:it, porno, space, night, we, working, candy, wheels, foundation, jazz,

    ghost, couples, the bible, giant, daddy, creation …

    • Find New entities (Problem 3)– words co-occur with important context are entities– AWARD:

    golden+globes, grammys, golden+globe, kentucky+derby, daytime+emmy, sag, sag+awards, american+idol, daytime+emmys

  • Web People Searcha SemEval 2007 task

    • InputFirst 100 results for a person name search

    • OutputClustring according to the actual people

    • John Smith 1 (Captain)– Captain John Smith – www.apva.org– John Smith wikipedia – en.eikipedia..– …

    • John Smith 2 (Labour leader)– BBC: Labour leader John Smith – news…– John Smith wikipedia – en.wikipedia…

    • John Smith 3 (IBM researcher)• John Smith 4 (Film director)• John Smith 5 (Shoe company)• John Smith 6 (Writer)• …

    http://www.apva.org/

  • WePS• SemEval - formally called “Senseval” since 1998

    – 27 proposal / 18 final tasks– >100 teams, > 125 systems, 110 papers

    • WePS is the largest single task– 16 participants (38 people in ML)– http://nlp.uned.es/weps

    • Data

    98.9345.93Total av.71.2010.76Total av.

    99.1050.30Census47.205.90WEB03*

    98.4031.00ACL0699.2015.30ECDL06

    99.3056.50Wikipedia99.0023.14Wikipedia

    Av. doc.Av. entitysourceAv. doc.Av. entitysource

    TestTraining

    http://nlp.uned.es/weps

  • WePS (Result)team F α=0,5 purity inv_purity Presentation/Poster

    CU_COMSEM 0.79 0.72 0.88 Tomorrow - PS2 - 11:15–12:30COMBINED_BASELINE 0.78 0.64 1.00IRST-BP 0.77 0.75 0.80 Tomorrow - PS2 – 11:15–12:30PSNUS 0.77 0.73 0.82 Next presentation - 17:00–17:15UVA 0.69 0.81 0.60 Tomorrow - PS3 – 14:30–15:45SHEF 0.66 0.60 0.82FICO 0.67 0.53 0.90 Tomorrow - PS2 – 11:15–12:30UNN 0.66 0.60 0.73 Tomorrow - PS3 – 14:30–15:45SCATTERED_BASELINE 0.64 1.00 0.47AUG 0.64 0.50 0.88 Tomorrow - PS2 – 11:15–12:30SWAT-IV 0.62 0.55 0.71 -UA-ZSA 0.61 0.58 0.64 Tomorrow - PS3 – 14:30–15:45TITPI 0.60 0.45 0.89 Tomorrow - PS3 – 14:30–15:45JHU1-13 0.58 0.45 0.82 Tomorrow - PS2 – 11:15–12:30DFKI2 0.53 0.39 0.83 Tomorrow - PS2 – 11:15–12:30WIT 0.52 0.36 0.93 Tomorrow - PS3 – 14:30–15:45UC3M_13 0.51 0.35 0.95 Tomorrow - PS3 – 14:30–15:45UBC-AS 0.45 0.30 0.91 -JOINED_BASELINE 0.45 0.29 1.00

  • WePS

    • Systems– Basic strategies are similar HTML analsys => pre-processing (POS, NE…) =>

    feature selection (token, NE…) => calculate similarities => make clusters => produce output

    – Summary is available• Future

    – Second event (08 fall or 09 spring)– Attribute extraction (DOB, vocation…)

  • Human Relationship Extraction(Goto NHK)

    • Extract human relationships from movie/TV summary

    • Display it as a network in a graph

    • NE for the domain• Attribute for NE• Clustering the

    relationships

  • English Analyzer (OAK system)

    NE tagger Chunker Dependency Analyzer POS tagger Parser

    Tokenizer Function Tagger Sentence Splitter Regularizer

    Text

    Sentence

    Tokenized sentence

    POS tagged sentence

    Chunked sentence (low level constituent) NE tagged sentence (org,loc,person,time…)

    Dependency between Chunks and constituents

    Parse tree

    Parse tree with Function labels

    Regularized structure (predicate-argument)

    Plain

    Plain

    Plain

    PTB/tag PTB/tree Collins

    PTB/tag PTB/tree CONLL

    PTB/tag PTB/tree

    CONLL, MUC PTB/tree

    PTB/tree

    PTB/tree

    Triplet Glarf OAK

    System

  • Research topics

    Attributefor ENE

    EnglishAnalyzer(OAK)

    Web PeopleSearch

    JapaneseSpelling

    Multi/Singdoc. sum.

    IE andSum.

    Cross Ling.Pre-ODIE

    ParaphraseDiscoveryIE pattern

    Discovery

    EncyclopediaQA

    IR, clustering& sum.

    Cross Ling.QA

    2nd grade language test

    Query log& class

    Informationneeds

    On-demand IE

    QA

    IRSummarization

    RelationDiscovery

    ExtendedNE

  • Extra slides

  • IR, clustering and multi-doc. summaization( Nobata et.al Gengo03)  ( Nobata et.al DUC03) (Murakami et.al

    Gengo04)

    • Large number of retrieved documents• Automatic clustering by sub-topics

    – 200 documents -> 5 clusters• Summarize each cluster (MDS)• Demo (http://apple.cs.nyu.edu/eliot)

    http://apple.cs.nyu.edu/eliot

  • Cross Lingual Question Answering( Sekine DARPA03) (Sekine TALIP04)

    • Developed in a month in DARPA’s SurpriseLanguage experiment

    • Type in question in English    Translate question to Hindi

      Find answer in Hindi newspaper       Translate answer to English• Demo (http://apple.cs.nyu.edu/clqa-eh)

    http://apple.cs.nyu.edu/clqa-eh

  • Q Examiner

    Translation (H->E)

    Find answer

    CLIR

    NE tagger

    Q in English

    Keywords in EnglishType of expected answer

    Keywords in Hindi

    RelevantDocuments

    Answer, context in HindiBi-ling. dict.

    Numeric tagger

    newspaper

    NE tagged corpus

    User I/F

    Num. knowledge

    Answer, context in English

  • Huge Text data & knowledge acquisition(Yoshihira et.al Gengo04) (Ando et.al LREC04) (Sekine Gengo04)

    • Collecting huge corpus (untagged)– WEB pages : (Japanese 350G, English 300G)– Newspaper : 38 years, ~10G– Encyclopedia

    • Tool– Suffix Array for the huge data

    • Application– Semantic knowledge acquisition by lexico-syntactic pattern– Automatic acquisition of NE instances– IE pattern, paraphrase acquisition– Collect spelling variations– KWIC system (http://kwic.jp)

    http://kwic.jp/http://kwic.jp/

  • Survey on Multi-document summarization( Sekine et.al DUC03)

    • Purpose– Multi-doc. summarization & IE

    • Analyze summary/category of 100 doc. set• Category of doc. set

    – 13 category: (multi|single)-(person|org|loc|…), other– Consistent tagging by people– It should be useful for summarization strategy

    • Table format summary– 40-45%: Table is natural, 36-38%: Table is OK– Encouraging results for On-demand IE

  • Spelling variation in Japanese(Masuyama et.al Gengo04)

    • Serious problem in real world systems– 30% of question for encyclopedia QA system– 40% of zero hits on dictionary look up

    • A system to find the closest entry– 190,000 entries– Edit distance after narrowing down candidates

    • If the target is in the dictionary, 100% in top 3• Demo ( http://languagecraft.jp)• Planning to construct variation dictionary from

    WEB corpus

    http://languagecraft.jp/http://languagecraft.jp/

  • Encyclopedia QA(Sekine Gengo03) (Sekine et al. Gengo05)

    • User can ask question in natural language and get the answer from encyclopedia

    • System is working as a real demo (http://www.japanknowledge.com)

    • Answer in the entry words or numeral in explanation

    • Answer prepared in advance by IE and pattern based Q matcher pinpoint the answer

    http://www.japanknowledge.com/

  • 2nd Grade Language Test

    • Objectives– Realize the NLP technologies into the form which can be

    easily observed by ordinary people – Observe the problems of the NLP technologies by 

    degrading the level of target materials – http://languagecraft.net/dennou

    • Results

    55.283.4196235355Total

    29.445.5102234Reading

    47.883.6117140245Word K.

    90.893.2697476Kanji

    Cor. In totalCor in knownCorrectKnown# of Q

    http://languagecraft.net/dennouhttp://languagecraft.net/dennou

  • Research topics

    Attributefor ENE

    EnglishAnalyzer(OAK)

    Web PeopleSearch

    JapaneseSpelling

    Multi/Singdoc. sum.

    IE andSum.

    Cross Ling.Pre-ODIE

    ParaphraseDiscoveryIE pattern

    Discovery

    EncyclopediaQA

    IR, clustering& sum.

    Cross Ling.QA

    2nd grade language test

    Query log& class

    Informationneeds

    On-demand IE

    QA

    IRSummarization

    RelationDiscovery

    ExtendedNE

  • My Research Interest

    • “things” are the central– People, organization, location, event name,

    vocation, product name …– How they are related– How they act, are used or are evaluated

    • It is “knowledge”– How to create “knowledge”?

    • Extract from huge corpus• Fusion with the existing knowledge (e.g.

    encyclopedia)

  • Knowledge

    Belongs to

  • Knowledge

    Belongs to

    Belongs to

    Belongs toA part of

    A part of

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    A part of

    A part of

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    A part of

    A part of

    Located inLocated in

    Work in, commute in

    Located in

    Located inLocated in

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    Located nearby

    A part of

    A part of

    Located in

    Located in

    Located nearby

    Located nearby

    Favorite teamLocated in

    Work in, commute in

    Located in

    Located in

    Located infeatures

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    Located nearby

    A part of

    A part of

    Located in

    Located in

    Located nearby

    Located nearby

    Favorite teamLocated in

    Work in, commute in

    Located in

    Located in

    Located infeatures

    Largest cityLocated in

    Now visiting

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    Located nearby

    A part of

    A part of

    Located in

    Located in

    Located nearby

    Located nearby

    Favorite teamLocated in

    Work in, commute in

    Located in

    Located in

    Located in Located infeatures

    Largest cityLocated in

    Now visiting

    Friendship countries

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    A member of

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    Located nearby

    A part of

    A part of

    Located in

    Located in

    Located nearby

    Located nearby

    Favorite teamLocated in

    Work in, commute in

    Located in

    Located in

    Located in Located infeatures

    Largest cityLocated in

    Favorite teamAwarded

    Now visiting

    Friendship countries

    Saw it play

    Pay money

    Born in

    Nationality

    A member of

    Live nearby

    A member of

    A member of

    A member of

    Nationality

    Same name

    A member of

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    A member of

  • Knowledge

    Belongs to

    Belongs to

    Belongs to

    Located nearby

    Has a lunch

    Located nearby

    Located nearby

    Located nearby

    A part of

    A part of

    Located in

    Located in

    Located nearby

    Located nearby

    Favorite teamLocated in

    Work in, commute in

    Located in

    Located in

    Located in Located infeatures

    Largest cityLocated in

    Favorite teamAwarded

    Now visiting

    Friendship countries

    Saw it play

    Pay money

    Born in

    Nationality

    A member of

    Live nearby

    A member of

    A member of

    A member of

    Nationality

    Same name

    A member of

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    Located nearby

    Located in

    Located in

    Located nearby

    Located nearby

    Located in

    Largest cityLocated in

    A member of

    Cheer!

    Boo!

  • Knowledge Creation• Not only by hand

    – Cyc, EDR• Learn from Corpus

    – Frequency, TFIDF• Pattern discovery, NE discovery

    – Distributional similarity• NE discovery, attribute for NE, Paraphrase Discovery

    – Lexico-syntactic pattern• Hyper-hyponym discovery, IE pattern

    – Alignment• Paraphrase Discovery

  • Topics• Names

    – Create categories (e.g. product name, event name)– Find entities (e.g. instances of laptop names)

    • Relation– Find relations (e.g. pro/cons of products, sentiment, opinion)

    • Categorize Relation– Paraphrase, textual entailment (e.g. same kind of opinion)

    • Knowledge– Link and structure the extracted knowledge– Human interface