Using Wikipedia as a referencefor extracting semanticinformation from a text
Andrea Prato&
Marco RonchettiUniversità di Trento, Italy
Explicit Semantic Analysis
GabrilovichMarkovich2007
Throw away:
StopwordsFragment pages (<100 words)Suffixes (stemming)
A sample (ESA)
The development of T-cell leukaemiafollowing the otherwise successfultreatment of three patients with X-linkedsevere combined immune deficiency (X-SCID) in gene-therapy trials usinghaematopoietic stem cells has led to a re-evaluation of this approach. Using amouse model for gene therapy of X-SCID, we find that the correctivetherapeutic gene IL2RG itself can act asa contributor to the genesis of T-celllymphomas, with one-third of animalsbeing affected. Gene-therapy trials for X-SCID, which have been based on theassumption that IL2RG is minimallyoncogenic, may therefore pose some riskto patients.
- Leukemia- Severe combinedimmunodeficiency- Cancer-Non-Hodgkin lymphoma- AIDS-ICD-10 Chapter II:Neoplasms;-Chapter III: Diseases of theblood and blood-formingorgans, and certaindisorders involving theimmune mechanism- Bone marrow transplant- Immunosuppressive drug- Acute lymphoblasticleukemia- Multiple sclerosis.
A sample (ESA)
Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).
Lonely Planet Tourist Guide
1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.
Clustering
Wikipedia is hyperlinked
Swimming is clustered with Olympic Games
A sample (ESA)
Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).
Lonely Planet Tourist Guide
1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.
Throw away:
Large aggregators Category links Numbers Pages with more than (N=100) links
After clustering:
only 3 clusters with cardinality larger than 1. The first cluster, with cardinality 21, was
automatically named Swimming. The second and the third both have cardinality
equal to 2, and they are named Training andVenice-bucentaur.
Validation: Turing test
Classification
Text Classification
Classification
Which one is machine -generated?
Outcome 20 texts of lengthranging between 60and 200 words. Textswere collected fromvarious sources likenewspaper articles,text books, randomweb pages, MSNEncarta.
Further improvements
Using only nouns
Using a POS Tagger to identify syntacticroles in document to be classified
Keep only names (throw away the rest)
No degradation in the results!
Define Multiwords
Lexical multiword identification approach: The following generative pattern is considered
((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun - Prep)?)(Adj∣Noun)∗)Noun
+: One or more *: Zero or more ?: Zero or one ∣: Or
Validation: A candidate multiword is valid if thereis a Wikipedia entry related to it.
Text with multiwords:
Keep all nounsKeep all adjectives that are part of a
multiword
Evaluation (human inspection ofresults)100 samples (50 technical, 50 generic)Multiword improved significanty 7 (5 technical)It improved marginally 13It worsened marginally 6
Overall improvement: 10/% on technical text
Work in progress
Concept-mediated mappingamong documentsHow similar are two docs?
Doc 1 Doc 3
Concept 1
Concept 2 Concept 2
Concept 3 Concept 3
Concept 4
Jaccard Index
Syllabi comparison
Interlinks
Mapping documents in differentlanguagesDeploying Wikipedia Interlinks
Doc 1 Doc 3
Concept 1
Concept 2 Concept 2
Concept 3 Concept 3
Concept 4
Jaccard Index
INTERLINKS
Top Related