Post on 17-Jun-2015
Text Analytics on 2 Million Documents: A Case Study
Plus, An Introduction into Keyword Extraction
Alyona Medelyan
Text Analytics World, Boston, October 3-4, 2012
“Because he could” by D. Morris, E. McGann
“Still stripping after 25 years” by E. Burns
What are these books about?
“Glut” by A. Wright
Only metadata will tell…
What this talk will cover:
• Who am I & my relation to the topic• What types of keyword extraction are out there• How does keyword extraction work• How accurate can keywords be• How to analyze 2 million documents efficiently
nzdl.org/kea/ maui-indexer.googlecode.com
MauiMulti-purpose automatic topic indexing
My Background
SemEval-2
2005-2009 PhD Thesis on keyword extraction“Human-competitive automatic topic indexing”
2010 co-organized keyword extraction competitionSemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”
2010-2012 leading the R&D of Pingar’s text analytics APIPingar API features: keyword & named entities extraction, summarization etc.
@zelandiyamedelyan.com
Document Easy to extract: Title, file type & location, creation & modification date, authors, publisher
Difficult to extract: Keywords & keyphrases,people & companies mentioned, suppliers & addresses mentioned
Metadata
Text Analytics
Findability is ensured with the help of metadata
What can text analytics determine from text?
text text text
text text text text text text text text text text text texttext text text
sentiment
keywords tags
genre
categoriestaxonomy terms
entities
namespatterns
biochemicalentities… text text text
text text text text text text text text text text text texttext text text
focu
s of
this
pre
sent
ation
Types of keyword extraction (or topic indexing)
• Subject headings in libraries• general with Library of Congress Subject Headings• domain-specific in PubMed with MeSH
• Keyphrases in academic publications
• Tags in folksonomies• by authors on Technorati• by users on Del.icio.us
controlled indexing
free indexing
keywords tags
categoriestaxonomy
terms
Free indexing Controlled indexingE.g. keywords, tags E.g. LCSH, ACM, MeSH
InconsistentNo controlNo semanticsAd hoc
RestrictedCentrally controlledInflexibleNot always available
Document KeywordsCandidates
How keyword extraction works
NEJM usually has the highest impact factor of the journals of clinical medicine.
highest, highest impact, highest impact factor
1. Extract phrases using the sliding window approach
ignore stopwords
NEJM
Alternative approach:a) Assign part-of-speech tagsb) Extract valid noun phrases (NPs)
impact, impact factor…
Document KeywordsCandidates
How keyword extraction works
NEJM usually has the highest impact factor of the journals of clinical medicine.
NEJMhighesthighest impact factorimpactimpact factorjournalsjournals of clinicalclinicalclinical medicinemedicine
nejmhighhigh impact factorimpactimpact factorjournaljournal of clinicclinicclinic medicmedic
New England J of Med---Impact FactorJournal-ClinicMedicineMedicine
2. Normalize phrases (case folding, stemming etc.)
Free indexingControlled indexing
Document KeywordsCandidates Properties
How keyword extraction works
1. Frequency: number of occurrences (incl. synonyms)
2. Position: beginning/end of a document, title, headers
3. Phrase length: longer means more specific
4. Similarity: semantic relatedness to other candidates
5. Corpus statistics: how prominent in this particular text
6. Popularity: how often people select this candidate
7. Part of speech pattern: some patterns are more common
…
Document KeywordsCandidates Properties
How keyword extraction works
Scoring
Heuristics
A formula that combines most powerful features
• requires accurate crafting• performs equaly well or less well across various domains
Supervised machine learning
Train a model from manually indexed documents
• requires training data• performs really well on docs that are similar to training data, but poorly on dissimilar ones
How accurate is keyword extraction?
• It’s subjective…• But: the higher the indexing consistency is,
the better the search effectiveness (findability)
A – set of keyphrases 1B – set of keyphrases 2C – set of keyphrases in common
A
B
CConsistencyRolling = 2C / (A + B)
ConsistencyHopper = C / (A + B – C)
overweight
nutritionpolicies
foodconsumption
pricepolicies
feedinghabits
nutritionalrequirements
foodintake
diet
fiscalpolicies prices
taxes
developing countries
overeating humannutrition
nutritionsurveillance
nutritionalphysiology
nutritionstatus
foodpolicies
foods
nutrientexcesses
dietaryguidelines
mealpatterns
nutritionprograms
publichealth
diseasecontrol
nutritionaldisorders
globalization
weightreduction
urbanization
developedcountries
price formation
energyvalue
directtaxation
regulations
Agrovoc terms:
Professional indexers’ keywords*
* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus to the same document, entitled “The global obesity problem”
Agrovoc terms:Agrovoc relation:
overweight
overeating
feedinghabits
nutritionalrequirements
humannutrition
nutritionpolicies
nutritionsurveillance
nutritionalphysiology
nutritionstatus
foodpolicies
foods
foodintake
foodconsumption
dietaryguidelines
diet
nutrientexcesses
mealpatterns
nutritionprograms
publichealth
diseasecontrol
nutritionaldisorders
pricepoliciesfiscal
policies prices
taxes
developing countries
globalization
weightreduction
urbanization
developedcountries
price formation
energyvalue
directtaxation
regulationsIndexer 1:
Indexer 2:
Comparison of 2 indexers
Comparison of 6 indexers & Kea
overweight
overeating
feedinghabits
nutritionalrequirements
humannutrition
nutritionpolicies
nutritionsurveillance
nutritionalphysiology
nutritionstatus
foodpolicies
foods
foodintake
foodconsumption
diet
nutrientexcesses
dietaryguidelines
mealpatterns
nutritionprograms
publichealth
diseasecontrol
nutritionaldisorders
pricepoliciesfiscal
policies prices
developing countries
globalization
weightreduction
urbanization
developedcountries
price formation
energyvalue
directtaxation
regulations
1 2 3 4 5 6
Agrovoc terms:Agrovoc relation:
Indexers:
taxes
body weight
saturated fatprice fixing
policies
controlled prices
Kea Algorithm:
* 15 teams of 2 students each assigned keywords to the same document, entitled “A safe, efficient regression test selection technique”
Comparison of CS students* & Maui
Human vs. algorithm consistency
Method Min Avg Max
Students 21 31 37
Maui 24 32 36
Method Min Avg Max
Professionals 26 39 47
KEA 24 32 38
6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus
15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary
With other taggers With Maui
330 taggers & 180 docs 19 24
35 taggers & 140 docs 38 35
CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing
Text Analytics on 2 Million Documents: A Case Study
+
Collaboration with Gene Golovchinskyfxpal.com/?p=gene
The dataset
CiteSeer1.7 Million scientific
publications110 GB
slideshare.net/raffikrikorian/twitter-by-the-numbersen.wikipedia.org/wiki/Wikipedia:Size_comparisons
Twitter490 Million tweets per
week84 GB
Wikipedia3.6 Million articles13 GB
Britannica0.65 Million articles0.3 GBICWSM 2011
2.1 TB (compressed!)News, blogs, forums, etc.
The task goal1. Extract all phrases that appear in search results
2. Weigh and suggest the best phrases for query refinement
Gene’s collaborative search system Querium
Step 1: Get time estimatesA. Take a subset, e.g. 100 documentsB. Run on various machines / settingsC. Extrapolate to the entire dataset, e.g. 1.7M docs
Our example:
• Standard laptop 4 Core, 8GB RAM: 30 days• Similar Rackspace VM: 46 days• Threading reduces time: 24 days
Step 2: Look into your dataUnderstand the nature of your data:
look at samples, compute statistcs. Speed up by removing anomalies & targetting the text analytics.
Our example:
30% docs exceed 50KB (some ≈600KB)
Most important phrase appear in title, abstract, introduction and conclusions.
Only process top 30% and last 20%This reduces the time by 57%!
Validate: Can we crop our documents?
Top N keywords in original doc
How many were found in the cropped doc
10 91%
50 80%
100 75%
All 64%
ontologyknowledge baseknowledge representationSemantic WebWordNetknowledge engineeringpredicate logicartificial intelligencesemantic networksnatural languagefirst-order logicontology engineeringlexiconconceptual graphshigher-order logicnatural language processingdesign rationaleblock diagram
ontologyknowledge baseknowledge engineeringknowledge representationWordNetpredicate logicartificial intelligenceontology engineeringsemantic networksSemantic Webfirst-order logicblock diagramdynamic systemshigher-order logicconceptual graphsmodeling & simulationuniverse of discoursebond graphlexicon
…original document ...cropped documentTop 20 keywords from*…
* Toward principles for the design of ontologies used for knowledge sharing
T. R. Gruber (1993)
Step 3: Go cloud
Don’t be afraid to bring out the big guns
• Large Elastic Compute instance1000 docs x 4 threads = 30 min
• High-CPU Extra Large (8 virtual cores)1000 docs x 24 threads = 6 min
Also: increase the number of machines
• 4 machines = 4 times faster, i.e. 50 instead of 200 hours (or 1 weekend!)
* Taking into account 8h per working day** Assuming 250 working days per year (no holidays, no sickdays)
Min per doc
Min Hours Days* Years**
1 1.700.000 28.333 3.542 14
2 3.400.000 56.666 7.083 28
3 5.100.000 85.000 10.625 42
How long would a human need to extract keywords from 1.7M docs?
http://www.flickr.com/photos/mararie/2663711551/
CiteSeer1.7 Million scientific
publications110 GB
Can be done in a weekend
Don’t do it manually!
1. Get time estimates2. Look into your data3. Go cloud
Candidates Properties Scoring KeywordsDocument
To estimate quality, take a sample and computeinter-indexer consistency between several people
Keyword extraction : medelyan.com/files/phd2009.pdfCiteSeer study: pingar.com/technical-blog/Pingar API: apidemo.pingar.com