Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting...

48
Text Mining Department of Computer Science University of Liverpool February 27, 2019 comp 527: Text Mining February 27, 2019 1 / 49

Transcript of Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting...

Page 1: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining

Department of Computer Science

University of Liverpool

February 27, 2019

comp 527: Text Mining February 27, 2019 1 / 49

Page 2: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Overview

1 IntroductionSome examplesDefinition and ChallengesSteps in text mining

2 PreprocessingTokenisationStemmingStopword RemovalSentence Segmentation

3 Part-Of-Speech (POS) TaggingRule-based MethodsProbabilistic Models

4 References

comp 527: Text Mining February 27, 2019 2 / 49

Page 3: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Simple Question: Why do dogs howl at the moon?

comp 527: Text Mining February 27, 2019 3 / 49

Page 4: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining Around Us - Sentiment Analysis

source: https://www.jellyfish.co.uk/news-and-views/update-eu-referendum-campaigns-seem-to-be-causing-little-impact

comp 527: Text Mining February 27, 2019 4 / 49

Page 5: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining Around Us - Opinion Mining

source: https://phys.org/news/2018-04-brexit-debate-twitter-driven-economic.html

comp 527: Text Mining February 27, 2019 5 / 49

Page 6: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining Around Us - Movie Recommendation Systems

comp 527: Text Mining February 27, 2019 6 / 49

Page 7: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining Around Us - Document Summarization

comp 527: Text Mining February 27, 2019 7 / 49

Page 8: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining - Definition and Challenges

Text miningprocess of extracting interesting and non-trivial patterns or knowledgefrom unstructured text documents [Tan et al., 1999].a.k .a text data mining [Hearst, 1997],knowledge discovery from textual databases[Feldman and Dagan, 1995]text analytics - application to solve business problems

comp 527: Text Mining February 27, 2019 8 / 49

Page 9: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining - Challenges

Unorganized form of datasemi-structured or unstructured

Deriving semantics from contentambiguities at di↵erent levels - lexical, syntactic, semantic andpragmaticText has multiple interpretationsTeacher Strikes Idle KidsViolinist linked to JAL crash blossomsWord sense ambiguityRed Tape Holds Up New Bridges

Non-standard Englishlanguage in TweetsSOO PROUD of what U accomp.

comp 527: Text Mining February 27, 2019 9 / 49

Page 10: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining - Challenges

New Words850 new words added dictionary at Merriam-Webster.com in 2018CryptocurrencyChiweenie - a cross between a Chihuahua and a dachshundDumpster fire - a disastrous event

Idiomsdark horse; get cold feet

Combining information from multi-lingual texts

Integrate domain knowledge

comp 527: Text Mining February 27, 2019 10 / 49

Page 11: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Steps in Text Mining

source: http://openminted.eu/text-mining-101/comp 527: Text Mining February 27, 2019 11 / 49

Page 12: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Text Mining - Preprocessing Steps

Tokenisation

Stemming

Stopword Removal

Sentence Segmentation

comp 527: Text Mining February 27, 2019 12 / 49

Page 13: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenisation

Process of splitting text into words

What is a word?string of contiguous alphanumeric characters with space on either side;may include hyphens and apostrophes, but no other punctuationmarks [Kucera and Francis, 1967].

Useful clue - space or tab (English)

comp 527: Text Mining February 27, 2019 13 / 49

Page 14: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenisation - Problems

Periodsusually helps if we remove thembut useful to retain in certain cases such as $22.50; Ed.,

hypenationuseful to retain in some cases e.g., state-of-the-artbetter to remove in other cases e.g., gold-import ban, 50-year-old

Single apostrophesuseful to remove them e.g., is’nt, didn’t

space may not be a useful clue all the time

sometimes we want to use words separated by space as ‘single’ word

For example:San FranciscoUniversity of LiverpoolDanushka Bollegala

comp 527: Text Mining February 27, 2019 14 / 49

Page 15: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Regular Expressions for Tokenisation

Regular Expressions Cheatsheet

comp 527: Text Mining February 27, 2019 15 / 49

Page 16: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Regular Expressions for Tokenisation

comp 527: Text Mining February 27, 2019 16 / 49

Page 17: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Stanford Parser for Tokenisation

comp 527: Text Mining February 27, 2019 17 / 49

Page 18: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenisation

Tokenisation turns out to be more di�cult than one expects

No single solution works well

Decide what counts as a token depending on the application domain

comp 527: Text Mining February 27, 2019 18 / 49

Page 19: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

spaCy (https://spacy.io/)

spaCy - a relatively new package for “Industrial strength NLP inPython”.

Developed by Matt Honnibal at Explosion AI

Designed with applied data scientist in mind

spaCy supports:TokenisationLemmatisationPart-of-speech taggingEntity recognitionDependency parsingSentence recognitionWord-to-vector transformations

comp 527: Text Mining February 27, 2019 19 / 49

Page 20: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

spaCy - Feature Comparison

source: https://spacy.io/usage/facts-figurescomp 527: Text Mining February 27, 2019 20 / 49

Page 21: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

spaCy - Benchmarks

source: https://spacy.io/usage/facts-figures

comp 527: Text Mining February 27, 2019 21 / 49

Page 22: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

spaCy - Detailed Speed Comparison

source: https://spacy.io/usage/facts-figures

comp 527: Text Mining February 27, 2019 22 / 49

Page 23: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenization in spaCy

Tokenizes text into words, puntuations and so on.

Applies rules specific to each language

Step 1: Split raw text based on whitespace characters (text.split(‘ ’))

Step 2: Processes each substring from left to right and performs twochecks:

Does the substring match a tokenizer exception rulee.g., “don’t” ==> no whitespace ==> but split into two tokens “do”and “nt“U.K.” ==> remain as one token

comp 527: Text Mining February 27, 2019 23 / 49

Page 24: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenization in spaCy

source: https://spacy.io/usage/spacy-101

comp 527: Text Mining February 27, 2019 24 / 49

Page 25: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenization in spaCy

source: https://spacy.io/usage/spacy-101

comp 527: Text Mining February 27, 2019 25 / 49

Page 26: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Tokenization in spaCy

source: https://spacy.io/usage/spacy-101comp 527: Text Mining February 27, 2019 26 / 49

Page 27: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Stemming

Removal of inflectional ending from words (strip o↵ any a�xes)connections, connecting, connect, connected ! connect

ProblemsCan conflate semantically di↵erent words

Gallery and gall may both be stemmed to gall

Lemmatization: a further step to ensure that the resulting form is aword present in a dictionary

comp 527: Text Mining February 27, 2019 27 / 49

Page 28: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Regular Expressions for Stemming

note that the star operator is “greedy”

the .* part of expression tries to consume as much as the input aspossible

for non-greedy version of the star operator = *?

comp 527: Text Mining February 27, 2019 28 / 49

Page 29: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Regular Expressions for Stemming

comp 527: Text Mining February 27, 2019 29 / 49

Page 30: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Regular Expressions for Stemming

ProblemsRE removes ‘s’ from ‘ponds’, but also from ‘is’ and ‘basis’produces some non-words like ‘distribut’, ‘deriv’

comp 527: Text Mining February 27, 2019 30 / 49

Page 31: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

NLTK Stemmers

NLTK provides several o↵-the-shelf stemmers

Porter and Lancaster stemmers have their own rules for strippinga�xes

comp 527: Text Mining February 27, 2019 31 / 49

Page 32: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Is stemming useful?

Provides some improvement for IR performance (especially for smallerdocuments).

Very useful for some queries, but on an average does not help much.

Since improvement is very minimal, often IR engines does not usestemming.

comp 527: Text Mining February 27, 2019 32 / 49

Page 33: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Stopword Removal

Removal of high frequency words

Most common words such as articles, prepositions, and pronouns etc.does not help in identifying meaning

Figure: A stop list of 25 semantically non-selective words which are common inReuters-RCV1

comp 527: Text Mining February 27, 2019 33 / 49

Page 34: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Methods for stopword removal

Classic Methodremoving stop-words using pre-compiled lists

Zipf’s law (Z-methods)frequency of a word is inversely proportional to its rank in thefrequency tableremove most frequent words

Mutual Information Methodsupervised method that computes mutual information between a giventerm and a document classlow mutual information suggests low discrimination power of the termand hence should be removed

comp 527: Text Mining February 27, 2019 34 / 49

Page 35: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Sentence Segmentation

Divide text into sentences

Involves identifying sentence boundaries between words in di↵erentsentences

a.k .a sentence boundary detection, sentence boundarydisambiguation, sentence boundary recognition

Useful and necessary for various NLP tasks such assentiment analysisrelation extractionquestion answering systemsknowledge extraction

comp 527: Text Mining February 27, 2019 35 / 49

Page 36: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Sentence boundary detection algorithms

Heuristic methods

Statistical classification trees [Riley, 1989]probability of a word occurring before or after a boundary, case andlength of words

Neural Networks [Palmer and Hearst, 1997]POS distribution of preceding and following words

Maximum entropy model [Mikheev 1998]

comp 527: Text Mining February 27, 2019 36 / 49

Page 37: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Sentence Segmentation - Using spaCy

comp 527: Text Mining February 27, 2019 37 / 49

Page 38: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Sentence Segmentation - Using spaCy

comp 527: Text Mining February 27, 2019 38 / 49

Page 39: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Part-of-Speech Tagging (POS)

Task of tagging POS tags (Nouns, Verbs, Adjectives, Adverbs, ...) forwords

POS tags provide lot of information about a wordknowing whether a word is noun or verb gives information aboutneighbouring wordsnouns are preceded by determiners; adjectives and verbs by nounsuseful for Named entity recognition; Machine Translation; Parsing;Word sense disambiguation

Given a word, we assume it can belong to only of the POS tags.

POS Tagging problemGiven a sentence S = w

1

w2

....wn consisting of n words, determine thecorresponding tag sequence P = P

1

P2

....Pn

comp 527: Text Mining February 27, 2019 39 / 49

Page 40: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

POS Tagging - Challenges

comp 527: Text Mining February 27, 2019 40 / 49

Page 41: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

POS Tagging - Tagset

Figure: Penn Treebank POS Tags

comp 527: Text Mining February 27, 2019 41 / 49

Page 42: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

POS Tagging - Brown Corpus

Brown Corpus - standard corpus used for POS tagging task

first text corpus of American English

published in 1963-1964 by Francis and Kucera

consists of 1 million words (500 samples of 2000+ words each)

Brown corpus is PoS tagged with Penn TreeBank tagset.

⇡ 11% of the word types are ambiguous with regard to POS

⇡ 40% of the word tokens are ambiguous

ambiguity for common words. e.g. thatI know that he is honest = preposition (IN)Yes, that play was nice = determiner (DT)You can’t to that far = adverb (RB)

comp 527: Text Mining February 27, 2019 42 / 49

Page 43: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Automatic POS Tagging

SymbolicRule-basedTransformation-based

ProbabilisticHidden Markov ModelsMaximum Entropy Markov ModelsConditional Random Fields

comp 527: Text Mining February 27, 2019 43 / 49

Page 44: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Automatic POS Tagging - Brill Tagger

comp 527: Text Mining February 27, 2019 44 / 49

Page 45: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Automatic POS Tagging - Brill Tagger

comp 527: Text Mining February 27, 2019 45 / 49

Page 46: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Automatic POS Tagging: Brill Tagger - Example

comp 527: Text Mining February 27, 2019 46 / 49

Page 47: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Automatic POS Tagging: Brill Tagger - Sample Final Rules

comp 527: Text Mining February 27, 2019 47 / 49

Page 48: Text Mining - Danushka Bollegaladanushka.net/lect/dm/text_mining_1.pdfprocess of extracting interesting and non-trivial patterns or knowledge from unstructured text documents [Tan

Next Week

Probabilistic Models for POS Tagging

Relation Extraction

Question and Answering

comp 527: Text Mining February 27, 2019 48 / 49