Text mining by examples, By Hadi Mohammadzadeh

Post on 11-May-2015

5.116 views 4 download

Tags:

Transcript of Text mining by examples, By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Text Mining by Examples Pages

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 27 Jan. 2010

Seminar on

Text Mining

by Examples

2

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

OutLine

1. New Terminologies2. WordNet - A Large Lexical DataBase of English3. Reuters-21578 … as a Text Collection4. CMU Text Learning Group Data Archives

5. Text Mine Software - Web based algorithms6. Text Mine Software - Command based algorithms7. Usefull Web sites

3

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part One

New TerminologiesWord and Meaning Relationships

4

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text

Hyponym and Hypernym

• In linguistics, a hyponym is a word or phrase whose semantic range is included within another word, its hypernym. For example, scarlet and crimson are all hyponyms of red (their hypernym), which is, in turn, a hyponym of colour.

5

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text Meronym

• Meronymy is a semantic relation used in linguistics. A meronym denotes a constituent part of, or a member of something. That is,– X is a meronym of Y if Xs are parts of Y(s), or– X is a meronym of Y if Xs are members of Y(s).

• For example, 'finger' is a meronym of 'hand' because a finger is part of a hand. Similarly 'wheel' is a meronym of 'automobile'.

6

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text Holonym

• Holonymy defines the relationship between a term denoting the whole and a term denoting a part of the whole. That is,

– 'X' is a holonym of 'Y' if Ys are parts of Xs, or– 'X' is a holonym of 'Y' if Ys are members of Xs.

• For example, 'tree' is a holonym of 'bark', of 'trunk‘ and of 'limb.'

7

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Two

WordNetA Large Lexical DataBase of English

8

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet

• WordNet® is a large lexical database of English, developed under the direction of George A. Miller.

• Develpoment of WordNet began in 1985 and its use is widespread in tools to manage text.

• WordNet is more than just a dictionary and thesaurus; it includes all kinds of relationships between words. WordNet version 2.0 contains roughly 150,000 content words.

9

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet cont.

• Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

• WordNet is also freely and publicly available for download.

• WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

10

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of Senses in WordNet

• A word can have more than one meaning that is not obvious in a sentence.

• In WordNet a word has an average of 1.4 senses.

Average of Sense

Word Number Average of Senses

Verb 2.1Adjective 1.45

Adverb 1.25

Nouns 1.24

11

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of Senses in WordNet

Words with the Highest Number of Senses from WordNet

Word Number of Senses

Break 74

Cut 73

Run 57

Play 52

Make 51

12

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Understanding Text – Polysemy

Number of POS in WordNet

• Some words also have more than one part of speech(POS). For example still has five different parts of speech.

Word Number of POS

Out 5Round 5

Still 5Down 5Over 4

13

.

Hadi Mohammadzadeh Text Mining by Examples Pages

World Classifications in WordNet

• Words can be classified into word classes or POS.

• We refer to nouns, verbs, adjectives, and adverbs as content words.

• Conjunctions, determiners, pronouns, and prepositions are called function words.

Frequencies of Word Classes from WordNet

Type Number Type Number

Noun 114,400(75%) Preposition 133(0.08%)

Adjective 21,438(14%) Pronoun 118(0.07%)

Verb 11,341(7.4%) Conjunction 89(0.05%)

Adverb 4662(3%) Determiner 14(0.009%)

14

.

Hadi Mohammadzadeh Text Mining by Examples Pages

WordNet Website and Developed Program

• WordNet Website

• WordNet Developed Program

15

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Three

Reuters-21578

as a Text Collection

16

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Reuters-21578 History

• The documents in the Reuters-21578 collection appeared on the Reuters newswire in 1987.

• Reuters-21578 is a test collection for evaluation of automatic text categorization techniques. Really it is a classic benchmark for text categorization algorithms.

• The Reuters-21578 collection is distributed in 22 files. Each of the first 21 files contain 1000 documents, while the last contains 578 documents.

17

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Reuters-21578

• Distribution 1.0 on 26 September 1997, By David D. Lewis AT&T Labs - Research

• The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text

categorization system.

18

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Four

CMU Text Learning Group Data Archives

as a Text Collection

19

.

Hadi Mohammadzadeh Text Mining by Examples Pages

CMU Text Learning Group Data Archives

• This data set is a collection of 20,000 messages, collected from 20 different netnews newsgroups. One thousand messages from each of the twenty newsgroups were chosen at random and partitioned by newsgroup name.

• Link

• Sample Message

• Experiment Results

• Prof. Cho , Sam Houston State of University

20

.

Hadi Mohammadzadeh Text Mining by Examples Pages

CMU Text Learning Group Data Archives

1. alt.atheism 2. talk.politics.guns 3. talk.politics.mideast 4. talk.politics.misc 5. talk.religion.misc 6. soc.religion.christian 7. comp.sys.ibm.pc.hardware 8. comp.graphics 9. comp.os.ms-windows.misc 10. comp.sys.mac.hardware 11. comp.windows.x 12. rec.autos 13. rec.motorcycles 14. rec.sport.baseball 15. rec.sport.hockey 16. sci.crypt 17. sci.electronics 18. sci.space 19. sci.med

20. misc.forsale

21

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Five

Text Mine SoftwareWeb based algorithms

22

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Text Mine Application

• The three scripts in the first row handle:1. the creation of text statistics

• Number of word types• Letter frequencies• Word frequencies

2. Entity Extraction3. Finding the POS tags for words

23

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Text Mine Application

• As an input use a text file such as Help File or write a text on Textbox.

24

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Six

Text Mine SoftwareCommand based algorithms

25

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Zeroth ProgramTokens

• Name of Program: tokens.pl• Input : sample. • Output : After runnig this program, it will generate a text

file with following name

tokens.txt• Aim : Generating Tokens

26

.

Hadi Mohammadzadeh Text Mining by Examples Pages

First ProgramPart of Speech Tagger

• Name of Program: pos-test.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

pos_test_results.txt• Aim : Part of Speech Tagger

27

.

Hadi Mohammadzadeh Text Mining by Examples Pages

• To generate named entities with associated types, we need some dictionaries for categories such as – Person, place, organization, number, currency,

dimension, time, technical time, or miscellaneous.– For Exampel co_abbrev.dat contains a list of about 900

abbreviations. Or co_places table is a list of about 3000 of the world’s lager cities.

Second ProgramEntity Extraction

28

.

Hadi Mohammadzadeh Text Mining by Examples Pages

• Name of Program: test-ent.pl• Input : Inside Perl File. • Output : After runnig this program, it will

generate a text file with following name

test_ent_results.txt• Aim : Entity Extraction

Second ProgramEntity Extraction

29

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Third ProgramDisambiguate words with multiple

• Name of Program: sense.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

sense.txt

30

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Fourth ProgramRandom Text Generator

• Name of Program: tgen.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

tgen.txt

31

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Fifth ProgramSplitting of text into sentences

• Name of Program: tsplit.pl• Input : Inside Perl File. • Output : After runnig this program,

it will generate a text file with following name

tsplit.txt

32

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Sixth programClustering

• Name of Program: cluster.pl

• Input Data: a collection of 55 Reuters documents from three topics– Cocoa , 15 documents– Suger , 22 documents– Coffee , 18 documentsInput file included in cluster.pl.

• Input Parameters : A similarity threshold, a linking parameter, and an indexing parameter.

• Output : It returns a list of clusters and similarity matrix. Cluster.txt

• Method : This program is based on genetic algorithm method.

33

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Seminar on Text Mining by Examples

Part Seven

Usefull Web sites

34

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Talk to Ditto

• http://www.convo.co.uk/x02/?

35

.

Hadi Mohammadzadeh Text Mining by Examples Pages

36

.

Hadi Mohammadzadeh Text Mining by Examples Pages

37

.

Hadi Mohammadzadeh Text Mining by Examples Pages

38

.

Hadi Mohammadzadeh Text Mining by Examples Pages

How it works?

• Bayesian Classification is used to teach Ditto the donkey the basics of the English language

• When Ditto receives a message, he evaluates it for niceness or nastiness, then responds emotionally on a scale of –100 to +100

• Ditto was trained using 5525 examples

39

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Dragon Toolkit

• Dragon Toolkit

40

.

Hadi Mohammadzadeh Text Mining by Examples Pages

Disp

• http://www.ltg.ed.ac.uk/disp/resources/

41

.

Hadi Mohammadzadeh Text Mining by Examples Pages

References

• Books– Introduction to Information Retrieval-2008– Managing Gigabytes-1999– The Text Mining Handbook– Text Mining Application Programming– Web Data Mining