1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

30
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    1

Transcript of 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

1

CS 430 / INFO 430 Information Retrieval

Lecture 26

Classification 1

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

2

Course Administration

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

3

Classification and Categorization

terms

documents

pre-defined classes

empirically-defined classes

thesaurus

classification

text categorization

document clustering

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

4

Text categorization

Text categorization

• Problem is to classify documents by whether they belong to a fixed set of pre-determined categories.

• Each document may belong to many categories.

• Each category is taken as a separate binary classification problem.

Classification

• Problem is to assign each document to exactly one pre-determined category.

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

5

Text categorization

Outline

• Select a subject domain.

• Choose a corpus of documents that cover the domain.

• Obtain a training set of documents that have been assigned to a set of categories.

• Create a vocabulary by extracting terms, normalization, precoordination of phrases, etc.

• Use the terms in a document as a feature set that describes the document. Scale the terms using idf or similar measure.

• Use machine learning methods (e.g., support vector machine) to train an automatic classifier.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

6

Lexicon and Thesaurus

Lexicon contains information about words, their morphological variants, and their grammatical usage.

Thesaurus relates words by meaning:

ship, vessel, sail; craft, navy, marine, fleet, flotilla

book, writing, work, volume, tome, tract, codex

search, discovery, detection, find, revelation

(From Roget's Thesaurus, 1911)

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

7

Thesaurus in Information Retrieval

Use of a thesaurus in indexing (precoordination)

A. Manual

A human indexer assigns standard terms and associations.

computer-aided instructionsee also educationUF teaching machinesBT educational computingTT computer applicationsRT educationRT teaching

From: INSPEC Thesaurus

used for

broader term

top term

related term

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

8

Thesaurus in Information Retrieval

Use of a thesaurus in indexing (precoordination)

B. Automatic

Divide terms into thesaurus classes. Replace similar terms by a thesaurus class.

408 dislocation 409 blast-cooledjunction heat-flow

minority-carrier heat-transfern-p-np-n-p 410 anneal

point-contact strainrecombinetransitionunijunction

From: Salton and McGill

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

9

Desirable Properties for Information Retrieval

• Thesaurus is specific to a subject area. Contains only terms of interest for identification within that subject area.

• Ambiguous terms are coded only for the senses important for that field.

• Target is that each thesaurus class should include terms of moderate frequency. Ideally the classes should have similar frequency.

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

10

Alexandria Thesaurus: Example

canals

A feature type category for places such as the Erie Canal.

Used for:

The category canals is used instead of any of the following.

canal bends canalized streams ditch mouths ditches drainage canals drainage ditches ... more ...

Broader Terms:

Canals is a sub-type of hydrographic structures.

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

11

Alexandria Thesaurus: Example (continued)

canals (continued)

Related Terms:

The following is a list of other categories related to canals (non-hierarchial relationships).

channels locks transportation features tunnels

Scope Note:

Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

12

Art and Architecture Thesaurus

•Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture.

•Almost 128,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures.

•Used by archives, museums, and libraries to describe items in their

collections.

•Used as a database accessed via a Web site.

•Used by computer programs, for information retrieval, and natural language processing.

A project of the J. Paul Getty Trust

http://www.getty.edu/research/conducting_research/vocabularies/aat/

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

13

Art and Architecture Thesaurus

Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism.

Concept:

a cluster of terms, one of which is established as the preferred term, or descriptor.

Categories:

associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects.

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

14

Art and Architecture Thesaurus: Sample Record

Record ID: 198841

Descriptor: rhyta

Note: Refers to vessels from Ancient Greece, eastern Europe, or the Middle East that typically have a closed form with two openings, one at the top for filling and one at the base so that liquid could stream out. They are often in the shape of a horn or an animal's head, and were typically used as a drinking cup or for pouring wine into another vessel.

Hierarchy: Containers [TQ]...<containers by function or context>...........<culinary containers>...................<containers for serving and consuming food>

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

15

Art and Architecture Thesaurus: Sample Record (continued)

Terms:rhytarhyton (alternate, singular)

protomai protome rhea rheon rheons

Related concepts:stirrup cupssturzbechersdrinking vesselsceremonial vessels

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

16

Medical Subject Headings (MeSH)

National Library of Medicine's controlled vocabulary thesaurus

The library provides MeSH subject headings for each article in the4,800 journals that it indexes every year and every book, etc. acquired by the library.

• 23,000 primary headings.

• Additional thesaurus of about 151,000 chemical terms.

• Terms are organized in a hierarchy.

• 135,000 cross-references.

Experts who understand the field and are able to formulate queries using MeSH terms and the MeSH structures.

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

17

MeSH hierarchy

Biological Sciences [G] Biological Sciences [G01] + Health Occupations [G02] + Environment and Public Health [G03] + Biological Phenomena, Cell Phenomena, and Immunity [G04] + Genetics [G05] + Biochemical Phenomena, Metabolism, and Nutrition [G06] + Physiological Processes [G07] + Reproductive and Urinary Physiology [G08] + Circulatory and Respiratory Physiology [G09] + Digestive, Oral, and Skin Physiology [G10] + Musculoskeletal, Neural, and Ocular Physiology [G11] + Chemical and Pharmacologic Phenomena [G12] +

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

18

MeSH hierarchy (continued)

Physiological Processes [G07] Adaptation, Physiological [G07.062] + Aging [G07.168] + Body Constitution [G07.265] + Body Temperature [G07.315] Body Temperature Regulation [G07.315.232] + Skin Temperature [G07.315.753] Chronobiology [G07.450] + Electrophysiology [G07.453] + Fluid Shifts [G07.503] Growth and Embryonic Development [G07.553] + Homeostasis [G07.621] + Tensile Strength [G07.900] Tropism [G07.950] +

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

19

MeSH hierarchy (continued)

MeSH Heading Body Temperature

Tree Number E01.370.600.120

Tree Number G07.315

Entry Term Organ Temperature

See Also Fever

See Also Thermography

See Also Thermometers

Allowable Qualifiers DE GE IM PH RE

Unique ID D001831

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

20

Automatic Thesaurus Construction

Outline

• Select a subject domain.

• Choose a corpus of documents that cover the domain.

• Create vocabulary by extracting terms, normalization, precoordination of phrases, etc.

• Devise a measure of similarity between terms and thesaurus classes.

• Cluster terms into thesaurus classes, using a cluster method that generates compact clusters (e.g., complete linkage).

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

21

Normalization of vocabulary

Normalization rules map variant forms into base expressions. Typical normalization rules for manual thesaurus construction are:

(a) Nouns only, or nouns and noun phrases.

(b) Singular nouns only.

(c) Spelling (e.g., U.S.).

(d) Capitalization, punctuation (e.g., hyphens), initials (e.g., IBM), abbreviations (e.g., Mr.).

Usually, many possible decisions can be made, but they should be followed consistently.

Which of these can be carried out automatically with reasonable accuracy?

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

22

Phrase construction

In a precoordinated thesaurus, term classes may contain phrases.

Informal definitions:

pair-frequency (i, j) is the frequency that a given pair of words occur in context (e.g., in succession within a sentence)

phrase is a pair of words, i and j that occur in context with a higher frequency than would be expected from their overall frequency

cohesion (i, j) = observed pair-frequency (i, j)

expected pair-frequency if i, j independent

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

23

Phrase construction: simple case

Example: corpus of n terms

pi, j is the observed frequency that a given pair of terms occur in succession.

fi is the number of occurrences of term i in the corpus.

There are n-1 pairs. If the terms are independent, the probability that a given pair begins with term i and ends with term j is

(fi/n).(fj/n)

cohesion (i, j) = n2.pi, j

(n-1)fi.fj

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

24

Phrase construction

Salton and McGill algorithm

1. Computer pair-frequency for all terms.

2. Reject all pairs that fall below a certain frequency threshold

3. Calculate cohesion values

4. If cohesion above a threshold value, consider word pair as a phrase.

There is promising research on phrase identification using methods of computational linguistics

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

25

Similarities

The vocabulary consists of a set of elements, each of which can be a single term or a phrase.

The next step is to calculate a measure of similarity between elements.

One measure of similarity is the number of documents that have terms i and k in common:

S(tj, tk) = tijtik

where tij = 1 if document i contains term j and 0 otherwise.i=1

n

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

26

Similarities: Incidence array

alpha bravo charlie delta echo foxtrot golf

D1 1 1 1 1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1 1

n 3 2 2 3 2 3 3

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

27

Term similarity matrix

alpha bravo charlie delta echo foxtrot golf

alpha 1 1 3 1 2 3

bravo 1 2 1 2 2 1

charlie 1 2 1 2 2 1

delta 3 1 1 1 2 3

echo 1 2 2 1 2 1

foxtrot 2 2 2 2 2 2

golf 3 1 1 3 1 2

Using count of documents that have two terms in common

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

28

Similarity measures

Improved similarity measures can be generated by:

• Using term frequency matrix instead of incidence matrix

• Weighting terms by frequency:

cosine measure

tijtik

|tj| |tk|

dice measure

tijtik

tik + tij

i=1

n

i=1

i=1 i=1

n

n n

S(tj, tk) =

S(tj, tk) =

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

29

Term similarity matrix

alpha bravo charlie delta echo foxtrot golf

alpha 0.2 0.2 0.5 0.2 0.33 0.5

bravo 0.2 0.5 0.2 0.5 0.4 0.2

charlie 0.2 0.5 0.2 0.5 0.4 0.2

delta 0.5 0.2 0.2 0.2 0.33 0.5

echo 0.2 0.5 0.5 0.2 0.4 0.2

foxtrot 0.33 0.4 0.4 0.33 0.4 0.33

golf 0.5 0.2 0.2 0.5 0.2 0.33

Using incidence matrix and dice weighting

Page 30: 1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.

30

Clustering terms to form concepts

The final stage is to group similar terms together into concepts. This is done by cluster analysis. Cluster analysis is the topic of the next lecture.