Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...

Grade clustering and seriation of words based on their co-occurrencesEmilia Jarochowska & Krzysztof CiesielskiInstitute of Computer Science, Poland

Summary Using data on terms co-occurrence, extracted from a newsgroup sample, we seek for the terms most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.

Clustering of documents and terms: what for?Improving and grouping search resultsFinding synonyms: construction of thesauri, query expansion based on the synonyms of the entered termsFinding collocations

The common approach to term clusteringAssociation matrices which quantify term correlationsThis global approach does not necessarily adapt well to the local context

The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).

MaterialA collection of posts from 20 newsgroups, widely used as a benchmark for text-mining methodshttp://people.csail.mit.edu/ jrennie/20Newsgroups/

comp.windows.xrec.antiques.radio+phonorec.sport.hockeysci.medtalk.religion.miscEntropy of within-group frequencies (condition)363 automatically selected keywords representing these groups

MethodsStemming to reduce inflected forms to one representativeHAL (Hyperspace Analogue to Language)Grade correspondence analysis implemented in the GradeStat program

HAL HAL generates matrix H in which the cell hij corresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hij is the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j.

Several forms of normalizations are possible.

Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendalls tau.

Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.

Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:

Results

Polarization between groups of termsComputer-related termsPolitical and religious terms ftp, server, unix, MIT, Columbia, mac, graphic,video, display, internetmurder, belief, kill, faith, Jewish, moral, hell, death, children, shot, war, fire, arm, defense, absolut, burn, Bible

Deviations from regularityAre themselves more regular than original dataThus are better descriptors of the position of a term in the dataset

Examples of seriationreligionwarsportcomputerscommercegeneralClustersftpcompanyproducehouseaprilexamplecitybaseballwar

ConclusionsWe identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes a scale obtained in a process of unsupervised learningDeviation from regularity in the dataset characterizes terms better than simply co-occurrence data

Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.

Thank you for attentionhttp://gradestat.ipipan.waw.pl/english/

Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...

Documents

Transcript of Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...