Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...

18
Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland

Transcript of Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...

  • Grade clustering and seriation of words based on their co-occurrencesEmilia Jarochowska & Krzysztof CiesielskiInstitute of Computer Science, Poland

  • Summary Using data on terms co-occurrence, extracted from a newsgroup sample, we seek for the terms most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.

  • Clustering of documents and terms: what for?Improving and grouping search resultsFinding synonyms: construction of thesauri, query expansion based on the synonyms of the entered termsFinding collocations

  • The common approach to term clusteringAssociation matrices which quantify term correlationsThis global approach does not necessarily adapt well to the local context

  • The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).

  • MaterialA collection of posts from 20 newsgroups, widely used as a benchmark for text-mining methodshttp://people.csail.mit.edu/ jrennie/20Newsgroups/

    comp.windows.xrec.antiques.radio+phonorec.sport.hockeysci.medtalk.religion.miscEntropy of within-group frequencies (condition)363 automatically selected keywords representing these groups

  • MethodsStemming to reduce inflected forms to one representativeHAL (Hyperspace Analogue to Language)Grade correspondence analysis implemented in the GradeStat program

  • HAL HAL generates matrix H in which the cell hij corresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hij is the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j.

    Several forms of normalizations are possible.

  • Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendalls tau.

  • Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.

  • Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:

  • Results

  • Polarization between groups of termsComputer-related termsPolitical and religious terms ftp, server, unix, MIT, Columbia, mac, graphic,video, display, internetmurder, belief, kill, faith, Jewish, moral, hell, death, children, shot, war, fire, arm, defense, absolut, burn, Bible

  • Deviations from regularityAre themselves more regular than original dataThus are better descriptors of the position of a term in the dataset

  • Examples of seriationreligionwarsportcomputerscommercegeneralClustersftpcompanyproducehouseaprilexamplecitybaseballwar

  • ConclusionsWe identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes a scale obtained in a process of unsupervised learningDeviation from regularity in the dataset characterizes terms better than simply co-occurrence data

  • Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.

  • Thank you for attentionhttp://gradestat.ipipan.waw.pl/english/