Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...
-
Upload
rafe-riley -
Category
Documents
-
view
213 -
download
0
Transcript of Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof...
-
Grade clustering and seriation of words based on their co-occurrencesEmilia Jarochowska & Krzysztof CiesielskiInstitute of Computer Science, Poland
-
Summary Using data on terms co-occurrence, extracted from a newsgroup sample, we seek for the terms most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.
-
Clustering of documents and terms: what for?Improving and grouping search resultsFinding synonyms: construction of thesauri, query expansion based on the synonyms of the entered termsFinding collocations
-
The common approach to term clusteringAssociation matrices which quantify term correlationsThis global approach does not necessarily adapt well to the local context
-
The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).
-
MaterialA collection of posts from 20 newsgroups, widely used as a benchmark for text-mining methodshttp://people.csail.mit.edu/ jrennie/20Newsgroups/
comp.windows.xrec.antiques.radio+phonorec.sport.hockeysci.medtalk.religion.miscEntropy of within-group frequencies (condition)363 automatically selected keywords representing these groups
-
MethodsStemming to reduce inflected forms to one representativeHAL (Hyperspace Analogue to Language)Grade correspondence analysis implemented in the GradeStat program
-
HAL HAL generates matrix H in which the cell hij corresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hij is the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j.
Several forms of normalizations are possible.
-
Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendalls tau.
-
Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.
-
Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:
-
Results
-
Polarization between groups of termsComputer-related termsPolitical and religious terms ftp, server, unix, MIT, Columbia, mac, graphic,video, display, internetmurder, belief, kill, faith, Jewish, moral, hell, death, children, shot, war, fire, arm, defense, absolut, burn, Bible
-
Deviations from regularityAre themselves more regular than original dataThus are better descriptors of the position of a term in the dataset
-
Examples of seriationreligionwarsportcomputerscommercegeneralClustersftpcompanyproducehouseaprilexamplecitybaseballwar
-
ConclusionsWe identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes a scale obtained in a process of unsupervised learningDeviation from regularity in the dataset characterizes terms better than simply co-occurrence data
-
Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.
-
Thank you for attentionhttp://gradestat.ipipan.waw.pl/english/