Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

17
Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Page 1: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Creating and Visualizing

Document Classification

J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell

Page 2: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Justification for fuzzy document classification

Fuzzy aims….how can you know exactly what you’re looking for when you don’t know the possibilities? “anomalous state of knowledge”

(Belkin et al 1982)

So fuzzy clusters reflect the cognitive state

Page 3: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Hypothesis: Fuzzy results clustering and visualization should save time by directing searchers to the level of results that they wish to view (rather than breaking off arbitrarily at screen bottom)

…in a prototype digital library for paleontology

Research overview

Page 4: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Talk overview

Background: often classification with algorithms alone, de-emphasizing document

Approach: * Facets and browse categories* Metadata generation

* Classifier algorithms* Visualization: labels and color grid

Findings from paleontologist experiments positive response to our fuzzy classification muted response to our fuzzy visualization

Page 5: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Background: fuzzy clustering

Text classification is well-researched (Sebastiani, 2002 review). It depends on algorithm used

(k-nearest neighbor, naïve bayes, support vector, etc.)

and on document representation (bag of words, or with natural language processing factors)

Our work differs from others’ in its emphasis on document representation which we hoped would provide greater precision.

Page 6: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Background: fuzzy info visualization

“Research in visualisation of fuzzy systems is still at an early stage” (Pham and Brown, 2003)

-- location on the page— with the top being most relevant (see left, ours)

-- 3D -- icons (see left)

-- color gradations with dark most relevant (ours)

Page 7: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Pre-set queries: facets based on user needs

Page 8: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Queries are supported by controlled vocabulary, or ontology

Page 9: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Metadata generation: classification according to

article rhetoric (could be improved)

Page 10: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Knowledge Engineeringrather than machine learning for small document set

Rules for finding matches of document to query

Example: Ma [number] Mya [number] Myr [number] B.P [number]

in document matches to associated time periods

Rules for clustering documents into fuzzy categories (requires metadata generation)

Example: *** Highly relevant if match found in title or abstract

** Relevant if match found in caption…

Page 11: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

To solve problem of showing uncertainty clusters in a familiar list

Page 12: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

To solve problem of showing more results per

screen as well as showing clusters

Page 13: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Participants: 3 paleontologists (undergraduate, graduate and museum curator)

Method: Compare classifications of people and system for same articles

• Sample: 30 articles, mix of training and non-training set articles, from 3 categories: gingko (3 levels relevancy), allosaurus (3 levels relevancy), neither

RESULTS: 70% agreed at least 1/3 of participant ratings

Participant experiments (algorithm testing)

Page 14: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Pilot testing with paleontologist in our group• Paleontology conferences:

– Spring 2009 NACP (North American Paleontological Convention) – 17 returned

– Fall 2009 SVP (Society of Vertebrate Paleontologists) Ask 3 graduate or undergraduates in paleontology to classify the articles – results not yet returned

• Questionnaires – Spring questionnaire: design focus – Fall: comparative focus (features as well as design)

RESULTS 58.8% liked our labels 35.7% liked our grid

Participant experiments (interface testing)

Page 15: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Future directions

To improve fuzzy classification:

adapt CiteSeer parse algorithm to improve our classification

To improve visualization:

list view with labels and colors for uncertainty levels

Page 16: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

Contributions in summary

(1) Fuzzy result groupings represent “fuzzy” concept of search aim as in user’s mind, so uncertainly labels are appreciated

(2) Fuzzy color blocks that represent abstract categories are not liked; stick to minor modifications of the familiar list

Page 17: Creating and Visualizing Document Classification J. Gelernter, D. Cao, R. Lu, E. Fink, J. Carbonell.

References

Belkin, N.J., Oddy, R.N. and Brooks H.M. (1982) ASK for information retrieval,. Part I: Background and theory; Part II: Results of a design study, Journal of Documentation, vol. 3, no. 2&3, pp. 61-71: 145-164, 1982.

Pham, B. & Brown, R. (2003). Analysis of visualization requirement for fuzzy systems. Proceedings of the 1st international conference on computer graphics and interactive techniques in Australasia and South East Asia, Melbourne, Australia, 181 ff.

Sebastiani, (2002) Machine learning in automated text

categorization, ACM Computing Surveys, 34 (1), 1-47.