Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan...

18
Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research

Transcript of Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan...

Understanding Text Corpora with Multiple Facets

Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou

IBM Research

Emergency Room Records

Hotel Reviews

Intelligence Reports

Email Documents

Financial News/Blogs/Message Boards

Outline Problem & Related Work Multi-Facet Text Data Model and Text Processing

– Data model

– Text pre-processing

– Content summarization

Visualization– Metaphor

– Creation algorithm

– Interactions

Video Demo

Problem & Related Work It’s challenging to build a visual analytics

tool to explain multi-faceted text corpora!– How to combine the raw text data with rich

text analytics result for visualization?

– What visual metaphors to apply to effectively illustrate text content, evolution and facet correlations?

– How to customize interactions to assist user in data navigation and other visual analytics task?

Related work– Text trend visualization

• ThemeRiver, NameVoyager, etc.

– Text content visualization• Tag cloud, Wordle, PhraseNet, etc.

– Text entity pattern visualization• TileBars, Jigsaw, FeatureLens, Takmi, etc.

– Text visualization in specific domains• Themail@email, TileBars@search,

Multi-Facet Data Model and Text Pre-Processing

Multi-Facet Data Model for Text Corpora -- – Time Facet

• Explicit field or extracted from raw text

– Category Facet• Topic modeling by Latent Dirichlet Allocation (LDA, Blei et al. 2003)• Category labels from document classification/clustering• Leverage other nominal structured information (hotel names, countries, etc.)

– Unstructured (Content) Facets• Inherent multiple text fields• Multiple facets from NE extraction (people, location, organization) or POS

parsing (Noun, Verbs, Adjective)

– Structured Facets• Categorical, numerical or nominal data fields• Other calculated categorical value (sentiment orientations, average ratings)

TFFD su ,,,

Content Facet Summarization

A set of topics {T1, …Ti,… TN }

A set of keywords

{W1, …, Wj, …, WM}

A set of topic probabilities

{…, P(Ti | Dk), …}

A set of word probabilities

{…, P(Wj | Ti), …}

kth document in the collection

Rank the topics to present most valuable ones first

Select keyword sub-set for each time segment for content summary

{…} t-1, {…, Wj, …}t, {…} t+1,

Doc-topic dist.

Doc lengthDoc no.

Content Facet Summarization Topic/category re-ranking by topic coverage and variance: find the most

active topic with significant variety– Topic coverage:

– Topic variance:

– Balancing two metrics:

Keyword re-ranking– Topic keyword re-ranking:

– Time-sensitive keyword re-ranking: preserve completeness and distinctiveness

• Completeness: cover the original keywords of a topic

• Distinctiveness: distinguish one time segment from another

Topic-keyword distribution

Topic number

System Architecture

Text Summarization

Text Preprocessing

Text content + meta data

Visualization

Text collection

UserInteraction

Summarizationresults

Visualization Metaphors Multi-stack trend visualization + Time-sensitive tag clouds

– Vis-data mappings: time facet – x (time) axis, category facet – stack, unstructured facets – tag clouds, structured facet – keyword style (color/font)

– Other mappings: document count – y axis, re-ranked occurrence count -- keyword size

Category Facet

Time

Unstructured Facets

Structured Facets

Keywords Layout Keyword layout with the sweep-line greedy algorithm

Interactions Temporal zooming for time facet navigation Topic editing for category facet navigation Unstructured facet navigation panel Structured facet mapping Other customized interactions: topic focus-in-context view

Focus-In-Context View Calculation Constraints for detailed trend view

– Contour-preserving

– Flexible space control

– All topic trends as undistorted as possible

1D fisheye distortion– Height calculation for expanded trend

– Order-preserving height adjustment

– Apply fisheye distortion from the center line of selected topic

Video Demo

Visual Analytics for Emergency Room Record

18

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Brazilian PortugueseArabic

Traditional Chinese

Simplified Chinese

HindiTamil

Thai

Korean