Large Scale Topic Detection using Node-Cut Partitioning on ...

Large Scale Topic Detection using Node-Cut Partitioning on Dense Weighted-Graphs

Kambiz GhoorchianŠarūnas Girdzijauskas

ghoorian@kth.se22.06.2016

• Motivation

• Solution

• Results

• Conclusion

What is a Topic (Trending Topic)?

#ChewbaccaMom

What is a Topic (Trending Topic)?

#ChewbaccaMom #Aylan

#uselections2016

#susanboyle

#Apple

#Wimbledon

#FacebookIsDown

#Superbowl

#Politics

#JobMarket

#Stefanlöfven#Sport #Euro2016

#TweetDeck

#FindingDory

رمضان#

#IranElection

#Immigration

#Russia

#Trump

Why Topics (Trends) are Important?

Given a large number of documents (e.g., tweets), how can we extract the

most frequent (significant) topics (trends)?

What is Topic Detection?

Current Solutions

• Statistical Topic Modeling

• Machine Learning

Current Solutions

• Matrix Factorization

• Latent Dirichlet Allocation (LDA)[1]

• Hierarchical LDA (HLDA)

W1 W2 W3 W4 …D1 1 0 1 1 …D2 0 1 0 1 …D3 0 0 1 1 …

…Dn 1 1 0 1 …

Document-Term

T1 T2 T1 … TkW1 0.1 0.6 0.01 … 0.2W2 0.7 0.1 0.1 … 0.02W3 0.01 0.1 0.4 … 0.4

…Wm 0.2 0.4 0.4 … 0.0

Word-Topic

T1 T2 T1 … TkD1 0.1 0.6 0.01 … 0.2D2 0.7 0.1 0.1 … 0.02D3 0.01 0.1 0.4 … 0.4

…Dn 0.2 0.4 0.4 … 0.0

Document-Topic

1. David M. Blei, Andrew Y. Ng, Michael I. Jordan; “Latent Dirichlet Allocation” 3(Jan):993-1022, 2003.

Current Solutions

• Matrix Factorization

• Latent Dirichlet Allocation (LDA)[1]

• Hierarchical LDA (HLDA)

1. Document Modeling

• Vector Modeling

• Graph Modeling

2. Topic Detection

• Unsupervised - Clustering

• Supervised - Classification

W1 W2 W3 W4 …D1 1 0 1 1 …D2 0 1 0 1 …D3 0 0 1 1 …

…Dn 1 1 0 1 …

Document-Term

T1 T2 T1 … TkW1 0.1 0.6 0.01 … 0.2W2 0.7 0.1 0.1 … 0.02W3 0.01 0.1 0.4 … 0.4

…Wm 0.2 0.4 0.4 … 0.0

Word-Topic

T1 T2 T1 … TkD1 0.1 0.6 0.01 … 0.2D2 0.7 0.1 0.1 … 0.02D3 0.01 0.1 0.4 … 0.4

…Dn 0.2 0.4 0.4 … 0.0

Document-Topic

1. David M. Blei, Andrew Y. Ng, Michael I. Jordan; “Latent Dirichlet Allocation” 3(Jan):993-1022, 2003.

Limitations

Limitations• Sparsity

• Short messages have Less informative co-occurrence patterns which results in[1]:

1. False segmentation of topics.

2. Difficulty in identification of ambiguous words (Apple, Computer vs Fruit).

[1] - Liangjie et al, “Empirical Study of Topic Modeling in Twitter. SOMA 2010”

[2] - http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/

• Dynamism

• Constant emergent of New phrases or Acronyms

• (e.g., Selfie, Unlike, Phablet, IAVS = I am very sorry, IWSN = I want sex now).

• Dynamism

• Constant emergent of New phrases or Acronyms

• (e.g., Selfie, Unlike, Phablet, IAVS = I am very sorry, IWSN = I want sex now).

• Scalability

• 310M active-users/month [2]

• 500M messages/day [2]

Solution

Unsupervised learning: 1-Graph Modeling 2-Node-cut Partitioning

DocumentsD1D2D3D4D5D6…

SolutionUnsupervised learning: 1-Graph Modeling 2-Node-cut Partitioning

1 - Graph Modeling

Random Indexing Knowledge Base

Word RI VectorW1 V1W2 V2W3 V3W4 V4W5 V5W6 V6W7 V7W8 V8…. …

1 - Graph Modeling

Word RI VectorW1 V1W2 V2W3 V3W4 V4W5 V5W6 V6W7 V7W8 V8…. …

2 - Node-Cut Partitioning

1 - Graph Modeling

1 - Graph Modeling using Random Indexing

Random Indexing (RI)• Is a dimensionality reduction method (similar to hashing).

DocumentsD1 = {W1, W4, W8, …}

D2D3D4D5D6…

RI VectorW1 V1 = {a1, b1, c1, d1, e1, f1}W2W3W4 V4 = {a4, b4, c4, d4, e4, f4}W5W6W7W8 V8 = {a8, b8, c8, d8, e8, f8}…. …

D2D3D4D5D6…

Random Indexing

D2D3D4D5D6…

Random Indexing

1. Unique

2. Fixed length

3. Captures Co-occurrence patterns of the words

D2D3D4D5D6…

Random Indexing

1. Unique

2. Fixed length

3. Captures Co-occurrence patterns of the words

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

RI VectorW1 V1 = {a1, b1, c1, d1, e1, f1}W2 V2 = {a2, b2, c2, d2, e2, f2}W3 V3 = {a3, b3, c3, d3, e3, f3}W4 V4 = {a4, b4, c4, d4, e4, f4}W5 V5 = {a5, b5, c5, d5, e5, f5}W6 V6 = {a6, b6, c6, d6, e6, f6}W7 V7 = {a7, b7, c7, d7, e7, f7}W8 V8 = {a8, b8, c8, d8, e8, f8}…. …

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

Graph Modeling

Documents

D1 = {W1, W4, W8, …}

D2 = {W2, W3, W7, …}

D3 = {W4, W1, W3, …}

D4 = {W2, W6, W9, …}

D5 = {W3, W4, W8, …}

D6 = {W1, W3, W7, …}

RI - Knowledge Base

2 - Node-Cut Partitioning

Node-Cut PartitioningJa-Be-Ja-VC[1]

balanced,

k-way partitioning

for un-weighted graphs

based on node-cut minimization.

1. F Rahimian, AH Payberah, S Girdzijauskas, S Haridi: Distributed Vertex-cut Partitioning, in Distributed Applications and Interoperable Systems, 186-200, 2014.

Node-Cut Partitioning

Random Initialization

Random Initialization Iteration

e e’

HeatGain

C = BlueC’ = Red

Random Initialization Iteration

e e’e e’

HeatGain

C = BlueC’ = Red

Random Initialization Iteration Iteration

e e’e e’e

HeatGain

C = BlueC’ = Red

Random Initialization Iteration Iteration

e e’e e’e

HeatGain

C = BlueC’ = Red Minimum Cut Size

• Same Utility Function

• Weighted Gain factor

• Weighted Cut

Modifications

HeatGain

Modifications

e e’e

5 , 5Un-Weighted Graph

11 , 11

133 1 1

13 , 9

e e’1

133 1 1

Weighted Graph

Modifications

e e’e

11 , 11

133 1 1

11 , 11

3 1 11

Weighted Graph

11 , 11

133 1 1

13 , 9

e e’1

133 1 1

Weighted Graph

Modifications

e e’e

11 , 11

133 1 1

13 , 9

e e’1

133 1 1

Weighted Graph

1. Scalability

2. Convergence

11 , 11

133 1 1

11 , 11

3 1 11

Weighted Graph

Modifications

11 , 11

13 , 9e

e e’

Modifications

11 , 11

13 , 9e

e e’

12 , 10

ee’1

Experiments

Experiments1. Accuracy (Quantitative)

• SNAP Twitter Trending Topics from 2009 [1]

• EXP1 - 3 Topics

• 2531 Documents

• K = 100

• Sam = 20%

• EXP2 - 8 Topics

• 23175 Documents

• K = 100

• Sam = 20%

A. Scalability (Qualitative)

• TREC Tweets 2011 - 16M Tweets [2]

• EXP3

• 275336 Documents

SNAP Twitter 2009

Topic Acronym EXP1 EXP2

Harry Potter (HP) HP 1457 —

American Idol (AI) AI — 4241

Dollhouse (DH) DH — 1262

Slumdog Milliner (SM) SM — 280

Susan Boyle (SB) SB 555 992

Swine Flue (SF) SF 519 1944

Tiger Wood (TW) TW — 2242

Tweetdeck (TD) TD — 5860

Wimbledon (WI) WI — 6354

1. https://snap.stanford.edu/data/ 2. http://trec.nist.gov/data/tweets/

Experiments

• Comparison

• GibsLDA - baseline [1]

• BiTerm - Best known solution[2]

1. David M. Blei, Andrew Y. Ng, Michael I. Jordan; “Latent Dirichlet Allocation” 3(Jan):993-1022, 2003. 2. Yan, Xiaohui and Guo, Jiafeng and Lan, Yanyan and Cheng, Xueqi, “A Biterm Topic Model for Short Texts”, WWW ’13.

Experiments - Evaluation• F1-Score (Quantitative)

• Average Coherence Score (Qualitative)

= [0 1]

= [Log(k/n) Log(1+k/n)]= [- ∞ 0.000001]

EXP1 - SNAP 3 Topics - F-ScoreBi

EXP2 - SNAP 8 Topics - F-ScoreLD

• Tweets 300K

• Edges 7,9M

• Vertices 4000

• Avg_Deg 3948

• Partitions 500

• Duration

• LDA 1684s

• BiTerm 1973s

• Our Algorithm 7000s (Centralized)

EXP3 - Twitter Large Large Dataset - Average Coherence Score - K=500

Num Top Words 20 10 5

LDA -637.75 -162.96 -41.52

BiTerm -597.5 -143.45 -34.3

Our Algorithm -582.0 -166.15 -49.59

EXP3 - TREC - Coherency

EXP1 - Twitter 3 Topics - Average Coherency Score - K=100

LDA -37.94 -15.85 -5.3

BiTerm -32.05 -12.57 -4.32

Our Algorithm -20.62 -9.12 -3.25

EXP1 - SNAP 3 Topics - Coherency

• Tweets 2K

• Edges 2.3M

• Vertices 3994

• Avg_Deg 1175

• Partitions 100

• Duration

• LDA 1.3s

• BiTerm 2s

EXP1 - Twitter 8 Topics - Average Coherence Score - K=100

LDA -162.89 -52.52 -13.88

BiTerm -141.37 -42.16 -11.15

Our Algorithm -124.67 -37.24 -9.18

EXP2 - SNAP 8 Topics - Coherency

• Tweets 2K

• Edges 7,5M

• Vertices 4000

• Avg_Deg 3779

• Partitions 100

• Duration

• LDA 7S

• BiTerm 24S

Scalability

Duration Growth RatePe

• Achievements

• Efficient and scalable solution for topic detection.

• Solves Sparsity and Dynamism using RI Knowledge-base

• Meets Scalability using Graph Partitioning

• Future work

• Enhance initialization and language modeling

• Extend the algorithm to an streaming model since Graph construction is incremental

Conclusion

Thank You

Questions?

Bibliography1. Sahlgren, M. (2005) An Introduction to Random Indexing, Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th

International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, Denmark. 2. Kanevara, P: Sparse Distributed Memory and Related Models. Associative Neural Memories, Oxford University Press, 1993. 3. Kanerava, P., Kristoferson, J., and Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In Gleitman, L. R. and Josh, A. K.,

editors, Proceedings of the 22nd Annual Conference of the Cognitive Science Society, page 1036, Mahwah, New Jersey. Erlbaum. 4. Johnson, W. and Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hil- bert space. In Beals, R., Beck, A., Bellow, A., and Hajian, A.,

editors, Conference on Modern Analysis and Probability (1982: Yale University), volume 26 of Con- temporary Mathematics, pages 189–206. American Mathematical Society.

5. K Ghoorchian, F Rahimian, S Girdzijauskas: Semi Supervised Multiple Disambiguation, Trustcom/BigDataSE/ISPA, 2015 IEEE 2, 88-95.

img1. Img 1 - http://www.studerasmart.nu/wp-content/uploads/2012/04/jobb-och-cv.png 2. Img 2 - http://gfx2.aftonbladet-cdn.se/image/19456728/485/normal/efc46e3660c6c/hedenmo3.jpg 3. Img 3 - http://cdn01.nyheter24.se/c4ab6c0402fa00a700/2014/04/03/941973/Sk%C3%A4rmavbild%202014-04-03%20kl.%2020.54.47.png 4. Img 4 - http://ericagelfandlaw.com/wp-content/uploads/2015/12/immigration.jpg

Large Scale Topic Detection using Node-Cut Partitioning on ...

Documents

Transcript of Large Scale Topic Detection using Node-Cut Partitioning on ...

Features to meet any requirement Technical Datasheet · Mobile/Tablet Apple iOS Android 2.3+ ... Using Class 4 SAPO digital certificates, ... Stateless Web Sessions Node Level Partitioning

A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

Genetic Algorithms for Graph Partitioning and Incremental Graph Partitioning

Understanding High Availability @ImMadanK options for ...€¦ · Network Partitioning ★Patroni demotes the PostgreSQL on the node which is isolated from majority. Copyright ©

Outline - University of California, Los Angelescadlab.cs.ucla.edu/~cong/CS258F/chap3_old.pdf · A Bi-Partitioning Example Min-cut size=13 Min-Bisection size = 300 Min-ratio-cut size=

Exploring 11g/12c Partitioning New Features and Best Practices · Oracle 11g Partitioning New Features Oracle 12c Partitioning New Features Live Demo. Oracle Partitioning Enhances

Database Partitioning, Table Partitioning, and MDC for DB2 9

Max Flow and Min-Cost Max Flow - Duke COMPSCI 309sdb.cs.duke.edu/courses/compsci309s/spring14/notes/flow.pdfMax Flow Min Cut Theorem A cut of the graph is a partitioning of the graph

Boosting Vertex-Cut Partitioning for Streaming Graphs

Which Space Partitioning Tree to Use for Search?papers.nips.cc › paper › 5121-which-space-partitioning-tree-to-use-for... · clustering of the data in the node to form the two-means

SBV-Cut: Vertex-Cut based Graph Partitioning using ... Vertex-Cut based Graph Partitioning using Structural Balance Vertices Mijung Kim · K. Selc¸uk Candan Received: date / Accepted:

Partitioning hdd

Bipartite Authentication Graph Partitioningweb.mst.edu/.../EvoBAGPart_presentation.pdf · Leverage the edge removal partitioning strength of METIS Edge weights: 1 Computer node weight:

Input Space Partitioning - Heracleia Human-Centered ...heracleia.uta.edu/~sharifara/5321/3_isp.pdf · Input Space Partitioning • Introduction • Equivalence Partitioning • Boundary-Value

An Adaptive Virtual Node Algorithm with Robust Mesh Cuttingcffjiang/research/cut/paper.pdf · Second cut Cutting ags Duplicate and merge Final subdivided result Algorithm Overview

WORKING WITH DATA · Node-04 Node-05 Node-06 Node-07 Node-08 Node-09 Node-10 Node-11 Node-12 Node-13 Node-14 Node-15 Node-16 Queue1 - 50 hrs Queue2 – 200 hrs Compute job ... •

Oracle Data Archiving - SFOUGsfoug.org/Downloads/Dave_Moore_Oracle_Archiving_SFOUG.pdfNew Partitioning Features in 11G Interval Partitioning — a true rolling window REF Partitioning

IBM i: Database DB2 Multisystem · A file is distributed acr oss all the systems in a node gr oup thr ough partitioning. T able partitioning, further described in Partitioned tables,

Database Partitioning, Table Partitioning, and MDC for · PDF fileDatabase Partitioning, Table Partitioning, and MDC for DB2 9 August 2007 International Technical Support Organization

Db2 partitioning