Apresentacao Paper SIGIR Sergio

16
Efficient and Scalable MetaFeature-based Document Classification using Massively Parallel Computing ergio Canuto, Marcos Andr´ e Gon¸calves, Wisllay Santos, Thierson Rosa, Wellington Martins [email protected] 1

description

Slides presenting the work on the novel hierirchichal algorithm for document clustering using knn graph and meta-features

Transcript of Apresentacao Paper SIGIR Sergio

Page 1: Apresentacao Paper SIGIR Sergio

Efficient and Scalable MetaFeature-basedDocument Classification using Massively Parallel

Computing

Sergio Canuto, Marcos Andre Goncalves, Wisllay Santos,Thierson Rosa, Wellington Martins

[email protected]

1

Page 2: Apresentacao Paper SIGIR Sergio

Automatic Text Classification (ATC)

I ATC goal:I F : X

news article, webpage, tweet, etc..

→ Y

categories

, X = Rd , Y = {1, 2, . . . ,m}

I Given:I A set of training examples {xi |xi ∈ Rd}I For each training instance, its category (yi |yi ∈ {1, 2, . . . ,m})

I ATC with meta-level features:I Transform the original feature space X (bag-of-words) into a

new one.I The new space M is potentially smaller and more informed.

I Our goal changes to finding a function F : M → Y .

2

Page 3: Apresentacao Paper SIGIR Sergio

Distance-based Meta-features

I Global information:I Distance between a test example and a class centroid.

I Local information:I Distance between a test example and each one of its k nearest

neighbors.

I They use Cosine, Euclidean and Manhattan distances [1].

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315–322, 2010

3

Page 4: Apresentacao Paper SIGIR Sergio

Distance-based Meta-features

I Global information:I Distance between a test example and a class centroid.

I Local information:I Distance between a test example and each one of its k nearest

neighbors.

I They use Cosine, Euclidean and Manhattan distances [1].

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315–322, 2010

4

Page 5: Apresentacao Paper SIGIR Sergio

Meta-feature Generation Problems

I An efficient meta-feature generator is very important, since wehave to generate them on classification time.

I However, kNN has slow classification time, and kNN-basedmeta-features inherits this poor performance.

I For textual datasets the performance problem is hardened,since the kNN algorithm will have to run on high dimensionaldata.

5

Page 6: Apresentacao Paper SIGIR Sergio

GPU-based Meta-feature Generation

I Both kNN library, used in [1] and the state-of-art kNN parallelimplementation use a DxV matrix, with D training documentsand V features.

I Our kNN GPU implementation considers the highdimensionality and heterogeneity in the representation of thetext documents.

I It takes advantage of Zipfs law.

I The parallel implementation makes the generation ofmeta-features feasible for big datasets.

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315–322, 2010

6

Page 7: Apresentacao Paper SIGIR Sergio

Inverted Index Implementation

I Compact representation of the inverted index in the GPUmemory:

E (entries)

count

0 1 2 3 4 5 6 7 8 9

t5

d5 t1

d1 t3

d1 t4

d1 t2

d2 t5

d2 t1

d3 t2

d4 t1

d5 t3

d5

0 1 2 3 4

2

t5 3

t1 2

t2 2

t3 1

t4

index

0 1 2 3 4

8

t5 0

t1 3

t2 5

t3 7

t4

invertedIndex

0 1 2 3 4 5 6 7 8 9

d5

t5 d1

t1 d3

t1 d5

t1 d2

t2 d4

t2 d1

t3 d5

t3 d1

t4 d2

t5

d1 t1 t3 t4

d2 t2 t5

d3 t1

d4 t2

d5 t1 t3 t5

Document Collection

7

Page 8: Apresentacao Paper SIGIR Sergio

Calculating the Distances

I For each query, we generate a reduced logical array from thefull inverted index.

q t1 t3 t4

query

df

0 1 2 3 4

2

t5 3

t1 2

t2 2

t3 1

t4

index

0 1 2 3 4

8

t5 0

t1 3

t2 5

t3 7

t4

Parallel copy

dfq

0 1 2

3

t1 2

t3 1

t4

startq

0 1 2

0

t1 5

t3 7

t4

indexq

0 1 2

3

t1 5

t3 6

t4

invertedIndex

0 1 2 3 4 5 6 7 8 9

d5

t5 d1

t1 d3

t1 d5

t1 d2

t2 d4

t2 d1

t3 d5

t3 d1

t4 d2

t5

Logical array

0 1 2 3 4 5

d1

t1 d3

t1 d5

t1 d1

t3 d5

t3 d1

t4 Eq

8

Page 9: Apresentacao Paper SIGIR Sergio

Calculating the Distances

I The elements of the logical array are equally distributedamong the gpu cores in to distribute the distance calculations.

I To sort the distances, we use truncated bitonic sort.

9

Page 10: Apresentacao Paper SIGIR Sergio

Experimental Setup

I Evaluation: efficiency with wall time of each experiment, andefficacy with MicroF1 and MacroF1.

I Software: Liblinear 1.92, Ubuntu Server 12.04.

I Hardware: Intel Core i7-870, 16Gb RAM, nVIDIA Tesla K40.

I General information on the datasets:

Dataset Classes # attributes # documents Density Space in Disk4UNI 7 40,194 8,274 140.325 14MB20NG 20 61,049 18,766 130.780 30MBACM 11 59,990 24,897 38.805 8.5MBREUT90 90 19,589 13,327 78.1646 13MBMED 7 803,358 861,454 31.805 327MBRCV1Uni 103 134,932 804,427 79.133 884MB

10

Page 11: Apresentacao Paper SIGIR Sergio

Experimental Results - Effectiveness

I Bag-of-words presented good results on our two big datasets.I Big datasets allow SVM to deal better with the high

dimensional data.

I The combination of meta-features and bag of words improvesthe results on our big datasets.

Dataset Meta-features Bag Bag + Meta-features4UNI 62.50± 2.27 54.55 ± 1.64 62.93 ± 2.0320NG 89.26 ± 0.23 87.08 ± 0.33 90.11 ± 0.30ACM 63.83± 2.05 53.62 ± 1.12 63.58 ± 1.15REUT90 38.96± 1.04 29.13 ± 2.03 37.36 ± 1.31MED 74.33 ± 0.17 75.15 ± 0.18 79.90 ± 0.20RCV1Uni 55.77 ± 0.92 55.32 ± 0.66 57.21 ± 0.32

Table : MacroF1 of the meta-features, bag-of-words and the combination ofmeta-features and bag-of-words.

11

Page 12: Apresentacao Paper SIGIR Sergio

Experimental Results - Execution Time

I On the small datasets:I High speedup in relation to the non parallel ANN.I The parallel BF-CUDA does not optimize the distance

calculations to deal with textual documents.I Low speedup on REUT90.

I GTkNN was the only implementation able to generatemeta-features for the larger datasets.

DatasetExecution Time Speedup

GTkNN BF-CUDA ANN BF-CUDA ANN4UNI 40± 1 259 ± 46 1590 ± 29 6.4 39.620NG 187± 4 2004 ± 17 10947 ± 1323 10.7 68.7ACM 112± 3 1760 ± 91 13589 ± 1539 15.7 141.3REUT90 625± 12 2242 ± 5 3024 ± 303 3.6 4.8MED 4637 ± 43 * * * *RCV1Uni 33884 ± 111 * * * *

Table : Average time in seconds to generate meta-features using different kNNstrategies.

12

Page 13: Apresentacao Paper SIGIR Sergio

Experimental Results - Execution Time

0

0.5

1

1.5

2

2.5

200 400 600 800 1000 1200 1400 1600 1800 2000

Se

co

nd

s

Number of training samples

BF-CUDAANN

GTkNN

Figure : Time to generate meta-features for one example with different sample sizesfrom the MED dataset. GTkNN keeps very low execution time (up to 0.005 seconds).The other two approaches slow down dramatically as the training dataset grows in size.

13

Page 14: Apresentacao Paper SIGIR Sergio

Experimental Results - Memory and Efficiency of theLiterature Implementation

I Our inverted index structure provides a very compact way torepresent documents.

I The traditional data representaion using a D x V matrix withD training documentos and V features is not a good choice.

I Infeasible for large datasets with many documents per class.

DatasetMemory Consumption

GTkNN BF-CUDA ANN4UNI 92 1697 94520NG 93 1257 395ACM 90 2541 2487REUT90 90 909 494MED 339 1859104 2857048RCV1Uni 120 43245 69328

Table : Memory consuption in Megabytes.

14

Page 15: Apresentacao Paper SIGIR Sergio

Conclusions and Future Work

I We provide an efficent and scalable way to generatemeta-features.

I We analyse the behaviour of meta-features on big datasets.

I As future work, we intend to exploit other classification andranking tasks.

15