Study of the parallel techniques for dimensionality reduction and its impact on quality of the text...

Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms

Marcin Pietroń1,2, Maciej Wielgosz1,2,Michał Karwatowski1,2, Kazimierz Wiatr12

1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków

RUC 17-18.09.2015 Kraków

2Agenda

Text classification

System architecture

Metrics

Dimensionality reduction

Experiments and results

Conclusions and future work

3Text classification

Very useful and popular problem in internet and big data processing

Real time processing requirement

Preceded by text preprocessing

Clustering as a one of a few techniques which helps text classification

4System architecture

Text pre-processing

Dictionary and model

transformation

SVD

K-means

5System architecture

Document corpus generation (e.g. crawler)

Text preprocessing (implemented by gensim library, lemmatization, stoplist etc.)

SVD

K-means as clustering method (clustering documents to chosen domains)

6Quality metrics

7Entropy

𝑬 (𝑪 𝒊 )=−∑𝒉=𝟏

𝒌 𝒏𝒊𝒉

𝒏𝒊

𝐥𝐨𝐠 (𝒏𝒊

𝒉

𝒏𝒊

¿)¿

𝑬𝒏𝒕𝒓𝒐𝒑𝒚=∑𝒊=𝟏

𝒌 𝒏𝒊

𝒏𝑬 (𝑪𝒊)

8Dimensionality reduction

SVD:

A =

where U is matrix of left singular vectors, V is matrix of the right singular vectors and is diagonal matrix with singular 𝛴values

9Dimensionality reduction

Random Projection:

random projection of vectors to reduced space by special matrixes (distances between points in reduced space are scalable)

A = (e.g. Achlioptas random projection matrix)

10Results and experiments

number of

clustersPrecision recall F-measure

business 3.9(0.3) 0.81(0.022) 0.56(0.077) 0.66(0.034)

culture 3(0) 0.37(0.015) 0.7(0.061) 0.48(0.024)

automotive4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)

science 2.1(0.3) 0.39(0.014) 0.74(0.016) 0.51(0.014)

sport 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)

employed algorithmsEntropy

vsm+kmeans 0.28(0.012)

vsm+tfidf+kmeans0.17(0.019)

vsm+tfidf+svd+kmeans0.16(0.006)

11Results and experiments

0.750

0.800

0.850

0.900

0.950

1.000

1.050

2.000 42.000 700.000 1500.000 2300.000 3100.000 3900.000 4700.00 5500.00 6300.00 7100.00 7900.00

Entropy mean

12GPU implementation

reduction size GPGPU [ms] CPU [ms]

10 33 80

20 77 305

30 107 420

40 161 624

NVIDIA tesla m2090 Intel Xeon e5645

13Conclusions and future work

Applying more algorithms lowers entropy

GPU can efficiently reduce time of text classification

Random projection hardware implementation

K-means GPU acceleration

14Questions

?

Study of the parallel techniques for dimensionality reduction and its impact on quality of the text...

Documents

Transcript of Study of the parallel techniques for dimensionality reduction and its impact on quality of the text...