Study of the parallel techniques for dimensionality reduction and its impact on quality of the text...
-
Upload
sabrina-singleton -
Category
Documents
-
view
212 -
download
0
Transcript of Study of the parallel techniques for dimensionality reduction and its impact on quality of the text...
Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms
Marcin Pietroń1,2, Maciej Wielgosz1,2,Michał Karwatowski1,2, Kazimierz Wiatr12
1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków
RUC 17-18.09.2015 Kraków
2Agenda
Text classification
System architecture
Metrics
Dimensionality reduction
Experiments and results
Conclusions and future work
3Text classification
Very useful and popular problem in internet and big data processing
Real time processing requirement
Preceded by text preprocessing
Clustering as a one of a few techniques which helps text classification
4System architecture
Text pre-processing
Dictionary and model
transformation
SVD
K-means
5System architecture
Document corpus generation (e.g. crawler)
Text preprocessing (implemented by gensim library, lemmatization, stoplist etc.)
SVD
K-means as clustering method (clustering documents to chosen domains)
6Quality metrics
7Entropy
𝑬 (𝑪 𝒊 )=−∑𝒉=𝟏
𝒌 𝒏𝒊𝒉
𝒏𝒊
𝐥𝐨𝐠 (𝒏𝒊
𝒉
𝒏𝒊
¿)¿
𝑬𝒏𝒕𝒓𝒐𝒑𝒚=∑𝒊=𝟏
𝒌 𝒏𝒊
𝒏𝑬 (𝑪𝒊)
8Dimensionality reduction
SVD:
A =
where U is matrix of left singular vectors, V is matrix of the right singular vectors and is diagonal matrix with singular 𝛴values
9Dimensionality reduction
Random Projection:
random projection of vectors to reduced space by special matrixes (distances between points in reduced space are scalable)
A = (e.g. Achlioptas random projection matrix)
10Results and experiments
number of
clustersPrecision recall F-measure
business 3.9(0.3) 0.81(0.022) 0.56(0.077) 0.66(0.034)
culture 3(0) 0.37(0.015) 0.7(0.061) 0.48(0.024)
automotive4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)
science 2.1(0.3) 0.39(0.014) 0.74(0.016) 0.51(0.014)
sport 4.8(0.4) 0.39(0.007) 0.56(0.021) 0.45(0.01)
employed algorithmsEntropy
vsm+kmeans 0.28(0.012)
vsm+tfidf+kmeans0.17(0.019)
vsm+tfidf+svd+kmeans0.16(0.006)
11Results and experiments
0.750
0.800
0.850
0.900
0.950
1.000
1.050
2.000 42.000 700.000 1500.000 2300.000 3100.000 3900.000 4700.00 5500.00 6300.00 7100.00 7900.00
Entropy mean
12GPU implementation
reduction size GPGPU [ms] CPU [ms]
10 33 80
20 77 305
30 107 420
40 161 624
NVIDIA tesla m2090 Intel Xeon e5645
13Conclusions and future work
Applying more algorithms lowers entropy
GPU can efficiently reduce time of text classification
Random projection hardware implementation
K-means GPU acceleration
14Questions
?