Big Data Management and Analytics Assignment 10
Transcript of Big Data Management and Analytics Assignment 10
![Page 1: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/1.jpg)
DATABASESYSTEMSGROUP
Big Data Management and AnalyticsAssignment 10
1
![Page 2: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/2.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
Finding similar items
Suppose that the universal set is given by {1,…,10}. Construct minhashsignatures for the following sets:
(a) 3,6,9(b) 2,4,6,8(c) 2,3,4
1. Construct the signatures for the sets using the following list ofpermutations:• (1,2,3,4,5,6,7,8,9,10)
• (10,8,6,4,2,9,7,5,3,1)
• (4,7,2,9,1,5,3,10,6,8)
2
![Page 3: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/3.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
i. Create first a characteristic matrix foreach permutation
(a) Set one column with the shingles
(b) Set columns for each document
(c) Fill the document columns by setting a 1 where there is an occurence for each shingle, and 0 else
3
Element
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
![Page 4: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/4.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
i. Create first a characteristic matrix for each permutation
4
Element
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
Element
10 0 0 0
8 0 1 0
6 1 1 0
4 0 1 1
2 0 1 1
9 1 0 0
7 0 0 0
5 0 0 0
3 1 0 1
1 0 0 0
Element
4 0 1 1
7 0 0 0
2 0 1 1
9 1 0 0
1 0 0 0
5 0 1 0
3 1 0 1
10 0 1 0
6 1 1 0
8 0 1 0
1st permutation 2nd permutation 3rd permutation
![Page 5: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/5.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
ii. Compute the minhash for each permutation
5
Element
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
1st permutation
Select the first occurence of 1 per setand get the elements at which they canbe found
Which leads to the following minhash:322
![Page 6: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/6.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
ii. Compute the minhash for each permutation
6
The same procedure for the 2nd permutation…
684
Element
10 0 0 0
8 0 1 0
6 1 1 0
4 0 1 1
2 0 1 1
9 1 0 0
7 0 0 0
5 0 0 0
3 1 0 1
1 0 0 0
2nd permutation
![Page 7: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/7.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
ii. Compute the minhash for each permutation
7
…and for the 3rd permutation
Which leads to the following minhash:944
Element
4 0 1 1
7 0 0 0
2 0 1 1
9 1 0 0
1 0 0 0
5 0 1 0
3 1 0 1
10 0 1 0
6 1 1 0
8 0 1 0
3rd permutation
This yields the following signatures:3,6,92,8,42,4,4
![Page 8: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/8.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
2. Instead of using the previously given permutations use hash functions:
102 1 103 2 10
8
![Page 9: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/9.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
2. Instead of using the previously given permutations use hash functions:
i. Set up a table:
9
Element
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
![Page 10: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/10.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
ii. Compute the hash values (except for zero-rows):
10
Element
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
![Page 11: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/11.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
11
∞ ∞ ∞∞ ∞ ∞∞ ∞ ∞
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
![Page 12: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/12.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
12
∞ 2 2∞ 5 5∞ 8 8
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
Update of 2nd row:
Update only , !
:min ∞, 2 2min ∞, 5 5min ∞, 8 8
:min ∞, 2 2min ∞, 5 5min ∞, 8 8
![Page 13: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/13.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
13
3 2 27 5 51 8 1
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
Update of 3rd row:
Update only , !
:min ∞, 3 3min ∞, 7 7min ∞, 1 1
:min 2,3 2min 5,7 5min 8,1 1
![Page 14: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/14.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
14
3 2 27 5 51 4 1
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
Update of 4th row:
Update only , !
:min 2,4 2min 5,9 5min 8,4 4
:min 2,4 2min 5,9 5min 1,4 1
![Page 15: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/15.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
15
3 2 23 3 50 0 1
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
Update of 6th row:
Update only , !
:min 3,6 3min 7,3 3min 1,0 0
:min 2,6 2min 5,3 3min 4,0 0
![Page 16: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/16.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
16
3 2 23 3 50 0 1
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
Update of 8th row:
Update only !
:min 2,8 2min 3,7 3min 0,6 0
![Page 17: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/17.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
17
3 2 23 3 50 0 1
e
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 910 0 0 0 - - -
Update of 8th row:
Update only !
:min 3,9 3min 3,9 3min 0,9 0
![Page 18: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/18.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
This yields the following signatures:3,3,02,3,02,5,1
18
3 2 23 3 50 0 1
![Page 19: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/19.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
3. How does the estimated Jaccard similarity from (1.) and (2.) comparewith the true Jaccard similarity of the original data? How to reducedeviations in the approximated Jaccard similarities?
19
RECAP: Jaccard similarity:
, ∩∪
![Page 20: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/20.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
i. Actual Jaccard similarity:
20
RECAP: Jaccard similarity:
, ∩∪
- 16
15
- - 25
- - -
(a) 3,6,9(b) 2,4,6,8(c) 2,3,4
![Page 21: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/21.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
ii. Permutation estimated Jaccard similarity:
21
RECAP: Jaccard similarity:
, ∩∪
- 06
05
- - 23
- - -
(a) 3,6,9(b) 2,8,4(c) 2,4,4
![Page 22: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/22.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iii. Hash estimated Jaccard similarity:
22
RECAP: Jaccard similarity:
, ∩∪
- 23
05
- - 15
- - -
(a) 3,3,0(b) 2,3,0(c) 2,5,1
![Page 23: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/23.jpg)
DATABASESYSTEMSGROUP
Assignment 10-1
iv. Comparison:
23
- 23
05
- - 15
- - -
- 06
05
- - 23
- - -
- 16
15
- - 25
- - -
Actual J. similarity Perm. est. J similarity Hash. est. J similarity
Estimation of the actual J. similarity is rather poor, why?
Too small minhash vectors. Get more permutations of the universal set ormore hash functions to extend the minhash vectors!
![Page 24: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/24.jpg)
DATABASESYSTEMSGROUP
Assignment 10-2
(a) Describe what a PCA aims for and under what circumstances it is most helpful
24
From the lecture slides:
• Detect hidden linear correlations
• Remove redundant and noisy features
• Interpretation and visualization
• Easier storage and processing of the data
When is PCA most helpful:
• The assumption is that the observed variable can be expressed as a linear combination of the hidden variables . If that is not thecase, another heuristics should be used (e.g. LDA,RCA etc.)
![Page 25: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/25.jpg)
DATABASESYSTEMSGROUP
Assignment 10-2
(b) Which possibly netgative consequences might arise when applying PCA to a dataset of unknown structure?
25
• Data which is not normed can skew the result. Therefore first norm thedata!
• Loss of possibly relevant structures (see red lines within the figures)
• Solution: subspace clustering / correlation clustering
![Page 26: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/26.jpg)
DATABASESYSTEMSGROUP
Assignment 10-2
(b) Which possibly netgative consequences might arise when applying PCA to a dataset of unknown structure?
26
• Further, problems with outliers may arise, as they may massively skewthe PCA transformation:
![Page 27: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/27.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
Consider the ∈ matrix containing six data points ∈ .
Conduct a PCA on the given data, i.e. project the data onto a one-dimensional space. Please state the eigenvectors, eigenvalues, covariance matrix and visualize the databefore and after PCA.
27
1 02 03 05 66 67 6
dim 1 dim 2
![Page 28: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/28.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
i. Center the data by substracting the mean value for each dimension:
28
1 4 0 32 4 0 33 4 0 35 4 6 36 4 6 37 4 6 3
3 32 31 31 32 33 3
1 43
![Page 29: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/29.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
ii. Calculate the covariance matrix · :
29
Σ 4, 7 66 9
= ⁄ 66 9
![Page 30: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/30.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
iii. Now compute the eigenpairs (eigenvalues, eigenvectors). Construct theeigendecomposition Σ with sorted eigenvalues in
30
det Σ14
3 66 9
143 · 9 36
14 · 3 36413 6 0
, ∓ ·
13.21 0.45
Compute the eigenvalues:
![Page 31: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/31.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
iii. Now compute the eigenpairs (eigenvalues, eigenvectors). Construct theeigendecomposition Σ with sorted eigenvalues in
31
Σ ,
Compute the eigenvectors:
143 6
6 9,
⇒14
3 66 9 ⇒ 1st (normed) eigenvector: ,
,
Eigenvalues: 13,21 00 0,45
Eigenvectors: 0,57 0,820,82 0,57
![Page 32: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/32.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
iv. Reduce to one-dimensional space. For this purpose remove the secondeigenvector and form the transformation matrix :
Now transform the data with
·
3 32 31 31 32 33 3
0,570,82 4,18 3,6 3,033,033,64,18
32
0,57 00,82 0 ⇒
0,570,82
![Page 33: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/33.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
iv. Reduce to one-dimensional space. For this purpose remove the secondeigenvector and form the transformation matrix :
We can now try to reconstruct the original data matrix with
· · ·
1,6 0,421,93 0,052,26 0,525,74 5,486,07 5,956,40 6,42
33
![Page 34: Big Data Management and Analytics Assignment 10](https://reader034.fdocuments.in/reader034/viewer/2022051315/627a341c61302a2d401620d4/html5/thumbnails/34.jpg)
DATABASESYSTEMSGROUP
Assignment 10-3
v. As we have already reduced to the one-dimensional space (here we did that byeliminiating the second principal component), the reconstruction does not imply theinformation of the second pc:
34