[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
-
Upload
yasanka-sameera-horawalavithana -
Category
Data & Analytics
-
view
517 -
download
0
Transcript of [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
1
An Efficient incremental indexing mechanism for extracting Top-k
representative queries over continuous data streams
Y.S. Horawalavithana, D.N. Ranasinghe
Adaptive and Reflective Middleware (ARM) ACM/IFIP/USENIX Middleware
Vancouver, BC, CanadaDecember 08, 2015
University of Colombo School of Computing, Sri Lanka
2
Overview
• Motivation• Adaptive Diversification• Incremental Top-k• Evaluation• Conclusion• Future work
3
4
Diversity: Top-k representative setRepresentative Top-kDrawback
(without diversity)What we want(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
5
Minimum independent-dominating set
𝑝1
𝑝2𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2 𝑣1
𝑣4
𝑣3𝑣2
𝑣5
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
jijiji ppppdppodNeighborho ,| )(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication space
Graph model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
6
NAÏVE Greedy argmax𝑟 (𝑝𝑖)
2
∑𝑝 𝑗∈𝑁 (𝑝 𝑖)
𝑟 (𝑝 𝑗)×𝑑 (𝑝𝑖 ,𝑝 𝑗)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
7
Handling streaming publications
𝑝1
𝑝2𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements1. Durability
an item is selected as diversified in window may still have the chance to be in window if it's not expired & other valid items in window are failed to compete with it.
2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not-older than j.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
8
Adaptive Diversification
𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....
Matching publication stream
𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆 𝑖∗
𝑆 𝑖+1∗
Independence
Dominance
Durability
Order
Straightforward solution: Apply naïve greedy method at each instance
Propose incremental index mechanism! Avoid the curse of re-calculating neighborhood
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
9
Locality Sensitive Hashing (LSH) Simple Idea
if two points are close together, then after a “projection” operation these two points will remain close together
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
10
LSH in Adaptive Diversification:Publications as categorical data
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
11
LSH in Adaptive Diversification:Characteristic Matrix
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
12
LSH in Adaptive Diversification:Minhashing No Publications any more!
Signature to represent
Technique Randomly permute the rows at
characteristic matrix m times Take the number of the 1st row, in
the permuted order, which the column has a 1 for
the correspondent column of publications.
First permutation of rows at characteristic matrix
Advantage: Reduce the dimensions into a small
minhash signature1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
13
LSH in Adaptive Diversification:Signature Matrix
Fast-minhashingSelect m number of random hash
functionsTo model the effect of m number of
random permutationMathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
14
LSH in Adaptive Diversification:LSH Buckets
Take r sized signature vectors From m sized
minhash-signature
Map them into, L Hash-Tables Each with
arbitrary b number of buckets
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
15
LSH in Adaptive Diversification:Batch-wise Top-k computation
Bucket “Winner” – a publication which has the highest relevancy score
Winner is dominant to represent it's bucket neighborhood
Top-k "winners“ that have a majority of votes k winners are independent
𝑃 𝐴𝑃 𝐵𝑃𝐶𝑃𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .
ith window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
16
LSH in Dynamic Diversification:Incremental Top-k computation
𝑁𝑒𝑤𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑖 𝑈𝑝𝑑𝑎𝑡𝑒𝑖 h𝑡 h𝑐 𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐𝑣𝑒𝑐𝑡𝑜𝑟 Characteristic Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 h𝑡 h h𝑚𝑖𝑛 𝑎𝑠 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature Matrix
Map signature into L hash-tables
Update “Winner” at bucket signature
maps into
Vote 𝑇𝑜𝑝−𝑘𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
17
LSH in Dynamic Diversification:When new publication F arrives…
Only buckets will vote Follow continuity requirements
Durability Order
𝑃 𝐴𝑃 𝐵𝑃𝐶𝑃 𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .
ith window
(i+1)th window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
18
LSH in Adaptive Diversification:Analysis
For two vectors x,y
For publications x & y At a particular hash table
x & y map into the same bucket:
x & y does not map into the same bucket:
At L Hash-tables x & y does not map into the same bucket:
1−¿
True near neighbors will be unlikely to be unlucky
in all the projections
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
19
Publication Stream Zipfian subscriptions
Normalized preferences
Evaluation:Dataset
Amazon on-line market place data available at 17th – 19th November 2014
N - number of elements in distribution,
k - rank of element
s - value of exponent
20
TerminologyILSH, BLSH and NAÏVE
𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8. .BLSH
or NAIVE
BLSH or
NAIVE
BLSH or
NAIVE
BLSH or
NAIVE
ILSH
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
21
Accuracy:ILSH vs. NAÏVE
Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
22
Performance & Efficiency:ILSH vs. BLSH vs. NAÏVE
log (Top-k matching time) on number of publications with D=500
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
23
Conclusions Locality Sensitive Hashing (LSH) indexing method
Produce diverse set of results at average 70% accuracy over naïve method Reduce the matching time very significantly over NAÏVE method Further, refine by it’s incremental version
For handling streaming publications Avoid the curse of re-computing neighborhoods
Top k to restrict the delivery of Top publications Given a window size & delivery method Model can produce best diverse set of personalized results
To represent the set of all matching publications at given instance
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
24
Future work Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g. Personalized newspaper for every Facebook user Adaptive resource scheduling in large scale distributed system
Exploit overlap among diversified results of users who have similar interest
Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
25
Q&A
THANK YOU!