Transcript of Dancing with publish/subscribe
- 1. Dancing with publish/subscribe (Distributed event based
systems) Lightening Talk on Top-k publish/subscribe By Y.S.
Horawalavithana BSc(Hons.) Computer Science MSc. Distributed System
1
- 2. Today For who? Outline Discussion MSc. Distributed System
2
- 3. Communication paradigms Point-to-point communication
Participants need to exist at the same time Direct coupling Strict
Identity management Not good for volatile environment Not a good
way to communicate with several participants Indirect communication
Communication through an intermediary between sender(s) &
receiver(s) No direct coupling Space uncoupling Anonymity Time
uncoupling Independent lifetimes Through persistent communication
channel MSc. Distributed System 3
- 4. Indirect communication Scenarios where users connect and
disconnect very often Mobile environments, messaging services,
forums Event dissemination where receivers may be unknown and
change often RSS, events feeds in financial services Scenarios with
very large number of participants Google Ads system, Spotify
Commonly used in cases when change is anticipated Need to provide
dependable services MSc. Distributed System 4
- 5. Taxonomy MSc. Distributed System 5 Indirect Communication
Communication based Group communication Message Queues
Publish/subscribe State based Tuple spaces Distributed Shared
Memory
- 6. Publish/Subscribe Notify me of all stock quotes of Google
from NYSE if the price is greater than 150 MSc. Distributed System
6
- 7. Introduction: pub/sub systems Information consumers express
their interests in information with subscriptions, identifying
which items are of interest. Information producers, publish
information by submitting publications (a.k.a. publication events
or event notifications). A pub/sub system: Subscription processing:
Indexing and storing subscriptions. Event processing: upon event
arrival, access subscription indices and identify all matched
subscriptions. Event delivery: deliver event to clients with
matched subscriptions.. MSc. Distributed System 7
- 8. Programming model MSc. Distributed System 8 Figure adapted
from Instructors Guide for Coulouris, Dollimore, Kindbergand Blair,
Distributed Systems: Concepts and Design Edn. 5 Pearson Education
2012
- 9. Introduction: DB view at pub/sub Events correspond to data
(data-carrying events). Subscriptions correspond to continuous
queries: Define predicates on attributes Fundamentally different
model: Instead of storing/indexing data and issuing queries to
access it Queries (subscriptions) are stored/indexed and incoming
data (events) is matched against stored queries. MSc. Distributed
System 9
- 10. Introduction: Communications view at pub/sub Akin to
multicasting (group IPC, 1-N communication) Each publisher (through
its events) communicates to a large number of subscribers. However,
communication is, Anonymous Subscribers do not know publishers and
vice versa Asynchronous publishers and subscribers do not block
when publishing/subscribing Mutually out-of-sync: no rendezvous in
time Heterogeneous can be used to connect heterogeneous components
MSc. Distributed System 10
- 11. Example: Real-world Implementation MSc. Distributed System
11
- 12. Pub/sub: System Space 12 Figure adapted from K. Pripui, I.
Podnararko, and K. Aberer, Top-k/w publish/subscribe 2012
- 13. Pub/sub: Subscription models Content based Type based Topic
based Context Type Object Types Independent Channels Hierarchical
Topics MSc. Distributed System 13 (Un)structured queries Complex
Event Processing
- 14. Pub/sub: Real-world Applications Too numeroussome
representative application classes News alerts Online stock quotes
Internet games Sensor networks Location-based services Network
management Internet auctions . MSc. Distributed System 14
- 15. Case study: Dealing Room MSc. Distributed System 15
- 16. Case study: Spotify MSc. Distributed System 16
- 17. Spotify at First glance End-to-end architecture to support
social interaction Topic-based subscriptions Friends (Spotify +
Facebook): FB friends who are Spotify users and by sharing music
Playlists (URI): other users playlists (updates), Collaborative
playlists or only modifiable by creator Artists pages (follow
artist): new albums or news related to artist MSc. Distributed
System 17
- 18. Spotify at First glance Hybrid engine Relay events to
online users in real time Store and forward selected events to
offline users DHT based overlay 3 sites: Stockholm Sweden, London
UK, Ashburn USA Design to scale Stores approx., 600 million
subscriptions at any given time Matches billions of publication
events every day MSc. Distributed System 18
- 19. Large scale publish/subscribe systems MSc. Distributed
System 19
- 20. Boolean matching at pub/sub Assume the dealer room system
implemented on top of pub/sub paradigm Dealer submits a
subscription [Name = Google , price > 150 , volume < 5000]
Stock Exchange publishes a stock quote (publication) [Name = Google
and price = 200 and volume = 3000] MSc. Distributed System 20
- 21. Drawbacks at Boolean pub/sub Drawbacks A subscriber may be
either overloaded with publications or receive too few publications
Impossible to compare different matching publications as ranking
functions are not defined, and Partial matching between
subscriptions and publications is not supported. MSc. Distributed
System 21
- 22. Real-world Requirements: Sensor Web Real-time environmental
monitoring Environmental scientists would like to identify and
monitor up to 10 sites with the largest pollution readings over the
course of a single day - NSF's Ocean Observatories Initiative (OOI)
Identify 10 sensors closest to a particular location measuring the
largest pollution levels over time (e.g. top-10 readings are
provided on hourly basis) - SNSFs Sensor Scope project Power grid
monitoring Operators would like to monitor over time 100 sites with
the largest or the lowest power production using solar panel
current and voltage readings so that they to identify power grid
hot-spots MSc. Distributed System 22
- 23. Real-world Requirements: Forest Fire rescue MSc.
Distributed System 23
- 24. Real-world Requirements: Social Media Personalized
newspaper Facebook user is approximately exposed to more than 1500
stories per day, but an average user only engaged with 100 stories
from the current news feed. What if to have a personalized
news-paper at the end of day Social Annotation of news-stories
Serving of Yahoo! News page-views with a fresh set of Top-k tweets,
by considering news-story as a subscription while tweets as
incoming publications MSc. Distributed System 24
- 25. Top-k publish/subscribe Notify me of all Top-10 stock
quotes of Google hourly from NYSE if the price is greater than 150
MSc. Distributed System 25
- 26. Top-k publish/subscribe How many matching publications will
be delivered to a subscriber during a period of time? Actually we
dont know in state-of-the-art pub/sub systems Top-k pub/sub models
are powered by, Expressive stateful query processing engines User
defined parameter k restricts the delivered publications Time
(in)dependent Top-k computing methods Sliding window model for
handling streaming publications Methods to deliver Top-k
notifications Pro-active On-demand MSc. Distributed System 26
- 27. Abstract Top-k/w matching Limit the number of matching and
delivered publications to k best within a sliding window of size w
MSc. Distributed System 27 1 2 3 4 5 6 7 8 9 10 .... 1 2 3 4 5 6 7
8 9 10 .... 1 2 3 4 5 6 7 8 9 10 .... 5 1 5 6 5 9 Top-2 Matching
publication stream h=1 h=3 Jumping step (h)
- 28. [Pripui 2012] Top-k/w model: DaZaLaPS Subscriber controls
the number of publications it receives per subscription (top-k)
within a sliding window Subscription is defined by Totally-ordered
and time-independent scoring function Parameter k N Parameter w
R+*(time-based)or n N (count-based sliding window). Ranks
publications according to the degree of relevance (score) to a
subscription Each publication is competing with other publications
from the sliding window for a position among top-k publications
MSc. Distributed System 28
- 29. [Pripui 2012] Top-k/w model: DaZaLaPS When can a
publication become a Top-k object in the subscription window?
Immediately upon publication Later on when it becomes a Top-k
object in the subscription window MSc. Distributed System 29
Maintain a set of candidate (potential Top-k) publications in
memory!
- 30. [Pripui 2012] Distributed Top-k/w model Network of
processing nodes, where each node is responsible for computing
Top-k/w publications Publication Flooding MSc. Distributed System
30 A B C D E F subscribe(s) change( ) publish(p) p p p p p
- 31. [Pripui 2012] Distributed Top-k/w model Subscription
Flooding Proxy subscriptions: Replicas of original publications
which to be advertised over the network MSc. Distributed System 31
A B C D E F subscribe(s) change( ) publish(p)
- 32. [Pripui 2012] Distributed Top-k/w model Rendezvous routing
Often implemented on top of a structured peer-to-peer network
Rendezvous node is responsible for Matching mapped publications
& subscriptions Delivering matching publications to subscribers
directly MSc. Distributed System 32 A B C D E F subscribe(s)
publish(p) s s sp p change( )
- 33. [Pripui 2012] Distributed Top-k/w model Basic gossiping
Similar to publication flooding, but randomly spread through an
overlay network as a gossip Cannot provide any guarantee regarding
publication delivery Purely probabilistic MSc. Distributed System
33 A B C D E F subscribe(s) change( ) publish(p) p p p
- 34. [Pripui 2012] Distributed Top-k/w model Informed gossiping
Each node additionally stores subscriptions of its close neighbors
and also processes the subscriptions of its neighbors Partially
probabilistic and partially deterministic MSc. Distributed System
34 A B C D E F subscribe(s) change( ) publish(p) p p p
- 35. [Shrarer 2014] Google Top-k pub/sub MSc. Distributed System
35News-story as a subscription Tweets as publications
- 36. [Shrarer 2014] Google Top-k pub/sub Annotating news stories
with social updates (tweets), at a news website serving high volume
of page-views Billions page-views at Yahoo News! per day More than
100 millions related tweets per day Top-k pub/sub approach stories
are standing subscriptions on tweets Story Index is queried
frequently, but it is updated infrequently based on DAAT, TAAT
algorithms Tweet Index updated frequently but queried only for new
stories MSc. Distributed System 36
- 37. [Drosou 2009] PrefSIENA Say Addison is more interested in
horror movies than comedies Addison would like to receive
notifications about (various) comedies only if there are no (or
just a few) notifications about horror movies MSc. Distributed
System 37 title = The Godfather genre = drama showing time = 21:10
title = Ratatouille genre = comedy showing time = 21:15 title =
Fight Club genre = drama showing time = 23:00 title = Casablanca
genre = drama showing time = 23:10title = Vertigo genre = drama
showing time = 23:20 Published events User subscriptions genre =
drama genre = horror
- 38. [Drosou 2009] PrefSIENA To express some form of ranking
among subscriptions, PrefSIENA allow users to define priorities
among them To do this, they introduce preferential subscriptions
Based on preferential subscriptions, we deliver to users only the k
most interesting events Covering/Matching relation MSc. Distributed
System 38 string director = Peter Jackson time release date > 1
Jan 2003 string director = Steven Spielberg string genre = fantasy
string release date > 1 Jan 2003 string title = LOTR: The Return
of the King string director = Peter Jackson time release date = 1
Dec 2003 string genre = fantasy integer oscars = 11
- 39. [Drosou 2009] PrefSIENA Ordering subscriptions To order
user subscriptions according to the preference relation, they use
the winnow operator1, applying it on various levels Step 01:
Construct DAG MSc. Distributed System 39 genre = drama genre =
horror User preferences genre = comedy genre = romance genre =
romance genre = action genre = drama genre = horror genre = comedy
genre = romance genre = romance genre = action genre = comedy genre
= horror genre = drama genre = comedy genre = horror genre =
romance genre = action Preference graph
- 40. [Drosou 2009] PrefSIENA Step 02: perform a topological sort
to compute winnow levels. The subscriptions of level i are
associated with a preference rank (i): is a monotonically
decreasing function with [0, 1] e.g. for = (D +1 (l -1)) / (D +1)
MSc. Distributed System 40 genre = drama genre = comedy genre =
horror genre = romance genre = action Preference graph Preference
rank = 1 Preference rank = 2/3 Preference rank = 1/3
- 41. [Drosou 2009] PrefSIENA Step 03: Computing Event Ranks Step
04: Based on the ranks, they deliver to users only the k most
interesting events Continuous, periodic & sliding window MSc.
Distributed System 41 User subscriptions genre = adventure 0.9
director = Peter Jackson 0.7 string title = King Kong string
director = Peter Jackson time release date = 14 Dec 2005 string
genre = adventure string title = King Kong string director = Peter
Jackson time release date = 14 Dec 2005 string genre = adventure
0.9 = max
- 42. [Drosou 2009] PrefSIENA: Sliding window Delivery MSc.
Distributed System 42 title = The Big Parade genre = romance
showing time = 21:00 title = The Apartment genre = comedy showing
time = 21:10 title = The Godfather genre = drama showing time =
21:25 title = Forrest Gump genre = romance showing time = 21:10
title = Jaws genre = horror showing time = 20:55 title = Vertigo
genre = horror showing time = 21:45 title = Psycho genre = horror
showing time = 21:50 title = Pulp Fiction genre = drama showing
time = 21:25 User subscriptions genre = comedy 0.9 genre = romance
0.9 genre = drama 0.8 genre = horror 0.6 20:00 20:15 20:22 20:25
20:50 20:40 20:45 20:55 k = 2 w = 4 title = The Big Parade genre =
romance showing time = 21:00 title = The Apartment genre = comedy
showing time = 21:10 title = Forrest Gump genre = romance showing
time = 21:10 title = The Godfather genre = drama showing time =
21:25 title = Psycho genre = horror showing time = 21:50 title =
Pulp Fiction genre = drama showing time = 21:25 Matching events
Delivered events
- 43. [Drosou 2009] PrefSIENA But wait.. The most highly ranked
events may be very similar to each other We wish to retrieve
results on a broader variety of user interests Two different
perspectives on achieving diversity: Avoid overlap: choose
notifications that are dissimilar to each other Increase coverage:
choose notifications that cover as many user interests as possible
How to measure diversity? Many alternative ways Common ground:
measure similarity/distance among the selected items MSc.
Distributed System 43
- 44. MSc. Distributed System 44 Diversity: Top-k representative
set Representative Top-kDrawback (without diversity) What we want
(with diversity) Method to retrieve Top-k publications from
matching publications
- 45. MSc. Distributed System 45 MAX* k-diversity problem where
1. P = {p1, , pn} 2. k n 3. d: a distance metric 4. f: a diversity
function ),(argmax* dSfS k|S| PS Find:
- 46. MSc. Distributed System 46 Proposed: MAXDIVREL k-diversity
problem S-Pinrelevancy&similarity-distheminimize,,
Sinrelevancy&similarity-disthemaximize,,g ),,( ),,(
maxarg),,(argmax* rdSh rdS rdSh rdSg rdSfS PS where 1. P = {p1, ,
pn} 2. d: a distance metric 3. r: a relevance metric 4. f: a
diversity function
- 47. MSc. Distributed System 47 Formal Definition: MAXDIVREL
k-diversity SPpSp ji i j Spp ji i j ji ji ppd pr pr SP rdSh ppd pr
pr S rdS , , dominanceholds),( )( )( || 1 ,,argmin
ceindependenholds),( )( )( || 1 ,,gargmax where 1. P = {p1, , pn}
2. d: a distance metric 3. r: a relevance metric 4. > 0
Independence condition: , , , > Dominance condition: , . . ,
;
- 48. MSc. Distributed System 48 NP-Hardness: Minimum
independent-dominating set 1 2 3 4 5 1 4 3 5 2 1 4 3 5 2 1 4 3 2 5
1 4 3 2 5 jijiji ppppdppodNeighborho ,|)( 1 4 32 5 Publication
space Graph model Independent, dominating Independent, dominating
Independent, dominating Dominating, not independent
- 49. MSc. Distributed System 49 NAVE Greedy argmax ()2 ( ) () (,
)
- 50. MSc. Distributed System 50 Handling streaming publications
1 2 3 4 5 1 4 3 5 2 6 1 4 3 5 26 Continuity Requirements 1.
Durability an item is selected as diversified in window may still
have the chance to be in + 1 window if it's not expired & other
valid items in + 1 window are failed to compete with it. 2. Order
Publication stream follow the chronological order We avoid the
selection of item j as diverse later, when we already selected an
item i which is not- older than j.
- 51. MSc. Distributed System 51 MAXDIVREL continuous k-diversity
1 2 3 4 .. +1 .. .. .. .... Matching publication stream 1 2 3 4 ..
+1 .. .. .. .... ith window (i+1)th window +1 MAXDIVREL k-diversity
MAXDIVREL k-diversity Independence Dominance Durability Order
Straightforward solution: Apply nave greedy method at each instance
Propose incremental index mechanism! Avoid the curse of
re-calculating neighborhood
- 52. MSc. Distributed System 52 Locality Sensitive Hashing (LSH)
Simple Idea if two points are close together, then after a
projection operation these two points will remain close
together
- 53. MSc. Distributed System 53 LSH Analysis For any given
points , = 1 1 = 2 1 = 2 Hash function h is (1, 2, 1, 2) sensitive,
Ideally we need (12) to be large (12) to be small
- 54. MSc. Distributed System 54 LSH in MAXDIVREL: Publications
as categorical data
- 55. MSc. Distributed System 55 LSH in MAXDIVREL: Characteristic
Matrix
- 56. MSc. Distributed System 56 LSH in MAXDIVREL: Minhashing No
Publications any more! Signature to represent Technique Randomly
permute the rows at characteristic matrix m times Take the number
of the 1st row, in the permuted order, which the column has a 1 for
the correspondent column of publications. First permutation of rows
at characteristic matrix Advantage: Reduce the dimensions into a
small minhash signature
- 57. MSc. Distributed System 57 LSH in MAXDIVREL: Signature
Matrix Fast-minhashing Select m number of random hash functions To
model the effect of m number of random permutation Mathematically
proved only when, The number of rows is a prime.
- 58. MSc. Distributed System 58 LSH in MAXDIVREL: LSH Buckets
Take r sized signature vectors From m sized minhash- signature Map
them into, L Hash-Tables Each with arbitrary b number of
buckets
- 59. MSc. Distributed System 59 LSH in MAXDIVREL: How to select
L, r? For two vectors x,y , = 1 , ; , , = 1. = 2. ? 2) () 1 1
- 60. MSc. Distributed System 60 LSH in MAXDIVREL: Analysis For
two vectors x,y , = 1 , ; , , = For publications x & y , = At a
particular hash table x & y map into the same bucket: , x &
y does not map into the same bucket: 1 , At L Hash-tables x & y
does not map into the same bucket: (1 , ) 1 (1 , ) True near
neighbors will be unlikely to be unlucky in all the
projections
- 61. MSc. Distributed System 61 LSH in MAXDIVREL: Batch-wise
Top-k computation Bucket Winner a publication which has the highest
relevancy score Winner is dominant to represent it's bucket
neighborhood Top-k "winners that have a majority of votes k winners
are independent . . ith window
- 62. MSc. Distributed System 62 LSH in MAXDIVREL: Incremental
Top-k computation Characteristic Matrix Signature Matrix Map
signature into L hash-tables Update Winner at bucket signature maps
into Vote
- 63. MSc. Distributed System 63 LSH in MAXDIVREL: When new
publication F arrives Only buckets 13 , 23 , 32 , 43 will vote
Follow continuity requirements Durability Order . . ith window
(i+1)th window
- 64. MSc. Distributed System 64 Implementation
- 65. MSc. Distributed System 65 Cloud service modules Source:
Amazon Kinesis Source: Amazon Elastic-cache
- 66. MSc. Distributed System 66 Top-k pub/sub: DEMO
- 67. P2P Pub/Sub Scribe: topic-based, built on top of Pastry,
stateful, rendezvous. Hermes: topic & content-based, built on
top of Pastry(-like) net, stateful, rendezvous & flooding-like.
Meghdoot: content-based, built on top of CAN, stateful, rendezvous.
Tera: topic-based, built on unstructured P2P net, stateful, random
walk- based-flooding. Sub2Sub: content-based, built on unstructured
P2P net, stateful, flooding- like. DHTStrings: content-based,
DHT-independent, string support, stateless, rendezvous. OP-DHT
Pub/Sub: content-based, (can be) built on top of
Chord/Pastry/Bamboo. MSc. Distributed System 67
- 68. DHT based pub/sub: Scribe Topic Based Based on DHT (Pastry)
Rendezvous event routing A random identifier is assigned to each
topic The pastry node with the identifier closest to the one of the
topic becomes responsible for that topic MSc. Distributed System
68
- 69. DHT based pub/sub: Meghdoot Content Based Based on
Structured Overlay CAN Mapping the subscription language and the
event space to CAN space Subscription and event Routing exploit CAN
routing algorithms MSc. Distributed System 69
- 70. Top-k publish/subscribe at P2P Stateful approaches
introduce some kind of state at (intermediate) nodes. State can
refer to : State needed to support specialized structures built on
top of the network structure E.g. trees (parent, children pointers)
Routing state for content-based routing: Subscription paths to be
followed by matching publications Subscriptions (meta)data: not
just forward pointers to be followed and subscription content (its
predicates), but also possible info as to What about query inherent
diversification? The controlled parameters (k & w) can change
Updates and the need to maintain state consistency may stress the
system and revoke any benefits.. So well be left with the
complexity MSc. Distributed System 70
- 71. Future work Apply Top-k diversification modules at
(un)structured P2P Exploiting overlap among diversified results of
users who have similar interest Develop LSH based index over
multi-threaded distributed environment Develop large scale Top-k
pub/sub applications by exploring other suitable use-cases E.g.
Personalized newspaper for every Facebook user Diverse set of
personalized Twitter trends Social annotation of news-stories MSc.
Distributed System 71
- 72. Thank you! sam2010ucsc@acm.org @SamTube405
http://geektube405.wordpress.com MSc. Distributed System 72