Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size =...
Transcript of Probabilistically Counting Counting is Hard: Views at ... · HLL (Redis implementation) Max size =...
Counting is Hard: Probabilistically Counting Views at RedditKrishnan Chandra, Data Engineer
Overview
● What is probabilistic counting?
● How did probabilistic counting help us scale?
● What issues did we face along the way?
What is Reddit?Reddit is the frontpage of the internet
A social network where there are tens of thousands of communities around whatever passions or interests you might have
It’s where people converse about the things that are most important to them
Reddit by the numbers
Alexa Rank (US/World)
MAU
Active Communities
Posts per month
Screenviews per month
4th/7th
330M+
138K+
10.7M
14B
Counting Views
Why Count Views?
● Includes logged-out users● Better measure of reach than
votes● Currently exposed to
moderators and content creators
Cat Walking a HumanCat Fist Bumping
Why is Counting Hard?
Product Requirements
● Counts are over the life of a post● The same user should not count multiple
times within a short time frame● Should build in some protections against
spamming/cheating (similar to votes)● Should provide (near) real-time feedback
● Exact counting:○ Requires storing state per user per
post
● Approximate counting:○ Requires much less state and storage○ Provides an estimate of reach within a
few percentage points of the exact number
Exact vs. Approximate Counting
● HyperLogLog (HLL)○ Hash-based probabilistic algorithm
published in 2007○ Approximates set cardinality○ Works well for large cardinalities,
but not for small ones
● HyperLogLog++○ Introduced by Google in 2013○ Uses sparse and dense HLL
representations○ Switches over to HLL once needed
HyperLogLog (And Friends)
● Hash table consisting of m registers or buckets, each of width k bits
● Hash the input value, and split the hash value into 2 portions
● First portion (log2m bits) used to index to a register
● Second portion used to count the number of leading zeros and set the register value
How does HLL work?
Assume: m=8 registers, k=3 bits
input hash 1 1 1 0 0 0 1 1
Register# 7 3 leading zeroesRecord 3+1=4 into Register# 7
r0
r1
r2
r3
r4
r5
r6
r7 1 0 0
Adapted from HyperLogLog - A Layman’s Overview
● Estimate of cardinality is computed by taking the harmonic mean of the registers and raising 2 to that power
● Intuition: HLL is like flipping a coin!
● Largest run of heads gives an estimate of total number of flips
Computing Cardinality
Counting Error
● HLL standard error○ Number of registers/hash
buckets m○ Standard error = 1.04/sqrt(m)○ Using Redis’s HLL
implementation, standard error is 0.81%!
Using HLL to Count Views
● 1 HLL per post● HLL inserts are idempotent!
○ Allows reprocessing data if needed
● How to manage de-duping over short time window?○ Store user + truncated timestamp
as the value
Space Usage
● Exact counting:○ User id = 8 byte long○ ~1.5m users * 8 bytes = 12
MB
● HLL (Redis implementation)○ Max size = 12 KB○ 0.1% of the exact counting
storage
Counting Architecture
Architecture Goals
1. Consume a stream of view events and filter out spam/bad events
2. For good events, insert into an HLL in real time
3. Allow clients to consume views values in real time
Counting
Server Side Events
App Servers
Client Side Events
Anti-Spam
Stream Processing Infrastructure
● Kafka○ Main message bus for view events
● Redis○ Used for storing state + HLLs○ Intended as short term storage○ Functions as a cache for Cassandra
● Cassandra○ Used to store the final counts and
HLLs in separate column families○ Intended as long term storage
Counting Application (Part 1)
● Anti-Spam Consumer○ Consumes the stream of views from
Kafka○ Basic rules engine backed by Redis○ Consumer outputs a decision to a
Kafka topic
Counting Application (Part 2)
● Counting Consumer○ Consumes the decisions topic output
by the anti-spam consumer○ Creates/updates the HLL for the post
in Redis.○ Stores both the count and the HLL
filter out to Cassandra.
Scaling Challenges
● Problems○ Rules engine is very memory heavy○ HLL counting is very CPU-heavy○ Rules engine data is generally
time-bound with expiry○ HLL data should be kept in Redis as
long as possible to avoid reading from Cassandra
Redis
● Solutions○ Separate Redis instances for the
2 parts of the application○ Different instance types to reflect
the different workloads○ Allkeys-lru expiration on HLLs,
volatile-ttl expiration on the rules engine
● Problems○ 1 row per post - overwritten
frequently○ Read rate on page loads
overwhelming the cluster○ Issues with load when “catching
up“○ Storage grows forever with the
number of posts!
Cassandra
● Solutions○ Updates to the same row in
Cassandra throttled to every 10 seconds
○ Read caching○ Slow the update rate when
catching up ○ More disk!
● Views on Reddit skew towards newer posts○ Allows most views to be served by
Redis○ Keeps read rate on Cassandra
very low
Observations
● Thanks to HLLs, counting views became much more efficient○ Current storage usage is ~1TB for a
full year of posts!
● Delivery was possible in a quarter with an engineering team of 3 (not always full time)
Takeaways
Thanks to our team!
● /u/gooeyblob - Cassandra + Backend
● /u/d3fect - Backend + API
● /u/powerlanguage - Product Management
Thanks! Krishnan Chandra [email protected]/shrink_and_an_arch
PS: We’re hiring!http://reddit.com/jobs
References
● View Counting at Reddit (Blog Post from 2017)
● Original HyperLogLog paper● Redis blog announcing HLL support● Google paper announcing HLL++
algorithm● HyperLogLog - A Layman’s Overview