Large-Scale Real-Time Data Management for Engagement and Monetization

25
Large-Scale Real-Time Data Management for Engagement and Monetization Simon Lia-Jonassen LSDS-IR 2015

Transcript of Large-Scale Real-Time Data Management for Engagement and Monetization

Page 1: Large-Scale Real-Time Data Management for Engagement and Monetization

Large-Scale Real-Time Data Management for Engagement and Monetization Simon Lia-Jonassen LSDS-IR 2015

Page 2: Large-Scale Real-Time Data Management for Engagement and Monetization

Our mission is to help companies understand their audience and build great online user experiences.

– Find interesting articles. – Stay longer on the site – Get relevant ads. – Sign up for subscriptions.

Some of our customers:

About Cxense

Page 3: Large-Scale Real-Time Data Management for Engagement and Monetization

Our solutions

Page 4: Large-Scale Real-Time Data Management for Engagement and Monetization

Cxense DMP

Page 5: Large-Scale Real-Time Data Management for Engagement and Monetization

How does it work!? (JavaScript tag example)

Page 6: Large-Scale Real-Time Data Management for Engagement and Monetization

Page view events (example)

Page 7: Large-Scale Real-Time Data Management for Engagement and Monetization

Content profiles (example)

Page 8: Large-Scale Real-Time Data Management for Engagement and Monetization

Custom events (example)

Page 9: Large-Scale Real-Time Data Management for Engagement and Monetization
Page 10: Large-Scale Real-Time Data Management for Engagement and Monetization

UI and API capabilities

Page 11: Large-Scale Real-Time Data Management for Engagement and Monetization

UI and API capabilities

Page 12: Large-Scale Real-Time Data Management for Engagement and Monetization

User Segments

Page 13: Large-Scale Real-Time Data Management for Engagement and Monetization

Data Volume and Traffic (monthly)

–  5 000 active Web-sites –  100 million pages –  1 billion users –  15 billion page views

Constrains and Requirements –  Online and real-time processing

•  Show, analyze and act on what is happening exactly right now.

–  High and sustainable performance •  Peak-load 10K+ request/sec.

•  50 ms latency constrain for ads and recs. –  Availability, reliability, durability

•  multi DC and fault-tolerance –  Security and privacy

Challenges

Page 14: Large-Scale Real-Time Data Management for Engagement and Monetization

Heterogeneity and Reliability –  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple browsers and devices per user, cross-domain tracking (3rd party cookies are dying out). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™

Customer success –  Providing the right insights.

•  Data, metrics and visualization. –  Providing the right set of tools.

•  Usability, brevity, expressiveness, completeness.

–  Best practices. •  Analytics, ads, recs, user engagement,

personalization and subscription optimization. –  Onboarding and support.

Challenges

Page 15: Large-Scale Real-Time Data Management for Engagement and Monetization

Communication –  HTTP with JSON payload. –  Durable and Idempotent.

Local storage

–  Atomically append to file. –  Use a separate directory for each

partition and a new file each hour. –  Tail files and/or directories.

Metadata

–  Keeps the state. –  Rewind and re-feed when needed.

System

–  Configured via Upstart and Cron. –  Monitoring via Graphite and log files. –  Automatic alerting.

Architecture and Data Flow

Page 16: Large-Scale Real-Time Data Management for Engagement and Monetization

Data Cubes –  Partitioned column store database. –  Efficient string handling and integer compression. –  Fast filtering and aggregation over billions of data points. –  Low update latency (100ms). –  Exists in multiple variants:

•  Disk or memory based.

•  Partitioned by site, by user or by both. –  Low-level API.

Example:

The Cube

!me   user   rnd   siteid   url    

browser  

1409425329634   “4szi”   “xzst”   “9978”   “cxnews.com”   “Chrome”  

1409425329634   “zthp”   “fd0z”   “9978”   “cxnews.com/seahawks-­‐win-­‐again…”   “Firefox”  

1409425329635   “4szi”   “tzdt”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Chrome”  

1409425329640   “4szi”   “aext”   “9978”   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”   “Chrome”  

1409425329640   “zx5t”   “dxrf”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Safari”  

Page 17: Large-Scale Real-Time Data Management for Engagement and Monetization

Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the

differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and

compress the deltas using fixed bit width.

–  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.

The Cube – Integer Columns

Page 18: Large-Scale Real-Time Data Management for Engagement and Monetization

–  A global lexicon maps all strings to numbers and back. –  For each column, map global keys to a smaller set of numbers and back.

The Cube – String Columns

Page 19: Large-Scale Real-Time Data Management for Engagement and Monetization

Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column.

The Cube – Advanced Data Types

Page 20: Large-Scale Real-Time Data Management for Engagement and Monetization

Filtering operations and tricks: –  Keep a bit-filter over a range of rows (1 = exclude). –  By a number or range – unset bits where numbers not match. Can use binary search for ordered

columns such as time, and inverted indexes for unordered such as user id. –  By a key or set of key – map keys to a number or bit-set and filter. –  By pattern – filter by the set of keys matching the pattern. –  Logical AND, OR, NOT – use a stack of filters and binary operations.

The Cube – Filtering and Aggregation

Page 21: Large-Scale Real-Time Data Management for Engagement and Monetization

Some aggregation operations and tricks: –  Count, Sum, Cardinaltiy – bit-counting. Can use HLL for distributed cardinality.

–  Frequency, SumBy, CardinalityMap – sorting and bit-counting using pairs of integers. –  Frequency-, SumBy–, CardinalityDistribution – histograms, more sorting and bit-counting.

The Cube – Filtering and Aggregation

Page 22: Large-Scale Real-Time Data Management for Engagement and Monetization

Advanced operations –  Use aggregation output as filtering input (e.g., top-list, histogram, etc.). –  Join between cubes on one or multiple dimensions.

The Cube – Filtering and Aggregation

Page 23: Large-Scale Real-Time Data Management for Engagement and Monetization

Partitioning –  Most of the data structures are partitioned into chunks of data. –  This improves memory allocation, materialization, skipping, compression and locking.

Static and dynamic parts

–  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. –  Updates are mostly appends, but updates can also be done via deletion and a new write.

Maintenance

–  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping.

The Cube – Updates

Page 24: Large-Scale Real-Time Data Management for Engagement and Monetization

Thank you! Questions?

Credits: Erik Gorset and the Oslo R&D Team

[email protected] …btw,  we  are  hiring!  

cxense.com facebook.com/cxense

twitter.com/cxense linkedin.com/company/cxense

youtube.com/user/cxense

Page 25: Large-Scale Real-Time Data Management for Engagement and Monetization

One more thing… the Internet of Things!