Leveraging Big Data and Real-Time Analytics at Cxense

Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15

2

Our mission is to help companies understand their audience and build great online user experiences.

– Stay longer on the site. – Sign up for subscriptions. – Find interesting articles. – Buy recommended products.

About Cxense

3

Founded in 2010, ~100 employees in 2015. Offices

–  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London, Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.

Some of our customers

About Cxense

4

Our solutions

5

How does it work!?

6

Event (example)

7

Content Profile (example)

9

Data Volume and Traffic –  5K+ Web-sites –  50M+ pages (last month) –  500M+ users (last month) –  10B+ events/month (20K events/sec peak)

Heterogeneity and Reliability

–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple devices per user, cross-domain tracking (3rd party cookie is dying). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™

Constrains and Requirements

–  Online and real-time processing •  Show and analyze what is happening right now.

–  High and sustainable performance •  Throughput: peak-load 10K+ request/sec.

•  Latency: 100ms latency constrain for ads and recs. –  Fault-tolerance and durability

Challenges

10

Architecture and Data Flow (simplified)

11

Communication –  HTTP with JSON payload. –  Durable and Idempotent.

Local storage

–  Atomically append to file. –  Use a new file each hour. –  Use a separate directory for each partition. –  Tail files and/or directories.

Metadata

–  Keeps the state. –  Can go backwards and re-feed when needed.

System

–  Semi-automatic configuration via Upstart and Crontab. –  Monitoring via Graphite and log files. –  Automatic alerting and centralized log search.

Data Flow and Feeding

12

What is The Cube? –  Partitioned column store database. –  Using efficient string handling and integer compression. –  Provides fast filtering and aggregation over 50B data points. –  Guarantees low update latency (100ms). –  Exists in multiple variants:

•  Disk or memory based.

•  Partitioned by site, by user or by both. –  Low-level API.

Example:

The Cube

© imdb.com

!me user rnd siteid url

browser

1409425329634 “4szi” “xzst” “9978” “cxnews.com” “Chrome”

1409425329634 “zthp” “fd0z” “9978” “cxnews.com/seahawks-‐win-‐again…” “Firefox”

1409425329635 “4szi” “tzdt” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Chrome”

1409425329640 “4szi” “aext” “9978” “cxnews.com/elon-‐musk-‐is-‐awes…” “Chrome”

1409425329640 “zx5t” “dxrf” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Safari”

13

Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the

differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and

compress the deltas using fixed bit width.

–  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.

The Cube – Integer Columns

14

–  A global lexicon maps all strings to numbers and back. –  For each column, we map global keys to a smaller set of numbers and back.

The Cube – String Columns

15

Filter –  Keep a bit-filter over a particular range of rows as a state.

Filtering

–  By number or range – pass through a column and update the filter. Use binary search for ordered columns such as time, inverted index for user id.

–  By key – map the key to a number and filter by the number. –  By set of keys – map the keys to a bit-set and filter using the bit-set. –  By pattern – filter by the set of keys matching the pattern.

Logical operations

–  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.

Advanced operations –  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.). –  Join between different cubes on one or multiple dimensions.

The Cube – Filtering

16

Operations –  Count – count the number of bits in the filter. –  Sum – sum the numbers where filter bit is set. –  Cardinality – count the number of distinct keys/numbers. –  CardinalityEstimator – create a HyperLogLog cardinality estimator. –  Frequency – create a map of keys/numbers with the associated count. –  TopList – create a frequency map with only the k most popular keys/numbers. –  SumBy – create a map of keys/numbers with the associated sum. –  CardinalityMap – create a map of keys/numbers with the associated sum. –  FrequencyDistribution – create a histogram over frequencies. –  CardinalityDistribution – create a histogram over cardinalities. –  SumByDistribution – create a histogram over sums. –  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).

The Cube – Aggregation

17

Partitioning –  Most of the data structures are partitioned into chunks of data in order to improve memory

allocation, materialization, skipping, compression and locking. Static and dynamic parts

–  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – can use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.

Locking

–  Distinct Read and Read-Write Locks with different granularity/scope. –  The updates are mostly appends, but some of the columns might be updated later (e.g.,

active time, exit query, etc.). Maintenance

–  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping.

The Cube – Updates

18

Keyword vectors –  Represent user and document profiles. –  Each contain as a document id, version and a set of group-item pairs with a weight. –  Stored in a separate, highly partitioned set of containers. –  Each container keeps multiple groups. –  Each group contains a document ids, items and weights as columns.

The Cube – Advanced Data Types

19

Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column.

The Cube – Advanced Data Types

20

Analytics API –  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc. –  API resource paths, JSON in - JSON out. –  Most of the APIs require authentication. –  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.

Traffic API –  A rich set of high-level API. –  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc. –  See the demo!

Analytics UI

–  HTML and JavaScript. –  Is built on top of the Analytics API. –  Has multiple fixed, functional views which can be combined with arbitrary filters. –  Premium users have a workspace area for dynamic, configurable widgets.

Analytics API and UI

21

Demo Session

Thank you! Questions?

Credits: Erik Gorset & Oslo Dev Team

23

…btw, we are hiring!

www.cxense.com

https://twitter.com/cxense

www.facebook.com/cxense

www.linkedin.com/company/cxense

Connect with Cxense

[email protected]

© http://w

ww.perspectivaconica.com

/

Leveraging Big Data and Real-Time Analytics at Cxense

Technology

Transcript of Leveraging Big Data and Real-Time Analytics at Cxense