Leveraging Big Data and Real-Time Analytics at Cxense
-
Upload
simon-lia-jonassen -
Category
Technology
-
view
95 -
download
7
Transcript of Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15
2
Our mission is to help companies understand their audience and build great online user experiences.
– Stay longer on the site. – Sign up for subscriptions. – Find interesting articles. – Buy recommended products.
About Cxense
3
Founded in 2010, ~100 employees in 2015. Offices
– Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London, Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.
Some of our customers
About Cxense
4
Our solutions
5
How does it work!?
6
Event (example)
7
Content Profile (example)
8
9
Data Volume and Traffic – 5K+ Web-sites – 50M+ pages (last month) – 500M+ users (last month) – 10B+ events/month (20K events/sec peak)
Heterogeneity and Reliability
– Hundreds of mobile and desktop platforms, browsers, internet providers, etc. – Multiple devices per user, cross-domain tracking (3rd party cookie is dying). – Web-pages (articles, image/video galleries, chats, search/front pages) and human language. – The Internet is Broken™
Constrains and Requirements
– Online and real-time processing • Show and analyze what is happening right now.
– High and sustainable performance • Throughput: peak-load 10K+ request/sec.
• Latency: 100ms latency constrain for ads and recs. – Fault-tolerance and durability
Challenges
10
Architecture and Data Flow (simplified)
11
Communication – HTTP with JSON payload. – Durable and Idempotent.
Local storage
– Atomically append to file. – Use a new file each hour. – Use a separate directory for each partition. – Tail files and/or directories.
Metadata
– Keeps the state. – Can go backwards and re-feed when needed.
System
– Semi-automatic configuration via Upstart and Crontab. – Monitoring via Graphite and log files. – Automatic alerting and centralized log search.
Data Flow and Feeding
12
What is The Cube? – Partitioned column store database. – Using efficient string handling and integer compression. – Provides fast filtering and aggregation over 50B data points. – Guarantees low update latency (100ms). – Exists in multiple variants:
• Disk or memory based.
• Partitioned by site, by user or by both. – Low-level API.
Example:
The Cube
© imdb.com
!me user rnd siteid url
browser
1409425329634 “4szi” “xzst” “9978” “cxnews.com” “Chrome”
1409425329634 “zthp” “fd0z” “9978” “cxnews.com/seahawks-‐win-‐again…” “Firefox”
1409425329635 “4szi” “tzdt” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Chrome”
1409425329640 “4szi” “aext” “9978” “cxnews.com/elon-‐musk-‐is-‐awes…” “Chrome”
1409425329640 “zx5t” “dxrf” “9978” “cxnews.com/tesla-‐model-‐3-‐will-‐…” “Safari”
13
Frame of Reference Compression – Compress the numbers in groups of 64. – If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas). – Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
– For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas.
The Cube – Integer Columns
14
– A global lexicon maps all strings to numbers and back. – For each column, we map global keys to a smaller set of numbers and back.
The Cube – String Columns
15
Filter – Keep a bit-filter over a particular range of rows as a state.
Filtering
– By number or range – pass through a column and update the filter. Use binary search for ordered columns such as time, inverted index for user id.
– By key – map the key to a number and filter by the number. – By set of keys – map the keys to a bit-set and filter using the bit-set. – By pattern – filter by the set of keys matching the pattern.
Logical operations
– AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.
Advanced operations – Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.). – Join between different cubes on one or multiple dimensions.
The Cube – Filtering
16
Operations – Count – count the number of bits in the filter. – Sum – sum the numbers where filter bit is set. – Cardinality – count the number of distinct keys/numbers. – CardinalityEstimator – create a HyperLogLog cardinality estimator. – Frequency – create a map of keys/numbers with the associated count. – TopList – create a frequency map with only the k most popular keys/numbers. – SumBy – create a map of keys/numbers with the associated sum. – CardinalityMap – create a map of keys/numbers with the associated sum. – FrequencyDistribution – create a histogram over frequencies. – CardinalityDistribution – create a histogram over cardinalities. – SumByDistribution – create a histogram over sums. – NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).
The Cube – Aggregation
17
Partitioning – Most of the data structures are partitioned into chunks of data in order to improve memory
allocation, materialization, skipping, compression and locking. Static and dynamic parts
– Each data column, lexicon or mapping consist of a static and a dynamic part. – The static part is ordered – can use binary search and Minimal Perfect Hashing. – The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.
Locking
– Distinct Read and Read-Write Locks with different granularity/scope. – The updates are mostly appends, but some of the columns might be updated later (e.g.,
active time, exit query, etc.). Maintenance
– Periodically flush the dynamic part into the static part. – Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates
18
Keyword vectors – Represent user and document profiles. – Each contain as a document id, version and a set of group-item pairs with a weight. – Stored in a separate, highly partitioned set of containers. – Each container keeps multiple groups. – Each group contains a document ids, items and weights as columns.
The Cube – Advanced Data Types
19
Structured data – Can represent any simple JSON object (document). – Node types: Null, Object, Array, Integer, Float, String, Boolean. – Stored in a separate container, separate columns for each node type. – Each document is decomposed into a list of paths and nodes. – Each node is added to the corresponding column.
The Cube – Advanced Data Types
20
Analytics API – RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc. – API resource paths, JSON in - JSON out. – Most of the APIs require authentication. – Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.
Traffic API – A rich set of high-level API. – Powerful ad-hoc syntax – types, groups, items, filters, fields, etc. – See the demo!
Analytics UI
– HTML and JavaScript. – Is built on top of the Analytics API. – Has multiple fixed, functional views which can be combined with arbitrary filters. – Premium users have a workspace area for dynamic, configurable widgets.
Analytics API and UI
21
Demo Session
Thank you! Questions?
Credits: Erik Gorset & Oslo Dev Team
23
…btw, we are hiring!
www.cxense.com
https://twitter.com/cxense
www.facebook.com/cxense
www.linkedin.com/company/cxense
Connect with Cxense
© http://w
ww.perspectivaconica.com
/