Acunu Analytics: Simpler Real-Time Cassandra Apps
-
Upload
acunu -
Category
Technology
-
view
1.048 -
download
1
description
Transcript of Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
Tim Moreton CTO@timmoreton
Monday, 29 April 13
2
• Scalable. No single point of {failure, bottleneck}• Fast. Especially for writes•Available. Effortless Multi-DC support•Maturing fast. Lots of production deployments
WE C*
Monday, 29 April 13
3
WE C*
Virtual nodes CQL Support
Monday, 29 April 13
4
• Spartan queries •Thrift (and CQL, a bit) •Denormalization hurts agility •Weak update semantics
Challenges remain, of course.
WE C*
Monday, 29 April 13
5
C*: Two uses
Monday, 29 April 13
5
Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html
• Many more reads than writes
• Updates to existing records(ideally, transactionally)
• Probably fits in RAM:distribute for availability
C*: Two uses
Monday, 29 April 13
5
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html
• Many more writes than reads
• Almost all reads are to results
• Almost no writes are ‘updates’
• Distribute for availability, performance, capacity
Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html
• Many more reads than writes
• Updates to existing records(ideally, transactionally)
• Probably fits in RAM:distribute for availability
C*: Two uses
Monday, 29 April 13
5
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html
• Many more writes than reads
• Almost all reads are to results
• Almost no writes are ‘updates’
• Distribute for availability, performance, capacity
Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html
• Many more reads than writes
• Updates to existing records(ideally, transactionally)
• Probably fits in RAM:distribute for availability
C*: Two uses
Monday, 29 April 13
6
C* on
•Rich, SQL-like queries•RESTful HTTP APIs, JSON-based•Automated denormalization •Update semantics < less critical for analytics
Supplement Cassandra with:
Monday, 29 April 13
7
Analytics: Two patterns
Monday, 29 April 13
7
Exploratory Analytics
UnstructuredWarehouses
Data Mining
?Machine Learning
Analytics: Two patterns
Monday, 29 April 13
7
Exploratory Analytics
UnstructuredWarehouses
Data Mining
?Machine Learning
Analytics: Two patterns
Operational Intelligence
Dashboards Real-time Decisions
Alerting
!
Monday, 29 April 13
7
Exploratory Analytics
UnstructuredWarehouses
Data Mining
?Machine Learning
Analytics: Two patterns
Operational Intelligence
Dashboards Real-time Decisions
Alerting
!
Complex analysis, data varietyQuery richness
Data freshness, response timeQuery speed
Monday, 29 April 13
7
Exploratory Analytics
UnstructuredWarehouses
Data Mining
?Machine Learning
Analytics: Two patterns
Operational Intelligence
Dashboards Real-time Decisions
Alerting
!
Complex analysis, data varietyQuery richness
Data freshness, response timeQuery speed
Monday, 29 April 13
8
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Monday, 29 April 13
9
Who uses Acunu?
Location DataWeb and Visitor
Market/Tick Data
Infrastructure
Sensor Data
Social Media
Social GamingSmart Grid
Production Line
Monday, 29 April 13
10
Monday, 29 April 13
10
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interfaceAPI
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Cassandra stores raw events and intermediate aggregates
Monday, 29 April 13
10
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interfaceAPI
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Cassandra stores raw events and intermediate aggregates
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Acunu Analytics is a Cassandra client mapping new events, queries and schema changes to aggregate reads and writes
!
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Monday, 29 April 13
10
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interfaceAPI
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Cassandra stores raw events and intermediate aggregates
Acunu Dashboards provides embeddable, custom data visualization using HTTP API
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Acunu Analytics is a Cassandra client mapping new events, queries and schema changes to aggregate reads and writes
!
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Monday, 29 April 13
CREATE TABLE APICalls (time TIME(‘PST’, HOUR, MIN, SEC),path PATH(/),useragent STRING,latitude DOUBLE(0.1, 0.01),longitude DOUBLE(0.1, 0.01)
);
CREATE CUBE SELECT COUNT, AVG(respTime) FROM APICalls WHERE time, path GROUP BY time, path;
CREATE CUBE SELECT COUNT FROM APICalls WHERE latitude, longitude GROUP BY latitude, longitude;
11
(Loosely) Define a schema
• Tables have HTTP endpoint; map to a set of ColumnFamilys• Dimensions map keys in events; allow hierarchical aggregation• Cubes defines dimensions and aggregate to maintain
Monday, 29 April 13
CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;
12
Aggregation
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
Monday, 29 April 13
CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;
12
Aggregation
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
New event:Apply SUM(v, v’) on this cell
vA: v’X: xY: yZ: z
y
x
(g, h, i)
Monday, 29 April 13
CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;
12
Aggregation
• Hierarchical dimensions cause multiple writes per event(That’s ok: Cassandra’s good at writes)
• Most aggregates result in atomic counter increments
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
New event:Apply SUM(v, v’) on this cell
vA: v’X: xY: yZ: z
y
x
(g, h, i)
Monday, 29 April 13
SELECT SUM(a) FROM t WHERE x = .. and y = .. GROUP BY g, h, i;
13
Queries
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
• WHEREs map to a Cassandra row and GROUP BY to a compound column key in that row (very roughly)
Monday, 29 April 13
SELECT SUM(a) FROM t WHERE x = .. and y = .. GROUP BY g, h, i;
13
Queries
API
event stream
event store
roll-upcubes
Ingest Processing
dashboard queries programatic interface
New query:
• Locate slice that matches WHERE
• Return all mappings from GROUP BY tuples to cell values
vy
x
(g, h, i)
• WHEREs map to a Cassandra row and GROUP BY to a compound column key in that row (very roughly)
Monday, 29 April 13
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3221 :00→22 :01→19 :02→104 ...
... ...
UK all→228 user01→1 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1904 ...
∅ all→87314 UK→238 US→354 ...
14
A concrete example
Monday, 29 April 13
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→355 ...
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
15
Each event updates multiple aggregates:
A concrete example
Monday, 29 April 13
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→355 ...
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
15
Each event updates multiple aggregates:
WHERE time IN (22:00,23:00)GROUP BY minute
A concrete example
Monday, 29 April 13
21:00 all→1345 :00→45 :01→62 :02→87 ...
22:00 all→3222 :00→22 :01→19 :02→105 ...
... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→355 ...
{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,
}
15
Each event updates multiple aggregates:
WHERE time IN (22:00,23:00)GROUP BY minute
WHERE geography=US GROUP BY user
A concrete example
Monday, 29 April 13
16
SELECT `SUM(x)/(MAX(y) - MIN(y) + 0.5) AS 'spread' FROM ...
Arithmetic expressions
SELECT a - b AS lbound, a + b AS ubound FROM (SELECT AVG(score) AS a FROM scores WHERE year = 2012) JOIN (SELECT STDDEV(score) AS b FROM scores) USING (school)
Fast inner joins
SELECT COUNT UNIQUE (visitors) GROUP BY time(DAY(‘US/Pacific’))
Time zone support
SELECT SUM(size) FROM ..WHERE path MATCHES /usr/*
Hierarchical aggregationSELECT DRILL FROM errors WHERE category IN (“warn”, “error”)
Drill down to raw events
SELECT COUNT (items) FROM ..GROUP BY category LIMIT 3, country
... HAVING AVG(rating) < 2.0 AND COUNT >= 10
Limits
Query-time filtering
Rich queries
Monday, 29 April 13
17
Monday, 29 April 13
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.
Thank You.
Tim Moreton CTO@timmoreton
Monday, 29 April 13