LazyBase: Trading freshness and performance in a scalable database
description
Transcript of LazyBase: Trading freshness and performance in a scalable database
![Page 1: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/1.jpg)
LazyBase: Trading freshness and performance in a
scalable databaseJim Cipar, Greg Ganger,
*Kimberly Keeton, *Craig A. N. Soules,
*Brad Morrey, *Alistair Veitch
PARALLEL DATA LABORATORYCarnegie Mellon University
* HP Labs
![Page 2: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/2.jpg)
(very) High-level overview
LazyBase is…
• Distributed data base
• High-throughput, rapidly changing data sets
• Efficient queries over consistent snapshots
• Tradeoff between query latency and freshness
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 2
![Page 3: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/3.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 3
Example application• High bandwidth stream of Tweets
• Many thousands per second• 200 million per day
![Page 4: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/4.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 4
Example application• High bandwidth stream of Tweets
• Many thousands per second• 200 million per day
•Queries accept different freshness levels•Freshest: USGS Twitter Earthquake Detector•Fresh: Hot news in last 10 minutes•Stale: social network graph analysis
•Consistency is important•Tweets can refer to earlier tweets•Some apps look for cause-effect relationships
![Page 5: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/5.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 5
Class of analytical applications• Performance
• Continuous high-throughput updates• “Big” analytical queries
• Freshness• Varying freshness requirements for queries• Freshness varies by query, not data set
• Consistency• There are many types, more on this later...
![Page 6: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/6.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 6
Applications and freshness
Freshness / Domain
Seconds Minutes Hours+
Retail Real-time coupons, targeted ads
Just-in-time inventory
Product search, earnings reports
Enterprise information management
Infected machine identification
File-based policy validation
E-discovery requests, search
Transportation Emergency response
Real-time traffic maps
Traffic engineering, route planning
![Page 7: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/7.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 7
Current Solutions
Performance Freshness Consistency
OLTP/
SQL DBs Data warehouse NoSQL
![Page 8: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/8.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 8
Current Solutions
Performance Freshness Consistency
OLTP
Data warehouse
NoSQL
LazyBase is designed to support all three
![Page 9: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/9.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 9
Key ideas
• Separate concepts of consistency and freshness• Batching to improve throughput, provide consistency
• Trade latency for freshness• Can choose on a per-query basis
– Fast queries over stale data– Slower queries over fresh data
![Page 10: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/10.jpg)
Consistency ≠ freshness
Separate concepts of
consistency and freshness
• Query results may be stale: missing recent writes • Query results must be consistent
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 10
![Page 11: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/11.jpg)
Consistency in LazyBase• Similar to snapshot consistency• Atomic multi-row updates• Monotonicity:
• If a query sees update A, all subsequent queries will see update A
• Consistent prefix:• If a query sees update number X, it will also see
updates 1…(X-1)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 11
![Page 12: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/12.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 12
LazyBase limitations
• Only supports observational data• Transactions are read-only or write-only• No read-modify-write• Not online transaction processing
• Not (currently) targeting really huge scale• 10s of servers, not 1000s• Not everyone is a Google (or a Facebook…)
![Page 13: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/13.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 13
LazyBase design• LazyBase is a distributed database
• Commodity servers• Can use direct attached storage
• Each server runs:• General purpose worker process• Ingest server that accepts client requests• Query agent that processes query requests
• Logically LazyBase is a pipeline• Each pipeline stage can be run on any worker• Single stage may be parallelized on multiple workers
![Page 14: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/14.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 14
Pipelined data flow
IngestClient
![Page 15: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/15.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 15
Pipelined data flow
Ingest Sort Merge
Authoritytable
Client
Old authority
![Page 16: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/16.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 16
Batching updates• Batching for performance and atomicity• Common technique for throughput
• E.g. bulk loading of data in data warehouse ETL
• Also provides basis for atomic multi-row operations• Batches are applied atomically and in-order• Called SCU (self-consistent update)• SCUs contain inserts, updates, deletes
![Page 17: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/17.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 17
Batching for performanceLarge batches of updates increase throughput
![Page 18: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/18.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 18
Problem with batching: latency• Batching trades update latency for throughput
• Large batches database is very stale• Very large batches/busy system could be hours old
• OK for some queries, bad for others
![Page 19: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/19.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 19
Put the “lazy” in LazyBase
As updates are processed through pipeline, they become progressively “easier” to query.
We can use this to trade query
latency for freshness.
![Page 20: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/20.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 20
Query freshness
Ingest Sort Merge
Authority
table
Client
Old authorit
y
![Page 21: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/21.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 21
Query freshness
Ingest Sort Merge
Raw Input SortedAuthorit
ytable
Slow, fresh Fast, stale
Client
Old authority
![Page 22: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/22.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 22
Query freshness
Raw Input SortedAuthorit
ytable
Graphanalysis
Hot news!Quake!
Slow, fresh Fast, stale
![Page 23: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/23.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 23
Query freshness
Raw Input SortedAuthorit
ytable
Graphanalysis
Hot news!Quake!
Slow, fresh Fast, stale
![Page 24: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/24.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 24
Query freshness
Raw Input SortedAuthorit
ytable
Graphanalysis
Hot news!Quake!
Slow, fresh Fast, stale
![Page 25: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/25.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 25
Query freshness
Raw Input SortedAuthorit
ytable
Graphanalysis
Hot news!Quake!
Slow, fresh Fast, stale
![Page 26: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/26.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 26
Query interface• User issues high-level queries
• Programatically or like a limited subset of SQL• Specifies freshness
SELECT COUNT(*) FROM tweets
WHERE user = “jcipar”
FRESHNESS 30;
• Client library handles all the “dirty work”
![Page 27: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/27.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 27
Query latency/freshnessQueries allowing staler results return faster
![Page 28: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/28.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 28
Experimental setup• Ran in virtual machines on OpenCirrus cluster
• 6 dedicated cores, 12GB RAM, local storage
• Data set was ~38 million tweets• 50 GB uncompressed
• Compared to Cassandra• Reputation as “write-optimized” NoSQL database
![Page 29: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/29.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 29
Performance & scalability• Large, rapidly changing data sets
• Update performance
• Analytical queries• Query performance, especially “big” queries
• Must be fast and scalable• Looking at clusters of 10s of servers
![Page 30: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/30.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 30
Ingest scalability experiment• Measured time to ingest entire data set
• Uploaded in parallel from 20 servers
• Varied number of worker processes
![Page 31: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/31.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 31
Ingest scalability resultsLazyBase scales effectively up to 20 servers
Efficiency is ~4x better than Cassandra
![Page 32: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/32.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 32
Stale query experiments• Test performance of fastest queries
• Access only authority table
• Two types of queries: point and range• Point queries get single tweet by ID• Range queries get list of valid tweet IDs in range
– Range size chosen to return ~0.1% of all IDs
• Cassandra range queries used get_slice• Actual range queries discouraged
![Page 33: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/33.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 33
Point query throughputQueries scale to multiple clients
Raw performance suffers due to on-disk format
![Page 34: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/34.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 34
Range query throughputRange query performance ~4x Cassandra
![Page 35: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/35.jpg)
Consistency experiments• Workload
• Two rows, A and B.• Each row has sequence number and timestamp• Sequence of update pairs (increment A, dec. B)• Goal: invariant of A+B = 0• Background Twitter workload to keep servers busy
• LazyBase• Issue update pair, commit
• Cassandra• Use “batch update” call• Quorum consistency model for reads and writes
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 35
![Page 36: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/36.jpg)
Sum = A+B
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 36
LazyBase maintains inter-row consistency
![Page 37: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/37.jpg)
FreshnessLazyBase results may be stale,
timestamps are nondecreasing
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 37
![Page 38: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/38.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 38
Summary
• Provide performance, consistency and freshness• Batching improves update throughput, hurts latency
• Separate ideas of consistency and freshness• Tradeoff between freshness and latency• Allow queries to access intermediate pipeline data
![Page 39: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/39.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 39
Future workWhen freshness is a first-class consideration…
• Consistency• Something more specific than snapshot consistency?• Eliminate the “aggregate everything first” bottleneck?
• Ingest pipeline• Can we prioritize stages to optimize for workload?• Can we use temporary data for fault tolerance?
![Page 40: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/40.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 40
Parallel stages
Ingest
Ingest
Ingest
ID Rema
p
ID Rewrit
e
ID Rewrit
e
Sort
Sort
Sort
Merge
Merge
Merge
Merge
Merge
Merge
Merge
Raw Input Sorted Authority
Old Authorit
yClient
Client
Client
Client
![Page 41: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/41.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 41
Dynamic pipeline allocationServer’s roles reassigned as data moves
through pipeline
![Page 42: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/42.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 42
Query library• Every query sees a consistent snapshot
• If SCU X is in results, so are SCUs 1…(X-1)
• User specifies required freshness• E.g. “All data from 15 seconds ago”• LazyBase may provide fresher data if possible
• May also specify “same freshness as other query”• E.g. “All data up to SCU 500, and nothing newer”• Allows multiple queries over same snapshot
![Page 43: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/43.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 43
Query overview1. Find all SCUs that must be in query results
• Current time - freshness = time limit• All SCUs uploaded before time limit are in results• Newer SCUs may be in results
2. For relevant SCUs, find fastest file to query• If merged, may include fresher SCUs than required• Most SCUs can be read from authority file• Some SCUs may have to be read from earlier stages
3. Query all relevant files, combine results
![Page 44: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/44.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 44
Query components• Client
• Gets plan from controller, sends request to agents• Aggregates results
• Controller• Single controller for cluster• Finds files relevant to query
• Agents• Run on all servers• Read files, filter, send results to client
![Page 45: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/45.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 45
Query pictured
Client
Controller
AgentsAgentsAgentsAgents
![Page 46: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/46.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 46
Query pictured
Client
Controller
AgentsAgentsAgentsAgents
1. Get query plan
3. List of files to query
2. Find SCU files to match query
![Page 47: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/47.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 47
Query pictured
Client
Controller
AgentsAgentsAgentsAgents
6. Send results
4. Request data
5. Filter data
![Page 48: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/48.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 48
Query pictured
Client
Controller
AgentsAgentsAgentsAgents
7. Aggregate results
![Page 49: LazyBase: Trading freshness and performance in a scalable database](https://reader036.fdocuments.in/reader036/viewer/2022062309/568147aa550346895db4e505/html5/thumbnails/49.jpg)
Jim Cipar © Apr 21, 2023http://www.pdl.cmu.edu/ 49
Query scalability• Query clients send parallel requests to agents
• Agents perform filtering and projection locally
• Query execution controlled by client• Multiple processes can work on same query