Ebay: DB Capacity planning at eBay

15
Feng Qu, Sr MTS Bass Chorng, Principal Capacity Engineer DB Capacity Planning at eBay #CassandraSummit2015

Transcript of Ebay: DB Capacity planning at eBay

Page 1: Ebay: DB Capacity planning at eBay

Feng Qu, Sr MTS Bass Chorng, Principal Capacity Engineer

DB Capacity Planning at eBay

#CassandraSummit2015    

Page 2: Ebay: DB Capacity planning at eBay

Who Am I?

#CassandraSummit2015 2

Bass Chorng – Principal Capacity Engineer @ eBay Specializes in database performance, availability & scalability in a large website. Established DB capacity team at eBay in 2003. Loves mountain biking.

Page 3: Ebay: DB Capacity planning at eBay

#CassandraSummit2015      

eBay Site DB Traffic At A Glance NoSQL Total – 52 B/Day

Cassandra – 15 B Mongo – 15 B CouchBase – 12 B PushVM – 10B

RDBMS Total – 350 B

MySQL – 10 B Oracle – 340 B

Peak Traffic – 8M/sec Site Total DB Calls – 400B/Day across 2,000 NoSQL Nodes + 450 Oracle Nodes Hosting 800M Active items & 120M Active Users Y-o-Y Growth – 30% ~ 35%

15 15 12 10 10

340

Billion SQL Calls per Day

Cassandra

Mongo

CouchBase

PushVM

MySQL

Oracle

Page 4: Ebay: DB Capacity planning at eBay

Capacity Planning - Simply Put Ø  Analyze Traffic

o  Data Ø  Analyze Utilization

o  Data Ø  Analyze The Relationship Of The Above Two

o  Same Data Ø  Forecast Growth

o  Simple Models, Then Impress Your Boss. Ø  Convert Resource Need into $

o  A Calculator, Then Impress Your CIO’s

BTW, You Also Need To Know …

•  Platform Domain Knowledge – Server, DB Engine, IO Subsystem, Networks … •  Relationship Between System Overhead & Utilization •  Seasonality & Workload Characteristics •  Bottlenecks – Components, Systems, Platforms, Architecture, Site & Apps •  New Technologies

#CassandraSummit2015 4

Page 5: Ebay: DB Capacity planning at eBay

Domain Knowledge Stack

#CassandraSummit2015 5

APPS

DB

UNIX

STORAGE

CAPACITY

CAPACITY

aka Whom To Blame Stack

Bottom of food chain =>

Page 6: Ebay: DB Capacity planning at eBay

Data Ø  What To Collect?

Apps, Database, Sessions, CPU, Memory, Connections, IOPS, IO Time, NIC, HBA, Array

Ø  How To Collect?

Time Resolution, Aggregation Level, Retention Ø  How To Use It?

Average, Max, 95th percentile, Dashboard, Reporting, Trending

#CassandraSummit2015 6

0.0

1.0

2.0

3.0

4.0 5/

1/20

15

5/2/

2015

5/

3/20

15

5/4/

2015

5/

5/20

15

5/6/

2015

5/

7/20

15

5/8/

2015

5/

10/2

015

5/11

/201

5 5/

12/2

015

5/13

/201

5 5/

14/2

015

5/15

/201

5 5/

16/2

015

5/17

/201

5 5/

19/2

015

5/20

/201

5 5/

21/2

015

5/22

/201

5 5/

23/2

015

5/24

/201

5 5/

25/2

015

5/26

/201

5 5/

27/2

015

0 5000000

10000000 15000000 20000000 25000000 30000000 35000000 40000000

1/26

/201

5 1/

28/2

015

1/30

/201

5 2/

1/20

15

2/3/

2015

2/

5/20

15

2/7/

2015

2/

9/20

15

2/11

/201

5 2/

13/2

015

2/15

/201

5 2/

17/2

015

2/19

/201

5 2/

21/2

015

2/23

/201

5 2/

25/2

015

2/27

/201

5 3/

1/20

15

Page 7: Ebay: DB Capacity planning at eBay

Forecast Ø  Model Traffic, Not Resources Ø  Need One Year Trend Ø  Forecast At Daily Level Ø  Eliminate Outliers Ø  No Data Is Better Than Wrong Data Ø  Convert Traffic To Resource Usage Ø  Linear Extrapolation Only (CPU Utilization, not IO Time) Ø  Simple Excel Formula Works Well Ø  For Long Term Resource Planning Only Ø  Use Average, Not Max Ø  Not All Workloads Are Predictable

#CassandraSummit2015 7

0

10

20

30

40

50

60

70

01/01/2012 01/01/2013 01/01/2014 01/01/2015

Billion Calls

CATY Traffic Forecast

Forecast Actual Capacity

Page 8: Ebay: DB Capacity planning at eBay

Things To Watch For Myths

Ø  More CPU Makes Apps Run Faster Ø  More Data Makes Apps Run Slower Ø  Apps Run Twice As Fast On CPU Twice The Speed Ø  High Session = High Load

Pitfalls

Ø  Cause VS. Symptom Ø  Time Resolution Masks Issues Ø  Look At The Whole Picture Ø  Slow Down In Order To Go Faster < Throttle > Challenges Ø  Data Quality – Data Missing, Data Source Changes, F/O Data Residency, Data Errors … Ø  Varieties of Data Formats & Resolutions Ø  Data Collection In Secured Zones #CassandraSummit2015

8

Page 9: Ebay: DB Capacity planning at eBay

Me: Everything NoSQL

CassandraSummit2015  |  #CassandraSummit  

Ø Prior to 2011: Worked on Oracle at DoubleClick/Yahoo/Intuit

Ø Worked on NoSQL at eBay Database Infrastructure team: Ø Cassandra since 2011 Ø MongoDB since 2012 Ø Couchbase since 2014

Ø Cassandra Summit speaker for 2013, 2014, 2015

Ø DataStax Cassandra MVP for 2014, 2015

Page 10: Ebay: DB Capacity planning at eBay

For Cassandra Ø Capacity Measurements Ø Throughput Ø Latency Ø E.g. 30,000 reads/sec with SLA of P99 at 5ms

Ø Hardware SKU Example Ø CPU: 20 cores Ø Memory: 128GB RAM Ø Storage: 1.5TB local SSD Ø Network: 10g NIC

CassandraSummit2015  |  #CassandraSummit  

Page 11: Ebay: DB Capacity planning at eBay

Benchmarking Ø Benchmarking for different hardware Ø High I/O SKU Ø High memory SKU Ø High storage SKU Ø Bare metal or cloud

Ø Benchmarking for different software releases Ø Benchmarking for different workloads

Ø  100% Writes Ø  50% Writes, 50% Reads Ø  5% Writes, 95% Reads Ø  100% Reads

Ø Benchmarking Tools Ø YCSB Ø Cassandra-stress

Ø Proactive and repeated process using near real-time traffic in prod like environment

CassandraSummit2015  |  #CassandraSummit  

Page 12: Ebay: DB Capacity planning at eBay

Capacity Planning

Ø Key to avoid surprise in production Ø The concept behind capacity planning is simple, but the mechanics are harder. Ø Business requirements may increase, need to forecast how much resource must be

added to the system to ensure that user experience continues uninterrupted Ø  Input: clearly defined capacity goal coming from business requirement and performance baseline

from benchmark test Ø  Output: Identify resources to be added, such as memory, CPU, storage, I/O, network

Ø Always prepare for peak + headroom

CassandraSummit2015  |  #CassandraSummit  

Page 13: Ebay: DB Capacity planning at eBay

Capacity Planning Process

Ø Initial Sizing Ø Storage size vs. data size Ø Compaction overhead, compression ratio, RF, indexes

Ø Cost-effective configuration to meet capacpity/latency SLA Ø Routine Review Ø System utilization on I/O, storage, network, CPU, memory etc Ø Cassandra metrics on GC, compaction, latency, throughput etc Ø Compactionstats, cfhistoralgrams, tpstats etc

Ø Forecasting Ø Historical comparison Ø Traffic projection

Ø Flex up or Flex down

CassandraSummit2015  |  #CassandraSummit  

Page 14: Ebay: DB Capacity planning at eBay

Scale Up vs. Scale Out Ø Scale Up(vertical)

Ø  Pros Ø Smaller data center footprint, such as space, power, cooling Ø Less license cost

Ø  Cons Ø Likely cost more using proprietary hardware Ø Less fault tolerant Ø Limited upgradability in future

Ø Scale Out(horizontal) Ø  Pros

Ø Cheaper using commodity hardware Ø More fault tolerant Ø (unlimited) upgradability

Ø  Cons Ø Bigger data center footprint Ø More license cost Ø Likely need more network equipment

CassandraSummit2015  |  #CassandraSummit  

Page 15: Ebay: DB Capacity planning at eBay

Questions ?

CassandraSummit2015  |  #CassandraSummit  

eBay is hiring experienced NoSQL professionals, please send resume to [email protected]