Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main...

67
Using monitoring to understand Cassandra

Transcript of Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main...

Page 1: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Using monitoring to understand Cassandra

Page 2: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

About TLP and Myself

Page 3: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Hi, I'm Alain. @arodream https://www.linkedin.com/in/arodrime

Cassandra operator since 2011 (v 0.8) Datastax MVP for Apache Cassandra Montpellier, France

Consultant - The Last Pickle

Page 4: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?
Page 5: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

About you!

Page 6: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Poll time!

Page 7: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

About Cassandra

Page 8: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra - History

2008

Page 9: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra - History

2008

2009 -

2010

Page 10: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra - Today!

Sharing knowledge slow + Growing fast —> Skills gap

Source: http://db-engines.com/en/ranking

Page 11: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra Internals - Main characteristicsA scalable, Highly Available, and eventually consistent Database!

Scalable? HA? Eventual consistency?

Shard! Redundancy! Anti-entropy systems!

Token range / vnodes Consistent hashing Linear scalability

Replication Factor (RF) No SPOF (Masterless)

Hinted handoff Read Repair

Anti-entropy Repair Consistency Level (CL)

Page 12: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra Internals - Tunable Consistency

• ANY (Writes - uses HH)• ONE (Two, Three)• LOCAL_ONE• QUORUM → floor(RF/2) + 1• LOCAL_QUORUM• EACH_QUORUM• ALL + Strong Consistency

- Poor Availability

+ Strong Availability- Poor Consistency

Page 13: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra Internals - Writes

Write withCL: ONE

RF = 3

Page 14: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra Internals - R&W Operations

Multiple Data Center - Write requestsRF {DC1: 3, DC2: 3} and CL: ONE

Page 15: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Cassandra Internals - R&W Operations

Read withCL: Quorum

RF = 3

Page 16: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

About Monitoring

Page 17: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Monitoring is important!

Page 18: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

I meant: Efficient monitoring is important!

Cassandra Summit 2016: https://www.youtube.com/watch?v=Q9AAR4UQzMk

I have been advocating this way

Page 19: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

We want to prevent this!

Or at least be aware that it is happening!

Page 20: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

An accurate feel for the system

Page 21: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Troubleshooting and optimization

What else!

Page 22: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Monitoring to learn about internals!

Page 23: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Monitor to learn about internals!

Page 24: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

The idea?

We can use monitoring to learn

But we need efficient monitoring tools!

Page 25: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

The problem?

How to have a relevant set of dashboards?

How could a new user know what dashboards to build?

?

Page 26: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A strong set of dashboards- Is long, painful and expensive to build- Requires an expert knowledge to be built

Page 27: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

An utopic world ?

So is this idea a fantasy?

Yet another unrealistic recommendation?

Page 28: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A strong set of dashboards+ Is very useful to any operator from day 1

+ A strong set of dashboards is actually expert knowledge easily sharable!

+ A template is easy to use and immediately available

Page 29: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Out of the box dashboards?An ideal world as an operator would probably look like this

Page 30: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Building dashboards: Templates idea origin

(From my talk at Cassandra Summit 2016)

• TLP = Consulting company = many Cassandra clusters

Page 31: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Building dashboards: Templates idea origin

(From my talk at Cassandra Summit 2016)

• TLP = Consulting company = many Cassandra clusters • Need to build dashboards for many clients

Page 32: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Building dashboards: Templates idea origin

(From my talk at Cassandra Summit 2016)

• TLP = Consulting company = many Cassandra clusters • Need to build dashboards for many clients • Laziness, I do not like repeating tasks

Page 33: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Building dashboards: Templates idea origin

• TLP = Consulting company = many Cassandra clusters • Need to build dashboards for many clients • Laziness, I do not like repeating tasks

We decided to build templates and share with the community

(From my talk at Cassandra Summit 2016)

Page 34: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Out of the box TLP-dashboards!Ongoing work to be shared with the C* community “soon”

Page 35: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Back to the idea…

With the right tools!

We can use monitoring to learn

Page 36: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

The right tools

1 - Overview Dashboards - Detect

Page 37: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

The right tools

1 - Overview Dashboards - Detect2 - Themed Dashboards - Troubleshoot

Page 38: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

The right tools

1 - Overview Dashboards - Detect2 - Themed Dashboards - Troubleshoot3 - Themed Dashboards - Optimize

Page 39: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Applying to Cassandra

Page 40: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

How to learn Cassandra relying on monitoring?

Let’s try to understand a couple production issues together

Example 1: Read messages dropped

Page 41: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Monitoring

From overview dashboard

Read path issue

Page 42: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Internals

How does Cassandra read data?

Let’s read and learn about it

Page 43: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Internals

How does Cassandra read data?

Read issue —> Read Path Dashboard!

Get values from MemtableGet values from Row Cache if presentCheck BF - find possible SSTablesCheck Key Cache for fast search in SSTablesGet values from SSTablesRepopulate the Row Cache?

Page 44: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - InternalsSo what can be causing this issue?

• Under-sized cluster• Inefficient bloomfilters• Inefficient key caching• Tombstones read• Too many SSTables• Wide rows• Poorly designed queries• JVM / GC issues• Hardware issue• …

Page 45: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Monitoring

From read path dashboard

Spikes only appear on disks charts-

On 2 nodes

Page 46: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Thinking

Looks like a hardware issue…

… but then, why aren’t writes dropped as well?

Page 47: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - Internals

Searching to understand internals

Cassandra write path

Page 48: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Insert into MemtableAppend to commitlog

→ No read, no data disk hit

→ Very fast

→ Bottleneck CPU before I/O !

Written data is not hitting the data disk in Cassandra!

A dropped read issue - Internals

Cassandra write path

Page 49: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - takeaway

We learned about read and write path at the node levelChanging Machines / disks solved the issue!

Page 50: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

How to learn Cassandra relying on monitoring?

Let’s try to understand a few production issues together

Example 2: Latency issue

Page 51: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - MonitoringFrom overview dashboard

Page 52: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - MonitoringFrom overview dashboard

Page 53: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - TroubleshootingFrom the read path dashboard

Page 54: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - InternalsWait a minute, what are compactions in Cassandra?

Back to the write path!

Page 55: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - InternalsWait a minute, what are compactions in Cassandra?

Inserts → Size up / Spread rows

Updates → Inserts

Deletes → Insert tombstone

SSTables are immutable !

So, why are compactions not keeping up?

Page 56: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - Troubleshooting

From the resources dashboard

Page 57: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - TroubleshootingSumming things up

• Latency issue• Dropped messages• High # of compactions pending• Frequent and long GC activity• Read and Write paths affected• Looks like a global outage• Resources doing OK though

From Monitoring

From context • Old Cassandra version• Huge dataset (4 TB / node)

Page 58: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - ActingTake 1 - Improve compactions

Page 59: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - Internals

Take 1 - Improve compactions

Facing CASSANDRA-9662: No-op compactions triggering

Page 60: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - Think

• Compaction max speed is not enough

• Disks are not a bottleneck, nor is CPU

• Not a resource issue, rather a Cassandra one

• Adding node is always possible solution

• Solving CASSANDRA-9662 is an other solution

Page 61: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A latency issue - Acting

Take 2 - Add nodes + Upgrade Cassandra

We added some nodes as a “quick” fix

Then planned an upgrade of the cluster

Page 62: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?
Page 63: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

A dropped read issue - takeawayWe learned that about compactions in Cassandra

We also learned about a Cassandra issue

Page 64: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Conclusion

Page 65: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Do not undervalue Monitoring!

Anomaly detection Troubleshooting

Optimization Feel for the cluster

Knowledge

Learning

Page 66: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Use Monitoring to learn as you go!

Happy trip, I hope you’ll chose the right way ;-)

Page 67: Using monitoring to understand Cassandra · 2017-12-14 · Cassandra Internals - Main characteristics A scalable, Highly Available, and eventually consistent Database! Scalable? HA?

Thanks!@arodrime

I hope this was “enlightening.”Questions?