Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel....

68
Scaling for the Known Unknown Suhail Patel

Transcript of Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel....

Page 1: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Scaling for the Known Unknown

Suhail Patel

Page 2: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

March 2016

1,861Investors

£1,000,000Raised

96Seconds

Page 3: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

March 2016

Page 4: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

February 2017

41,267Pledges to invest

£2,500,000Raised

Page 5: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Late 2018

Monzo is raising £20,000,000 and all our customers will be eligible to participate

Page 6: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Hi, i’m Suhail

I’m a Platform Engineer at Monzo. I work on the Infrastructure and Reliability squad. We help build the base so other engineers can ship their services and applications.

● Email: [email protected]● Twitter: @suhailpatel

Page 7: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 8: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Introduction

A brief overview of our Platform

Building a Crowdfunding Backend

Load testing + Finding bottlenecks

Page 9: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Number of services

Page 10: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 11: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Deployment ServicePlease deploy

service.account at revision b32a9e64

Review checksStatic analysisBuild checks

Page 12: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Running services

service.account

Page 13: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Running services

What we want from services:● Self-contained● Scalable● Stateless● Fault tolerance

Page 14: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Running services

service.account

Page 15: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Kubernetes Worker Node

Running services

Kubernetes Worker NodeKubernetes Worker Node

Kubernetes Worker Node

service.transaction

service.account10.0.10.123

Page 16: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Kubernetes Worker Node

Running services

Kubernetes Worker NodeKubernetes Worker Node

Kubernetes Worker Node

Host: service.accountProxy: 127.0.0.1:4140HTTP GET /account

Route request to a service.account replica, let’s try the one at 10.0.10.123

service.transaction

service.account10.0.10.123

Service Mesh

Service Mesh

Page 17: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Service Mesh

The Service Mesh ties the microservices together. It acts as the RPC proxy.

● Handles service discovery and routing● Retries / Timeouts / Circuit Breaking● Observability

Page 18: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Asynchronous messaging

service.transaction

service.transaction

service.transaction

Many things can occur asynchronously rather than a direct blocking RPC.

Message queues like NSQ and Kafka provide asynchronous flows with at least once message delivery semantics.

service.transaction

service.txn-enrichment

Page 19: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Asynchronous messaging

Page 20: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Storing data with Cassandra

Please give me transaction idtxn_00000123456

service.transaction

Page 21: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Storing data with Cassandra

Cassandra Ring

Please give me transaction idtxn_00000123456

service.transactionReplication Factor: 3

Quorum: Local

Page 22: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Storing data with Cassandra

Please give me transaction idtxn_00000123456

service.transactionReplication Factor: 3

Quorum: Local

Page 23: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Storing data with Cassandra

Please give me transaction idtxn_00000123456

service.transactionReplication Factor: 3

Quorum: One

Page 24: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Storing data with Cassandra

Please give me transaction idtxn_00000123456

service.transactionReplication Factor: 3

Quorum: Local

Page 25: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Distributed Locking with etcd

Please can I get a lock on transaction

txn_00000123456so I have sole access

service.transaction

Page 26: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Distributed Locking with etcd

Source: https://raft.github.io/

Page 27: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Monitoring with Prometheus

Prometheus is a flexible time-series data store and query engine

Each of our services expose metrics in Prometheus format at /metrics

Monitor all the things● RPC Request/Response cycles● CPU / Memory / Network use● Asynchronous processing● C* and Distributed Locking

Page 28: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 29: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Introduction

A brief overview of our Platform

Building a Crowdfunding Backend

Load testing + Finding bottlenecks

Page 30: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Requirements

1. Raise at most £20,000,000We’d agreed with institutional investors leading the funding round that £20M was the cap

3. Handle lots of trafficIt was first-come-first-serve so we expected a lot of interest at the start of the crowdfunding round

2. Ensure users have enough moneyUsers should have the money they are pledging. We need to verify this before accepting the investment.

4. Don’t bring down the bankAll banking functions should continue to work whilst we’re running the crowdfunding

Page 31: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Requirements

1. Raise at most £20,000,000We’d agreed with institutional investors leading the funding round that £20M was the cap

3. Handle lots of trafficIt was first-come-first-serve so we expected a lot of interest at the start of the crowdfunding round

2. Ensure users have enough moneyUsers should have the money they are pledging. We need to verify this before accepting the investment.

4. Don’t bring down the bankAll banking functions should continue to work whilst we’re running the crowdfunding

Page 32: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Counters / Transactions

What if we used as Cassandra counter?

“In Cassandra, at any given moment, the counter value may be stored in the Memtable, commit log, and/or one or more SSTables. Replication between nodes can cause consistency issues in certain edge cases”Source: https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCountersConcept.html

Page 33: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Edge Proxy service.crowdfunding-pre-investment

service.crowdfunding-investment

Ledger checks, confirm transaction

rate limited consumption

Page 34: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Requirements

1. Raise at most £20,000,000We’d agreed with institutional investors leading the funding round that £20M was the cap

3. Handle lots of trafficIt was first-come-first-serve so we expected a lot of interest at the start of the crowdfunding round

2. Ensure users have enough moneyUsers should have the money they are pledging. We need to verify this before accepting the investment.

4. Don’t bring down the bankAll banking functions should continue to work whilst we’re running the crowdfunding

Page 35: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 36: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Introduction

A brief overview of our Platform

Building a Crowdfunding Backend

Load testing + Finding bottlenecks

Page 37: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Building our own load tester

There’s some great off-the-shelf solutions for load testing● Bees with Machine Guns● Locust● ApacheBench (ab)● Gatling

Page 38: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Building our own load tester

Load Test Worker

Load Test Worker

Load Test Worker

Load Test Worker

GET /account

GET /balance

GET /news

service.account

service.balance

service.news

Monzo Edge Proxy

AWS Load Balancer

Page 39: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 40: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

At one point, we saw really high error rates in the load testing metrics. We didn’t see load test requests make it to our our AWS Load Balancer.

The load test nodes were using internal DNS provided by Amazon Route 53. We were constantly resolving *.monzo.com subdomains.

Page 41: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 42: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Load testing in production

For our testing to create realistic load and give us useful results, we needed to test against our production systems – the real bank.

Page 43: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Load testing in production

We set up our load testing system as a third “app” alongside our iOS and Android apps, and we gave it read-only access to the data we needed to test.

Target: Reach 1,000 app launches per second

Page 44: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Scaling services

Target: Reach 1,000 app launches per second

Page 45: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 46: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Scaling services

Target: Reach 1,000 app launches per second

replicas: 9template: spec: containers: resources: limits: cpu: 30m memory: 40Mi requests: cpu: 10m memory: 20Mi

Page 47: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Scaling services

Target: Reach 1,000 app launches per second

replicas: 9template: spec: containers: resources: limits: cpu: 100m memory: 40Mi requests: cpu: 50m memory: 20Mi

Page 48: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

“But wait, you are re-inventing autoscaling, manually?”

Page 49: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

We got to around 500-600 app launches before we found a major Platform bottleneck

Cassandra Bottlenecks

Page 50: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

21 x i3.4xlarge EC2 machines ● 16 cores● 122GiB memory ● 2 * 1.9TiB of NVMe disks

Each node holds about 500GB of data

The numbers

Page 51: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 52: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Our profiling identified three key areas● Generating Prometheus metrics● LZ4 Decompression● CQL Statement Processing

Cassandra Bottlenecks

Page 53: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 54: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

LZ4 Decompression

Page 55: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 56: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

CQL Statement Parsing

We saw a significant amount of time being spent in parsing CQL statements.

The majority of our applications had a fixed model during the service pod lifetime so we would’ve been processing the same statement over and over again.

Page 57: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Prepared Statements

Cassandra supports prepared statements! Our gocql library which runs Cassandra queries was actively using them too for the majority of queries.

Page 58: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Prepared Statements

SELECT id, accountid, userid, amount, currencyFROM transaction.transaction_map_IdWHERE id = ?

SELECT currency, accountid, userid, id, amountFROM transaction.transaction_map_IdWHERE id = ?

Page 59: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Target: Reach 1,000 app launches per second

At around 800 app launches per second, we saw our RPCs take a really long time across our Platform.

Service Mesh Bottlenecks

Page 60: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 61: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

● A comprehensive spreadsheet of all the services involved and how much we’d need to scale them (replicas/resource requests/limits)

● An idea of how many EC2 Kubernetes Worker Nodes we need, so we could provision them before it started

● Much more knowledge of where things can fail at this scale● Confidence!

○ Knowing what levers you can pull when things go wrong

What we ended up with

Page 62: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

No matter how much preparation we did beforehand, we wanted to ensure we could recover the Platform if anything went wrong

● Feature Toggles○ Gracefully degrading the less critical app features

● Shedding traffic○ Stopping the traffic before it even enters our edge

Levers

Page 63: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 64: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 65: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017
Page 66: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Things went well

36,006Investors

£20MRaised

£6.8Mfirst 5 minutes

Page 67: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

What we learned

Here are the key takeaways and what we learnt as a result of this exercise● Horizontal scaling has limits● Treat software as just that, software● Continuously load test

Page 68: Scaling for the Known Unknown - QCon London 2020 · Scaling for the Known Unknown Suhail Patel. March 2016 1,861 Investors £1,000,000 Raised 96 Seconds. March 2016. February 2017

Thanks!

Email: [email protected]: @suhailpatel / @monzo