Scaling Infrastructure at Carousell

Harshad Rotithor & Ankur Shrivastava

January 12, 2017

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48

Who are we?

Harshad Rotithor

Principle Software Engineer

Leads Infrastructure team

Previously at Flipkart,Airpush, Zynga, etc.

harshad@carousell.com

Who are we?

Ankur Shrivastava

Senior Software Engineer

Engineer in the Infrastructureteam

Previously at Flipkart,Amazon, Zynga, etc.

ankur@carousell.com

Where are we currently?

Started in 2012 at a Hackathon

7 countries, 19 cities

57M+ listings

23M+ items sold

Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things

57M+ listings

23M+ items sold

57M+ listings

23M+ items sold

400+ servers

Multiple Services see 2000+ requests per second

Self Managed deployments

PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...

Uptime of 99.95

Ability to handle AZ failures

400+ servers

Uptime of 99.95

400+ servers

Uptime of 99.95

So what is this talk about ?

What it took to reach hereAnd what lies ahead!

Current Infrastructure - Overview

Infrastructure is:

ArchitectureSystemsOperations

Stateful components most important

We self-manage user path datastores

Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc

Infrastructure is:

Current Infrastructure

Current Infrastructure - Data Stores

Master + 2 Slaves in each AZ (Total7)

pgbouncer + HA Proxy(config-service)

Dedicated data disks (always useSSDs)

Master disk snapshot every 3hr(fsync enabled)

Don’t turn off Autovacuum(transaction id)

Master + 2 Slaves in each AZ (Total7)

pgbouncer + HA Proxy(config-service)

Dedicated data disks (always useSSDs)

Master disk snapshot every 3hr(fsync enabled)

Don’t turn off Autovacuum(transaction id)

3 clusters, largest being close to 75 nodes

Shard allocation awareness

Use Plugins (kopf /head/cerebro)

Keep masters in different AZ

HAProxy with L7 healthchecks(config-service)

Incremental backups

Set shard count correctly, be on higher side.

Rely on linux page cache

Incremental backups

History

Cloud provider ’x’

Everyday firefighting

We hit upper limits

NetworkDisk

Noisy neighbours

Limited types of instances

Lack of features

Load balancerAutoscalingSecurity!

Decided on Migration

History

We hit upper limits

NetworkDisk

Noisy neighbours

Lack of features

History

We hit upper limits

NetworkDisk

Noisy neighbours

Lack of features

History

We hit upper limits

NetworkDisk

Noisy neighbours

Lack of features

Planning

Around June 2016

250+ Nodes

Identify ALL nodes and their functionalities

Identify ALL traffic flows and patterns

Architecture Freeze

Perform comparative benchmarks

Redefine node and cluster configuration

Isolated deployment in GCP

Dry run data migration for all clusters

Estimate time

Planning

Around June 2016

250+ Nodes

Architecture Freeze

Estimate time

Planning

Around June 2016

250+ Nodes

Architecture Freeze

Estimate time

Preparation

July 2016

VPN across the providers (HeavyDuty)

Replicate all that can be replicated(inter DC)

Keep stateless nodes ready

Make DNS nameserver changes inadvance (3-4 days)

Script everything - node creation,data movement, etc.

Aim for only data movement duringMigration

Preparation

July 2016

VPN across the providers (HeavyDuty)

Replicate all that can be replicated(inter DC)

Make DNS nameserver changes inadvance (3-4 days)

Script everything - node creation,data movement, etc.

Aim for only data movement duringMigration

Preparation

Practice, Practice, Practice!

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Migration

Post Migration

5-6hr of Maintenance

Latency dropped to 1/4th on GCP

DNS propagation issue (even after 2 days)

L7 tunnels over VPN

Ensure monitoring is taken over after migration

Key Take Away

Practice makes the migrationperfect!

Keep configuration updated

Expect issues

Redis cluster state switchDNS caching by ISPs for days

Keep Calm!

Key Take Away

Practice makes the migrationperfect!

Keep configuration updated

Expect issues

Redis cluster state switchDNS caching by ISPs for days

Keep Calm!

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

From Pets To Cattle

Production impact!

From Pets To Cattle

Production impact!

From Pets To Cattle

Production impact!

From Pets To Cattle

Production impact!

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

From Pets To Cattle

Dynamic Discovery

Autoscaling

From Pets To Cattle

Dynamic Discovery

Autoscaling

From Pets To Cattle

Dynamic Discovery

Autoscaling

From Pets To Cattle

Dynamic Discovery

Autoscaling

From Pets To Cattle

Dynamic Discovery

Autoscaling

From Pets To Cattle

Dynamic Discovery

Autoscaling

Configuration and Service Discovery

For Configuration we needed →

Centralized configuration storage

Consistent store

Audit of configuration changes

Versioning for quick reverts

Easy to deploy and manage

For Service Discovery we needed →

Decoupled from application code

Health checks

Easy to Scale Out

Easy to deploy and manage

We built ’Config-Service’ on top on’Consul’

Configuration on nodes using ConsulTemplate & Envconsul

Installation on instances usinginternal Debian package and repo

’Config-Service’ package takes careof consul cluster configuration andhealth check registration

We built ’Config-Service’ on top on’Consul’

Configuration on nodes using ConsulTemplate & Envconsul

Installation on instances usinginternal Debian package and repo

’Config-Service’ package takes careof consul cluster configuration andhealth check registration

Configuration Management

Git repository to manageconfiguration

Filename is the key, content is thevalue

Single source of truth

Audit log of changes

Easy reverts and versioning (just usegit revert)

Configuration Management

Git repository to manageconfiguration

Filename is the key, content is thevalue

Single source of truth

Audit log of changes

Easy reverts and versioning (just usegit revert)

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Auto scaling on CPU usage /Number of Requests

Node Maintenance

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Node Maintenance

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Node Maintenance

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Node Maintenance

Config-Service Overview

Auto Scaling

Pay as you go, lower cost

Better fault tolerance

Availability zone failures

Handle sudden increase in traffic (specially at midnight!)

Auto Scaling

Pay as you go, lower cost

Better fault tolerance

Availability zone failures

Handle sudden increase in traffic (specially at midnight!)

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Key Take Away

Set Convention

Script everything

Key Take Away

Set Convention

Script everything

Key Take Away

Set Convention

Script everything

Key Take Away

Set Convention

Script everything

Key Take Away

Set Convention

Script everything

Kubernetes

Partial Kubernetes deployment sinceOct, 2016

Full Production deployment sinceNov, 2016

Using Google Container Engine

30+ deployments

500+ containers (At Peak)

Autoscale on CPU targets

Not all services on boarded yet

Kubernetes

30+ deployments

Kubernetes

30+ deployments

Kubernetes

We don’t use K8S Ingress/Service

Config-Service (consul) asDaemonSet

Containers get registered onConfig-Service (NodePort) fromhealth check

No change in existing architectureneeded

Service discovery fromInternal/External HA Proxy stillworks

Kubernetes

’Config-Service’ allows us to have hybrid model

Instance groups can coexist with Kubernetes

Recovery mechanism / Transitioning

Instance group size set to zero (Fully on K8S)

Kubernetes

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Deployment Pipeline

Jenkins Pipeline

3 Clicks to Deploy

Approval Steps

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Use Proper tools

Identify Retention

Use Proper tools

Identify Retention

Use Proper tools

Identify Retention

Use Proper tools

Identify Retention

Bare minimum required metrics→

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Build Dashboards

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

On Call

Keep links handy

Schedule jobs

Automate

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Future Plans

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Microservices

Flash Sale

Ultimate test of scalability

Hard to judge peak

Throughput can multiply inshort time

Planned for 2x throughput

Flash Sale

Ultimate test of scalability

Hard to judge peak

Throughput can multiply inshort time

Planned for 2x throughput

Flash Sale - Latency

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Flash Sale

Flash Sale - Standard Operating Procedure

Infrastructure Team at Carousell

400+ servers

Thousands of requests per second

Production Issues get looked after in < 5 Mins

Uptime of 99.95

Failures don’t result in outages

All thanks to Planning, Monitoring and Automation

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Take Away

Monitor everything

Take Away

Monitor everything

Take Away

Monitor everything

Take Away

Monitor everything

Take Away

Monitor everything

Take Away

Monitor everything

Take Away

Monitor everything

Thank You

P.S. we are hiring http://careers.carousell.com/

Scaling Infrastructure at Carousell

Technology

Transcript of Scaling Infrastructure at Carousell

TERN eMAST : Observations and terrestrial ecosystem models : Terrestrial Ecosystem Modelling and Scaling Infrastructure : ecosystem Modelling And Scaling infrasTructure (eMAST)

Voices on Infrastructure Scaling EV infrastructure to meet ...

Scaling Cloud Data Infrastructure with Industry's Broadest ...

Scaling Data Infrastructure @ Spotify - QCon · Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com. Mārtiņš Kalvāns kalvans@spotify.com Matti Pehrs matti@spotify.com.

Scaling Informal Learning - Tools and Infrastructure for Workplace Learning

Scaling your logging infrastructure using syslog-ng

Working collaboratively: scaling infrastructure, services, learning and innovation

Scaling up investment for sustainable urban infrastructure: A … · 1 Working Paper Scaling up investment for sustainable urban infrastructure: A systematic approach to urban finance

Scaling WebRTC Video Infrastructure, June 2014 @ WebRTC conference and Expo

Scaling Up Infrastructure Spending in the Philippines… · SAVARD Scaling up infrastructure spending in the Philippines 45 treatment to balance out each household account and reconcile

Scaling Counting Infrastructure At Quora - Final (1)

Scaling Big Data Mining Infrastructure: The Twitter Experience › exploration_files › V14-02-02-Lin.pdf · Scaling Big Data Mining Infrastructure: The Twitter Experience Jimmy

Scaling Down Distributed Infrastructure on Wimpy Machines for …jimmylin/publications/Lin_TempWeb2015.pdf · Scaling Down Distributed Infrastructure on Wimpy Machines for Personal

Building scalable applications while scaling your infrastructure by rhommel lamas

Scaling Auctions as Insurance: A Case Study in Infrastructure … · 2019-01-30 · Scaling Auctions as Insurance: A Case Study in Infrastructure Procurement Clickherefor the latest

Presentation cisco vxi–optimized infrastructure for scaling v mware view with confidence

Scaling Data Center Application Infrastructure...Scaling Data Center Application Infrastructure. Data center managers must support ever-increasing application workloads for up to tens

Scaling Infrastructure

Scaling IBM DB2 9 in a VMware Infrastructure 3 Environment

Scaling Your Logging Infrastructure With Syslog-NG