Scaling Infrastructure at Carousell

Post on 26-Jan-2017

30 views 0 download

Transcript of Scaling Infrastructure at Carousell

Scaling Infrastructure at Carousell

Harshad Rotithor & Ankur Shrivastava

January 12, 2017

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48

Who are we?

Harshad Rotithor

Principle Software Engineer

Leads Infrastructure team

Previously at Flipkart,Airpush, Zynga, etc.

harshad@carousell.com

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 2 / 48

Who are we?

Ankur Shrivastava

Senior Software Engineer

Engineer in the Infrastructureteam

Previously at Flipkart,Amazon, Zynga, etc.

ankur@carousell.com

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 3 / 48

Where are we currently?

Started in 2012 at a Hackathon

7 countries, 19 cities

57M+ listings

23M+ items sold

Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48

Where are we currently?

Started in 2012 at a Hackathon

7 countries, 19 cities

57M+ listings

23M+ items sold

Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48

Where are we currently?

Started in 2012 at a Hackathon

7 countries, 19 cities

57M+ listings

23M+ items sold

Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48

Where are we currently?

400+ servers

Multiple Services see 2000+ requests per second

Self Managed deployments

PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...

Uptime of 99.95

Ability to handle AZ failures

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48

Where are we currently?

400+ servers

Multiple Services see 2000+ requests per second

Self Managed deployments

PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...

Uptime of 99.95

Ability to handle AZ failures

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48

Where are we currently?

400+ servers

Multiple Services see 2000+ requests per second

Self Managed deployments

PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...

Uptime of 99.95

Ability to handle AZ failures

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48

So what is this talk about ?

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 6 / 48

What it took to reach hereAnd what lies ahead!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 7 / 48

Current Infrastructure - Overview

Infrastructure is:

ArchitectureSystemsOperations

Stateful components most important

We self-manage user path datastores

Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48

Current Infrastructure - Overview

Infrastructure is:

ArchitectureSystemsOperations

Stateful components most important

We self-manage user path datastores

Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48

Current Infrastructure - Overview

Infrastructure is:

ArchitectureSystemsOperations

Stateful components most important

We self-manage user path datastores

Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48

Current Infrastructure

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 9 / 48

Current Infrastructure - Data Stores

Master + 2 Slaves in each AZ (Total7)

pgbouncer + HA Proxy(config-service)

Dedicated data disks (always useSSDs)

Master disk snapshot every 3hr(fsync enabled)

Don’t turn off Autovacuum(transaction id)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48

Current Infrastructure - Data Stores

Master + 2 Slaves in each AZ (Total7)

pgbouncer + HA Proxy(config-service)

Dedicated data disks (always useSSDs)

Master disk snapshot every 3hr(fsync enabled)

Don’t turn off Autovacuum(transaction id)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48

Current Infrastructure - Data Stores

3 clusters, largest being close to 75 nodes

Shard allocation awareness

Use Plugins (kopf /head/cerebro)

Keep masters in different AZ

HAProxy with L7 healthchecks(config-service)

Incremental backups

Set shard count correctly, be on higher side.

Rely on linux page cache

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48

Current Infrastructure - Data Stores

3 clusters, largest being close to 75 nodes

Shard allocation awareness

Use Plugins (kopf /head/cerebro)

Keep masters in different AZ

HAProxy with L7 healthchecks(config-service)

Incremental backups

Set shard count correctly, be on higher side.

Rely on linux page cache

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48

Current Infrastructure - Data Stores

3 clusters, largest being close to 75 nodes

Shard allocation awareness

Use Plugins (kopf /head/cerebro)

Keep masters in different AZ

HAProxy with L7 healthchecks(config-service)

Incremental backups

Set shard count correctly, be on higher side.

Rely on linux page cache

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48

History

Cloud provider ’x’

Everyday firefighting

We hit upper limits

NetworkDisk

Noisy neighbours

Limited types of instances

Lack of features

Load balancerAutoscalingSecurity!

Decided on Migration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48

History

Cloud provider ’x’

Everyday firefighting

We hit upper limits

NetworkDisk

Noisy neighbours

Limited types of instances

Lack of features

Load balancerAutoscalingSecurity!

Decided on Migration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48

History

Cloud provider ’x’

Everyday firefighting

We hit upper limits

NetworkDisk

Noisy neighbours

Limited types of instances

Lack of features

Load balancerAutoscalingSecurity!

Decided on Migration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48

History

Cloud provider ’x’

Everyday firefighting

We hit upper limits

NetworkDisk

Noisy neighbours

Limited types of instances

Lack of features

Load balancerAutoscalingSecurity!

Decided on Migration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48

Planning

Around June 2016

250+ Nodes

Identify ALL nodes and their functionalities

Identify ALL traffic flows and patterns

Architecture Freeze

Perform comparative benchmarks

Redefine node and cluster configuration

Isolated deployment in GCP

Dry run data migration for all clusters

Estimate time

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48

Planning

Around June 2016

250+ Nodes

Identify ALL nodes and their functionalities

Identify ALL traffic flows and patterns

Architecture Freeze

Perform comparative benchmarks

Redefine node and cluster configuration

Isolated deployment in GCP

Dry run data migration for all clusters

Estimate time

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48

Planning

Around June 2016

250+ Nodes

Identify ALL nodes and their functionalities

Identify ALL traffic flows and patterns

Architecture Freeze

Perform comparative benchmarks

Redefine node and cluster configuration

Isolated deployment in GCP

Dry run data migration for all clusters

Estimate time

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48

Preparation

July 2016

VPN across the providers (HeavyDuty)

Replicate all that can be replicated(inter DC)

Keep stateless nodes ready

Make DNS nameserver changes inadvance (3-4 days)

Script everything - node creation,data movement, etc.

Aim for only data movement duringMigration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48

Preparation

July 2016

VPN across the providers (HeavyDuty)

Replicate all that can be replicated(inter DC)

Keep stateless nodes ready

Make DNS nameserver changes inadvance (3-4 days)

Script everything - node creation,data movement, etc.

Aim for only data movement duringMigration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48

Preparation

Practice, Practice, Practice!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 15 / 48

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

DB

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

Redis

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

DB

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

Redis

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

DB

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

Redis

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

DB

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

Redis

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48

Migration

29th July 2016 at 3am

Queues - RabbitMQ, Kafka, etc

Drain on XSwitch to new on GCP

DB

Replicated slaves across DCPromote to master and createslaves

ElasticSearch & Cassandra

Snapshot/RestoreVery Quick - Fast GCP network

Redis

RDB restore, create slavesBeware of cluster state in case ofredis cluster

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48

Post Migration

5-6hr of Maintenance

Latency dropped to 1/4th on GCP

DNS propagation issue (even after 2 days)

L7 tunnels over VPN

Ensure monitoring is taken over after migration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 17 / 48

Key Take Away

Practice makes the migrationperfect!

Keep stateless nodes ready

Keep configuration updated

Expect issues

Redis cluster state switchDNS caching by ISPs for days

Keep Calm!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48

Key Take Away

Practice makes the migrationperfect!

Keep stateless nodes ready

Keep configuration updated

Expect issues

Redis cluster state switchDNS caching by ISPs for days

Keep Calm!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48

From Pets To Cattle

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 19 / 48

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48

From Pets To Cattle

Static Infrastructure is a myth!

Manual updates can be faulty

Nodes can fail quickly, one afteranother

Configuration can quickly becomestale

Misconfiguration of Nodes

Salt propagation issuesRecent config update

Painful to detect and fix

Production impact!

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

From Pets To Cattle

Infrastructure at scale needs →

Centralized configurations

Dynamic Discovery

Automatic recovery from failures

Autoscaling

Scripts for stateful nodes(create/update/migrate)

Aggressive Monitoring and Alerting

Streamline Deployments

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48

Configuration and Service Discovery

For Configuration we needed →

Centralized configuration storage

Consistent store

Audit of configuration changes

Versioning for quick reverts

Easy to deploy and manage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 22 / 48

Configuration and Service Discovery

For Service Discovery we needed →

Decoupled from application code

Health checks

Easy to Scale Out

Easy to deploy and manage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 23 / 48

Configuration and Service Discovery

We built ’Config-Service’ on top on’Consul’

Configuration on nodes using ConsulTemplate & Envconsul

Installation on instances usinginternal Debian package and repo

’Config-Service’ package takes careof consul cluster configuration andhealth check registration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48

Configuration and Service Discovery

We built ’Config-Service’ on top on’Consul’

Configuration on nodes using ConsulTemplate & Envconsul

Installation on instances usinginternal Debian package and repo

’Config-Service’ package takes careof consul cluster configuration andhealth check registration

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48

Configuration Management

Git repository to manageconfiguration

Filename is the key, content is thevalue

Single source of truth

Audit log of changes

Easy reverts and versioning (just usegit revert)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48

Configuration Management

Git repository to manageconfiguration

Filename is the key, content is thevalue

Single source of truth

Audit log of changes

Easy reverts and versioning (just usegit revert)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Auto scaling on CPU usage /Number of Requests

Node Maintenance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Auto scaling on CPU usage /Number of Requests

Node Maintenance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Auto scaling on CPU usage /Number of Requests

Node Maintenance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48

Service Discovery

Named discovery

Loose coupling

Auto failover

Load balancing

Auto scaling on CPU usage /Number of Requests

Node Maintenance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48

Config-Service Overview

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 27 / 48

Auto Scaling

Pay as you go, lower cost

Better fault tolerance

Availability zone failures

Handle sudden increase in traffic (specially at midnight!)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48

Auto Scaling

Pay as you go, lower cost

Better fault tolerance

Availability zone failures

Handle sudden increase in traffic (specially at midnight!)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Key Take Away

Assume things willbreak

Set Convention

Script everything

Use deb/rpm packages

Instance groups forstateless services

More Cattle, less Pets

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48

Kubernetes

Partial Kubernetes deployment sinceOct, 2016

Full Production deployment sinceNov, 2016

Using Google Container Engine

30+ deployments

500+ containers (At Peak)

Autoscale on CPU targets

Not all services on boarded yet

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48

Kubernetes

Partial Kubernetes deployment sinceOct, 2016

Full Production deployment sinceNov, 2016

Using Google Container Engine

30+ deployments

500+ containers (At Peak)

Autoscale on CPU targets

Not all services on boarded yet

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48

Kubernetes

Partial Kubernetes deployment sinceOct, 2016

Full Production deployment sinceNov, 2016

Using Google Container Engine

30+ deployments

500+ containers (At Peak)

Autoscale on CPU targets

Not all services on boarded yet

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48

Kubernetes

We don’t use K8S Ingress/Service

Config-Service (consul) asDaemonSet

Containers get registered onConfig-Service (NodePort) fromhealth check

No change in existing architectureneeded

Service discovery fromInternal/External HA Proxy stillworks

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48

Kubernetes

We don’t use K8S Ingress/Service

Config-Service (consul) asDaemonSet

Containers get registered onConfig-Service (NodePort) fromhealth check

No change in existing architectureneeded

Service discovery fromInternal/External HA Proxy stillworks

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48

Kubernetes

We don’t use K8S Ingress/Service

Config-Service (consul) asDaemonSet

Containers get registered onConfig-Service (NodePort) fromhealth check

No change in existing architectureneeded

Service discovery fromInternal/External HA Proxy stillworks

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48

Kubernetes

’Config-Service’ allows us to have hybrid model

Instance groups can coexist with Kubernetes

Recovery mechanism / Transitioning

Instance group size set to zero (Fully on K8S)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48

Kubernetes

’Config-Service’ allows us to have hybrid model

Instance groups can coexist with Kubernetes

Recovery mechanism / Transitioning

Instance group size set to zero (Fully on K8S)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48

Kubernetes

’Config-Service’ allows us to have hybrid model

Instance groups can coexist with Kubernetes

Recovery mechanism / Transitioning

Instance group size set to zero (Fully on K8S)

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48

Deployment Pipeline

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 33 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Deployment Pipeline

Jenkins Pipeline

Pipeline triggers jenkins jobs

3 Clicks to Deploy

Approval Steps

Jobs to pause, resume orrevert deployment

Tracked in Slack channels

Soon to be transformed toCI/CD

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48

Monitoring & Alerting

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 35 / 48

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48

Monitoring & Alerting

Monitoring is critical

Know your Infrastructure

Capture everything, always

Use Proper tools

Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom

Identify Retention

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48

Monitoring & Alerting

Bare minimum required metrics→

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48

Monitoring & Alerting

Bare minimum required metrics→

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48

Monitoring & Alerting

Bare minimum required metrics→

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48

Monitoring & Alerting

Bare minimum required metrics→

Load Average

CPU percent

Memory Available

Network Bandwidth

Network Connections

Disk IOPS

Disk Usage

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48

Build Dashboards

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48

Build Dashboards

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48

Build Dashboards

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48

Build Dashboards

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Monitoring & Alerting

’Config-Service’ logs autofailover

Slack for notifications

On Call

Avoid alert blindness

Keep links handy

Schedule jobs

Automate

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Future Plans

Hire more engineers!

Move more services to Kubernetes

Move away from PG (don’t need ACID)

Transition to Microservices

Improve monitoring further

More fault tolerance

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Microservices

Golang (go-kit inspired)

Cassandra for storage

ElasticSearch for lookup

gRPC for communication

Hystrix for real timemonitoring

Zipkin for request tracing

Prometheus for metrics

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48

Flash Sale

Ultimate test of scalability

Hard to judge peak

Throughput can multiply inshort time

Planned for 2x throughput

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48

Flash Sale

Ultimate test of scalability

Hard to judge peak

Throughput can multiply inshort time

Planned for 2x throughput

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48

Flash Sale - Latency

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 43 / 48

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48

Flash Sale

Cache read calls at multiple layers

Upsized ES nodes, Eventuallyreplacing entire cluster

Local SSD PG slaves with RAID 0(100k IOPS)

Identify network bottlenecks

Recheck ulimit and connection limits

Build and keep SOP handy

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48

Flash Sale - Standard Operating Procedure

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 45 / 48

Infrastructure Team at Carousell

400+ servers

Thousands of requests per second

Production Issues get looked after in < 5 Mins

Uptime of 99.95

Failures don’t result in outages

All thanks to Planning, Monitoring and Automation

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 46 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Take Away

Isolate stateful and stateless components

Isolating compute is equally important

Choose data stores carefully, you won’t be changing themfrequently

Use Abstractions only after understating them

Perform Root Cause Analysis not just workarounds/isolations

Identify bottlenecks

Monitor everything

Blame CODE not CODER

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48

Thank You

Q&A

P.S. we are hiring http://careers.carousell.com/

Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 48 / 48