Scaling Infrastructure at Carousell
-
Upload
ankur-shrivastava -
Category
Technology
-
view
30 -
download
0
Transcript of Scaling Infrastructure at Carousell
Scaling Infrastructure at Carousell
Harshad Rotithor & Ankur Shrivastava
January 12, 2017
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 1 / 48
Who are we?
Harshad Rotithor
Principle Software Engineer
Leads Infrastructure team
Previously at Flipkart,Airpush, Zynga, etc.
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 2 / 48
Who are we?
Ankur Shrivastava
Senior Software Engineer
Engineer in the Infrastructureteam
Previously at Flipkart,Amazon, Zynga, etc.
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 3 / 48
Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
Started in 2012 at a Hackathon
7 countries, 19 cities
57M+ listings
23M+ items sold
Carousell makes buying and sellingsimple, so that you can fill our lifewith more meaningful things
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 4 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...
Uptime of 99.95
Ability to handle AZ failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...
Uptime of 99.95
Ability to handle AZ failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
Where are we currently?
400+ servers
Multiple Services see 2000+ requests per second
Self Managed deployments
PostgresSQLElasticSearchCassandraRabbitMQKafkaRedisMemcacheand more ...
Uptime of 99.95
Ability to handle AZ failures
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 5 / 48
So what is this talk about ?
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 6 / 48
What it took to reach hereAnd what lies ahead!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 7 / 48
Current Infrastructure - Overview
Infrastructure is:
ArchitectureSystemsOperations
Stateful components most important
We self-manage user path datastores
Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure - Overview
Infrastructure is:
ArchitectureSystemsOperations
Stateful components most important
We self-manage user path datastores
Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure - Overview
Infrastructure is:
ArchitectureSystemsOperations
Stateful components most important
We self-manage user path datastores
Enable choice of data storesRight tradeoff in terms ofconsistencyEnable possibilities ofworkarounds during rough timesHave flexibility in nodeconfiguration etc
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 8 / 48
Current Infrastructure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 9 / 48
Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total7)
pgbouncer + HA Proxy(config-service)
Dedicated data disks (always useSSDs)
Master disk snapshot every 3hr(fsync enabled)
Don’t turn off Autovacuum(transaction id)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
Current Infrastructure - Data Stores
Master + 2 Slaves in each AZ (Total7)
pgbouncer + HA Proxy(config-service)
Dedicated data disks (always useSSDs)
Master disk snapshot every 3hr(fsync enabled)
Don’t turn off Autovacuum(transaction id)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 10 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks(config-service)
Incremental backups
Set shard count correctly, be on higher side.
Rely on linux page cache
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks(config-service)
Incremental backups
Set shard count correctly, be on higher side.
Rely on linux page cache
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
Current Infrastructure - Data Stores
3 clusters, largest being close to 75 nodes
Shard allocation awareness
Use Plugins (kopf /head/cerebro)
Keep masters in different AZ
HAProxy with L7 healthchecks(config-service)
Incremental backups
Set shard count correctly, be on higher side.
Rely on linux page cache
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 11 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
NetworkDisk
Noisy neighbours
Limited types of instances
Lack of features
Load balancerAutoscalingSecurity!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
NetworkDisk
Noisy neighbours
Limited types of instances
Lack of features
Load balancerAutoscalingSecurity!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
NetworkDisk
Noisy neighbours
Limited types of instances
Lack of features
Load balancerAutoscalingSecurity!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
History
Cloud provider ’x’
Everyday firefighting
We hit upper limits
NetworkDisk
Noisy neighbours
Limited types of instances
Lack of features
Load balancerAutoscalingSecurity!
Decided on Migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 12 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Isolated deployment in GCP
Dry run data migration for all clusters
Estimate time
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Isolated deployment in GCP
Dry run data migration for all clusters
Estimate time
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Planning
Around June 2016
250+ Nodes
Identify ALL nodes and their functionalities
Identify ALL traffic flows and patterns
Architecture Freeze
Perform comparative benchmarks
Redefine node and cluster configuration
Isolated deployment in GCP
Dry run data migration for all clusters
Estimate time
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 13 / 48
Preparation
July 2016
VPN across the providers (HeavyDuty)
Replicate all that can be replicated(inter DC)
Keep stateless nodes ready
Make DNS nameserver changes inadvance (3-4 days)
Script everything - node creation,data movement, etc.
Aim for only data movement duringMigration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
Preparation
July 2016
VPN across the providers (HeavyDuty)
Replicate all that can be replicated(inter DC)
Keep stateless nodes ready
Make DNS nameserver changes inadvance (3-4 days)
Script everything - node creation,data movement, etc.
Aim for only data movement duringMigration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 14 / 48
Preparation
Practice, Practice, Practice!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 15 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on XSwitch to new on GCP
DB
Replicated slaves across DCPromote to master and createslaves
ElasticSearch & Cassandra
Snapshot/RestoreVery Quick - Fast GCP network
Redis
RDB restore, create slavesBeware of cluster state in case ofredis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on XSwitch to new on GCP
DB
Replicated slaves across DCPromote to master and createslaves
ElasticSearch & Cassandra
Snapshot/RestoreVery Quick - Fast GCP network
Redis
RDB restore, create slavesBeware of cluster state in case ofredis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on XSwitch to new on GCP
DB
Replicated slaves across DCPromote to master and createslaves
ElasticSearch & Cassandra
Snapshot/RestoreVery Quick - Fast GCP network
Redis
RDB restore, create slavesBeware of cluster state in case ofredis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on XSwitch to new on GCP
DB
Replicated slaves across DCPromote to master and createslaves
ElasticSearch & Cassandra
Snapshot/RestoreVery Quick - Fast GCP network
Redis
RDB restore, create slavesBeware of cluster state in case ofredis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Migration
29th July 2016 at 3am
Queues - RabbitMQ, Kafka, etc
Drain on XSwitch to new on GCP
DB
Replicated slaves across DCPromote to master and createslaves
ElasticSearch & Cassandra
Snapshot/RestoreVery Quick - Fast GCP network
Redis
RDB restore, create slavesBeware of cluster state in case ofredis cluster
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 16 / 48
Post Migration
5-6hr of Maintenance
Latency dropped to 1/4th on GCP
DNS propagation issue (even after 2 days)
L7 tunnels over VPN
Ensure monitoring is taken over after migration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 17 / 48
Key Take Away
Practice makes the migrationperfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switchDNS caching by ISPs for days
Keep Calm!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
Key Take Away
Practice makes the migrationperfect!
Keep stateless nodes ready
Keep configuration updated
Expect issues
Redis cluster state switchDNS caching by ISPs for days
Keep Calm!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 18 / 48
From Pets To Cattle
⇓
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 19 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one afteranother
Configuration can quickly becomestale
Misconfiguration of Nodes
Salt propagation issuesRecent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one afteranother
Configuration can quickly becomestale
Misconfiguration of Nodes
Salt propagation issuesRecent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one afteranother
Configuration can quickly becomestale
Misconfiguration of Nodes
Salt propagation issuesRecent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one afteranother
Configuration can quickly becomestale
Misconfiguration of Nodes
Salt propagation issuesRecent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Static Infrastructure is a myth!
Manual updates can be faulty
Nodes can fail quickly, one afteranother
Configuration can quickly becomestale
Misconfiguration of Nodes
Salt propagation issuesRecent config update
Painful to detect and fix
Production impact!
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 20 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
From Pets To Cattle
Infrastructure at scale needs →
Centralized configurations
Dynamic Discovery
Automatic recovery from failures
Autoscaling
Scripts for stateful nodes(create/update/migrate)
Aggressive Monitoring and Alerting
Streamline Deployments
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 21 / 48
Configuration and Service Discovery
For Configuration we needed →
Centralized configuration storage
Consistent store
Audit of configuration changes
Versioning for quick reverts
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 22 / 48
Configuration and Service Discovery
For Service Discovery we needed →
Decoupled from application code
Health checks
Easy to Scale Out
Easy to deploy and manage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 23 / 48
Configuration and Service Discovery
We built ’Config-Service’ on top on’Consul’
Configuration on nodes using ConsulTemplate & Envconsul
Installation on instances usinginternal Debian package and repo
’Config-Service’ package takes careof consul cluster configuration andhealth check registration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
Configuration and Service Discovery
We built ’Config-Service’ on top on’Consul’
Configuration on nodes using ConsulTemplate & Envconsul
Installation on instances usinginternal Debian package and repo
’Config-Service’ package takes careof consul cluster configuration andhealth check registration
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 24 / 48
Configuration Management
Git repository to manageconfiguration
Filename is the key, content is thevalue
Single source of truth
Audit log of changes
Easy reverts and versioning (just usegit revert)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
Configuration Management
Git repository to manageconfiguration
Filename is the key, content is thevalue
Single source of truth
Audit log of changes
Easy reverts and versioning (just usegit revert)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 25 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Service Discovery
Named discovery
Loose coupling
Auto failover
Load balancing
Auto scaling on CPU usage /Number of Requests
Node Maintenance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 26 / 48
Config-Service Overview
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 27 / 48
Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Handle sudden increase in traffic (specially at midnight!)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
Auto Scaling
Pay as you go, lower cost
Better fault tolerance
Availability zone failures
Handle sudden increase in traffic (specially at midnight!)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 28 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Key Take Away
Assume things willbreak
Set Convention
Script everything
Use deb/rpm packages
Instance groups forstateless services
More Cattle, less Pets
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 29 / 48
Kubernetes
Partial Kubernetes deployment sinceOct, 2016
Full Production deployment sinceNov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Not all services on boarded yet
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
Partial Kubernetes deployment sinceOct, 2016
Full Production deployment sinceNov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Not all services on boarded yet
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
Partial Kubernetes deployment sinceOct, 2016
Full Production deployment sinceNov, 2016
Using Google Container Engine
30+ deployments
500+ containers (At Peak)
Autoscale on CPU targets
Not all services on boarded yet
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 30 / 48
Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) asDaemonSet
Containers get registered onConfig-Service (NodePort) fromhealth check
No change in existing architectureneeded
Service discovery fromInternal/External HA Proxy stillworks
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) asDaemonSet
Containers get registered onConfig-Service (NodePort) fromhealth check
No change in existing architectureneeded
Service discovery fromInternal/External HA Proxy stillworks
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
We don’t use K8S Ingress/Service
Config-Service (consul) asDaemonSet
Containers get registered onConfig-Service (NodePort) fromhealth check
No change in existing architectureneeded
Service discovery fromInternal/External HA Proxy stillworks
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 31 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Instance group size set to zero (Fully on K8S)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Instance group size set to zero (Fully on K8S)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Kubernetes
’Config-Service’ allows us to have hybrid model
Instance groups can coexist with Kubernetes
Recovery mechanism / Transitioning
Instance group size set to zero (Fully on K8S)
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 32 / 48
Deployment Pipeline
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 33 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Deployment Pipeline
Jenkins Pipeline
Pipeline triggers jenkins jobs
3 Clicks to Deploy
Approval Steps
Jobs to pause, resume orrevert deployment
Tracked in Slack channels
Soon to be transformed toCI/CD
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 34 / 48
Monitoring & Alerting
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 35 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Monitoring is critical
Know your Infrastructure
Capture everything, always
Use Proper tools
Prometheus (withexporters)ELKSentryStatsDNewRelicOpsGeniePingdom
Identify Retention
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 36 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Monitoring & Alerting
Bare minimum required metrics→
Load Average
CPU percent
Memory Available
Network Bandwidth
Network Connections
Disk IOPS
Disk Usage
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 37 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Build Dashboards
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 38 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Monitoring & Alerting
’Config-Service’ logs autofailover
Slack for notifications
On Call
Avoid alert blindness
Keep links handy
Schedule jobs
Automate
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 39 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Future Plans
Hire more engineers!
Move more services to Kubernetes
Move away from PG (don’t need ACID)
Transition to Microservices
Improve monitoring further
More fault tolerance
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 40 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Microservices
Golang (go-kit inspired)
Cassandra for storage
ElasticSearch for lookup
gRPC for communication
Hystrix for real timemonitoring
Zipkin for request tracing
Prometheus for metrics
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 41 / 48
Flash Sale
Ultimate test of scalability
Hard to judge peak
Throughput can multiply inshort time
Planned for 2x throughput
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
Flash Sale
Ultimate test of scalability
Hard to judge peak
Throughput can multiply inshort time
Planned for 2x throughput
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 42 / 48
Flash Sale - Latency
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 43 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventuallyreplacing entire cluster
Local SSD PG slaves with RAID 0(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventuallyreplacing entire cluster
Local SSD PG slaves with RAID 0(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventuallyreplacing entire cluster
Local SSD PG slaves with RAID 0(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventuallyreplacing entire cluster
Local SSD PG slaves with RAID 0(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale
Cache read calls at multiple layers
Upsized ES nodes, Eventuallyreplacing entire cluster
Local SSD PG slaves with RAID 0(100k IOPS)
Identify network bottlenecks
Recheck ulimit and connection limits
Build and keep SOP handy
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 44 / 48
Flash Sale - Standard Operating Procedure
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 45 / 48
Infrastructure Team at Carousell
400+ servers
Thousands of requests per second
Production Issues get looked after in < 5 Mins
Uptime of 99.95
Failures don’t result in outages
All thanks to Planning, Monitoring and Automation
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 46 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Take Away
Isolate stateful and stateless components
Isolating compute is equally important
Choose data stores carefully, you won’t be changing themfrequently
Use Abstractions only after understating them
Perform Root Cause Analysis not just workarounds/isolations
Identify bottlenecks
Monitor everything
Blame CODE not CODER
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 47 / 48
Thank You
Q&A
P.S. we are hiring http://careers.carousell.com/
Harshad Rotithor & Ankur Shrivastava Scaling Infrastructure at Carousell January 12, 2017 48 / 48