Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin...

53
Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America

Transcript of Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin...

Page 1: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Otimizando Servidores WebDavi Menezes

Lead Cloud Technical Account ManagerAWS Support – Latin America

Page 2: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Different strategies for better performance

• Leverage newer hardware and software.• Apply more resources through auto scaling.• Offload the heavy lifting to someone else.• Optimize the web server stack.

Page 3: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Defining “better” performance

• Throughput -- transactions per second (tps).• Latency reduction.• Cost reduction.

Page 4: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Optimizations by definition are app-specific

• Test and validate together with the application itself.• There is no substitute to production data.• Make it an integral part of the application itself.

– E.g. Elastic Beanstalk .ebextensions

Page 5: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Identifying Bottlenecks

Page 6: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

First understand your workload

• What are we serving?– Number of transactions– Transaction size– Back-end resource consumption

• How much can we do today?– Theoretical benchmark– Actual production load (observability / data-driven)

• What is the bottleneck resource?– “Choose instance type for the bounding resource”– Workload Analysis vs. Resource Analysis

https://youtu.be/7Cyd22kOqWc

Page 7: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Avoid tuning finds at random

Page 8: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Logs: the ultimate source of truth119.246.177.166 - - [02/Nov/2014:05:02:00 +0000] "GET /tmUnblock.cgi HTTP/1.1" 400 301 "-"117.21.173.27 - - [02/Nov/2014:06:28:39 +0000] "GET /manager/html HTTP/1.1" 404 289 "-"117.21.225.165 - - [02/Nov/2014:16:36:58 +0000] "GET /manager/html HTTP/1.1" 404 289 "-"50.62.6.117 - - [02/Nov/2014:20:50:39 +0000] "GET //wp-login.php HTTP/1.1" 404 289 "-"50.62.6.117 - - [02/Nov/2014:20:50:39 +0000] "GET /blog//wp-login.php HTTP/1.1" 404 295 "-"50.62.6.117 - - [02/Nov/2014:20:50:40 +0000] "GET /wordpress//wp-login.php HTTP/1.1" 404 300 "-"50.62.6.117 - - [02/Nov/2014:20:50:40 +0000] "GET /wp//wp-login.php HTTP/1.1" 404 293 "-"24.199.131.50 - - [03/Nov/2014:08:00:30 +0000] "GET /tmUnblock.cgi HTTP/1.1" 400 301 "-"76.10.82.137 - - [03/Nov/2014:08:55:49 +0000] "GET /tmUnblock.cgi HTTP/1.1" 400 301 "-"123.249.19.23 - - [03/Nov/2014:09:15:29 +0000] "GET /manager/html HTTP/1.1" 404 289 "-"117.21.173.27 - - [03/Nov/2014:15:55:25 +0000] "GET /manager/html HTTP/1.1" 404 289 "-"62.210.136.228 - - [03/Nov/2014:22:31:22 +0000] "GET / HTTP/1.1" 403 3839 "-"24.27.104.175 - - [04/Nov/2014:00:18:18 +0000] "GET /tmUnblock.cgi HTTP/1.1" 400 301 "-"198.20.69.74 - - [04/Nov/2014:02:07:05 +0000] "GET / HTTP/1.1" 403 3839 "-"198.20.69.74 - - [04/Nov/2014:02:07:13 +0000] "GET /robots.txt HTTP/1.1" 404 287 "-”181.188.47.118 - - [04/Nov/2014:03:02:56 +0000] "GET /tmUnblock.cgi HTTP/1.1" 400 301 "-"117.21.173.27 - - [04/Nov/2014:09:27:19 +0000] "GET /manager/html HTTP/1.1" 404 289 "-"193.174.89.19 - - [04/Nov/2014:13:34:23 +0000] "GET / HTTP/1.1" 403 3839 "-"

Page 9: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

CloudWatch Metric Anatomy

• Statistical aggregation– Min– Max– Sum– Average– Count

• One data point per minute.• Can trigger actions via

alarms.

Page 10: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Micro metrics vs. Macro metrics

• Agent-based monitoring

• Available inAmazon Linux

• Provides highly-granular, server-specific insights

Source: http://demo.munin-monitoring.org/

Page 11: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Coming from a variety of sources

Customer generated• Kernel and Operating System

• Web Server

• Application Server/Middleware

• Application code

• Instance networking

AWS generated• Amazon CloudFront

• Amazon Elastic Load Balancing

• Amazon CloudWatch

• Amazon Simple Storage Service

Page 12: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

1 8 15 22 29 36 43 50 57 64 71 78 85 92 990

50

100

150

200

250

Latency at percentile Average Latency

6 10 14 18 22 26 30 34 38 42 46 50 204

208

0200400600800

100012001400160018002000

Latency Histogram

Frequency

More than meet the eyes

Page 13: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Noteworthy AWS CloudWatch metrics

• EC2 Instances– New T2 CPU Credits– CPU utilization– Bandwidth (In/Out)

• EBS– PIOPS utilization– GP2 utilization – Remember: 8GB volume

will provision 24 IOPs!

• Elastic Load Balancing– RequestCount– Latency– Queue length and spillover– Backend connections errors

• CloudFront– Requests– BytesDownloaded

Page 14: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Diving Deep on the Last Mile (you & us)

Page 15: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Elastic Load Balancer

Page 16: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

ELB Connection Behavior

• No true limits on influx of connections– But may require preemptive scaling (aka Pre-warming)

• Leverages HTTP Keep-Alives

• Configurable Idle Connection Timeout

• HTTP Session Stickness & Health-checking– Fast Registration

• SSL Off-loading and Back-end authentication

Page 17: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

ELB access logs

HTTP log entries• Only one side of picture.

• Can’t log custom headers or format logs.

• Logs are delayed.

• Drill down to instance level responsiveness, but can’t dive into latency outliers

0

5

10

15

20

25

30

35

Processing Time

backend_processing_time request_processing_timeresponse_processing_time

byte

s

Page 18: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

ELB Key Metrics

• Latency and Request Count• Surge Queue and Spillover • ELB 5xx and 4xx • Back-end Connection Errors• Healthy and Unhealthy Host Counts

Page 19: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

The life of an HTTP connection

Page 20: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

int cfd,fd=socket(PF_INET,SOCK_STREAM,IPPROTO_TCP);struct sockaddr_in si;si.sin_family=PF_INET;inet_aton("127.0.0.1",&si.sin_addr);si.sin_port=htons(80);bind(fd,(struct sockaddr*)si,sizeof si);listen(fd,512);while ((cfd=accept(fd,(struct sockaddr*)si,sizeof si)) != -1) { read_request(cfd); /* read(cfd,...) until "\r\n\r\n" */ write(cfd,"200 OK HTTP/1.0\r\n\r\n" ”Bem-vindo ao AWS Summit SP 2015.",19+27); close(cfd);}

http:80fd=socket(PF_INET,SOCK_STREAM,IPPROTO_TCP)

bind(fd,(struct sockaddr*)si,sizeof si)listen(fd,512)

accept(fd,(struct sockaddr*)si,sizeof si)

# of openfile descriptors

Page 21: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

The last TCP mile

• Accept Pending Queue– man listen(2): “(…) backlog argument defines the maximum length to which the

queue of pending connections for sockfd may grow.”– Recv-Q & Send-Q – TCP is stream oriented

• man accept(2): Blocking vs. Non-blocking sockets

Page 22: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Tweaking the TCP stack (aka sysctl)

Page 23: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Queuing at the TCP layer first

• ECONNREFUSEDman listen(2):

“if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds” – aka: TCP Retransmit

Page 24: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Scaling in the Linux Networking Stack

• Connection States– man netstat(8)

• Backlog Maximum Length– Waiting to be accepted: /proc/sys/net/core/somaxconnn– Half-Open connections: /proc/sys/net/ipv4/tcp_max_syn_backlog– CPU's input packet queue: /proc/sys/net/core/netdev_max_backlog

Page 25: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

TCP is a Window based protocol

• TCP Receive Window“considered one of the most important TCP tweaks” (ugh!)

– BDP = avail. bandwidth (KBps) X RTT (ms)

• Choose an EC2 Instancewith proper Bandwidth

{62,6C,…,75,6D}

ACK (wnd sz)

Page 26: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

TCP Initial Congestion Window

• RFC 3390 – Higher Initial Window

– ip route (…) initcwnd 10 (kernel <2.6.39)

• Disable Slow Start (net.ipv4.tcp_slow_start_after_idle)

• Google Research– “propose to increase (…) to at least ten segments (about 15KB)

Pub: “An Argument for Increasing TCP's Initial Congestion Window”

+/* TCP initial congestion window */+#define TCP_INIT_CWND 10

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=(…)commited to thekernel 2.6.39 (May 2011)

Page 27: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

TCP Buffers & Memory Utilization

• Buffering– Use case: sending/receiving large amounts of data– Auto-tunable by the kernel– However, has bounds: min, default, and max.– Tune: net.ipv4.tcp_rmem/wmem (in bytes)

• Sockets demand on page allocation– Tune: net.ipv4.tcp_mem (in pages)

Page 28: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

inet_timewait_death_row

Page 29: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

About TIME-WAIT state

• TIME-WAIT Assassination RFC

• Increase your port range– net.ipv4.ip_local_port_range

– A ballpark of your rate of connections per second: (ip_local_port_range / tcp_fin_timeout) leads to about 500 connections per second !

“The TIME_WAIT state is our friend and is there to help us (i.e., to let old duplicate segments expire in the network). Instead of trying to avoid the state, we should understand it.”

Vincent Bernat - (vincent.bernat.im)

Page 30: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Check your sources

XKCD: Duty Call - https://xkcd.com/386/

Page 31: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

• Clients behind NAT/Stateful FW• will get dropped

*99.99999999% of time should never be enabled

* Probably 100% but there may be a valid case out there

TL;DR: Do *not* enable net.ipv4.tcp_tw_recycle

Linux’s TCP protocol man pagedo not recommend

Page 32: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.
Page 33: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

net.ipv4.tcp_tw_reuse

Makes a safer attempt at freeing sockets in TIME_WAIT state.

Page 34: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Customer Story

Page 35: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Arquitetura

• Mais de 400k requisições por minuto

• 100+ instâncias EC2 em produção distribuídas em diferentes availability zones em Virtual Private Clouds, diversos Elastic Load Balancing

• RDS clusters, SQS, ElastiCache (Redis), CloudSearch, CloudWatch...

• Serviços Gerenciados permitem que nossos sys admins possam ser mais produtivosAvailability Zone Availability Zone

API API API… API API API…

Mongo Mongo

Page 36: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Erros 400 no ELB

• Identificou-se um aumento de erros 400 no ELB;

• Em conjunto com o suporte enterprise da AWS, realizamos um Deep dive nos logs de acesso do ELB usando Elasticsearch

• Verificamos que os eventos estavam correlacionados a usuários mobile de operadoras que usavam NAT em suas conexões 3g;

• Tcpdump para trace de pacotes revelaram que conexões estavam sendo silenciosamente descartadas;

Page 37: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Resultado das análises

• Depois das analises descobrimos que estávamos com as configuração abaixo em nossos servidores

– net.ipv4.tcp_tw_recycle & net.ipv4.tcp_tw_reuse habilitados

• Quando se ativa recycle, o kernel tenta tomar decisões baseadas no timestamp usado pelos hosts remotos. Ele tenta achar o último timestamp usado por cada host remoto que tenham uma conexão em TIME_WAIT, e ira permitir o reaproveitamento do socket se o timestamp tiver corretamente incrementado, mas se o timestamp usado pelo host não tiver aumentado corretamente o pacote será descartado pelo kernel.

• Muitos de nossos clientes conectam através de operadoras que usam NAT. Com a alta taxa de acesso entrando do mesmo IP passamos a ter o kernel recusando essas conexões devido a inconsistência no timestamp, resultando um Bad Request (400) no ELB.

Page 38: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Testemunho de Vinicius Garcia (CTO da Easy):

• A ajuda do suporte enterprise foi de extrema importância para encontramos a solução para o nosso caso

• Se não tivéssemos todos os logs e os dados que levantamos para a análise, teria sido extremamente difícil e provavelmente não teríamos conseguido chegar a conclusão do que estava acontecendo.

Page 39: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Tweaking the Webserver stack

Page 40: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

• Tune resources consumption– Context Switches / CPU– Memory Utilization

• Allow your webserver processes enough requests concurrently– “Child Processes” / “Max Clients” tunables

Webservers Tuning 101

Page 41: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

• Keep an eye on the somaxconn limits

• Understand resources utilization by the webserver– Process Isolation vs. Blast Radius– Avoid Resources Saturation & Starvation

The backlog is back, again!

Page 42: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

• man tcp(7) – tcp_defer_accept:Webserver only awakes when there is data available!

• Reduce the burden on the webserver’s process• TCP Socket is already established (i.e. no SYN flood)

Telling the webserver when to start

Nginx• listen [deferred]

Apache• AcceptFilter http data• AcceptFilter https data

Page 43: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

• man sendfile(2)“copying is done within the kernel”

• I.e. no use of User Space

Using the Zero-copy pattern

Nginx• sendfile on

Apache• EnableSendFile on

Page 44: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

HTTP Keep-Alive

Nginx• keepalive_timeout 75s• keepalive_requests 100

Apache• KeepAlive On• KeepAliveTimeout 5• MaxKeepAliveRequests 100

Ensure it matches your ELB timeout setting; otherwise…look into your ELB’s 5XX metric

Page 45: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

“The small-packet problem”

Flush() (tcp_cork)• flush() analogy

• The application needs to “uncork” the stream

• sendfile() is a must

Auto in Apache (+sendfile option)

Set tcp_nopush to false in NGINX

Nagle’s algo (tcp_nodelay)• The initial problem:

“congestion collapse”

• write() vs. writev()

• Onto the wire asap

Always On in ApacheSet tcp_nodelay flag in NGINX

Page 46: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

“The small-packet problem”

Flush() (tcp_cork)• flush() analogy

• The application needs to “uncork” the stream

• sendfile() is a must

Auto in Apache (+sendfile option)

Set tcp_nopush to false in NGINX

Nagle’s algo (tcp_nodelay)• The initial problem:

“congestion collapse”

• write() vs. writev()

• Onto the wire asap

Always On in ApacheSet tcp_nodelay flag in NGINX

/* TCP_NODELAY is weaker than TCP_CORK, so that* this option on corked socket is remembered, but* it is not activated until cork is cleared.** However, when TCP_NODELAY is set we make* an explicit push, which overrides even TCP_CORK* for currently queued segments.*/

Page 47: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Thanks Chartbeat!

Further details: http://engineering.chartbeat.com/author/justinlintz/

Page 48: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Start w/ Small Wins and keep iterating!

Page 49: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.
Page 50: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Quick review

• Keep the connection for as long as possible.

• Minimize the latency.

• Increase throughput.

• Most importantly, research what settings make most sense for your environment.

Page 51: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Offload opportunities

• Leverage ELB’s– Large Volumes Connection Handling– SSL Off-loading

• CloudFront + S3 for static file delivery– Tune HTTP responses’ cache headers

• Go Multi-region w/ Route 53 LBR

Page 52: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Last thoughts

• Monitor everything.• Tune your server to your workload.• Improvement must be quantifiable.• Experiment and continuously re-validate!

And most importantly, REMEMBER:

Page 53: Otimizando Servidores Web Davi Menezes Lead Cloud Technical Account Manager AWS Support – Latin America.

Otimizando Servidores WebDavi Menezes

Cloud Technical Account Manager | AWS Support

OBRIGADO!