SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at...

40
SSLShader: Cheap SSL Acceleration with Commodity Processors Keon Jang + , Sangjin Han + , Seungyeop Han * , Sue Moon + , and KyoungSoo Park + KAIST + and University of Washington *

Transcript of SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at...

Page 1: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSLShader: Cheap SSL Acceleration with

Commodity Processors

Keon Jang+, Sangjin Han+, Seungyeop Han*,

Sue Moon+, and KyoungSoo Park+

KAIST+ and University of Washington*

1

Page 2: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Security of Paper Submission Websites

2

Network and Distributed System Security Symposium

Page 3: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Security Threats in the Internet

Public WiFi without encryption

• Easy target that requires almost no effort

Deep packet inspection by governments

• Used for censorship

• In the name of national security

NebuAd’s targeted advertisement

• Modify user’s Web traffic in the middle

3

Page 4: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Secure Sockets Layer (SSL)

A de-facto standard for secure communication

• Authentication, Confidentiality, Content integrity

4

Client Server TCP handshake

Encrypted data

Key exchange using public key algorithm

(e.g., RSA) Server identification

Page 5: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSL Deployment Status

Most of Web-sites are not SSL-protected

• Less than 0.5%

• [NETCRAFT Survey Jan ‘09]

Why is SSL not ubiquitous?

• Small sites: lack of recognition, manageability, etc.

• Large sites: cost

• SSL requires lots of computation power

5

Page 6: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSL Computation Overhead

Performance overhead (HTTPS vs. HTTP)

• Connection setup

• Data transfer

Good privacy is expensive

• More servers

• H/W SSL accelerators

Our suggestion:

• Offload SSL computation to GPU

6

22x

50x

Page 7: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSL-accelerator leveraging GPU

• High-performance

• Cost-effective

SSL reverse proxy

• No modification on existing servers

SSLShader

7

SSLShader

Web Server

SMTP Server

POP3 Server

Plain TCP SSL-encrypted session

Page 8: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Our Contributions

GPU cryptography optimization

• The fastest RSA on GPU

• Superior to high-end hardware accelerators

• Low latency

SSLShader

• Complete system exploiting GPU for SSL processing

• Batch processing

• Pipelining

• Opportunistic offloading

• Scaling with multiple cores and NUMA nodes

8

Page 9: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

CRYPTOGRAPHIC PROCESSING

WITH GPU

9

Page 10: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

How GPU Differs From CPU?

Intel Xeon 5650 CPU:

6 cores

NVIDIA GTX580 GPU:

512 cores

Control

ALU

ALU

ALU

ALU

ALU ALU

Cache

ALU

10

62×109 870×109 < Instructions / sec

Core

Cache

Page 11: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

void VecAdd(

int *A, int *B, int *C, int N)

{

//iterate over N elements

for(int i = 0; i < N; i++)

C[i] = A[i] + B[i]

}

VecAdd(A, B, C, N);

__global__ void VecAdd(

int *A, int *B, int *C)

{

int i = threadIdx.x;

C[i] = A[i] + B[i]

}

//Launch N threads

VecAdd<<<1, N>>>(A, B, C);

Single Instruction Multiple Threads (SIMT)

11

GPU code CPU code

Example code: vector addition (C = A + B)

1/3지점 8분 10초

Page 12: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Parallelism in SSL Processing

12

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Page 13: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Our GPU Implementation

Choices of cipher-suite

Optimization of GPU algorithms • Exploiting massive parallel processing

• Parallelization of algorithms

• Batch processing

• Data copy overhead is significant

• Concurrent copy and execution

13

앞에랑 매핑이 되게-_- 그림을 가져와서 매핑이 되게 하는게 좋을듯

Client Server

Encryption: AES Message Authentication: SHA1

Key exchange: RSA

Page 14: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Basic RSA Operations

M: plain-text, C: cipher-text

(e, n): public key, (d, n): private key

Encryption:

C = Me mod n

Decryption:

M = Cd mod n

14

1024/2048 bits integer (300 ~ 600 digits)

Small number: 3, 17, 65537

Decryption at the server side is the bottleneck

Exponentiation many multiplications

Server-side

Server

Client

Page 15: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Breakdown of Large Integer Multiplication

15

Schoolbook multiplication

649 X 627 ---------

63 280 4200 180 800

12000 5400 32000

+ 360000 ---------

406923

Accumulation is difficult to parallelize due to

“overlapping digits”

“carry propagation”

3 x 3 = 9 multiplications 9 addition of 6-digits integers

Page 16: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

O(s) Parallel Multiplications

16

Example of 649 x 627 = 406,923

2s steps

1 or 2 steps (s – 1 worst case)

s = # of words in a large integer (E.g., 1024-bits = 16 x 64 bits word)

Page 17: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

More Optimizations on RSA

Common optimizations for RSA • Chinese Remainder Theorem (CRT) • Montgomery Multiplication • Constant Length Non-zero Window (CLNW)

Parallelization of serial algorithms • Faster Calculation of M×n • Interleaving of T + M×n • Mixed-Radix Conversion Offloading

GPU specific optimizations • Warp Utilization • Loop Unrolling • Elimination of Divergence • Avoiding Bank Conflicts • Instruction-Level Optimization

17

4054 6620 13281 9891 10146 6627 21041

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000

Throughput (operations/s)

Initial (1)

(2)

(3) Warp

Utilization

(4)

(5) (6) 64-bit words (7) Avoiding bank

conflicts

(8) Instruction-level

Optimization CLNW (9) Post-exponentiation offloading

Read our paper for details

Page 18: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Parallelism in SSL Processing

18

Client 1

Client 2

Client N 1. Independent Sessions

SSL Record SSL Record SSL Record 2. Independent SSL Record

3. Parallelism in Cryptographic Operations

SSLShader

Batch Processing

Page 19: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

GTX580 Throughput w/o Batching

19

0.08x 0.02x

1.57x

0.02x 0.0x

0.5x

1.0x

1.5x

2.0x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a “single CPU core”

Intel Nehalem single core (2.66Ghz)

Page 20: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

22.1x

6.8x 7.7x 9.4x

0x

5x

10x

15x

20x

25x

RSA AES-ENC AES-DEC SHA1

Throughput relative to a "single CPU core"

GTX580 Throughput w/ Batching

20

Difference: ratio of computation to copy

Batch size: 32~4096 depending on the algorithm

Page 21: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Copy Overhead in GPU Cryptography

GPU processing works by

• Data copy: CPU GPU

• Execution in GPU

• Data copy: GPU -> CPU

21

AES-ENC (Gbps)

AES-DEC (Gbps)

HMAC-SHA1 (Gbps)

GTX580 w/ copy 8.8 10 31

GTX580 no copy 21.8 33 124

0

20

40

60

80

100

120

140

AES-ENC AES-DEC HMAC-SHA1

Th

rou

gh

pu

t (G

bp

s)

↑2.4x ↑3.3x

↑4x

w/o copy

w/ copy

w/o copy

Page 22: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Hiding Copy Overhead

22

Synchronous Execution

Pipelining

Processing time : 3t

t

Amortized processing time : t

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

Data copy: CPU -> GPU

Execution in GPU

Data copy: GPU -> CPU

Page 23: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

0x

5x

10x

15x

20x

AES-ENC AES-DEC SHA1

Throughput relative to a single core

GTX580 Performance w/ Pipelining

23

↑ 36% ↑ 36%

↑ 51% w/o copy

synchronous

pipelining

9x 9x

14x

Page 24: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Summary of GPU Cryptography

Performance gain from GTX580

• GPU performs as fast as 9 ~ 28 CPU cores

• Superior to high-end hardware accelerators

Lessons • Batch processing is essential to fully utilize a GPU

• AES and SHA1 are bottlenecked by data copy • PCIe 3.0

• Integrated GPU and CPU

24

RSA-1024 (ops/sec)

AES-ENC (Gbps)

AES-DEC (Gbps)

SHA1 (Gbps)

GTX580 91.9K 11.5 12.5 47.1

CPU core 3.3K 1.3 1.3 3.3

16분 30초

Page 25: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

BUILDING SSL-PROXY THAT

LEVERAGES GPU

25

Page 26: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

SSLShader Design Goals

Use existing application without modification

• SSL reverse proxy

Effectively leverage GPU

• Batching cryptographic operations

• Load balancing between CPU and GPU

Scale performance with architecture evolution

• Multi-core CPUs

• Multiple NUMA nodes

26

Page 27: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Batching Crypto Operations

Network workloads vary over time • Waiting for fixed batch size doesn’t work

27

Output queue

GPU

Input queue

CPU

GPU

SSL Stack

Batch size is dynamically adjusted to queue length

Page 28: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Balancing Load Between CPU and GPU

For small batch, CPU is faster than GPU • Opportunistic offloading

28

Output queue

GPU

Input queue

CPU processing

GPU processing when input queue length > threshold

GPU queue

CPU

Cryptographic operation Minimum Maximum

RSA (1024-bit) 16 512

AES Decryption 32 2048

AES Encryption 128 2048

HMAC-SHA1 128 2048

Input queue length > threshold

Page 29: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Scaling with Multiple Cores

29

Per-core worker threads • Network I/O, cryptographic operation

Sharing a GPU with multiple cores • More parallelism with larger batch size

Output queues

GPU

CPU

CPU

CPU

Input queues GPU

queue CPU

GPU

Core0

Core1

Core2

Page 30: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Scaling with NUMA systems

A process = worker threads + a GPU thread

• Separate process per NUMA node

• Minimizes data sharing across NUMA nodes

30

CPU0

IOH0

GPU0

RAM

NIC0

CPU1

IOH1

GPU1

RAM

NIC1

Node 0 Node 1

Page 31: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Evaluation

Experimental configurations

31

Model Spec Qty

CPU Intel X5650 2.66Ghz x 6 croes 2

GPU NVIDIA GTX580 1.5Ghz x 512 cores 2

NIC Intel X520-DA2 10GbE x 2 2

SSLShader+ Lighttpd

8 Clients

4x 10GbE

Server Server

Lighttpd

SSLShader

Lighttpd

OpenSSL

HTTP

Clients

GPU

Clients

HTTPS HTTPS

Server Specification

Page 32: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Evaluation Metrics

HTTPS connection handling performance

• Use small content size

• Stress on RSA computation

Latency distribution at different loads

• Test opportunistic offloading

Data transfer rate at various content size

32

Page 33: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

HTTPS Connection Rate

33

29K

21K

11K

3.6K 0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

1024 bits 2048 bits

SSLShader

lighttpd

2.5x

6x

RSA Key Size

Connections / sec

Page 34: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

CPU Usage Breakdown (RSA 1024)

34

Kernel NIC device driver,

2.32

SSLShader, 5.31

Libc , 9.88

IPP + libcrypto,

12.89

lighttpd, 4.9

others, 4.35

Kernel (Including

TCP/IP stack), 60.35

Current Bottleneck

Page 35: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Latency at Light Load

35

0102030405060708090

100

1 10 100 1000

CD

F (

%)

Latency (ms)

Similar latency at light load

Lighttpd at 1k connections / sec

SSLShader at 1k connections / sec

Page 36: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Latency at Heavy Load

36

Lower latency and higher throughput at heavy load

0

20

40

60

80

100

1 10 100 1000

CD

F (

%)

Latency (ms)

Lighttpd at 11k connections / sec

SSLShader at 29k connections / sec

Page 37: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Data Transfer Performance

37

0.0x

0.5x

1.0x

1.5x

2.0x

2.5x

4KB 16KB 64KB 256KB 1MB 4MB 16MB 64MB

Rel

ati

ve

Per

form

an

ce

Content Size

2.1x

0.87x

Lighttpd performance

Typical web content size is under 100KB

SSLShader: 13 Gbps

Page 38: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

CONCLUSIONS

38

Page 39: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

Summary

Cryptographic algorithms in GPU

• Fast RSA, AES, and SHA1

• Superior to high-end hardware accelerators

SSLShader

• Transparent integration

• Effective utilization of GPU for SSL processing

• Up to 6x connections / sec

• 13 Gbps throughput

39

Linux network stack performance

Copy overhead

Page 40: SSLShader: Cheap SSL Acceleration with …...•Test opportunistic offloading Data transfer rate at various content size 32 HTTPS Connection Rate 33 29K 21K 11K 3.6K 0 5,000 10,000

QUESTIONS?

THANK YOU!

For more details

https://shader.kaist.edu/sslshader

40