Microservices summit talk 1/31
-
Upload
varun-talwar -
Category
Engineering
-
view
46 -
download
1
Transcript of Microservices summit talk 1/31
Google confidential │ Do not distribute Google confidential │ Do not distribute
Bringing learnings from Googley microservices with gRPCMicroservices Summit
Varun Talwar
Contents1. Context: Why are we here?
2. Learnings from Stubby experience
a. HTTP/JSON doesnt cut it
b. Establish a Lingua Franca
c. Design for fault tolerance and control: Sync/Async, Deadlines, Cancellations, Flow control
d. Flying blind without stats
e. Diagnosing with tracing
f. Load Balancing is critical
3. gRPC
a. Cross platform matters !
b. Performance and Standards matter: HTTP/2
c. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
d. Usability matters !
CONTEXTWHY ARE WE HERE?
Agility & Resilience
Developer Productivity
Performance
INTRODUCING STUBBY
Google confidential │ Do not distribute
Microservices at Google ~O(1010) RPCs per second.
Images by Connie Zhou
Stubby Magic @ Google
Making Google magic available to all
Kubernetes
Borg
Stubby
LEARNINGS FROM
STUBBY
Key learnings
1. HTTP/JSON doesnt cut it !
2. Establish a lingua franca
3. Design for fault tolerance and provide control knobs
4. Dont fly blind: Service Analytics
5. Diagnosing problems: Tracing
6. Load Balancing is critical
7. Solve for auth
HTTP1.x/JSON doesn’t cut it !
1. WWW, browser growth - bled into services
2. Stateless
3. Text on the wire
4. Loose contracts
5. TCP connection per request
6. Nouns based
7. Harder API evolution
8. Think compute, network on cloud platforms
1
Establish a lingua franca
1. Protocol Buffers - Since 2003.
2. Start with IDL
3. Have a language agnostic way of agreeing on data semantics
4. Code Gen in various languages
5. Forward and Backward compatibility
6. API Evolution
2
How we roll at Google
Google Cloud Platform
Service Definition (weather.proto)syntax = "proto3";
service Weather { rpc GetCurrent(WeatherRequest) returns (WeatherResponse);}
message WeatherRequest { Coordinates coordinates = 1;
message Coordinates { fixed64 latitude = 1; fixed64 longitude = 2; }}
message WeatherResponse { Temperature temperature = 1; float humidity = 2;}
message Temperature { float degrees = 1; Units units = 2;
enum Units { FAHRENHEIT = 0; CELSIUS = 1; KELVIN = 2; }}
Design for fault tolerance and control
Sync and Async APIs
Need fault tolerance: Deadlines, Cancellations
Control Knobs: Flow control, Service Config, Metadata
3
Google Cloud Platform 18
First-class feature in gRPC.
Deadline is an absolute point in time.
Deadline indicates to the server how long the client is willing to wait for an answer.
RPC will fail with DEADLINE_EXCEEDED status code when deadline reached.
gRPC Deadlines
Google Cloud Platform
Deadline Propagation
Gateway
90 ms
Now = 147660000000
0
Deadline = 147660000020
0
40 ms
20 ms
20 ms
60 ms
withDeadlineAfter(200, MILLISECONDS)
Now = 147660000004
0
Deadline = 147660000020
0
Now = 147660000015
0
Deadline = 147660000020
0
Now = 147660000023
0
Deadline = 147660000020
0
DEADLINE_EXCEEDED
DEADLINE_EXCEEDED
DEADLINE_EXCEEDED
DEADLINE_EXCEEDED
Google Cloud Platform 20
Deadlines are expected.
What about unpredictable cancellations?
•User cancelled request.
•Caller is not interested in the result any more.
•etc
Cancellation?
Google Cloud Platform
Cancellation?
GW
Busy Busy Busy
Busy Busy Busy
Busy Busy Busy
Active RPC
Active RPC
Active RPC
Active RPC
Active RPC
Active RPC
Active RPC
Active RPC
Active RPC
Google Cloud Platform
Cancellation Propagation
GW
Idle Idle Idle
Idle Idle Idle
Idle Idle Idle
Google Cloud Platform 23
Automatically propagated.
RPC fails with CANCELLED status code.
Cancellation status be accessed by the receiver.
Server (receiver) always knows if RPC is valid!
Cancellation
Google Cloud Platform
BiDi Streaming - Slow Client
Fast ServerRequest
Responses
Slow Client
CANCELLEDUNAVAILABLE
RESOURCE_EXHAUSTED
Google Cloud Platform
BiDi Streaming - Slow Server
Slow ServerRequest
Response
Fast Client
CANCELLEDUNAVAILABLE
RESOURCE_EXHAUSTED
Requests
Google Cloud Platform 26
Flow-control helps to balance computing power and network capacity between client and server.
gRPC supports both client- and server-side flow control.
Flow-Control
Photo taken by Andrey Borisenko.
Google Cloud Platform 27
Policies where server tells client what they should do
Can specify deadlines, lb policy, payload size per method of a service
Loved by SREs, they have more control
Discovery via DNS
Service Config
Metadata Exchange - Common cross-cutting concerns like authentication or tracing rely on the exchange of data that is not part of the declared interface of a service. Deployments rely on their ability to evolve these features at a different rate to the individual APIs exposed by services.
Metadata helps in exchange of useful information
Don’t fly blind: Stats4
What is the mean latency time per RPC?
How many RPCs per hour for a service?
Errors in last minute/hour?
How many bytes sent? How many connections to my server?
Data collection by arbitrary metadata is useful
Any service’s resource usage and performance stats in real time by (almost) any arbitrary metadata
1. Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC and the mdb user who sent it.
2. Ads can monitor the RPC latency of shared bigtable jobs when responding to their requests, broken down by whether the request originated from a user on web/Android/iOS.
3. Gmail can collect usage on servers, broken down by according POP/IMAP/web/Android/iOS. Layer propagates Gmail's metadata down to every service, even if the request was made by an intermediary job that Gmail doesn't own
Stats layer export data to varz and streamz, and provides stats to many monitoring systems and dashboards
Diagnosing problems: Tracing5
1/10K requests takes very long. Its an ad query :-) I need to find out.
Take a sample and store in database; help identify request in sample which took similar amount of time
I didnt get a response from the service. What happened? Which link in the service dependency graph got stuck? Stitch a trace and figure out.
Where is it taking time for a trace? Hotspot analysis
What all are the dependencies for a service?
Load Balancing is important !5
Iteration 1: Stubby BalancerIteration 2: Client side load balancingIteration 3: HybridIteration 4: gRPC-lb
● Round-robin-over-list - Lists not sets → ability to represent weights
● For anything more advanced, move the burden to an external "LB Controller", a regular gRPC server and rely on a client-side implementation of the so-called gRPC LB policy.
client LB Controller
backends
1) Control RPC2) address-list
3) RR over addresses of address-list
gRPC LB
Some new ideas !Iteration 1: Stubby BalancerIteration 2: Client side load balancingIteration 3: HybridIteration 4: gRPC-lb
In summary, what did we learn
Contracts should be strict
Common language helps
Common understanding for deadlines, cancellations, flow control
Common stats/tracing framework is essential for monitoring, debugging
Common framework lets uniform policy application for control and lb
Single point of integration for logging, monitoring, tracing, service discovery and load balancing makes lives much easier !
INTRODUCING gRPC
Open source on Github for C, C++, Java, Node.js, Python, Ruby, Go, C#, PHP, Objective-C
gRPC core
gRPC Java
gRPC Go
1.0 with stable APIs
Well documented with an active community
Reliable with continuous running tests on GCE
Deployable in your environment
Measured with an open performance dashboard
Deployable in your environment
Well adopted inside and outside Google
Where is the project today?
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
4. Usability matters !
More lessons
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
4. Usability matters !
More lessons
Google Cloud PlatformGoogle Cloud Platform
Coverage & Simplicity
The stack should be available on every popular development platform and easy for someone to build for their platform of choice. It should be viable on CPU & memory limited devices.
gRPC Principles & Requirements
http://www.grpc.io/blog/principles
Google Cloud Platform
gRPC Speaks Your Language
● Java● Go● C/C++● C#● Node.js● PHP● Ruby● Python● Objective-C
● MacOS● Linux● Windows● Android● iOS
Service definitions and client libraries
Platforms supported
Google Cloud Platform
Interoperability
Java Servic
e
Python
Service
GoLang
Service
C++ Servic
e
gRPC Service
gRPC
Stub
gRPC
Stub
gRPC
Stub
gRPC
Stub
gRPC Service
gRPC Service
gRPC Service
gRPC
Stub
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
4. Usability matters !
More lessons
Google Cloud Platform
• Single TCP connection.
• No Head-of-line blocking.
• Binary framing layer.
• Request –> Stream.
• Header Compression.
HTTP/2 in One Slide
Transport(TCP)
Application (HTTP/2)
Network (IP)
Session (TLS) [optional]Binary Framing
HEADERS FrameDATA Frame
HTTP/2
POST: /upload HTTP/1.1 Host: www.javaday.org.ua Content-Type: application/json Content-Length: 27
HTTP/1.x
{“msg”: “Welcome to 2016!”}
Google Cloud Platform
HTTP/2 breaks down the HTTP protocol communication into an exchange of binary-encoded frames, which are then mapped to messages that belong to a stream, and all of which are multiplexed within a single TCP connection.
Binary Framing Stream 1 HEADERS
Stream 2
:method: GET :path: /kyiv :version: HTTP/2 :scheme: https
HEADERS :status: 200 :version: HTTP/2 :server: nginx/1.10.1 ...
DATA
<payload>
Stream N
Request
Response
TCP
Google Cloud Platform
HTTP/1.x vs HTTP/2http://http2.golang.org/gophertileshttp://www.http2demo.io/
Google Cloud Platform
gRPC Service Definitions
Unary RPCs where the client sends a single request to the server and gets a single response back, just like a normal function call.
The client sends a request to the server and gets a stream to read a sequence of messages back. The client reads from the returned stream until there are no more messages.
The client send a sequence of messages to the server using a provided stream. Once the client has finished writing the messages, it waits for the server to read them and return its response.
Client streaming
Both sides send a sequence of messages using a read-write stream. The two streams operate independently. The order of messages in each stream is preserved.
BiDi streamingUnary Server streaming
Google Cloud Platform 48
Messaging applications.
Games / multiplayer tournaments.
Moving objects.
Sport results.
Stock market quotes.
Smart home devices.
You name it!
BiDi Streaming Use-Cases
Open Performance Benchmark and Dashboard
Benchmarks run in GCE VMs per Pull Request for regression testing.
gRPC Users can run these in their environments.
Good Performance across languages:
Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs
Java Latency: ~320 us for unary ping-pong (netperf 120us)
C++ Throughput: ~1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs.
Performance
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Auth
4. Usability matters !
More lessons
Google Cloud PlatformGoogle Cloud Platform
Pluggable
Large distributed systems need security, health-checking, load-balancing and failover, monitoring, tracing, logging, and so on. Implementations should provide extensions points to allow for plugging in these features and, where useful, default implementations.
gRPC Principles & Requirements
http://www.grpc.io/blog/principles
Google Cloud Platform
Interceptors
Client ServerRequest
Response
Client interceptor
s
Server interceptor
s
Auth & Security - TLS [Mutual], Plugin auth mechanism (e.g. OAuth)
Proxies
Basic: nghttp2, haproxy, traefik
Advanced: Envoy, linkerd, Google LB, Nginx (in progress)
Service Discovery
etcd, Zookeeper, Eureka, …
Monitor & Trace
Zipkin, Prometheus, Statsd, Google, DIY
Pluggability
1. Cross language & Cross platform matters !
2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Auth
4. Usability matters !
More lessons
Get Started
1. Server reflection2. Health Checking3. Automatic retries4. Streaming compression5. Mechanism to do caching6. Binary Logging
a. Debugging, auditing though costly7. Unit Testing support
a. Automated mock testingb. Dont need to bring up all dependent services just to test
8. Web support
Coming soon !
Microservices: in data centres
Streaming telemetry from network devices
Client Server communication/Internal APIs
Mobile Apps
Some early adopters
Thank you!Thank you!Twitter: @grpcioSite: grpc.ioGroup: [email protected]: github.com/grpc
github.com/grpc/grpc-java github.com/grpc/grpc-go
Q & A