Post on 15-Apr-2017
Microservices tracing with Spring Cloud and Zipkin
Marcin Grzejszczak
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
About meDeveloper at Pivotal
Part of Spring Cloud Team
Working with OSS:● Accurest - Consumer Driven Contracts verifier for Java● JSON Assert - fluent JSON assertions● Spock Subjects Collaborators Extension● Gradle Test Profiler● Up To Date Gradle Plugin
TWITTER: @MGrzejszczakBLOG: http://TOOMUCHCODING.COM
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
AgendaWhat is distributed tracing?
How to correlate logs with Spring Cloud Sleuth?
How to visualize latency with Spring Cloud Sleuth and Zipkin?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
An ordinary system...
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
UI calls backend
UI -> BACKEND
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Everything is awesome
CLICK 200
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Until it’s not
CLICK 500
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Time to debug
https://tonysbologna.files.wordpress.com/2015/09/mario-and-luigi.jpg?w=468&h=578&crop=1
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
It doesn’t look like this
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
More like this
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
On which server / instance was the exception thrown?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
SSH and grep for ERROR to find it?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Distributed tracing - terminologySpan
Trace
Logs (annotations)
Tags (binary annotations)
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Distributed tracing - terminologySpan
Trace
Logs (annotations)
Tags (binary annotations)
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
SpanThe basic unit of work (e.g. sending RPC)
● Spans are started and stopped
● They keep track of their timing information
● Once you create a span, you must stop it at some point in the future
● Has a parent and can have multiple children
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
TraceA set of spans forming a tree-like structure.
● For example, if you are running a book store then
○ Trace could be retriving a list of available books
○ Assuming that to retrive the books you have to send 3 requests to 3 services then you could have at least 3 spans (1 for each hop) forming 1 trace
SERVICE 1
REQUEST
No Trace IdNo Span Id
RESPONSE
SERVICE 2
SERVICE 3
Trace Id = XSpan Id = A
Trace Id = XSpan Id = A
Trace Id = XSpan Id = A
REQUEST
RESPONSE
Trace Id = XSpan Id = BClient Sent
Trace Id = XSpan Id = B
Client Received
Trace Id = XSpan Id = B
Server Received
Trace Id = XSpan Id = C
Trace Id = XSpan Id = BServer Sent
REQUEST
RESPONSE
Trace Id = XSpan Id = DClient Sent
Trace Id = XSpan Id = D
Client Received
Trace Id = XSpan Id = D
Server Received
Trace Id = XSpan Id = E
Trace Id = XSpan Id = DServer Sent
Trace Id = XSpan Id = E
SERVICE 4
REQUEST
RESPONSE
Trace Id = XSpan Id = FClient Sent
Trace Id = XSpan Id = F
Client Received
Trace Id = XSpan Id = F
Server Received
Trace Id = XSpan Id = G
Trace Id = XSpan Id = FServer Sent
Trace Id = XSpan Id = G
Trace Id = XSpan Id = C
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Span Id = AParent Id = null
Span Id = BParent Id = A
Span Id = CParent Id = B
Span Id = DParent Id = C
Span Id = EParent Id = D
Span Id = FParent Id = C
Span Id = GParent Id = F
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Is it that simple?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Is it that simple?How do you pass tracing information (incl. Trace ID) between:
● different libraries?
● thread pools?
● asynchronous communication?
● …?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Would you want to do that yourself?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Log correlation with Spring Cloud SleuthWe take care of passing tracing information between threads / libraries / contexts for● Hystrix● RxJava● Rest Template● Feign● Messaging with Spring Integration● Zuul● ...
If you don’t do anything unexpected there’s nothing you need to do to make Sleuth work. Check the docs for more info.
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Now let’s aggregate the logs!Instead of SSHing to the machines aggregate the logs!
● With Cloud Foundry’s (CF) Loggergator the logs from different instances are streamed into a single place
● You can harvest your logs with Logstash Forwarder / FileBeat
● You can use ELK stack to stream and visualize the logs
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Spring Cloud Sleuth with Maven<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-dependencies</artifactId>
<version>Brixton.RELEASE</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Spring Cloud Sleuth with Gradle
dependencies {
compile "org.springframework.cloud:spring-cloud-starter-sleuth"
}
dependencyManagement {
imports {
mavenBom "org.springframework.cloud:spring-cloud-dependencies:Brixton.RELEASE"
}
}
SERVICE 1
REQUEST
RESPONSE
SERVICE 2
SERVICE 3
REQUEST
RESPONSE
REQUEST
RESPONSE
SERVICE 4
REQUEST
RESPONSE
“Hello from service3”
“Hello from service4”
“Hello from service2, response from service3 [Hello from service3] and from service4 [Hello from service4]”
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Log correlation with Spring Cloud SleuthDEMO
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Great! We’ve found the exception!But meanwhile....
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
The system is slow...
CLICK 200
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
One of the services is slow?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Which one?How to measure that?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
● Client Sent (CS) - The client has made a request
● Server Received (SR) - The server side got the request and will start processing
● Server Send (SS) - Annotated upon completion of request processing
● Client Received (CR) - The client has successfully received the response from the server side
Let’s log events!
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
CS 0 ms SR 100 ms
SS 200 msCR 300 ms
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
● The request started at T=0ms
● It took 300 ms for the client to receive a response
● Server side received the request at T=100 ms
● The request got processed on the server side in 100 ms
ConclusionsCS 0 ms SR 100 ms
SS 200 msCR 300 ms
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Why is there a delay between sending and receiving messages?!!11!one!?!1!
ConclusionsCS 0 ms SR 100 ms
SS 200 msCR 300 ms
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
https://blogs.oracle.com/jag/resource/Fallacies.html
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Distributed tracing - terminologySpan
Trace
Logs (annotations)
Tags (binary annotations)
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
LogsRepresents an event in time associated with a span
● Every span has zero or more logs
● Each log is a timestamped event name
● Event should be the stable name of some notable moment in the lifetime of a span
● For instance, a span representing a browser page load might add an event for each of the Performance.timing moments (check https://developer.mozilla.org/en-US/docs/Web/API/PerformanceTiming)
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Main logs
● Client Send (CS)○ The client has made a request - the span was started
● Server Received (SR)○ The server side got the request and will start processing it
○ SR timestamp - CS timestamp = NETWORK LATENCY
CS 0 ms SR 100 ms
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Main logs
● Server Send (SS)○ Annotated upon completion of request processing
○ SS timestamp - SR timestamp = SERVER SIDE PROCESSING TIME
● Client Received (CR)○ The client has successfully received the response from the server side
○ CR timestamp - CS timestamp = TIME NEEDED TO RECEIVE RESPONSE
○ CR timestamp - SS timestamp = NETWORK LATENCY
CS 0 ms SR 100 ms
SS 200 msCR 300 ms
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Key-value pair
● Every span may have zero or more key/value Tags
● They do not have timestamps and simply annotate the spans.
● Example of default tags in Sleuth○ message/payload-size○ http.method○ commandKey for Hystrix
Tag
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
How to visualise latency in a distributed system?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
● Zipkin is a distributed tracing system
● It runs as a separate process (you can run it as a Spring Boot application)
● It helps gather timing data needed to troubleshoot latency problems in microservice architectures
● The front end is a "waterfall" style graph of service calls showing call durations as horizontal bars
The answer is: Zipkin
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
How does Zipkin work?
SPANS SENT TO COLLECTORS
SPANS SENT TO COLLECTORS
STORE IN DB
APP
APP
UI QUERIES FOR TRACE INFO VIA API
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Spring Cloud Sleuth and Zipkin integration● We take care of passing tracing information between threads / libraries /
contexts
● Upon closing of a Span we will send it to Zipkin○ either via HTTP (spring-cloud-sleuth-zipkin)○ or via Spring Cloud Stream (spring-cloud-sleuth-stream)
● You can run Zipkin Sping Cloud Stream Collector as a Spring Boot app (spring-cloud-sleuth-zipkin-stream)○ you can add the dependency to Zipkin UI!
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Spring Cloud Sleuth Zipkin with Maven<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-dependencies</artifactId>
<version>Brixton.RELEASE</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-zipkin</artifactId>
</dependency>
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Spring Cloud Sleuth Zipkin with Gradle
Marcin Grzejszczak @mgrzejszczak, Kraków, 11-13 May 2016
dependencies {
compile "org.springframework.cloud:spring-cloud-starter-zipkin"
}
dependencyManagement {
imports {
mavenBom "org.springframework.cloud:spring-cloud-dependencies:Brixton.RELEASE"
}
}
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
HOLD IT!● If I have billion services that emit gazillion spans - won’t I kill Zipkin?
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Sampling to the rescue!● By default Spring Cloud Sleuth sends only 10% of requests to Zipkin
● You can change that by changing the property
spring.sleuth.sampler.percentage (for 100% pass 1.0)
● Or register a custom org.springframework.cloud.sleuth.Sampler
implementation
SERVICE 1/start
REQUEST
RESPONSE
SERVICE 2/foo
SERVICE 3/bar
REQUEST
RESPONSE
REQUEST
RESPONSE
SERVICE 4/baz
REQUEST
RESPONSE
CYBERCOMSERVICE/cybercom
REQUEST
RESPONSE
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
DEMO
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Traced call
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Traced call
TOTAL DURATION
EN
D
STA
RT
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Traced call
CLI
EN
TS
EN
T
CLI
EN
TR
EC
EIV
ED
SERVICE 2CLIENT
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Traced call
SE
RV
ER
RE
CE
IVE
D
SE
RV
ER
SE
NT
SERVICE 4SERVER
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Zipkin for Brewery● A test app for Spring Cloud end to end tests
● Source code: https://github.com/spring-cloud-samples/brewery
● Around 10 applications involved
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Summary● Log correlation allows you to match logs for a given trace
● Distributed tracing allows you to quickly see latency issues in your system
● Zipkin is a great tool to visualize the latency graph and system dependencies
● Spring Cloud Sleuth integrates with Zipkin and grants you log correlation
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
Marcin Grzejszczak @mgrzejszczak, 21 May 2016
THANK YOU● https://github.com/marcingrzejszczak/vagrant-elk-box/tree/presentation - code for this presentation (clone
and run getReadyForConference.sh - NOTE: you need Vagrant!)
● https://github.com/spring-cloud/spring-cloud-sleuth - Spring Cloud Sleuth repository
● http://cloud.spring.io/spring-cloud-sleuth/spring-cloud-sleuth.html - Sleuth’s documentation
● http://toomuchcoding.com/blog/2016/03/25/spring-cloud-sleuth-rc1-deployed/ - article about RC1 release
● https://github.com/openzipkin/zipkin-java - Repo with Spring Boot Zipkin server
● http://docssleuth-service1.cfapps.io/start - The service1 app from this presentation deployed to Pivotal Cloud
Foundry - point of entry to the app
● http://docssleuth-zipkin-server.cfapps.io/ - Zipkin deployed to Pivotal Cloud Foundry
● http://brewery-zipkin-web.cfapps.io - Zipkin deployed to PCF for Brewery Sample app