JUG Poznan - 2017.01.31

51
Avoiding software fails Few metrics to improve application reliability [email protected] Poznań, 2017/01/31

Transcript of JUG Poznan - 2017.01.31

Page 1: JUG Poznan - 2017.01.31

Avoiding software failsFew metrics to improve application reliability

[email protected]

Poznań, 2017/01/31

Page 2: JUG Poznan - 2017.01.31

2 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Dynatrace

What to do with the fastest car …

Page 3: JUG Poznan - 2017.01.31

3 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Dynatrace

… if it fails to reach the finish line

Page 4: JUG Poznan - 2017.01.31

In 2005, only 2% of performance incidents had been predicted

Source: Gartner

Page 5: JUG Poznan - 2017.01.31

What % of problems were predicted in 2015?

A. 75%B. 46%C. 11%D. 3%E. None of the above

Page 6: JUG Poznan - 2017.01.31

What % of problems were predicted in 2015?

A. 75%B. 46%C. 11%D. 3%E. None of the above

Page 7: JUG Poznan - 2017.01.31

Why do software projects fail so often?http://spectrum.ieee.org/computing/software/why-software-fails

Unrealistic or unarticulated project goals

Inaccurate estimates of needed resourcesBadly defined system requirements

Poor reporting of the project's statusUnmanaged risks

Poor communication among customers, developers, and users

Commercial pressures Stakeholder politics

Poor project management

Sloppy development practices

Inability to handle the project's complexity

Use of immature technology

Page 8: JUG Poznan - 2017.01.31

Performance issues increase costs63% of IT organizations spend 20%+ of the time working on performance issues

Inability to Innovate40% of Developers’ time is wasted in triage, stealing a focus from activities that innovates

Page 9: JUG Poznan - 2017.01.31

The good thing is

80:20

Page 10: JUG Poznan - 2017.01.31

Lets start on the frontend 80/20 rule from Steve

Page 11: JUG Poznan - 2017.01.31

But then we’d focus on the backend

Page 12: JUG Poznan - 2017.01.31

5 Use cases

&

metrics that really pay off…

Page 13: JUG Poznan - 2017.01.31

#1

Pushing without a Plan

Page 14: JUG Poznan - 2017.01.31

Web Site: this shoudn’t happenSome Ad Company during American Super-Bowl

Total size ~ 20MB

434 Resources in on that page

Page 15: JUG Poznan - 2017.01.31

Web Site: this could be easily eliminatedObama Care

16 individual jQuery

-related files that should be merged

Most JavaScript files contains Dev

documentation, which makes up to 80% of the file size

Page 16: JUG Poznan - 2017.01.31

Web Site: this shoudn’t happenFifa.com doring Woldcup

Faviconthe Largest element

Some heavy CSS & JS +150kb

Page 17: JUG Poznan - 2017.01.31

• Developers not using the browser built-in diagnostics tools• Testers not doing a sanity checks with the same tools

• Some tools for you • Built-in Inspectors via Ctrl-Shift-I in Chrome and Firefox

• YSlow, PageSpeed

• Dynatrace Ajax Edition

• Level-Up: Automate Testing & Diagnostics Check

Lessons Learnt – NO Excuses for …

Page 18: JUG Poznan - 2017.01.31

# Resources

# of Domains

Usage of CDNs

Page Load & Size

Page 19: JUG Poznan - 2017.01.31

#2

Not every Architect makes good decisions

Page 20: JUG Poznan - 2017.01.31

• Symptoms• HTML takes 60-120s to render

• High GC Time

• Developer Assumptions• Bad GC Tuning

• Probably bad DB performance as rendering was simple

• Resulted in: months of finger-pointing between Dev & DBA

Project: Online Room Reservation System

Page 21: JUG Poznan - 2017.01.31

Developers-built monitoring

void roomreservationReport(int officeId){long startTime = System.currentTimeMillis();Object data = loadDataForOffice(officeId);long dataLoadTime = System.currentTimeMillis() - startTime;

generateReport(data, officeId);}

Result:Avg. Data Load Time: 41s!

DB Tool says:Avg. SQL Query: <1ms!

Page 22: JUG Poznan - 2017.01.31
Page 23: JUG Poznan - 2017.01.31

#1: Loading too much data

24889! Calls to the DB API

High CPU & High Memory Usage to keep all data in Memory

Page 24: JUG Poznan - 2017.01.31

#2: On individual connections 12444! individual connections

Individual SQL really fast <1ms

Classical N+1 Query Problem

Page 25: JUG Poznan - 2017.01.31

#3: Putting all data in temp Hashtable

Lots of time spent in Hashtable.get

Called from their Entity Objects

Page 26: JUG Poznan - 2017.01.31

• …You know what code is doing• Challenge the developers• Don’t use Hashtabels as workaround, use O/R mappers

• Explore Tools that “might seem” out of your league!• Built-In Database Analysis Tools• “Logging” options of Frameworks such as Hibernate, …• JMX, Perf Counters, … of your Application Servers• APM (Performance Tracing) Tools: Dynatrace Personal Ed.,…

Lessons Learned – Don’t Assume …

Page 27: JUG Poznan - 2017.01.31

# SQL Executions

# of Same SQLs

Conn. Acquisition Time

Page 28: JUG Poznan - 2017.01.31

Root Cause: Deployment Considerations

Log Service provides a Synchronized File across all JVMs

1M Log exceptions over 30 min

Page 29: JUG Poznan - 2017.01.31

Production Deployment leads to Log SYNC Issues

Log message TimeIn Sync

Two calls comming fromCustomr coded methods

Page 30: JUG Poznan - 2017.01.31

Time Spent in Sync & Logging

# of Log Messages

# of Exceptions

Page 31: JUG Poznan - 2017.01.31

#3

Deployment Gone Bad

Page 32: JUG Poznan - 2017.01.31

Test Environment

Production Environment8x slower

3x more SQL

Page 33: JUG Poznan - 2017.01.31

Test Environment Production Environment

That’s Normal: Having I/O for Web

Request as main contributor

Hibernate, Classloading, XML – The

Key Hotspots

I/O for Web Requests doesn’t even show up!

Page 34: JUG Poznan - 2017.01.31

These calls all originate form thousands of calls to

find item by code

Top Contributor Class.getInterfaces

Called from Hibernates FieldInterceptionHelper

Page 35: JUG Poznan - 2017.01.31

Top Methods related to XML Processing

Classloading is triggered through CustomMonkey and the Xalan Parser

Page 36: JUG Poznan - 2017.01.31

• Plan enough time for proper testing

• Anticipate changed user behavior during peak load

• Only test what really ends up in Production

Lessons Learned

Page 37: JUG Poznan - 2017.01.31

Time Spent in API

# Calls to API

Page 38: JUG Poznan - 2017.01.31

#4

Incorrect Sizing of Pools and Queues

Page 39: JUG Poznan - 2017.01.31

Online Banking: Slow Balance Check

101s! To Check Balance!

600! SQL Executions87% spent in IIS

Page 40: JUG Poznan - 2017.01.31

#1 Time really spent in IIS?

Tip: Elapsed Time tells us WHEN a

Method was executed!

Tip: Thread# gives us insight on Thread Queues / Switches

Finding: Thread 32 in IIS waited 87s to pass

control to Thread 30 in ASP.NET

Page 41: JUG Poznan - 2017.01.31

#2 What about these SQL Executions?

Finding: EVERY SQL statement is executed on ITS OWN

Connection!

Tip: Look at “GetConnection”

Page 42: JUG Poznan - 2017.01.31

#2 SQL Executions! continued …

#1: Same SQL is executed 67! times

#2: NO PREPARATION because everything executed on new

Connection

Page 43: JUG Poznan - 2017.01.31

Lessons Learned!

ASP.NET Worker Thread Pool Sizing!

DB Connection PoolsMore Efficient SQL

Page 44: JUG Poznan - 2017.01.31

Idle vs. Busy Threads

# SQLs / Request

# GetConnection

%CPU Starvation

Page 45: JUG Poznan - 2017.01.31

#5

Do know what you Test

Page 46: JUG Poznan - 2017.01.31

23s for One click

22s$3-5M worth

Data grid

New Generation CRM: Angular.js / Coherence

Page 47: JUG Poznan - 2017.01.31

New Generation CRM: Angular.js / Coherence

7sfor filter execution

Filter Value

Page 48: JUG Poznan - 2017.01.31

Talk to Architects, andTrace argument’s values 4 performance sensitive methods

# of unique invocations

Response Time

Page 49: JUG Poznan - 2017.01.31

# Images

# Redirects

# and Size of Resources

# SQL Executions

# of SAME SQLs

# Items per Page

# AJAX per Page

Remember: New Metrics When Testing Apps

Time Spent in API

# Calls into API

# Functional Errors

# 3rd Party calls

# of Domains

Total Size

Resource (W3C) Timings: PLT, DOM Processing/Ready, Page Interactive

Page 50: JUG Poznan - 2017.01.31

Online Performance Clinics

Every week @

bit.ly/onlineperfclinic

bit.ly/dttrial

Page 51: JUG Poznan - 2017.01.31

Putting it into a Test Automation

12 0 120ms3 1 68ms

Build 20 testPurchase OKtestSearch OK

Build 17 testPurchase OKtestSearch OK

Build 18 testPurchase FAILEDtestSearch OK

Build 19 testPurchase OKtestSearch OK

Build # Test Case Status # SQL # Excep CPU12 0 120ms3 1 68ms

12 5 60ms3 1 68ms

75 0 230ms3 1 68ms

Test Framework Results Architectural Data

We identified a regression

Problem solved

Exceptions probably reason for failed tests

Problem fixed but now we have an architecturalregression

Problem fixed but now we have an architectural regression

Now we have the functional and architectural confidence

Let’s look behind the scenes