Post on 15-Dec-2015
Windows Azure SQL Database (WASD) Troubleshooting
Bob WardPrincipal Architect Escalation Engineer
bobward@microsoft.com
I will assume basic SQL Server knowledge
2
My Goals for You Today
Prepare
React
Prevent
3
What Will We Cover Today
The Azure Troubleshooting Challenge
Troubleshooting Connectivity
WASD Errors
Query Performance
Practical Advice and Tips
4
The Azure Troubleshooting Challenge
WASD is a platform service (PAAS) This is not a VM running SQL Server “box” (IAAS)
Multi-tenant platform You are sharing a SQL instance with other databases from
other customers You are abstracted from the SQL Server instance,
Windows, and computer server Less admin tasks means lower TCO but also means less access
You are isolated to a specific database You have a logical server and a master but most things are
done in your database Most things are database scoped (Ex. DMVs)
We make decisions to maximize all database availability
Application design may be required The service can be updated far quicker than the
“box” product
5
WASD Connectivity Errors
WASD specific errors
Firewall blocked in Azure
Windows authentication not
supported
Invalid login – Invalid account or password
Denial of Service – After a large number of
login failures
Network related errors
“…Server not found”
Connection Timeout Expired
Msg 121 “.. The semaphore timeout period has expired”
You could lose
connectivity
Idle connections terminated after 30 minutes (Msg 10053
and 10054)
We may forcibly disconnect on
failover/some errors or change to MAXSIZE
Retries you need to take into account
Use min 30sec login
timeout
6
Example Connectivity Errors
40XXX errors unique to WASD
Be sure to give this to support
May see this after deleting a server
Network latency
After getting dropped on idle connection
7
Troubleshooting Connectivity
Configuration issues• WASD Firewall and your firewall• Allow Windows Azure Service
Is it our service or your internet?
• Windows Azure Management Portal• Windows Azure Service Dashboard• Windows Azure SQL Database Connectivity Troubleshooting Guide
General Tools to use• ping.exe, telnet.exe, tracert.exe• SQL Server 2012 Management Studio – Free with SQL Server 2012 Express• ostress.exe and sqlcmd.exe (username requires @<full server name>)• SQL Database Management Portal –
https://<servername>.database.windows.net
New System Views (Event Tables) – in master database• sys.event_log• sys.database_connection_stats
History tables – not real time
Demo
Tools for Connectivity
9
WASD Errors
Failover
Governance and Quota
Throttling Limits
Engine Throttling
“Not supported”
Database copy
Federation
These can result in connection termination
and possible future rejection of work
Many “box” errors still apply – Ex.
1205 = deadlock
Msg 40XXX range can be
seen in sys.messages in SQL Server
2012
full list here
10
Failover
• Your database, the instance, or the computer is “unhealthy”• We may need to patch the instance and/or computer
We may decide to “move you” to a replica of your database to another server
• Msg 40197• “..Server not available”
What will you see?
Implement retry logic in your application
The partition is in transition and transactions are being terminated.
SHUTDOWN is in progress.
11
Governance Max number of concurrent worker threads (currently
180) per database Msg 10928 if you exceed the limit
Connection terminated. Retry when your concurrent work subsides Check for blocking problems or inefficient queries
Msg 10929 if the overall system has too many workers
You may get less than 180 max Connection terminated. You can retry but it may take longer to
stabilize Still could be an application issue but a service issue could also be
occurring
Resource ID : 1 = worker threads
12
Quotas Quota errors for space used
Msg 40544 when you run out of space for your max size for your db
Only reads and DELETE/DROP allowed until you free up space Use sys.dm_db_partition_stats to find what is consuming
space Solutions
Increase max size Delete data or drop tables/indexes Partition out database
But…freeing up may not be immediately recognized
Changing MAXSIZE disconnects all users
13
Throttling Limits
• We have a service called a “Watchdog Service” querying the instance for “conditions” to terminate connections to prevent resource problems.
We also call these “Watchdogs alerts”
• We will kill the session with a “reason”. The “reason” is the error message you get• Application gets an error message (high severity) and connection terminated
(KILL/ROLLBACK status)• Sometimes retry works but these usually require some change on your part• throttling_long_transaction in sys.event_log
We monitor all databases and look for conditions to prevent problems
Error Condition40549 Session blocking system task for long period of time (20 secs)
40550 Session is consuming too many locks (1 million)
40551 Session is consuming too much tempdb space (5Gb)
40552 Transaction consuming too much log space or active transaction preventing log truncation
40553 Session consuming memory (16Mb) and there are memory waits (20secs)
Rebuild index Online
14
Engine Throttling This is more of a legacy monitoring method used to keep instances
healthy Another external service monitors the health of the instance and
computer Soft throttling – we have detected a resource issue so pick specific
databases Hard throttling – entire instance at risk so all databases are affected
How it Works Existing requests run to completion New requests for existing connections and new connections may get
Msg 40501 and connection terminated depending on type of request Reason code in Error has more details on soft vs hard, what will be
rejected, and why throttling in sys.event_log
Decode reason codesAnother resource
0x8003
x03 = RejectAllx80 = Hard Throttling on I/O
15
“Not Supported” Errors USE <db> not supported – specify when connecting ALTER DATABASE supported minimally (Ex. Name, Edition,
MAXSIZE, READ_ONLY) All DBCC commands not supported except for DBCC
SHOW_STATISTICS Database scoped DMVs supported Feature Support for Windows Azure SQL Database Unsupported Transact-SQL Statements (Windows Azure SQL Datab
ase)
Partially Supported Transact-SQL Statements (Windows Azure SQL Database)
Demo
Using Event Tables to Troubleshoot WASD Errors
17
WASD and Query Performance
Stick to the basics…..
• Running or waiting? Blocking or CPU?• Is it your application, Windows Azure role, your computer, or
queries?• Is it network latency?• Differences from when “good”? Did the query plan change?• Proper indexes – Avoid scans, large sorts, ….• Auto create and Auto update stats on by default
There are methods to optimize performance specific to Azure
• Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted
• Inevitably you may have to shard your data• “Chatty” applications don’t usually perform well• Avoid large result sets• Application problems may show up earlier on this platform (Ex.
Transaction keeping the log from being truncated)
18
WASD Performance Scenarios
Interesting Performance
ScenariosOn-premise clients may
see higher ASYNC_NETWORK_IO
waits
Small transactions may result in WRITELOG and
SE_REPL* waits
Deadlocks (Msg 1205) just like the “box” – Use sys.event_log to debug
Troubleshooting Query Timeouts
Could just be blocking
Trace your queries so you know which one timed
out
Examine query plan and tune the query/indexes
19
Dynamic Management Views (DMV) for Performance
• Find out currently running requests in your database. Use this to detect blocking
sys.dm_exec_requests
• Find out the performance of queries that have run in your database. Look here for worst performing queries
sys.dm_exec_query_stats
• Display the query plan of a specific querysys.dm_exec_query_plan
• Aggregation history of waits – Some new for WASD
• Only shows any wait_type with count > 0
sys.dm_db_wait_stats
• Could indexes help query performance?“missing index DMVs”
20
A look at WASD Wait Types
Demo
Troubleshooting Query Performance on WASD
22
Watch Out for These
Keep database copies for “user error”
Be careful dropping servers and databases in portal
DML may fail if no clustered index (temp tables excluded)
DMVs are database scoped
Databases have RCSI on by default – tables can be larger
DATETIME in all data centers is stored as UTC time
You may not have access to objects that appear in catalog views
Non-supported or partial supported commands/featuresSystem Views Unique to WASD
23
Before you contact support
Check the Azure forums: MSDN or stackoverflow
Check the service dashboard
Is it Windows Azure? On-premise problem?
Have exact error message(s) available
Have TracingID available
Do you know the query?
Do you have application retry logic?
Give us the date and time of issue with “observed” timezone
Is this happening now or in the past?We can do RCA but….
It can take some time and we may not have enough history
24
References
Retry Logic for Transient Failures in Windows Azure SQL Database
Error Messages (Windows Azure SQL Database) Windows Azure SQL Database Performance and Elast
icity Guide
Windows Azure SQL Database Connection Management
sys.event_log documentation CSS SQL Escalation Blog Troubleshoot and Optimize Queries with Windows Az
ure SQL Database
Questions?
Thank you!
http://sdrv.ms/Zqdkex
26
The Troubleshooting Checklist
Does the Windows Azure Portal work and list your databases?
Is there a dashboard posting for an outage in your region?
Does the SQL Management Portal work? Does SQL Server Management Studio work? Is there an internet provider issue? Is your firewall configuration correct? Is the problem Windows Azure vs WASD? Is there blocking? Are your queries and index tuned? Is this really an application retry issue? Governance, quotas, limits, and throttling are “part
of this platform” Have you looked at Event Tables?