Inside Azure Diagnostics (DevLink 2014)
-
Upload
michael-collier -
Category
Technology
-
view
1.655 -
download
5
description
Transcript of Inside Azure Diagnostics (DevLink 2014)
Inside Azure Diagnostics
Michael S. CollierPrincipal Cloud Architect
[email protected]@MichaelCollierwww.MichaelSCollier.com
17
COLUMBUS, OH OCTOBER 17, 2014 CLOUDDEVELOP.ORG
Today’s Agenda1. The need for diagnostic
data in cloud applications
2. Data we can monitor
3. Using the Azure Diagnostic Agent
4. Real-world guidance for troubleshooting Azure apps
Successful projects share at least one common trait . . .
Success vs. Failure
node.js C# Java
Agile- vs -
Waterfall
Successful projects share at least one common trait . . .
Success vs. Failure
Diagnostics Data / Telemetry
A True Story
Scenario1 week before date of production launch. “Am I ready?”
Well, we eventually log
any fatal errors, but that’s all.
OH . . .
Logs? Yeah . . .we really don’t have logs.
Let’s run some tests and look at your logs
I guess that’s better than
nothing.
We looked at Azure diagnostic logging but
didn’t see much value in it
A True Story
You’re kidding? Right?
A True StoryScenarioo Determine if solution is
production readyo Deployed as an Azure Cloud
Serviceo No load testso No performance testso No unit testso Very little instrumentation
We have a problemhttp://www.cutedaily.com/wp-content/uploads/2011/11/shockedbaby.jpg
A True StoryResolution1. Enable Azure diagnostics
– Set key performance counters
2. Add logging statements around key functionality– Especially external
services3. Test, test, test4. Analyze5. Fix it
Scenarioo Determine if solution is
production readyo Deployed as an Azure Cloud
Serviceo No load testso No performance testso No unit testso Very little instrumentation
Instrumentation more important in “the cloud”o Need to have good instrumentation for on-premises
applications
o Cloud – it matters more!
o Distributed environments and serviceso Composite applicationso Reliance on 3rd party vendors . . . such as Microsoft for Azureo Highly automated environmentso Scale out modelo Massive amounts of data
The Cloud Scales
worker roles
web roles
The Cloud Scales . . . You Do Not
worker roles
web roles
Diagnostic Data – 4x
Diagnostic DataWhat data do you gather today?
Performance Counters
Custom Logs(nLog, Log4net, etc.)
IIS Logs
Windows Event Logs
Crash Dumps
Diagnostic Data
Performance Counters
Custom Logs(nLog, Log4net, etc.)
IIS Logs
Windows Event Logs
Crash Dumps
Diagnostic Data – Azure Not so Different
Performance Counters
Custom Logs(nLog, Log4net, etc.)
IIS Logs
Windows Event Logs
Crash Dumps
Azure
Sto
rage
Diagnostic Data StorageDiagnostic Item Table Name Blob Container
NameWindows Event Logs WADWindowsEventLogsTable
Performance Counters WADPerformanceCountersTable
Trace Log Statements WADLogsTable
Azure Diagnostic Infrastructure Logs
WADDiagnosticInfrastructureLogs
Custom Logs(i.e. log4net, NLog, etc.)
<custom>
IIS Logs WADDirectoriesTable* wad-iis-logfiles
IIS Failed Request Logs WADDirectoriesTable* wad-iis-failedreqlogfiles
Crash Dumps WADDirectoriesTable* * Location of the blob log file is specified in the Container field and name of the blob in the RelativePath field. The AbsolutePath field contains the name of the file as it existed on the role instance.
Diagnostic Monitor Agent
1. Role starts2. Diagnostic monitor agent
starts3. Diagnostics configured4. Data buffered locally5. Data transferred to storage
wad-control-containero Container in Azure blob
storage
Diagnostic Monitor Agent
Configuration Options
Default Configuration
Imperative Configuration
Declarative Configuration
o Trace logso IIS logso Infrastructure
logs
o No transfer
o OnStart()
o Overrides default
o diagnostics.wadcfg
o Root of worker or \bin of web
Imperativepublic override bool OnStart(){ // Create the DiagnosticMonitorConfiguration object to use for configuring the monitoring agent. DiagnosticMonitorConfiguration config = DiagnosticMonitor.GetDefaultInitialConfiguration(); // Performance Counter configuration config.PerformanceCounters.DataSources.Add(new PerformanceCounterConfiguration { CounterSpecifier = @"\Processor(_Total)\% Processor Time", SampleRate = TimeSpan.FromSeconds(30) }); config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); // Log configuration config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Information; config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); // Event Log configuration config.WindowsEventLog.DataSources.Add("Application!*"); config.WindowsEventLog.DataSources.Add("System!*"); config.WindowsEventLog.ScheduledTransferLogLevelFilter = LogLevel.Warning; config.WindowsEventLog.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); // Start the diagnostic monitor with the new configuration DiagnosticMonitor.Start("Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString", config); return base.OnStart();}
Impacts local agent only!
Imperative
Deployment ID
Declarative Configuration using Visual Studio
demo
1. wad-control-containera. Created for each role instance
2. Imperative codea. RoleInstanceManager.SetCurrentConfiguration() – update instance’s
diagnostics.wadcfg onlyb. DiagnosticMonitor.Start() – impacts current instance only; will not
update diagnostics.wadcfg
3. Declarative configurationa. Root of worker role or bin of web roleb. Updates to diagnostics.wadcfg take effect only if the wad-control-container
blob has never been updated programmatically.
4. Default configurationa. Last resortb. Collects, but doesn’t transfer to Azure storage
There’s a Precedence
Proble
m?
oDeployment Updateo Change configuration and redeploy
package
oRemotelyo Visual Studioo APIo Cerebrata Azure Management Studio
Update Diagnostic Configuration
On-Demand TransferInstruct WAD to transfer specific data sources to storageSpecify which data sourcesSpecify time range to transferSpecify a notification queueCode or API (or tool)
Overwrites current diagnostic configurationUse sparingly . . . . With caution
More info see http://mcollier.net/DiagOnDemand
Bonus: Verbose LoggingAdditional host-level data – not DiagnosticAgent.exe
WAD*deploymentID*PT*aggregation_interval*[R|RI]Table
Aggregation at 5 minutes, 1 hour, and 12 hour intervals
10 day retention period
Let’s Get Realo Sample every 1 -2 minutes*o Transfer every 5 minutes*
o Transfer only what is needed
o Azure Diagnostics writes data in 60 second wide partitions
o Too much data could overwhelm the partition
* Don’t take my word for it. You don’t know me. Test and validate for your situation.
Query Azure Diagnostic Data
demo
o Two separate channels for telemetry dataoVital informationo Application or service failures. Higher level of alerting.o Fix and return to “normal” as soon as possibleo Alert now – email, SMS, dashboard, ninjas from ceiling, etc.
oDay-to-day operational datao Root cause analysisoHow to prevent in the futureo Azure diagnostics
o Fine tune the alerts – reduce false alarms and noise
Set Priorities
Define Key Metrics
Compute node resource usage
Windows Event logs
Database queries
response times
Application specific
exceptions
Database connection & cmd failures
Microsoft Azure Storage
Analytics
Process for Azure hosted solutions is not that different from traditional, on-premises solutions.
o Log all calls to external services. Challenge an SLA?
o Log details of transient faults
o Partition telemetry data by date (or hour) – reduce impact of data aggregation or reporting
o Use a different storage account!
o Remove old / non-relevant telemetry data
o Apply to development, test, and QA versions – validate performance & ensure telemetry systems operating correctly
Considerations
o Use declarative configuration (diagnostics.wadcfg) exclusively.
o Bring Azure diagnostic data into relational databaseo Easier reportingo Periodically fetch from Azure table and insert into SQL Database table.
Use PK and keep most recent.o Custom code
o Supplement Azure diagnostics with other toolso New Relic or AppDynamicso Cerebrata Azure Management Studioo AzureWatch (Paraleap)
Considerations (cont.)
o Instrumentation and telemetry are key to successful projects
o Cloud metrics similar to metrics for traditional applications
o Be realistic and set priorities
o 3rd party tools can be essential for troubleshooting
Summary
o Diagnostics Configuration Order of Precedence – http://bit.ly/1eomek9
o Use the Azure Diagnostic Configuration File – http://bit.ly/1mVHN3u
o Cloud Service Fundamentals (wiki) – http://bit.ly/1k1YkjI
o Failsafe: Guidance for Resilient Cloud Architectures – http://bit.ly/Q33mkU
o Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services – http://bit.ly/1qp4omC
Resources
oMulti-part series on Azure diagnostics
oMany other fantastic articles:o Getting Started with Azure Searcho Azure storage queueso Cloud Serviceso Automated testing in Azure
Just Azure
www.JustAzure.com
Questions?
Thank You!Michael S. CollierPrincipal Cloud Architect
[email protected]@MichaelCollierwww.MichaelSCollier.com