HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1...

8
Health Scores Understanding Health Scores, on page 1 Understanding Faults, on page 4 How Health Scores Are Calculated, on page 5 Health Score Use Cases, on page 7 Understanding Health Scores ACME's Operations team has been challenged on a regular basis to answer basic questions regarding the current status, performance, and availability of the system they are responsible for operating. To address these challenges they can now utilize the Cisco Application Centric Infrastructure (ACI), which provides health scores that provide information on status, performance, and availability. While providing such answers might be easy as it relates to an independent device or link, this information by itself is of little to no value without additional data on its effect on the overall health of the network. To manually collect and correlate information would have previously been a long and tedious task, but with health scores, data throughout the fabric is collected, computed, and correlated by the Application Policy Infrastructure Controller (APIC) in real time and then presented in an understandable format. Traditional network monitoring and management systems attempt to provide a model of the infrastructure that has been provisioned, and describe the relationship between the various devices and links. The object model at the heart of ACI is inherent to the infrastructure. A single consolidated health score therefore shows the current status of all of the objects including links, devices, their relationships, the real-time status of their utilization, and a quick at-a-glance assessment of the current status of the entire system or any subset of the system. This visibility has a number of practical use cases, and in this chapter we will classify these use cases as reactive and proactive. ACI also provides the flexibility to monitor some aspects of how the health scores are calculated, and how various faults impact the calculation of the health score. Most objects in the model will have an associated health score, which can be found from the Dashboard or Policy tabs of the object from the GUI. To check the overall fabric health, in the APIC GUI, go to System > Dashboard. You can view the following information: • The controller health • Nodes with health less than 99 • Tenants with health less than 99 • A health graph depicting the health score of the system over a period of time Health Scores 1

Transcript of HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1...

Page 1: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

Health Scores

• Understanding Health Scores, on page 1• Understanding Faults, on page 4• How Health Scores Are Calculated, on page 5• Health Score Use Cases, on page 7

Understanding Health ScoresACME's Operations team has been challenged on a regular basis to answer basic questions regarding thecurrent status, performance, and availability of the system they are responsible for operating. To address thesechallenges they can now utilize the Cisco Application Centric Infrastructure (ACI), which provides healthscores that provide information on status, performance, and availability. While providing such answers mightbe easy as it relates to an independent device or link, this information by itself is of little to no value withoutadditional data on its effect on the overall health of the network. To manually collect and correlate informationwould have previously been a long and tedious task, but with health scores, data throughout the fabric iscollected, computed, and correlated by the Application Policy Infrastructure Controller (APIC) in real timeand then presented in an understandable format.

Traditional network monitoring and management systems attempt to provide a model of the infrastructurethat has been provisioned, and describe the relationship between the various devices and links. The objectmodel at the heart of ACI is inherent to the infrastructure. A single consolidated health score therefore showsthe current status of all of the objects including links, devices, their relationships, the real-time status of theirutilization, and a quick at-a-glance assessment of the current status of the entire system or any subset of thesystem. This visibility has a number of practical use cases, and in this chapter we will classify these use casesas reactive and proactive. ACI also provides the flexibility to monitor some aspects of how the health scoresare calculated, and how various faults impact the calculation of the health score.

Most objects in the model will have an associated health score, which can be found from the Dashboard orPolicy tabs of the object from the GUI. To check the overall fabric health, in the APIC GUI, go to System >Dashboard. You can view the following information:

• The controller health

• Nodes with health less than 99

• Tenants with health less than 99

• A health graph depicting the health score of the system over a period of time

Health Scores1

Page 2: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

The health graph is a good indication of any system issues. If the system is stable, the graph will be a constant,otherwise it will fluctuate.

All health scores are instantiated from the healthInst class and can be extracted through the API.

In a reactive capacity, ACI health scores provide a quick check in which a newly occurred issue instantlyresults in a degradation of the health score. The root cause of the issue can be found by exploring the faults.Health scores also provide a real-time correlation in the event of a failure scenario, immediately providingfeedback as to which tenants, applications, and EPGs are impacted by that failure.

Almost every object and policy has a Health tab. As an example, to check if a specific EPG has faults, youcan go to Tenants > APIC GUI > Tenants > Tenant > Application Profile > YourProfile > YourEPG. Inthe work pane, look for the Health tab. You can also access the Health tab under History > Health. This tabprovides the affected object and how it is tied within the larger model. By clicking on the +, you can explorethe health tree of any affected object or policy to reveal the faults.

Figure 1: Object with a fault

Health Scores2

Health ScoresUnderstanding Health Scores

Page 3: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

Proactively, ACI health scores can help identify potential bottlenecks in terms of hardware resources, bandwidthutilization, and other capacity planning exercises. Operations teams also stand a better chance of identifyingissues before they impact customers or users.

Ideally, the health of all application and infrastructure components should always be at 100%. However, thisis not always realistic given the dynamic nature of data center environments. Links, equipment, and endpointshave failures. Instead the health score should be seen as a metric that will change over time, with the goal ofincreasing the average health score of a given set of components over time.

Viewing a Health Score Using the NX-OS-Style CLI

You can use the NX-OS-style CLI to view the health of specific objects.

To view the health of a tenant:show health tenant tenant

To view the health of bridge domain of a tenant:show health tenant tenant bridge domain bd

To view the health of an endpoint group of an application within a tenant:show health tenant tenant application app epg epg

To view the health of a leaf:show health leaf leafnode

The following example views the health of tenant "tenant1":apic1# show health tenant tenant1Score Change(%) UpdateTS Dn----- ----- ------------------- ------------------------------100 0 2015-11-13T18:23:14 uni/tn-pineapple/health

.767-08:00

The following example views the health of leaf 101:apic1# show health leaf 101Score Change(%) UpdateTS Dn----- ----- ------------------- ------------------------------72 10 2015-11-11T12:55:52 topology/pod-1/node-101/sys/health

.847-08:00

Viewing a Health Score Using the iShell

You can use the iShell to view the health of specific objects.

To view the health of an APIC:show health controller ID

To view the health of a switch:show health switch node

The following example views the health of switch 101:admin@apic1:~> show health switch 101Current Score Previous Score Timestamp Dn------------- -------------- --------------------- -------------------72 65 2015-11- topology/pod-1/

11T12:55:52.847-08:00 node-101/sys/health

Health Scores3

Health ScoresUnderstanding Health Scores

Page 4: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

Understanding FaultsFrom a management point of view we look at the Application Policy Infrastructure Controller (APIC) fromtwo perspectives:

1. Policy Controller - Where all fabric configuration is created, managed and applied. It maintains acomprehensive, up-to-date run-time representation of the administrative or configured state.

2. Telemetry device - All devices (Fabric Switches, Virtual Switches, integrated Layer 4 to Layer 7 devices)in an Cisco Application Centric Infrastructure (ACI) fabric report faults, events and statistics to the APIC.

Faults, events, and statistics in the ACI fabric are represented as a collection of Managed Objects (MOs)within the overall ACI Object Model/Management Information Tree (MIT). All objects within ACI can bequeried, including faults. In this model, a fault is represented as a mutable, stateful, and persistent MO.

Figure 2: Fault Lifecycle

When a specific condition occurs, such as a component failure or an alarm, the system creates a fault MO asa child object to the MO that is primarily associated with the fault. For a fault object class, the fault conditionsare defined by the fault rules of the parent object class. Fault MOs appear as regular MOs in MIT; they havea parent, a DN, RN, and so on. The Fault "code" is an alphanumerical string in the form FXXX. For moreinformation about fault codes, see the Cisco APIC Faults, Events, and System Messages Management Guide.

Health Scores4

Health ScoresUnderstanding Faults

Page 5: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

The following example is a REST query to the fabric that returns the health score for a tenant named "3tierapp":

https://hostname/api/node/mo/uni/tn-3tierapp.xml?query-target=self&rsp-subtreeinclude=health

The following example is a REST query to the fabric that returns the statistics for a tenant named "3tierapp":

https://hostname/api/node/mo/uni/tn-3tierapp.xml?query-target=self&rsp-subtreeinclude=stats

The following example is a REST query to the fabric that returns the faults for a leaf node:

https://hostname/api/node/mo/topology/pod-1/node-103.xml?query-target=self&rspsubtree-include=faults

As you can see, MOs can be queried by class and DN, with property filters, pagination, and so on.

In most cases, a fault MO is automatically created, escalated, de-escalated, and deleted by the system asspecific conditions are detected. There can be at most one fault with a given code under an MO. If the samecondition is detected multiple times while the corresponding fault MO is active, no additional instances ofthe fault MO are created. In other words, if the same condition is detected multiple times for the same affectedobject, only one fault is raised while a counter for the recurrence of that fault will be incremented. A faultMO remains in the system until the fault condition is cleared. To remove a fault, the condition raising thefault must be cleared, whether by configuration, or a change in the run time state of the fabric. An exceptionto this is if the fault is in the cleared or retained state, in which case the fault can be deleted by the user byacknowledging it.

Severity provides an indication of the estimated impact of the condition on the capability of the system orcomponent to provide service.

Possible values are:

• Warning (possibly no impact)• Minor• Major• Critical (system or component completely unusable)

The creation of a fault MO can be triggered by internal processes such as:

• Finite state machine (FSM) transitions or detected component failures• Conditions specified by various fault policies, some of which are user configurable

For example, you can set fault thresholds on statistical measurements such as health scores, data traffic, ortemperatures.

How Health Scores Are CalculatedHealth scores exist for systems and pods, tenants, managed objects (such as switches and ports), as well asan overall health score for the overall system. All health scores are calculated using the number and importanceof faults that apply to it. System and pod health scores are a weighted average of the leaf health scores, dividedby the total number of learned end points, multiplied by the spine coefficient which is derived from the numberof spines and their health scores. In other words:

Health Scores5

Health ScoresHow Health Scores Are Calculated

Page 6: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

Figure 3: Health Score calculation

Tenant health scores are similar, but contain health scores of logical components within that tenant. Forexample, it will only be weighted by the end points that are included in that tenant.

You can see how all of these scores are aggregated by looking at how managed object scores are calculated,which is directly by the faults they have associated with them. Each fault is weighted depending on the levelof importance. Critical faults might have a high fault level at 100%, while warnings might have a low faultlevel at only 20%. Faults that have been identified as not impacting might even be reassigned a percentagevalue of 0% so that it does not affect the health score computation.

Luckily there is really no need to understand the calculations of the health scores to use them effectively, butthere should be a basic understanding of whether faults should have high, medium, low, or "none" fault levels.Though faults in ACI come with default values, it is possible to change these values to better match yourenvironment.

Keep in mind, because of the role-based access control, not all administrators will be able to see all of thehealth scores. For example, a fabric admin will be able to see all health scores, but a tenant admin would onlybe able to see the health scores that pertain to the tenants to which they have access. In most cases, the tenantadmin should be able to drill into the health scores that are visible to them, but it is possible a fault may beoccurring that is affecting more than that one tenant. In this case the fabric administrator may have to starttroubleshooting. The tenant and fabric admins may also see health scores of any layer four through sevendevices, such as firewalls, load balancers, and intrusion prevention/detection systems. These, along with faultswithin our VMM domains will all roll up into our tenant, pod, and overall system health scores.

Health Scores6

Health ScoresHow Health Scores Are Calculated

Page 7: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

For more information on how to use faults, see Troubleshooting Cisco Application Centric Infrastructure athttp://aci-troubleshooting-book.readthedocs.org/en/latest/.

Health Score Use Cases

Using Health Scores for Proactive MonitoringWhile ACME administrators have traditionally spent a lot of time reacting to issues on the network, ACIhealth scores will allow them to start preventing issues. Health scores not only act as indicators of faults, theyare essentially baselines to which you can make comparisons later. If you see that one of the leaf switches isat 100% (green for good) one week, and the next week the leaf is showing a warning, you can drill down tosee what changed. In this scenario, it is possible the links are oversubscribed and so it can be time to eithermove some of of the workload to another leaf or maybe to add more bandwidth by connecting more cables.Since it is still only a warning, there is time to resolve the issue before any bottlenecks on the network arenoticeable.

The same scenario can observed with a load balancer or firewall that is getting overloaded. In these casesadding another load balancer, or firewall, or maybe even optimizing the rules may be needed to make trafficflowmore efficient. As shown in the above examples, this baseliningmethod can be used as a capacity planningtool.

Other ways health scores can be used to proactively monitor your ACI environment are by giving visibilityof certain components to other groups. Since you can export the scores and faults, it is possible to send thesenotifications to application owners, VMware administrators, Database Administrator, and so on. This wouldprovide monitoring of the environment across the network that has not previously been available and whichis not able to be retrieved by any other means.

Using Health Scores for Reactive MonitoringReactively, health scores can be used to diagnose problems with the ACI fabric. Upon notification that a healthscore has been degraded, an operator can use the GUI to easily navigate the relationships and faults that arecontributing to that health score. Once the root cause faults have been identified, the fault itself will containinformation about possible remediation steps.

Most objects will have a Health tab which can be used to explore the relationship between objects, and theirassociated faults. This provides the ability to "double-click to root cause".

Health Scores7

Health ScoresHealth Score Use Cases

Page 8: HealthScores - Cisco - Global Home Page · HealthScores •UnderstandingHealthScores,onpage1 •UnderstandingFaults,onpage4 •HowHealthScoresAreCalculated,onpage5 •HealthScoreUseCases,onpage7

Health Scores8

Health ScoresUsing Health Scores for Reactive Monitoring