Golden Topology and Best Practices 2018 - IBM
Transcript of Golden Topology and Best Practices 2018 - IBM
Golden Topology andBest Practices2018—Simon KapadiaDeveloper Portal Security Lead, APICIBM Development
Pleasenote
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
2
Who am I?
• Currently: Developer Portal Security Lead for API Connect Development
• Previously: Security Lead for IBM MobileFirst Platform Development, EMEA Security Lead for Software Services for WebSphere
• WebSphere Specialist since 1999, DataPower Specialist since 2005, APIC since APIMv3
• 15 Years of implementing real world distributed computing systems in all manner of customers and industry, including “the serious ones” (banks, governments, law enforcement, pharmaceuticals, etc)
Who are you?
• A technical audience – so this is not a marketing presentation, and not targeted at executives!
• An interested audience – user group membership is optional! This presentation assumes that you will be building an APIC infrastructure and want my thoughts on how to do that.
• An informed audience – I assume that you already know APIC
• Hopefully, a difficult audience – I expect questions, so speak up; call me out on things I don’t explain well or if you disagree with me!
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
5
Runtime Goals
• What do you actually want to achieve with your topology?• How much are you willing to pay to achieve it?
• Lets start with some definitions:
• HA – High Availability• DR – Disaster Recovery• CO – Continuous Operation• CA – Continuous Availability
HA – High Availability
• Ensuring that the system can continue to process work within onelocation after routine single component failures
• Usually we assume a single failure
• Usually the goal is very brief disruptions for only some users for unplanned events
CO – Continuous Operations
• Ensuring that the system is never unavailable during planned activities
• For example, if the application is upgraded to a new version, we do it in a way that avoids downtime
CA – Continuous Availability
• High Availability coupled with Continuous Operations• No tolerance for planned downtime • As little unplanned downtime as possible• Very expensive to implement• Note that while achieving CA almost always requires an aggressive DR
plan, they are not the same thing• Also referred to as “Always On”
DR – Disaster Recovery
• Ensuring that the system can be reconstituted and/or activated at anotherlocation and can process work after an unexpected catastrophic failure at one location• Often multiple single failures (which are normally handled by high
availability techniques) is considered catastrophic• There may or may not be significant downtime as part of a disaster
recovery• This environment may be substantially smaller than the entire production
environment, as only a subset of production applications demand DR • DR Measurements:
• Recovery Time Objective (RTO) = Service Recovery with little to no interruption• Recovery Point Objective (RPO) = Data Recovery and acceptable data loss
…but but but
• Surely with all this cloud and kubernetes and cassandra and other modern words, we don’t have to worry about any of this?• Right?• Well, no, you really do!
• You are still running a complex distributed computing system with non-functional requirements• The key is understanding those requirements and designing an
infrastructure which can meet them• Back to my first questions: What do you actually want to achieve with
your topology? How much are you willing to pay to achieve it?
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
12
So, let’s talk about APIC
• Other presentations have discussed the architecture, the components and how they work together
• I will be presenting deployment options and how the components scale and fail
• Remember the icons on the right
Gateway Instance
Analytics Instance
Portal Instance
Manager Instance
Obligatory IBM APIC 2018.1 Marketing SlideCreateAutomatically create and test APIs to expose data, microservices, enterprise applications and SaaS services.
SecureEasily apply built-in and extensible policies to secure, control and mediate the delivery of APIs with unmatched scale.
ManageRapidly publish, lifecycle govern, socialize, analyze, monitor and monetize APIs with built-in capabilities.
The Scalable Multi-Cloud API Platform
A complete, modern and intuitive API lifecycle platform to create, securely
expose and manage APIs across clouds to power digital applications
14
API Connect V2018.1: Component Scope• Single Manager per API Connect Cloud, as it is the
brain of the API Management system
• Manger can span multiple Availability Zones, giving flexibility in deployment scenarios
• Multiple Portal, Analytics and Gateway Services per Cloud, and are scoped to a single Availability Zones
• API Connect Cloud defined as 1 APIM, with N Component Services, and most customers have 2+ Cloud environment (Development, Staging, Prod, etc…)
15
What’s an Availability Zone?− A logical
configuration construct
− Can be in one Datacentre or over multiple Datacentres
− Management Service can span multiple availability zones
High Availability in 2018.x
17
Why is there substantial change with APIC v2018.1?Major advances in application development, deployment and management have led companies to begin pursuing multi-cloud application strategies.
API Connect 2018.1 Ships Kubernetes in OVA Installations, giving customers some of the benefits of a cloud-native solution without having to install their own Kubernetes environment. This however comes with additional high availability requirements.
API Connect leverages Kubernetes and other underlying technologies (data persistence such as cassandra) to achieve the scalability and reliability needs for modern API management multi-cloud platforms. For E.g. if database within portal fails, traffic would be directed to remaining members and the failed node would be auto-restarted to support future traffic.
Kubernetes & Underlying APIC Components Technology Requires Quorum, and without quorum the services will begin behaving abnormally. Quorum requirements calculated as such: Node Failure tolerance of (N-1)/2Where N is number of instances or nodes in cluster (ICP Explanation)
***In 2018.x the API Connect team uses High Availability to refer to 3 instance deployments, while in v5 High Availability refers to 2 instances
© 2018 IBM Corporation
18
High Availability in v5
© 2018 IBM Corporation
High Availability in 2018.x+Vs.
o Cannot dynamically scaleo Slow upgrade processo Gateways reliant on manager for
gateway configurations o Does not require Quorum
o Dynamically scaleo Drastically reduced upgrade timeo Gateways self manage configurations at cluster
level o Requires Quorum
o Bottlenecks manager instanceo Does not promote remote gateway
deploymentso Only 1 Analytics Cluster per API
Connect Cloud
o Separated from manager instanceo Optimized for remote gateway deployments, by
deploying analytics next to gateway to reduce latencies o Deploy multiple analytics clusters per API Connect
Cloud
o No true active/active set upo Cloud dissociation (split-brain scenarios)o Impacted by analytics functionality
o True Active/ Active cluster configurations o Quorum avoids cloud dissociationo Better performance and stability with analytics removed
o Recommended to have 3+ Portal instanceso Only support 1 Portal Cluster per API
Connect Cloud
o No changes from v5o Deploy multiple portal clusters per API Connect
Cloud
Gateway
Analytics
Portal
Manager
High Availability: Quorum
Cluster has Quorum
Cluster does not have Quorum
Cluster has Quorum
Cluster does not have Quorum
Cluster has Quorum
Cluster does not have Quorum
Cluster has Quorum
Odd numbers are better for Quorum!• Cluster can scale to even number of nodes under increased load,
but better to always have odd number of members
OVA install 2018.1
Management Cluster(Shown Right)
API Server
etcd
Controller
SchedulerKubernetes Requires 3 Master Nodes Minimumo Etcd requires quorum or will loose write abilitieso If no write abilities can not update API server and may
send requests to failed or disconnected nodes
This impacts APIM, Portal, & Analytics OVA installs
Quorum requirements for kubernetes master nodes need to be applied to OVA and Container Deployments.
Link to k8s docs
MasterMicroservices
Worker
MicroservicesWorker
MicroservicesWorker
Ingress
VM
API Server
etcd
Controller
Scheduler
MasterMicroservices
Worker
MicroservicesWorker
MicroservicesWorker
API Server
etcd
Controller
Scheduler
MasterMicroservices
Worker
MicroservicesWorker
MicroservicesWorker
20
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
21
• 3 instances minimum for HA based on quorumo (N-1)/2 dictates that this HA set up can handle a single
node failure o If more than 1 node or 1 instance fails then the
application begins behaving abnormally
• Nodes represent either Physical Machine or VMs
High Availability (HA)Single DC, 3 Instance, HA Deployment
Data Center 1
Node 1 Node 2 Node 3
**Nodes represent physical machine or VMs
• Scenario depicts either a failure of node 3, or a failure of the instances on node 3 or a loss connection from node 3
• Quorum is maintained between instances running on Node 1 & 2
Data Center 1
Node 1 Node 2 Node 3
High Availability (HA)Single DC, 3 Instance, HA Deployment
**Nodes represent physical machine or VMs
Data Center 1
Node 1 Node 2 Node 3
High Availability (HA)Single DC, 3 Instance, HA Deployment
**Nodes represent physical machine or VMs
• Scenario depicts either a failure of node 2 & 3, or a failure of the instances on node 2 & 3, or a loss connection from nodes 2 & 3
• Quorum is lost on Node 1 and thus the API Connect components begin having abnormal behavior
• Cluster can scale to even number of nodes under increased load, but better to always have odd number of members
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
25
What is a Disaster?Some people have permanently lost access to the files on the affected disks as a result. A number of disks damaged following the lightning strikes did, however, later became accessible. Generally, data centres require more lightning protection than most other buildings.Google has said that lightning did not actually strike the data centre itself, but the local power grid and the BBC understands that customers, through various backup technologies, were able to recover all lost data.While four successive strikes might sound unlikely, lightning does not need to repeatedly strike the same place or the actual building to cause damage.Justin Gale, project manager for the lightning protection service Orion, said lightning could strike power or telecommunications cables connected to a building at a distance and still cause disruptions."The cabling alone can be struck anything up to a kilometre away, bring [the shock] back to the data centre and fuse everything that's in it," he said.Unlucky strikeThe Google Compute Engine (GCE) service allows Google's clients to store data and run virtual computers in the cloud. It's not known which clients were affected, or what type of data was lost.In an online statement, Google said that data on just 0.000001% of disk space was permanently affected."Although automatic auxiliary systems restored power quickly, and the storage systems are designed with battery backup, some recently written data was located on storage systems which were more susceptible to power failure from extended or repeated battery drain," it said.The company added it would continue to upgrade hardware and improve its response procedures to make future losses less likely.A spokesman for data centre consultants Future-Tech, commented that while data centreswere designed to withstand lightning strikes via a network of conductive lightning rods, it was not impossible for strikes to get through."Everything in the data centre is connected one way or another," said James Wilman, engineering sales director. "If you get four large strikes it wouldn't surprise me that it has affected the facility."Although the chances of data being wiped by lightning strikes are incredibly low, users do have the option of being able to back things up locally as a safety measure.
Some definitions− Redundancy: The provision of additional or duplicate systems,
equipment, etc., that function in case an operating part or system fails, as in a spacecraft.
− Isolated: Separated from other persons or things; alone; solitary− Independent: Not dependent; not depending or contingent upon
something else for existence, operation, etc.
− All of the Above are Fundamental for Effective High Availability and Disaster Recovery
Disaster Recovery Objectives − Recovery Time Objective• How quickly the system will be able to accept traffic after the disaster• Shorter times require progressively more expensive techniques• e.g., a tape backup and restore is relatively inexpensive• e.g., a fully redundant fully operational data center is very expensive
− One challenge is detection time• It takes time to determine you are in a disaster state and trigger disaster
procedures• While you are deciding if you are down, you are probably missing your SLA. • Does the RTO include detection time?
Disaster Recovery Objectives − Recovery Point Objective• How much data you are willing to lose when there is a disaster• Limiting data loss raises costs• e.g., restoring from tape is relatively inexpensive but you'll lose everything
since the last backup• e.g., asynchronous replication of data and system state requires significant
network bandwidth to prevent falling far behind• e.g., synchronous replication to the backup data center guarantees no data
loss but requires VERY fast and reliable network and will significantly harm performance• Warning: in turn this results in increased latency which means capacity must
be increased at all layers
Disaster Recovery Objectives − Most RTO and RPO goals will deeply impact application and
infrastructure architecture and can't be done “after the fact”• e.g., if data is shared across data centers, your database and application
design will have to be careful to avoid conflicting database updates and/or tolerate them• e.g., application upgrades have to account for multiple versions of the
application running at once which can affect user interface design, database layout, etc
− Extreme RTO and RPO goals tend to conflict• e.g., using synchronous disk replication of data gives you a zero RPO but that
means the second system can't be operational, which raises RTO
− Trying to achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive
Data Center Utilization Urban Legends− Legend:• Active/Active Improves Utilization
− Reality:• An Active/Active Topology at 40-50% Utilization in Each DC Is Equivalent to
An Active/Passive Datacenter Deployment with One Active at 80% to 90 % Utilization and the Other Passive • Running Active/Active at Greater Than 50% Of Total (both Datacenters)
Capacity Can Often Result in a Complete Loss of Service When a Data Center Outage Occurs • Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity
Results in• Poor Response Time (at best) • Network and Server Overload, Resulting in a Complete Crash
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
32
• Services are highly available in Data Center 1
• Data Center 2 is pre-configured and ready to have instances of the services deployed
• Periodic cronjobs are scheduled to backup data from each of the serviceso These backups are synced from DC1 to
DC2
• Customer only paying IBM for DC 1 services
Data Center 1: Active
Node 1
Node 2
Node 3
Data Center 2: Passive
Node 1
Node 2
Node 3
High Availability + Disaster Recovery (DR)2 DC, Active/Passive, CA Deployment
**Nodes represent physical machine or VMs
Data Center 1: Active
Node 1
Node 2
Node 3
• In the event of DC 1 failure, scripts can be executed in DC 2 that will begin deploying instances of the APIC services
• This gives users an answer to catastrophic events that cause datacenter failure
• However…. DC 2 can only recover last back-up • Any data not backed up is lost when failover
to DC 2
• Additionally the customer would need to wait until the infrastructure and software ready before operations could resume
Data Center 2: Passive
Node 1
Node 2
Node 3
High Availability + Disaster Recovery (DR)2 DC, Active/ Passive, CA Deployment
**Nodes represent physical machine or VMs
High Availability + Disaster Recovery (DR)2 DC, Active/ Passive, HA Deployment
IBM Sub-Capacity Licensing “In the case of a program or system configuration that is designed to support a high availability environment by using various techniques (for example, duplexing, mirroring of files or transactions, maintaining a heartbeat, or active linking with another machine, program, database, or other resource), the program is considered to be doing work in both the warm and hot situation and license entitlements must be acquired.”
Backup Entitlements Required
Hot Yes
Warm **Yes
Cold No
**Based on definitions of “Doing Work” APIC does not qualify for Warm Passive systems. Entitlements are needed for Warm set-ups See bottom of page 8 of IBM Software Licensing Guide
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
36
• Can tolerate failure node failure, connection drop, or DC failure
• Optionally deploy additional instances in any DC as long as quorum maintained in event of a failure in DC
• Cluster can scale to even number of nodes under increased load, but better to always have odd number of members
• Best option if operations plan requires datacenter fault tolerance
Data Center 2: Active
Node 2
Data Center 1: Active
Node 1
Data Center 3: Active
Node 3
Low Latency Network
HA Active/ Active 1: Odd Number of DCs 3 DC, Single AZ
**Nodes represent physical machine or VMs
Data Center 2
Node 1
Node 2
Node 3
• If DC 2 fails then,o All services maintain functionality
• If DC 1 fails, or instances are unavailable• Portal would be lost
• APIM registered to 2 different AZs
• Not the best topology for continuous operation of the Portal
Data Center 1
Node 1
Node 2
Node 3
Low Latency Network
HA Active/ Active 2: Multiple Services 2 DC Active/Active, Dual AZ
**Nodes represent physical machine or VMs
• If DC 2 fails then,o All services maintain functionality
• If DC 1 fails, or instances are unavailable• Quorum is lost and thus the
cluster begins having abnormal behavior
• Gateways in DC 2 will be able to continue to handle traffic
• Not the best topology for continuous operation
Data Center 1: ActiveMain-Site
Node 1
Node 2
Node 3
Data Center 2: Active
Node 4
Node 5
Low Latency Network
HA Active/ Active 3: Main-site 2 DC Active/Active, Single AZ
**Nodes represent physical machine or VMs
• If DC 2 fails, or connection drops between the datacenters or loose one of the DCs• Quorum is lost and thus the
cluster begins having abnormal behavior • Gateway can still handle traffic
• This leaves 3 options above as best deployment options Datacenter failure for API Manager & Portal:1. 3 DCs2. Multiple Services3. Main-site
Data Center 1: Active
Node 1
Node 2
Node 3
Data Center 2: Active
Node 4
Node 5
Node 6
Low Latency Network
HA Active/ Active: Not recommended2 DC, Single AZ
**Nodes represent physical machine or VMs
Networks are inherently unreliable
• Your network people are lying to you if they say otherwise• Things do go wrong.• Routers get misconfigured or have bugs (yes, a router is a computer just like
other computers and they have bugs).• People dig holes in the wrong place and cut through your special magical
custom designed dark fiber links.• Quality of service mechanisms get overwhelmed and misbehave• ”It’s not a network problem, pings are fine” usually guarantees it really is a
network problem!
• How sure are you that your “low latency network” is robust?
A word on Latency…1GB
10GB
Which is Faster ?
Both are exactly the same. You cannot change the laws of physics!
Distance 0.5 mile 100 miles 500 miles
1000 miles
Estimated Round Trip
2.1 ms 5.4 ms 12.5 ms 21.35 ms
10GB handles more traffic, then 1GB but both are equally fast !
The CAP Theorem• In a distributed environment, especially spanning data centers across LANs and WANs there are
three core requirements for a service: • Consistency
• Either the service works or fails• Traditional ACID of databases provides consistency and isolation
• Availability• Extremely important in web business model• In a large distributed system, one may have to compromise with consistency for the sake of availability
• Partition Tolerance• Network partition will happen when not all machines are connected • “No set of failures less than the total network failure is allowed to cause the system to respond incorrectly” –
Seth and Lynch • Quorum is used to guard against split brain syndrome
• Brewer’s CAP conjecture states that “ One can achieve only two not all three of the above mentioned requirements“
Multiple Active DCs and the CAP Theorem
• Active/Active requires you to sacrifice either consistency, availability or partition tolerance. • All three aren’t possible
• If you choose full availability, then you are going to lose guaranteed consistency. • So you need to design with this in mind, and build in mechanisms (typically involving queuing
technologies) that enable your system to "tend towards“ consistency. • Your data is going to be in two places, either partitioned or replicated.
• If the former, what happens when one site is down? • If the latter, what happens when users hitting each site see slightly different versions of the
current state? • These are very complex problems.
• Which is why I try to steer customers away from active/active and into an active/passive model with DR from active to passive. • But they always feel like they are wasting hardware…
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
45
Continuous Operation: Distributed Architecture
Requirement for APIC Services Spanning DCs • Low Latency connection recommended to federate
services across DCso Latency of <30 ms roundtrip o True for Analytics, Portal and Gateway
Services
• Manager is a singleton that can span across AZs, as Cassandra works on eventual consistency
• 2 common patterns in distributed architectures1. Applications/ services across multiple clouds
and environments2. Setting up geo-graphic high availability of
applications
DC 2
Node 2
DC 1
Node 1
DC 3
Node 3
Low Latency Network
DC 2
Node 2
DC 1
Node 1
DC 3
Node 3
Low Latency Network
Distributed Architecture: Pattern 1 Hybrid Cloud Applications
• Single web & mobile APP, with geo-graphically disperse target services
• Managed from single API Management layer, optimized for Cloud scale
• Remote Gateways deployed next to different components of application
• Components could also be in same AZ in different DCs
• Two Analytics Options:1. Co-locate analytics in same AZ or Cloud to
reducing network overhead2. Centralize to single Analytics service
Target Service 1
Icons Represent Clusters
Target Services 2 &3
Target Services 4
Public Cloud On-Prem DCs (US East)
On-Prem DCs (US West)
Load Balancer
• Replicated services across geo-graphic regions
• Co-locate API Gateways with runtimes to reduce latency of application
• Client to deploy load balancer to route incoming requests to AZ best suited to serve request
Target Service 1
Target Service1
Target Service1
Public Cloud On-Prem DCs (US East)
On-Prem DCs (US West)
Distributed Architecture: Pattern 2 Geo-graph HA of Applications
Icon Represent Clusters
Example Customer 1
DC 2
Node 2
DC 1
Node 1
DC 3
Node 3
Low Latency Network
DC 2
Node 2
DC 1
Node 1
DC 3
Node 3
Low Latency Network
Customer Requirements: • Have internal set of DCs already running
internal apps o Add APIM layer to these APPs
• Use 3rd party cloud to deploy new set of apps & services for external parties to consume
• No direct access to any components running in internal network from anyone outside company
• Separate Portal for internal and external API consumers
• Separate analytics services for security and networking latency reasons
• Single APIM layer for both internal and externally facing environment
Customer Managed DCsInternal Traffic
3rd Party Cloud (AWS)3rd Party Traffic/ BP integrations
Agenda
Ø Runtime GoalsØ APICØ High AvailabilityØ Disaster Recovery
Ø Active/PassiveØ Active/Active
Ø Distributed ArchitecturesØ Final thoughts
50
Am I trying to scare you off?
• I’m pretty sure I am going to be accused by marketing of trying to frighten customers (they haven’t seen these slides yet J)• This is absolutely not the case.• The product is designed to work around the issues involved in
creating a large scale distributed environment. We have years of experience in doing this at IBM!• My point is that doing this requires thought, planning, and treating it
as the difficult computing endeavor that it is.• Focus on your goals – what do you want to achieve?
If you are planning a large deployment…
• We are actively searching for customers who are planning large scale deployments
• We have a Lab Advocacy program which can work with you to bring your feedback to development
• We are constantly striving to improve our products; feedback from customers is essential to that goal!
One thing we don’t talk about enough…
• What about your back-end systems?
• There is limited value in scaling your API gateway to multiple global datacenters if they all connect to a single back-end application in one of them!• Are your back-end systems capable of handling all of the load in the event of a
disaster?• Are all the subsidiary systems APIM relies upon (e.g. LDAP) available in all
datacenters and capable of handling the load?• Is your database available and capable of handling everything if things go
wrong?
Recovery time and Recovery Point Objectives
• Most RTO and RPO goals will deeply impact application and infrastructure architecture and can't be done “after the fact”• e.g. if data is shared across data centers, your database and application design
will have to be careful to avoid conflicting database updates and/or tolerate them• e.g. application upgrades have to account for multiple versions of the application
running at once which can affect user interface design, database layout, etc• Extreme RTO and RPO goals tend to conflict• e.g., using synchronous disk replication of data gives you a zero RPO but that
means the second system can't be operational, which raises RTO
• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive
Test your Disaster Recovery!
• Have a complete detailed plan of what to do in a disaster• Test it! Actually cause a disaster.• No, ok, I don’t mean blow up a datacentre. That will get you in trouble.• But you can simulate a disaster. Take a network link down. Pull the plug on a
bunch of servers. At random.• Have someone who doesn’t know the environment at all walk into a
datacentre and just start pulling cables! Executives love doing this and it gets you brownie points (as long as everything goes according to plan).
Learn from mistakes
• Mistakes and failures will occur, learn from them• What separates mediocre organizations from the good and great isn't so much perfection as it is
the constant striving to get better – to not repeat mistakes
• After every outage perform • Root cause analysis
• Capture diagnostic information• Meet as a team including all key players to discuss• Determine precisely what went wrong
• Wrong doesn't mean “Bob made an error.”• Find the process flaw that led to the problem
• Determine a corrective action that will prevent this from happening again• If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected
• Implement that corrective action• All too often this last step isn't done• Verify that action corrected problem
• A senior manager must own this process
Questions?