SQLCAT: SQL Server 2012 AlwaysOn Lessons Learned from Early ...

40

Transcript of SQLCAT: SQL Server 2012 AlwaysOn Lessons Learned from Early ...

SQLCAT: SQL Server 2012 AlwaysOn Lessons Learned from Early Customer DeploymentsSanjay MishraProgram ManagerMicrosoft Corporation

DBI360

Setting the StageAssumed Pre-requisites for this presentation: Basic knowledge of

AlwaysOn Failover Cluster Instances (FCI)AlwaysOn Availability Groups (AG)

There is much more to each of these deployments than we can discuss in this session. Come by the SQL Server Technical Learning Center (TLC) / Booth and discuss with us.

Setting the Stage

AlwaysOn ≠ Availability Groups

AlwaysOn = { SQL Server Failover Cluster Instances, Availability Groups }

Availability Groups ≠ Database Mirroring

Key Learnings from Early Customer Deployments• Windows Cluster

• is the foundation for HA and DR in SQL Server 2012 AlwaysOn• AlwaysOn inherits all “characteristics” of Windows Cluster

• Windows Cluster • every single AlwaysOn deployment is a Windows Cluster deployment

• Windows Cluster • understand Windows Cluster for succesfully deploy, operate, monitor, troubleshoot,

administer AlwaysOn• key areas are: quorum model, cluster network communication, DR procedures,

cluster.exe, PowerShell• Windows Cluster • ≠ SQL Cluster (SQL Server Failover Cluster Instance)• therefore, is NOT necessarily a shared-storage cluster

• Windows Cluster • many key enhancements have been made to Windows Cluster specifically for SQL

Server 2012 AlwaysOn• Asymmetric Disk• Node Votes• Asymmetric Disk as Quorum resource

Key Learnings from Early Customer Deployments• Organizational structure• Typically, teams and skills are organized into separate groups – SQL Server DBA

team and Windows Server Admin team• AlwaysOn reaches out beyond the SQL Server DBA• DBAs need to work closely with Windows / Network Administration teams• Not just for initial deployment, but for troubleshooting and disaster recovery as

well• Historical experience• need to unlearn and relearn a few things if you are already experienced with

Windows Cluster, but new to AlwaysOn• For example, if you haven’t read the Windows Cluster documentation in the last few

months, it is worth a re-read now• New/Different Tools for administration and troubleshooting• Windows cluster log • Failover Cluster Manager• Knowledge of PowerShell and cluster.exe command lines will come very handy

SQL Server 2012 AlwaysOn Customer Examples

Customer SQL Server 2012 AlwaysOn HA+DR Solution

1 Microsoft IT Availability Group for HA and DR

2 bwin.party Availability Group for HA and DR

3 Caregroup Availability Group for HA and DR

4 ServiceU Corporation Failover Cluster Instance for local HA + Availability Group for DR

5 Edgenet Multi-site Failover Cluster Instance (FCI) for HA and DR

customerMicrosoft IT SAP ERP Deployment

~ 6 TB in a single, central, row/page compressed databaseDatabase growth around 120GB/month

Live in ~85 countries 4,000 named GUI users~100K internal web users plus external web usersUp to 1500+ concurrent users2 million dialog steps per business day240K+ batch job executions per month80+ million transactions steps per month (100+ million during Year End)0.8 seconds user response time 99.995% availability since SQL Server 2005Database Servers: 4 X 8 cores, 256 GB of memory

http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/Microsoft-IT/Microsoft-Ensures-Smooth-Operation-of-ERP-System-and-Cuts-Disaster-Recovery-Time/710000000493

Microsoft IT SAP ERP Usage Statistics

Production

TestDR Site

Log Shipping

SAP Volume Test and Integration System Image of production

Synchronous DBM

Witness Primary Site

HA/DR Deployment Prior to SQL Server 2012Database Mirroring for local HA, Log Shipping for DR

SQL Server 2012 AlwaysOn DeploymentAvailability Group for HA and DR

11

Production

TestDR Site

Async

SAP Volume Test and Integration SystemImage of production

Sync1 1 1

0

File share for Cluster Quorum

5+ TbEMC

CX3-80SAN

7+ TbEMC

CX3-80SAN

7+ TbEMC

CX3-80SAN Sync

Production Availability Group on production DBMS cluster

SAP production CI cluster containing File Share quorum for DBMS cluster

Test Availability Group on test DBMS cluster

SAP test CI cluster containing File Share quorum for test DBMS cluster

Primary Site

customer

bwin.party digital entertainment plc

The SystemOnline gaming and gamblingReal money handling system for bwin.partyAuthoritative system for Responsible Gaming LimitationsIncludes a specialized Data Warehouse

Multiple databases, and multiple availability groups in the topology4 servers in the topology

Each server is hosting the primary replica of an AG, and secondary replica of other AGsFocus on 1 AG in this presentation

http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/bwin.party/Company-Cuts-Reporting-Time-by-up-to-99-Percent-to-3-Seconds-and-Boosts-Scalability/710000000087

HA/DR objectives>99.99% availability in the last years>99.99% availability even with maintenance

RPO: Zero data lossRTO: 10 seconds or lessPlan for the worst case scenario: Loss of a complete datacenterMust still be able to do maintenance during the worst case

Deployment Architecture: Prior to SQL Server 2012

Pre-SQL Server 2012 ChallengesCan’t easily glue together databases that need to run on the same nodeData Warehouse load restrictions due to limitations in Log Shipping Maintaining database mirroring connection strings (failover_partner) in all applications is painful, and in some cases (some 3rd party applications) not even supported

Deployment Architecture: SQL Server 2012AlwaysOn Availability Groups

Key Points Quorum Model: Node and FileShare MajorityEach node has a voteFileShare in 3rd DatacenterAutomatic Failover between Datacenters

Avoiding downtime in case of Datacenter failure

GainsFaster failoverMaintenance now is easy to do during a failure conditionReduced system load on Primary due to backup offloadingAbility to run read-only workload on the secondary, and not interfere with OLTP production

Considerations for MigrationMigration involves other teams, not just the DBA teamNeed to change connection string, as the DBM connection string (with failover_partner) only works with one secondaryDifferent machines used different OS versions before. This is no longer possibleAll machines in the topology now need to be in the same Active Directory domain

customerCareGroup Healthcare Systems

CareGroup Healthcare SystemsAmong Top 5 Large Healthcare Systems in the USAFour Hospitals located in Boston, MA 16,000 Employees 146 Mission Critical Clinical Applications 2+ Million Patient Medical Records Annual Revenue : $2 Billion All mission-critical applications are enabled for high availability and DRRanked #1: Most Innovative Healthcare IT nationwide (InformationWeek)

http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/Beth-Israel-Deaconess-Medical-Center/Hospital-Improves-Availability-and-Speeds-Performance-to-Deliver-High-Quality-Care/5000000011

CareGroup Database Classification and SLA 80+ databases rated “AAA”

RPO 0 & RTO 0 Standard HA/DR Solution: FCI + AG Storage: Use EMC Clariion SAN with SSD disk

300+ databases rated “AA” RPO =<1 hour & RTO 1 hour Standard HA/DR Solution: Hyper-V and AlwaysOn AG Storage: Use EMC Clariion SAN

Rest of the databases rated “A” RPO & RTO 1 day Selective HA/DR

SQL Server 2012 HA / DR Architecture for “AA” applications

Sync

ASync

Windows 2008 R2 Hosts Cluster

Windows 2008 R2 Guest Cluster

Availability Group: BillingSys

Prim

ary

Site

DR

Site

Denali_A Denali_B

Denali_C

Prim

ary

Hyper-VNode BHyper-V

Node A

Node C

HW & OS Failure ProtectionOS & SQL Failure ProtectionDisk & DB Failure Protection

customer

ServiceU Corporation,Part of the Active Network

ServiceU Solution OverviewServiceU provides web-based online scheduling, event management, payment processing, and other services to customers in 15 countriesArchitecture Goals:

99.99% uptime (which means maximum allowable downtime of 52 minutes per year including scheduled maintenance)Security – Level 1 PCI Service ProviderPerformance

Architecture Decision DriversTechnologies should provide more uptime – even if a few secondsTry to eliminate manual intervention Eliminate single points of failureKeep it simple!! Make sure troubleshooting can be done easily

Approach to High AvailabilityHighly trained personnel, extensive monitoring, good documentation, standardization across the enterprise

http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/ServiceU/Online-Company-Reduces-Downtime-and-Helps-Its-Customers-to-Improve-Service/4000011506

ServiceU FCI + DBM Solution (Pre-SQL Server 2012)FCI for local HA, DBM for DR

Asynchronous Database Mirroring

Windows Server 2008, SQL Server 2008 Windows Server 2008, SQL Server 2008

Disk Only Quorum Disk Only Quorum

• 3 nodes in each FCI• SQL Server is available with NO user intervention! (unless there is a

disaster)• “Last Man Standing”• Disk Only Quorum provides benefits but the quorum disk must be fully

protected and always available

SQL Server 2008 FCI #1 SQL Server 2008 FCI #2

Windows Server Failover Cluster #1 Windows Server Failover Cluster #2

Disk Only Quorum

SECONDARY – SQL Server 2012 FCI #2PRIMARY – SQL Server 2012 FCI #1

• Windows Server 2008 and later – support added for Asymmetric Disk Only Quorum• Must be configured with cluster.exe; not supported in GUI or PowerShell• Requires testing and thorough knowledge of clustering• With a primary site loss, getting the cluster online at the remote site involves force

quorum, changing to node majority, then disk only• Allows “Last Man Standing”

Availability Group (Asynchronous Secondary)

ServiceU FCI + AG Solution (SQL Server 2012)FCI for local HA, AG for DR

This is a single Windows cluster instead of a

Windows cluster at each site.

Asymmetric storage is the key to this architecture.

Setup for Availability Groups across FCIsIn a FCI + AG setup, the SQL Instance names must be

unique within the Windows Cluster

In a FCI + AG setup, the data and log file paths should be the same between all instances; by default the instance name is part of the file path, making them different

Site 1 Site 2 NoteWRONG INST01 INST01 This was correct with FCI+DBM configurationRIGHT INST01 DRINST0

1This means default file paths are different for data and log files because the instance name is part of the path (discussed below)

Site 1 Site 2NOT Recommended

F:\MSSQL11.INST01\MSSQL\DATA

F:\MSSQL11.DRINST01\MSSQL\DATA

RIGHT F:\DATA F:\DATA

customer

Edgenet, Inc.

About EdgenetLeader in Data Services, Guided Selling and Marketing SolutionsConsumers and businesses want details about products. At Edgenet, we organize that product information to increase sales.Provide retail applications

Help retailers sell configurable productsHelp consumers compare and purchase the right product for them.

Collect, certify and distribute product dataGoogle Search & ShoppingBing Search & ShoppingRetailersOne of Four Active US GDSN-certified pools

Rigorous certification and data quality scoring processhttp://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/Edgenet/Data-Provider-Supports-Growth-and-Gains-Competitive-Advantage-with-Microsoft/4000011528

Edgenet Multi-site FCI SolutionConfiguration

SLA: 99.99% Annual uptimeProvides high availability and disaster recovery for our data pool applications

Near real-time data replication with MSDTC support Additional, read-Only secondary to offload Exports & BI Workload

Software / HardwareSQL Server 2012 EnterpriseWindows Server 2008 R2 DatacenterBrocade 5300 - 8 Gb FC Switches EMC Clariion CX4-80EMC RecoverPoint CE – Disk Based ReplicationNEC Express 5800/A1080a-D GX

Edgenet HA / DR Topology DiagramMulti-site FCI for HA/DR + AG readable secondary replicaPrimary Site - Milwaukee DR Site - Atlanta

WSFC Node BFCI Passive Node

WSFC Node AFCI Active Node

EMC RecoverPoint CE Appliances

EMC RecoverPoint CE Appliances

Hardware Replicated LUNS Hardware Replicated

LUNS

WSFC Node CAvailability Group Secondary

Replica(Synchronous, Readable)

LUNS for AG secondary

Asynchronous SAN Replication

300 Mb Ethernet Connection

850 Miles

10.10.10.0/24

11.11.11.0/24

Edgenet HA / DR SolutionCluster, Disk and Instances

3 Node Windows Server Failover Cluster2 Nodes (one at each data center) SQL Stretch Cluster (Multi-site FCI)

850 mi. – Milwaukee to Atlanta1 Node in the primary DC hosting AG readable secondary

4 Clustered SQL instances, 1 Clustered MSDTC11 TB of useable SAN replicated storage – 54 LUNSMulti-Subnet (two)TempDB on Local Disk

Saves money on storage replication licensingReduces cross-data center storage replication trafficEnables use of local solid state storage to improve performance

… And there is more …Please come by the booth if you would like a deep dive discussion on any of

these or other customer deployments

SanjayMishra

[email protected] www.sqlcat.com

@sqlcat

Resources

Connect. Share. Discuss.http://europe.msteched.com

Learning

Microsoft Certification & Training Resourceswww.microsoft.com/learning

TechNet

Resources for IT Professionalshttp://microsoft.com/technet

Resources for Developershttp://microsoft.com/msdn

Evaluations

http://europe.msteched.com/sessions

Submit your evals online

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS

PRESENTATION.