Remember the NonStop Fundamentals? - Squarespace · Remember the NonStop Fundamentals? ... Customer...

©2009 HP Confidential

Remember the NonStop Fundamentals?

Iain Liston-Brown

HP NED EMEA Presales Consulting

Agenda

• Customer requirements and design goals • Architectural overview

– hardware – system interconnect – operating system – higher-level software

• Self-healing HP NonStop® servers • Benefits of HP NonStop

OLTP and self service application and system profile • Application is transactional • Application is complex • Growth of the application is

unpredictable and explosive • Customers will take their business to

competitors if the system is down • Company will lose revenue and

market capitalization if the application is not available

• Any corruption of data will be disastrous for my business and my customers

• Secure e-business transactions are required

• 24 x 7 availability − avoid unplanned outages and

planned outages and shorten recovery time

− end-to-end availability − a culture of 24 x 7 support

• Scalability − ability to scale without planned

outages − scalability in multiple dimensions—

processors, database, and software

• Complexity − transactions − mixed workloads − intense real-time database access

Design goals for HP NonStop™ servers

•Continuous availability •Data integrity •Linear scalability •Open access •Parallelism •Distributed systems management

Availability • It’s the user’s point of view that counts

− is the application available to me? − is its response time acceptable? − is my data correct and consistent?

•Availability requires an end-to-end perspective − environment − power − network −… not just the systems...

"The @#$*& system's down again!"

the customer’s opinion!

+

6

There’s more to availability than fault tolerance • Data integrity • Dynamic system configuration

− adding processors, direct access storage devices, and network components

• Online manageability − database services

• backup • reorganization • database maintenance • recovery • cache configuration

− transaction services − networking services

• Availability services • Disaster protection and recovery solutions

faults

planned outages


NonStop Architecture


“ If anything can go wrong, it will.” Captain Edward A. Murphy US Air Force Project MX981, North Base, Edwards Air Force Base, 1949

“Murphy was an optimist!” O’Toole’s Commentary


“ Expect that hardware and software will fail – always be prepared to handle such situations.” HP NonStop Division Design Rule


Continuous Availability

Data Integrity

• Mirrored disks • Security

• Fail fast • Error checks

NonStop Server Design Goals “out of the box”

• Hardware fault tolerance • Software fault tolerance • Disaster planning • Automated operations • Online repair • Online upgrades • Online reconfiguration

Open Interfaces

− Open System Services

(OSS) • POSIX.1 • POSIX.2

Performance and Scalability

− Massive parallelism − Linear scalability − Parallel software

FUNDAMENTALS


NonStop™ server availability features

• Shared-nothing, loosely-coupled architecture

• N+1 redundancy

• All elements do real work all of the time

• Fail-fast design philosophy to minimize failure scope

• Software fault tolerance provided by process pairs and checkpointing

• Hardware fault tolerance built into every aspect of the system

• Takeover instead of failover

ServerNet


NonStop™ server architecture: integrated fault-tolerant hardware & software

FC FC FC FC

ServerNet expansion

ServerNet expansion

memory memory ServerNet Transfer Engine ServerNet Transfer Engine

ServerNet X

ServerNet Y

Lock

step

CPU

s

Lock

step

CPU

s

connections to other ServerNet ® routers creates two fabrics for fault tolerance

$X_P $X_B

communications or external I/O


technology unique to HP NonStop server: reliability of communications

• Self-checked, shared-nothing redundant hardware

• Message-based OS – process pairs – software fault tolerance – transaction support – distributed single system

• Cluster-aware file system • Fault tolerant parallel

database • Application server TP

monitors



SCSI SCSI SCSI SCSI SCSI SCSI

ServerNet expansion

board

ServerNet® expansion

board

memory

ServerNet Transfer Engine ServerNet Transfer Engine

ServerNet X


ServerNet Y

Lock

step

CPU

s

Lock

step

CPU

s

Hardware + Software fault tolerance

– Process failure •backup process takes over

– Processor failure •other processors take over application and I/O operations

– Disk controller failure • reroute access to disk

– Disk drive failure • reroute access to mirror disk

– Fabric failure • reroute data packets

memory


Data Integrity

ServerNet

$X

SNet Interface

Memory

Core 1

Core 0

SNet Interface

Memory

Core 1

Core 0

SNet Interface

Memory

Core 1

Core 0

SNet Interface

Memory

Core 1

Core 0

Note: modular construction means disks are isolated from CPUs and can have primary & backup DP2 in any CPU

– ECC on cache, memory and bus, parity checks on DIMMS and chipset, CRC e-2-e on PCIe, ServerNet™ Transfer Engine in each processor

– Fail-fast design philosophy – Checksums for I/O and

messages – Dual or Quad disk

controllers – Duplicate data written

through alternate paths – Checksum written on disk


NonStop Software Stack vs. build your own Commercial server

systems

Hardware platform

Operating system

System/network management

Applications and solutions

DBMS/TP software

Clustering architecture

NonStop

systems

Middleware

NonStop - less integration, less testing, less management, less complexity


Message Based Operating System using NB50000c

ServerNet

$data

SNet Interface

Memory

Core 1

Core 0

File system Msg system

SNet Interface

Memory

Core 1

Core 0


SNet Interface

Memory

Core 1

Core 0


$data

SNet Interface

Memory

Core 1

Core 0


$app


Software Fault tolerance using NB50000c

ServerNet

$X

SNet Interface

Memory

Core 1

Core 0 $X $X’

SNet Interface

Memory

Core 1

Core 0

SNet Interface

Memory

SNet Interface

Memory


Software Fault tolerance using NB50000c

ServerNet

$X

SNet Interface

Memory

Core 1

Core 0 $X $X’

SNet Interface

Memory

Core 1

Core 0

SNet Interface

Memory

SNet Interface

Memory

Remaining CPUs assume workload and continue to run without restart!


ServerNet architecture scalable, self checking, self healing

– Scalability—from small systems to thousands of processors and I/O devices

– Single interconnect technology for Inter-Processor Communication and I/O

– No idle backup resources – Interconnect technology

embedded in ASICs • minimal software stack

overhead • more cycles available for

applications – Fault tolerance and fault

isolation built in – Data integrity built in

– Message content guaranteed •message integrity assured via end-to-end 32-bit cyclic redundancy check

•CRC and routing info checked at every link

• transparent detection, isolation and repair

– Message delivery guaranteed •automatic link check •automatic resend if no end-to-end ack received

• link-level and end-to-end flow control

•automatic takeover by other fabric


Fail Fast — Keeping Faults from Propagating

Time Ex

posu

re

Fault occurrence

Increasing probability of application failure and longer recovery times

Expo

sure

• Detection Immediate • Containment Faulty element • Recovery Employ alternate

resources, paths • Repair Availability = f(MTTR,

...)

HP NonStop Error

contained

Time

HP NonStop OS – Designed for software fault

tolerance – Extensive built-in checking

• software “fail fast” if ability to recover is in question or there is a significant risk of corrupting data

– Independent operating system instances in each processor that monitor each other • I’m alive (heartbeat

messages)

• regroup algorithm to reliably reestablish a quorum in case of failure or when resources are added or removed


Why Fast fail?

– Ensuring problems do not impact data integrity is key! • The more critical the nature of the application the more this becomes important

– Problems happen, it is a fact of life! – If you let problems go undetected or uncontained then recovering a database

can take days or weeks! – NonStop has the ability to check hardware components through error

checking and correction routines end-to-end in the architecture, and Fast fail – NonStop’s HW and SW fault tolerance allows the system to “Fast fail” a CPU

to contain a potential data integrity issue • The HW & SW fault tolerance enables the system and application to keep running and

the SW fault tolerant “takeover” makes this transparent to the application • Basically unlike other systems NonStop does Fast Fail “because we can”, it’s a

philosophy to contain and avoid problem perpetuation

– Repairing a HW component is simpler an quicker (a few minutes) than recovering a corrupt database (hours or days) • Whilst a database is “down” your system is down!


NonStop™ Kernel

– Each processor has a separate copy of the OS image for fault isolation • applications have own address spaces

• communication is via messages

– Resources are implemented as processes, usually as process pairs • routing abstraction implements single system image

• global update algorithm keeps routing tables in sync

app

NSK message system

open(deviceA)

OS

NSK checkpointing file state, memory

application context restart information

replay manager

CPU 0 CPU 15

devA primary

devA backup

. . .


Process pairs – During normal operation

• primary process sends backup small amounts of critical state for use during recovery (restart context for request )

• file and network open context kept in sync between primary and backup

– Backup process takes over when primary process fails • transparent to application • OS hides the redirection of

application request to the backup service

– Takeover usually succeeds • deterministic software bugs can

result in complete failure of the pair

app

NSK message system

open(deviceA)

OS

NSK checkpoint system file state, memory

application context restart information

replay manager

CPU 0 CPU 15

devA primary

devA backup

. . .


Failover vs. process pairs

– Failover (HA) • application is aware of failure and must restart • relies on DBMS to recover DB and transactions • may take many minutes for application recovery

– Process pairs (FT) • hide the failure from the application • rely on NonStop™ SQL and TM/MP to recover DB and transactions • usually take very few seconds for application recovery

takeover time

hardware fault tolerance

process pairs (sw ft)

cluster failover

instantaneous seconds minutes


NonStop data manager

NonStop™ SQL/MP

NonStop SQL DBMS Integration

– One data access manager / disk • process pair with helper processes

– SQL access methods

– Data aggregations

– Data functions

– Mixed workload management

– Locking, concurrency control • no distributed lock manager

– Transaction support

– Audit log management

ODBC/MX, JDBC/MX, etc.

applications


A–J

Online data and index create/load, data distribution (split, merge, add, drop), data reorganization, cache

adjustments, update statistics, backup, restore, recovery

parallel operations online operations

NonStop™ SQL database manageability*

K–R S–Z

Operation

A–J K–R S–Z

Operation Update

* Full read/write access during these operations


unaudited volumes

$app

audited/ nonaudited

files

audited volumes

audit trails

TMF subsystem

disk process

Transaction management: NonStop™ TMF

Either all or none of your database updates are written with TMF:

• Begin transaction • Commit transaction • Abort transaction


Pathway ACS - Architecture

CPU 1 CPU 2 CPU 0 CPU 4 …..

SC2

Client 3

CS

PATHMON

RD (ROUT)

PB

RD (ROUT)

CS

SC1

ACS

RD (ROUT)

CS

SC3 SC3

Pathway Domain %PWYA Pathway Domain %PWYB

Key Initial setup & configurn changes

Control of the process broker in which CPU

PB connects client & app server code via ROUT

Business logic server code which can be replicated for scaling

Pathway domains

SC2

Client 1 Client n

Note: Diagram shows symmetric Pathway domains, domains do not have to be identical

SCF Persistence Manager

PATHMON backup

Client 2

RD (ROUT)

BC

Pathcom

PB

SC1

ACS ACS ACS

PB PB

CS


NonStop Server TP middleware

– transparent fault tolerance • continuous availability, process pair protected, instant recovery from HW or SW failures with no programming

– load balancing and scalability • automatic, dynamic, across the entire NonStop Server

– efficient transaction management • single transaction log; fewer log writes

– integrated with NonStop SQL database • single, common transaction manager, middleware leverages the database functionality (e.g. DB pub/sub)

– simpler installation and administration • one log file for the entire system for both TM and database; one install and administration process for the entire system

– standards based • standard tools, no special programming required


Transaction services

App Server

App Server

App Server

SQL/MX

DB DB DB DB

Single Transaction Log Automatic Commit /Rollback

Distributed Database

Distributed App Servers

Takes advantage of virtualised cluster aware resources without special coding utilising a single system & database image and Txn single log for simplicity


NonStop software product ecosystem

Portable applications using standard APIs and protocols

Java (SASH) frameworks J2EE APIs

SOA clients SOAP/XML

HTTP Messaging

clients Pathway APIs and protocols

SQL clients

Standard application development tools (NSDEE)

NonStop Kernel and OSS operating system

NonStop TS/MP, NonStop TMF, NonStop RDF—system-wide process and transaction management, business continuity

NonStop JSP BEA WLS

SOA services iTP WebServer NonStop SOAP

Pathway/ iTS

IBM WebSphere MQ Series

NonStop SQL

Net

wor

king

Man

agea

bilit

y Se

curit

y

Comm

on Standards

Uncom

mon

Advantages

Application programming models – “Develop”

Application infrastructure – “Deploy”

Platform infrastructure – “Enable”

NonStop Tuxedo

NonStop CORBA


Disaster Tolerance: Remote Database Facility

Remote duplicate database

Data

Facilities for database recovery

Mirrored disks

Data Audit log

Hot Standby - take over in minutes

Replicate Database Updates

• Closely integrated with TS/MP auditing • Extremely Fast, unlimited distance • RDF ZLT for RPO = 0 • Multiple configurations

– Triple – Reciprocal – Many to one

Non-audited files instrumented via NonStop AutoTMF which can also improve physical I/O NonStop AutoSync replicates text, binary object, configuration and ‘edit’ files over TCP/IP or Expand

Expand


NonStop Benefits


NonStop self-healing support

Close-to-source repair

Automatic detection of

hardware replacement

Sanity checks before

resource reintegration

Process pairs

(Fast-fail enabler)

Per-processor processes, virtualised resources

Automatic data repair,

repaired HW automatically

re-instated

Automated path failover

in storage and comm

I/O

Automatic maintained

cluster membership

Fault tolerance and availability out of the box which is simple to manage


Perf

orm

ance

Number of processors

NonStopTM Architecture Linear Scalability

– Linear scalability • Hardware scalability

− Processors, I/O bandwidth, interconnect bandwidth, etc.

• Software scalability − Database − Transaction monitors − Transaction management − Networking, etc.

– Benchmark and customer proven!

98.8% scalability


LINK Growth 1987 - 2002

0

25,000

50,000

75,000

100,000

Cards / ATMs

0

40,000,000

80,000,000

120,000,000

160,000,000

Transactions

Cards (Thous) ATMs Transactions


Integrity NonStop - a winning cost model

$1,000

$900

$800

$700

$600

$500

$400

$300

$200

$100

$0

IBM AIX Sun Solaris Wintel NS1000

Systems

Annual TCO($,000)

Application Cost

Basic Cost

Source: Standish Group, 2006

Identical ATM application workload running on Wintel, Sun and IBM.

Similar price/performance to clustered UNIX but with much higher service levels and far less complexity


Application Availability as Experienced in Live Production Sites Based on data from the VirtualADVISOR database maintained by Standish Group, as of September 2007

– Windows 98.970 % 3.70 days

– Windows Cluster 98.733 % 99.873 % 10.9 hrs. *)

– Linux 99.342 % 2.37 days

– Linux Cluster 99.207 % 99.920 % 6.91 hrs. *)

– Unix 99.765 % 20.3 hrs.

– Unix Cluster 99.642 % 99.964 % 3.11 hrs. *)

– Mainframe 99.958 % 3.63 hrs.

– MF Cluster (Parallel Sysplex) 99.985 % 1.30 hrs.

– HP NonStop 99.999(99) %

Average „best of breed“ Downtime Cluster *) per year

Assumption: Tenfold availability versus the average HA cluster installation, at threefold cost for system mgmt and application maintenance

5 Min (3 Sec)


弊社はNonStop テクノロジーを大容量

データかつミッションクリティカルな課金システム (1997年に稼働後、ゼロダウンタイムで運用中) において、可用性と拡張性を確

保するために利用しています。これを新製品である NB54000c の導入により、さらなる信頼性向上と大幅な低コストにて課金システムを提供できるようになります。我々は新しい革新的なサービスを競合他社よりも早く提供することにより競合優位性を維持しつつ、顧客へのサービス向上を図っていきます。

Takeshi Yonehana Deputy General Manager, System Infrastructures Management Dept

Daiwa Institute of Research Business Innovation Ltd

“Deploying the NB540000c provides a reliable billing system at a significantly lower cost, so we are able to create new and innovative services faster than our competitors, empowering us to achieve and maintain a competitive advantage. We improve our solutions and services continuously to meet customer requirements.“ Daiwa looks to NonStop technology to ensure high availability and expandability of mission-critical and large data applications associated with billing – all with zero downtime since our installation in 1997”

株式会社大和総研ビジネス・イノベーション

システム基盤統括部テレコムシステム統括課

次長米花竹志氏


Summary - NonStop Differentiation – Several platforms have High Availability “redundancy” at a Hardware Level – A few systems (e.g. Stratus) have Hardware Fault Tolerance – ONLY NonStop has Hardware and Software Fault Tolerance as well as on-

line reconfiguration – Tightly integrated HW & SW Fault Tolerance is what sets NonStop apart and

keeps the application and system “UP” – HW and SW Fault Tolerance allows NonStop to use “Fast Fail” to preserve

data integrity – NonStop middleware and database allows your application to scale from 2 –

4080 CPUs as a single system and database image – Developing to standard programming models and APIs though common

development toolsets and deploying on NonStop middleware inherits availability and scalability without special programming

– TMF and RDF allow you to manage data integrity and disaster tolerance with a single virtual log across any of the middleware and database option

– NonStop Essentials and HP Operations Agents allow common tools and skill-sets such as HP SIM, HP Operations Manager and other management frameworks to be used with HP NonStop

Remember the NonStop Fundamentals? - Squarespace · Remember the NonStop Fundamentals? ... Customer...

Documents

Transcript of Remember the NonStop Fundamentals? - Squarespace · Remember the NonStop Fundamentals? ... Customer...