Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist EMEA...

<Insert Picture Here>

Grids@Work VOracle Coherence for Finance ApplicationsEwan SlaterSenior Solution SpecialistEMEA Technology Fusion Middleware

Topics

• Scalability – why do we care?• Scalability – what’s the problem?• Traditional approaches and their drawbacks• The Coherence approach• What is Coherence?• Where does Coherence fit?• How Coherence works• Using Coherence• Coherence in Action• Conclusion• Q & A

Scalability – why do we care?

IT Initiatives Driving Scalability Demand

• XTP• Highest volume, Low Latency,

Absolute Transactional Integrity

• Virtualization• Increased demand on Data Sources • Application re-provisioning must occur transparently

without interruption of data access• Must handle multiple load increases at the same time

• SOA • Increasing common access to resources• Sharing access means continuous availability and absolute reliability

• EDA• Event driving transactions causing massive increase in load• Pervasiveness driving data need across all systems affected

Demand

Supply

Resources

Time

Compute Power: SMP/Multicore

Memory Arrives: “In Memory Option”

Network Speed: Gbe/10G/IB

Storage: Flexibility

Hardware Capacity ImpactHardware Capacity Impact

The more people have, the more they want!

Availability – Continuous

Reliability – Transactional Integrity

Scalability – Capacity on Demand

Performance – Zero Latency

Enterprise Infrastructure RequirementsEnterprise Infrastructure Requirements

Grid Automation

Service Level Management

Application Performance Mgmt

Provisioning

Enterprise Manageability RequirementsEnterprise Manageability Requirements

Service Oriented Architecture

Web 2.0

Event Driven Architecture

Extreme Transaction Volumes

Software Framework PressuresSoftware Framework Pressures

Scalability – what’s the problem?

In general, applications don’t scale well…

…what worked fine in development, or for 50 users…

…can’t cope with production demand…

…that increases over time…

Why don’t applications scale?

• Single points of failure (SPOF)• Database failure or pause = application failure or pause• One server fails, the entire system fails• One application or JVM fails, the application fails

• Single points of bottleneck (SPOB)• Shared resources• The “hub” of Hub-and-spoke architectures• Heavy database or disk I/O

• Applications are not designed to scale• It works in single-user testing on a PC, but it will work in

production?• Scaling is often an afterthought – “it’s the DBA’s problem”

Scaling the Application Tier:Traditional Approaches

Scale up (or even bigger boxes)

Approach How Advantages Disadvantages

Scale-Up

“It’s an infrastructure problem”

Buy Big Boxes

Increase Resources (cpu, memory, hdd capacity, speed and network, etc)

By specialized hardware (Azul, Infiniband…)

Simple (overnight) No development No impact on internal design

Expensive

Will hit physical limits

Will have to redesign at limit

Non-graceful deterioration at limit

Stop, Add, Restart required to scale

Bigger box = Much Bigger price tag!!!

• High incremental cost• Wasted capacity

At some point, even the biggest box has it’s limits!

Stateless application tier(or blame the DBA)


Stateless Scale-Out

“Push state scale-out into lower Data Source layer”

“It’s the DBA’s problem”

Make application stateless (eg: stateless sessions)

Use lots of stateless servers

Use load-balancing

Use “big” and “scalable” Data Source to ensure application state scale-out

Easy to develop (not overnight, but relatively simple as no state is managed)

Scale-out is easy, just add more servers

Only scales to match underlying Data Source performance

When underlying limit is reached, have to redesign

Network bottlenecks experienced as data is moved between layers

Performance Bottleneck Between Tiers

A A HUGEHUGE performance bottleneck:

Volume / Complexity / Frequency of Data Access

Application Database

Object

Java SQL

Relational

Performance Bottleneck Between Tiers

Solution:

Move relevant data to middle tier Application Server

Memory Cache

ObjectRelational Database

Java

Application

• One Solution is to keep the object data in object form in high-speed distributed memory cache

• Database remains the system of record (persistence)

Application Server

Memory Cache

Object

Application Server

Memory Cache

Object

Caching in the application


Caching

“Keep recent copies of state”

“We’ll save the DB and DBA by caching”

Application keeps local copies (in memory or on local disk) of recently / commonly used state

Seems simple

Reduces Data Source and Network load

Significant application performance improvements

Maintaining consistency of data between Local and Data Source instances can be difficult

Require “messaging infrastructure” to ensure consistency across a cluster (and application development)

Typically applicable to “read only” applications and not “write a lot” applications

Easy to get wrong

Local Caching

Can be scaled out…Farm Caching

Inconsistent Local Cache

Farm Caching

• Benefits:• Same as Local Cache• May now scale out

• Constraints:• Same as Local Cache - but now worse - across Farm!• Singularity broken between members (Incoherent)• Members have own copies of Entries• No cost savings in making copies to members• Cache capacity doesn’t increase with Farm size

Scale out the Container(or blame the App Server)


Use an Application Container

“Our magical clustered container will scale our application infinitely”

Believe the vendors & the marketing

Follow a “scalability paradigm”

Use a “Clustering Container”

… It scaled the “Pet Store” linearly, therefore our X application will also scale linearly (where X ≠ “Pet Store)

Simple

Well documented and communicable paradigm

Easily scale development team

Typically scales in-the-small

Usually relies on “scale-up” rather than “scale-out”

Requires specialized skills or products (out side of the standard paradigm) to really scale

Clustering is primarily about High-Availability, not Scalability!

Traditional Scale-Out Approaches…

#1. Avoid the challenge of maintaining consensus• Opt for the “single point of knowledge”

#2. Have crude consensus mechanisms, that typically fail and result in data integrity issues (including loss)

Client + Server Model(Hub + Spoke)

Master + Worker Model(Grid Agents)

Active + Passive(High Availability)

Traditional Scale-Out Consequences…

• Have unbalanced / unfair load and task management• Some servers have greater system responsibility than others

• Have Single Points of Bottleneck (SPoB)• Have Single Points of Failure (SPoF)

• “Micro outages” are magnified as you scale-out

• Exhibit Strong Coupling to Physical Resources• Software completely dependent on individual physical servers

• Require specialized deployment and operation for individual Resources

• Some servers require “special attention” to operate

The Coherence Approach

So how does Coherence solve the problem?

Consensus is the key…

Imagine a team where some members…

• Have a different impression of the actual members of the team

• Allocate tasks and information to their members (from their perspective) but on behalf of the team

• Result?• Inconsistent views of team information• Without consensus some information will be inconsistent (at

best) or be unavailable or lost (at worst / common)

Real Madrid before Capello

Membership Consensus

• Consensus between resources is fundamental to ensure integrity of information (and work) when scaling-out

Real Madrid after Capello

Coherence relies on Consensus

• Traditional scale-out approaches limit• Scalability, Availability, Reliability and Performance

• In Coherence…• Servers share responsibilities (health, services, data…)• No SPoB • No SPoF• Massively scalable by design

• Logically servers form a “mesh”• No Masters / Slaves etc.• Members work together as a team

The result?

Oracle Coherence:In Memory Data Grid

What is Coherence?

(c) Copyright 2007. Oracle Corporation

Oracle Coherence…

• Is an enabling technology that…• Allows customers to build bullet proof

applications…• And achieve high performance and predictable

scalability

Typical Coherence Customers

• Online gaming (e.g. trading system)• Telcos (e.g. SMS backbone)• Hospitality (e.g. flight reservation system)• Insurance (e.g. user profile management)• Financial Services (e.g. risk engine)• Public sector (e.g. railway signalling)

Common theme:Mission – critical, bullet – proof solutions

• Reliability• Availability• Scalability• Performance

Coherence doesn’t need an app server

There is a .NET client library…and this is pure .NET

…and…

There is a C++ client library…and this is pure C++

Where does Coherence fit?

Look at the shape of the data

Application Layers

• Web Server

• App Server

• DB Server

Network

Data “Shape” across tiers

WebCache

Web Servers

Application Servers

Coherence

RAC

Times Ten

HTML Data Structures in Memory

Java Data Structures in Memory

SQL Data Structures in Memory

Web Cache offloads Web Servers,

Improves Network Performance via

Compression

Coherence caches Java Structures in Memory; Very Fast

Access to Java Data in Memory across Mid-

Tier Grid

Times Ten & RAC provide Scalability to

Database Data improving Query & Transaction Write

Performance

Web Tier Application Tier Database Tier

What is Coherence not?

• Plug and play - the application code will need to change.

• A database – persistent data will need to be written to a database (Oracle RAC is often an ideal fit).

• A Transaction Processing Monitor.• A panacea for:

• Inadequate hardware• Badly written applications• Poor database design

How Coherence Works


Coherence Works by Consensus

• Consensus is key• Communication is more efficient (peer-to-peer)• No outages for voting (no need – everyone is a peer)• No SPoF, SPoB• No need for broadcast traffic (yelling at each other)• You can do many things once you have “consensus”.

made possible by TCMP

(the “secret sauce”)

Tangosol Cluster Management Protocol (TCMP)

• Coherence’s own protocol between cluster members• TCMP utilizes UDP• Massively scalable

• Asynchronous• Point-to-point

• UDP Multicast is used for:• New JVMs to join the cluster automatically• Maintaining cluster membership• Multicast is not required; it may be disabled with Well Known Addresses

(WKA)

• UDP Unicast is used for most communication• Very fast and scalable• TCMP guarantees packet order and delivery• TCP/IP connections do not need to be maintained

Distributed caching for your data…

…and go faster stripes for your data

Hardware implications(Blades not Bludgeons)

Big Iron

• Buy based on predicted growth• High incremental cost

Low cost clusters

• Buy as you grow• Small increments at present day

prices & clock speeds

Using Coherence

Building an Application

• Developers use Coherence API to• Access Data• Listen for Events• Query Data• Process Data in the Grid

Setting up a grid

• Coherence clusters to form a grid OOTB• A grid may contain many caches• A cache structure is defined by a scheme• Schemes are defined in config files

Distributed Data Management (access)


The Distributed Scheme

(one of many)

In-Process DataManagement

Distributed Data Management (update)


Distributed Data Management (failover)


Distributed Data Management

• Members have logical access to all Entries• At most 2 network operations for Access• At most 4 network operations for Update• Regardless of Cluster Size• Deterministic access and update behaviour

(performance can be improved with local caching)

• Predictable Scalability• Cache Capacity Increases with Cluster Size• Coherence Load-Balances Partitions across Cluster• Point-to-Point Communication (peer to peer)• No multicast required (sometimes not allowed)


Data Distribution: Clients and Servers


“Clients” with storage disabled

“Servers” with storage enabled

Near Caching (L1 + L2) Topology


Observing Data Changes


Parallel Queries


Parallel Processing and Aggregation


Data Source Integration (read-through)


Data Source Integration (write-through)


Data Source Integration (write-behind)


Coherence*Extend

WAN Topology

Oracle Coherence in Action

Example Use Cases

• Mainframe Cost Reduction• Caching repeated queries

• Oracle Coherence with Compute Grid• Intra – day risk calculation

• Oracle Coherence Cloud• Message – based infrastructure replacement

• Eliminating SPoB• Trading Exchange Redevelopment

Mainframe Cost Reduction

Taming the MIP Monster

• Retail banking IT provider• Supports 400+ banks• 4 key systems – repeated queries to mainframe• 100,000 queries to mainframe each day• Large recurring cost to the business

• Coherence deployed as distributed cache• 100,000 queries 1600 queries• Saving ~€1000000 in 1st year

Oracle Coherence with Compute Grid

Compute Grid on Database

Traditional Compute GridTraditional Compute Grid

Grid Manager

Grid Tasks

• Emphasis on orchestrating tasks out to compute nodes in grid

•Data Set either loaded locally or pulled off of back end data source

•Applications Highly Customized for Grid Environment

Grid Applications

Great processing scalability with inevitable data bottlenecking

Orchestration can be point of bottleneck as well

Compute Grid on Data Grid

Oracle Coherence

Oracle RAC

Traditional Compute Grid with Data Scale OutTraditional Compute Grid with Data Scale Out

High Performance Computing (HPC)High Performance Computing (HPC)

Grid Manager

Grid Tasks

Grid Applications

•Oracle Coherence Data Grid Overlay onto Compute Grid

• Compute Grid Scale Out with Data Fault Tolerance

• Massive Persistent Scale Out with Oracle RAC

Customer Story: WachoviaScenario• Wachovia Investment Bank introducing “Service Oriented Infrastructure (SOI)”

• Requires absolute data availability for complex Grid Computations

Problem• Existing Compute Grid infrastructure suffering from data latency and throughput

problems

• Complex calculations so lengthy as to be outdated

Solution• Data Grid overlay on Compute Grid

• Enable risk calculations to fully utilized the grid hardware by having real time access to in-memory data as well as parallelization .

• Reduced critical risk computation from 50 days to under 1 hour!

Over 300 CPUs in Production!

Oracle Coherence Cloud

The challenge:Scale this...

• Domain: Retail Banking Infrastructure• Over 500 Banks• 100,000+ Teller Staff Desktops Applications• 10,000+ Cash Machines (ATMs) • 10,000,000’s of Internet Banking Transactions/day

• Current Infrastructure• Java SE based (no J2EE – apart from Servlets)• Oracle RAC (not an issue – scaling across a WAN )• Messaging (serious challenges)• Processing Business Tasks (challenges approaching)• 30,000,000+ Business Tasks a day – minimum.

• must do 100,000,000 effortlessly per/day before going live


The challenge continued:Scale this...

• Execution of Business Tasks• Account Balance, Credit/Debit, Funds Transfer, Statement

Processing, Batch Processing, Payment Processing• Tasks arrive from a variety of clients (thin, rich, cross-

platform, mainframes...) – variety of languages

• Goal:• Tasks are executed by the “cloud”• Don’t want to build own “cloud” software

• Their knowledege:• Massive experience in scale-out. Could build it themselves,

but budget (time/resources/money) will be saved by buying.


The Cloud

Architectural issue:Performance Bottleneck Between Tiers

A A HUGEHUGE performance bottleneck:

Volume / Complexity / Frequency of Data Access

Application Database

Object

Java SQL

Relational

(in some companies, this is would be time to blame the DBA)

Constraints...

• No Single Points of Failure• No Simple Points of Bottleneck• No Service Registries• No Masters + Workers

• already got one that is partitioned into over 200 separate clusters

• No Manual Partitioning• Keep everything in Memory• Active + Active Sites

• Across WAN

• Develop system on a note book• Scale to over 500 servers• No reconfiguration outages• No byte-code manipulation /

proxies

• No Data or Task Loss• During failure• During server upgrade• During scale out

• No Transactions (XA)• Support multiple versions• Predictable response times• Predictable scale out costs• Manage via JMX, from any point in

the “Cloud”.• Pure Java Standard Edition• Infrastructure add a maximum of

3ms latency to tasks.• Integrate with existing applications

(Java 1.4.2+)


Approach


• Business Tasks are regular Java objects (pojo)

• Place Business Tasks into Coherence • Coherence dynamically distributes Tasks across the Cluster• Tasks are resilient in the Cluster• May use “affinity” to ensure related Tasks processed together• Coherence triggers task processing

• Scaling out Coherence = Scaling out Task Processing

List of the Performed tests

Scalability Test

Guaranteed Delivery Test

Failover Test

Server Joining Test

Unattended Long Term Test

Results


• While submitting Tasks (regular system load)• Test 1: Scale from 1 server to over 400

• No reconfiguration• Test 2: Randomly kill servers

• No reconfiguration• Test 3: Kill 1, 2, 4, 8, 16, 32, 64, 128, 160 servers at once

• No data loss

• Possible 1,200,000,000 Tasks execution capacity per/day

• Client may reduce current hardware costs by 75%

Eliminating Single Point of Bottleneck

Trading Exchange


• Similar requirements and constraints• Order processing (Foreign Exchange)• 1,000’s per second (initial) per currency pair• No manual partitioning• No transactions• 10ms max latency for full accept, validate, match,

respond

• Achieved with Coherence using BMLs (< 3ms)• 14 weeks development (start to go live)

Previous Approach(failed to meet SLA’s)


Coherence – based Solution


Conclusion

Oracle Coherence…

• Is an in – memory object data grid, providing• Scalability• Availability• Reliability• Performance

• Supports many mission – critical apps especially in Financial Services

• Integrates with and supports other technologies:• Compute Grids• Database Grids• C++, .Net

• Is a key component of Oracle’s XTP platform

Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist EMEA...

Documents

Transcript of Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist EMEA...