5 Fundamental Strategies for Building a Data-centered Data ...€¦ · At the end of a 3-year life...

Post on 31-May-2020

1 views 0 download

Transcript of 5 Fundamental Strategies for Building a Data-centered Data ...€¦ · At the end of a 3-year life...

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

5 Fundamental Strategies for Building a Data-centered Data Center

June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2

Last generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3

Summary – The Data-centered Data Center

Elastic: flexible, shared-nothing, scale-out architecture

Cost competitive: low-cost commodity hardware, lower TCO

Converged: single data layer for operational and analytical workloads

Managing data life-cycle in real-time: prioritize your data storage

Governed, not renegade: customizable, transparent, secure

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4

ELASTIC

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5

Data organization in MarkLogic

Data inserted into stands One stand is in-memory Many other stands are on disk A collection of stands is a forest Each forest is an atomic unit and

can be managed and moved

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6

Servers have Multiple Forests

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7

Scale out

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8

Clustering

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9

Clustering

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10

Migration

Two forests on one node Bring a second node online Replicate a forest Disable the forest on the

original node Original forest on original

node fails over Enable the original forest as

a replica

X

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11

Migration in one step $ cat forest-migrate.json

{ "operation": "forest-migrate”, "forest": [”forest-in-database", ”another-forest-in-database"], "host": ”destination-host” }

$ curl --anyauth --user user:password -X PUT -d @./forest-migrate.json \ -i -H "Content-type: application/json" \ http://anyhostinthecluster:8002/manage/v2/forests

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12

Cluster topologies XA

RDBMS

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13

Knowing where you’re going - and where you’ve been

Business context

What are your SLAs? How many requests per second does the application

have to support? How will the business grow? What will drive growth - and how fast will it go?

As-Built Capacities

How does your system perform under different usage profiles (e.g., QPS tests)?

How often do you hit the cache? What is your peak storage I/O? What is end-to-end recovery objective/capability?

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14

Performance History

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16

Performance History

To handle more requests: • Fix Configuration • Add Disk IO via Volumes or Nodes • Add Ram to decrease Disk IO • Rewrite Query

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17

Scaling out: questions to consider

How do you know when you need to add a node?

CPU/Memory/IO: when you get close to hardware limits, time to grow

High Performance: SLA’s may drive forest sizes; more docs, time to grow

High Capacity: running low on storage, time to grow

Easy (temporary?) fix—add RAM

Cheaper alternative

Increases cache hits for better performance

Fewer than three hosts, local forests MUST move across hosts

Use forest migrate to move forests from one host to another

Faster than backup/restore

Follow distribution pattern:

Don’t just swap masters/replicas on two: if one goes down, load is not split evenly across cluster

Adding a node - or RAM Migrating a forest

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18

LOWERING TCO THROUGH

COMMODITY HARDWARE

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19

Kryder’s Law: The density of hard drives

increases by a factor of 1,000

every 10.5 years.

(doubling every 13 months)

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20

Moore’s Law: The density of transistors on

integrated circuits doubles

every 18 months.

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21

The laws in action At the end of a 3-year life cycle, one new server can do the job of four old

servers.

At 1.5 Years, you can add 100% more capacity for 50% of original spend

For the cost of storing 1TB in 1996, we will be able to store 1PB in 2016.

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22

Commodity hardware will reduce costs

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23

Hardware/sizing recommendations

2U 25 SFF Chassis 2 Socket

8 Core/2.8Ghz

128GB – 256GB RAM

10GB Network

2 2GB RAID Cards

22 10K 900-1200GB Data Drives

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24

VMWare NetApp recommendations (preliminary)

1U 8SFF Chassis 2 Socket

8 Core/2.8Ghz

128GB – 256GB RAM

10GB Network

1 10GB iSCSI 12-16 Spindles per Server, 10K SAS

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25

Storage Economics

SAN/Scale-up

$2 - $10/Gigabyte

$1M gets: 0.5Petabytes

200,000 IOPS 1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets: 1 Petabyte

400,000 IOPS 2Gbyte/sec

Local Compute

$0.20/Gigabyte

$1M gets: 10 Petabytes

5,000,000 IOPS 40 Gbytes/sec

SAN (Scale Up) Commodity (Scale Out)

Public cloud

$0.04/gb/month

$100K/month: 1.25 Petabytes (HA)

1,500,000 IOPS 150 GB/Sec

Cloud

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26

Signs of war: cloud prices have dropped recently

Google Cloud: - $0.04 GB-month for 1000GB

Amazon EBS: - $.055 GB-month (standard) - $.138 GB-month (provisioned)

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27

Leveraging Scale-out Economics Run on existing Infrastructure today

Leverage Scale-Out Commodity Hardware as you grow

Leverage Cloud today or tomorrow

SAME DATABASE, SAME CODE

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28

DATA LAYER CONVERGENCE

FEWER MOVING PARTS =

MORE AGILITY

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32

Last generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33

RDBMS: One Tool, Many Contortions OLTP

3rd normal form, updates, simple query Reporting DB

Because the OLTP app slowed down during heavy query use Enterprise Data Warehouse

Because we needed a unified view of the enterprise – Star schema enters the picture

Data Marts Because the EDW didn’t have everything – Also star schema

Federated Because it took too long to agree on a standard model

Hybrid Because Federated is too slow

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34

If I run analytics in my OLTP DB then.... Won’t meet my SLA’s Too expensive No common data model Cache won’t ever be right Too expensive to keep around

context necessary for analytics

If I run transactions in my Analytical DB...

Transaction locks will block aggregate reads

Too expensive Why constrain ad-hoc

query? We need to investigate

The old consensus: mixing is bad

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35

The new wisdom: mixing is good Operational with Analytics Risk calculations Underwriting Compliance Content Discovery Fraud

Analytics with Operations Operational BI Archival/E-Discovery Personalization Situational Awareness

SINGLE SOURCE OF TRUTH

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 36

Mixing workloads in MarkLogic - how it works

ML as an analytic database - examples and possibilities

Range indexes: in-memory columnar Query load separation Tiered storage and real-time replication Hadoop MapReduce and HDFS Transactions and ACID help manage and

prioritize data - better performance, lower TCO

Operations and analytics in MarkLogic

COPIES, NOT ETL

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37

INFORMATION LIFE-CYCLE MANAGEMENT (FOR REAL)

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 38

Understanding the life cycle The older your database,

the more data you have

The older the data, the less likely you will reuse it

Storage requirements increase, but much of what is stored will go untouched

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 39

Data life cycle management, in three easy steps

1. Move data off active system to cheaper system.

2. Keep track of what you moved.

3. Provide facility for getting it back.

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 40

CERN: implementation is hard in the RDBMS world DBAs / database developers cannot easily

implement these policies by themselves.

Database admins, application developers, and application owners must work together to: Reduce amount of data produced Allow for database structure that can

facilitate archiving Define data availability requirements for

online data and archive Identify how to leverage database features

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41

CERN: archiving RDBMS data is also difficult

The DBA removes old partitions from the production database and moves them to the archive. One option: use partition exchange to table Post-move jobs can implement compression, drop indexes

Sticking points: Set of data must be consistent Must build support in the application Have to validate access to archived data Archived data must remain readable in future

versions

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 42

Tiered Storage With Tiered Storage, you can… Define data tiers based on a range index

Have content balanced into forests by tier

Move an entire tier to different storage

Attach a tier to a different database

Query one database on one tier…

…or the other database on the other tier…

…or both at once All with no downtime, and 100% consistency

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43

0

10000

20000

30000

40000

50000

60000

Tier 1 SAN Exadata ML usingDAS

Tier-1

Effective Cost/TB for Production Storage (all copies)

0100020003000400050006000700080009000

FlexPod/VCE NetApp ML usingDAS

Tier-2

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 44

GOVERNANCE + PROVENANCE

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45

Data Governance Considerations

Security

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 46

Data Governance Considerations

Security

Privacy

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47

Data Governance Considerations

Security

Privacy

Provenance

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 48

Data Governance Considerations

Security

Privacy

Provenance

Retention

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49

Data Governance Considerations

Security

Privacy Continuity

Provenance

Retention

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 50

Data Governance Considerations

Security

Privacy Continuity

Provenance Compliance

Retention

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51

Last Generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 52

New Generation

Application

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 54

Summary Elastic systems let you respond rapidly to changing loads - and let you keep costs

in line with usage.

Scale-out systems on commodity hardware are much less expensive and more powerful than scale-up systems.

Converging transactional and analytical workloads into single data layer is not only possible - it is often a great idea. A single data layer can increase agility.

Managing information throughout its life cycle means more than choosing the cheapest storage possible - it means being able to manage and query data in real time.

Proper data governance is simpler in an enterprise NoSQL system.

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Q&A