© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
5 Fundamental Strategies for Building a Data-centered Data Center
June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2
Last generation
OLTP
Warehouse
Data Marts Archives
“Unstructured”
“ ”
Video Audio
Signals, Logs, Streams
Social
Documents, Messages
{ } Metadata
Search 🔍
Reference Data
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3
Summary – The Data-centered Data Center
Elastic: flexible, shared-nothing, scale-out architecture
Cost competitive: low-cost commodity hardware, lower TCO
Converged: single data layer for operational and analytical workloads
Managing data life-cycle in real-time: prioritize your data storage
Governed, not renegade: customizable, transparent, secure
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4
ELASTIC
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5
Data organization in MarkLogic
Data inserted into stands One stand is in-memory Many other stands are on disk A collection of stands is a forest Each forest is an atomic unit and
can be managed and moved
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6
Servers have Multiple Forests
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7
Scale out
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8
Clustering
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9
Clustering
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10
Migration
Two forests on one node Bring a second node online Replicate a forest Disable the forest on the
original node Original forest on original
node fails over Enable the original forest as
a replica
X
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11
Migration in one step $ cat forest-migrate.json
{ "operation": "forest-migrate”, "forest": [”forest-in-database", ”another-forest-in-database"], "host": ”destination-host” }
$ curl --anyauth --user user:password -X PUT -d @./forest-migrate.json \ -i -H "Content-type: application/json" \ http://anyhostinthecluster:8002/manage/v2/forests
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12
Cluster topologies XA
RDBMS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13
Knowing where you’re going - and where you’ve been
Business context
What are your SLAs? How many requests per second does the application
have to support? How will the business grow? What will drive growth - and how fast will it go?
As-Built Capacities
How does your system perform under different usage profiles (e.g., QPS tests)?
How often do you hit the cache? What is your peak storage I/O? What is end-to-end recovery objective/capability?
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14
Performance History
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16
Performance History
To handle more requests: • Fix Configuration • Add Disk IO via Volumes or Nodes • Add Ram to decrease Disk IO • Rewrite Query
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17
Scaling out: questions to consider
How do you know when you need to add a node?
CPU/Memory/IO: when you get close to hardware limits, time to grow
High Performance: SLA’s may drive forest sizes; more docs, time to grow
High Capacity: running low on storage, time to grow
Easy (temporary?) fix—add RAM
Cheaper alternative
Increases cache hits for better performance
Fewer than three hosts, local forests MUST move across hosts
Use forest migrate to move forests from one host to another
Faster than backup/restore
Follow distribution pattern:
Don’t just swap masters/replicas on two: if one goes down, load is not split evenly across cluster
Adding a node - or RAM Migrating a forest
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18
LOWERING TCO THROUGH
COMMODITY HARDWARE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19
Kryder’s Law: The density of hard drives
increases by a factor of 1,000
every 10.5 years.
(doubling every 13 months)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20
Moore’s Law: The density of transistors on
integrated circuits doubles
every 18 months.
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21
The laws in action At the end of a 3-year life cycle, one new server can do the job of four old
servers.
At 1.5 Years, you can add 100% more capacity for 50% of original spend
For the cost of storing 1TB in 1996, we will be able to store 1PB in 2016.
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22
Commodity hardware will reduce costs
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23
Hardware/sizing recommendations
2U 25 SFF Chassis 2 Socket
8 Core/2.8Ghz
128GB – 256GB RAM
10GB Network
2 2GB RAID Cards
22 10K 900-1200GB Data Drives
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24
VMWare NetApp recommendations (preliminary)
1U 8SFF Chassis 2 Socket
8 Core/2.8Ghz
128GB – 256GB RAM
10GB Network
1 10GB iSCSI 12-16 Spindles per Server, 10K SAS
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25
Storage Economics
SAN/Scale-up
$2 - $10/Gigabyte
$1M gets: 0.5Petabytes
200,000 IOPS 1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets: 1 Petabyte
400,000 IOPS 2Gbyte/sec
Local Compute
$0.20/Gigabyte
$1M gets: 10 Petabytes
5,000,000 IOPS 40 Gbytes/sec
SAN (Scale Up) Commodity (Scale Out)
Public cloud
$0.04/gb/month
$100K/month: 1.25 Petabytes (HA)
1,500,000 IOPS 150 GB/Sec
Cloud
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26
Signs of war: cloud prices have dropped recently
Google Cloud: - $0.04 GB-month for 1000GB
Amazon EBS: - $.055 GB-month (standard) - $.138 GB-month (provisioned)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27
Leveraging Scale-out Economics Run on existing Infrastructure today
Leverage Scale-Out Commodity Hardware as you grow
Leverage Cloud today or tomorrow
SAME DATABASE, SAME CODE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28
DATA LAYER CONVERGENCE
FEWER MOVING PARTS =
MORE AGILITY
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32
Last generation
OLTP
Warehouse
Data Marts Archives
“Unstructured”
“ ”
Video Audio
Signals, Logs, Streams
Social
Documents, Messages
{ } Metadata
Search 🔍
Reference Data
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33
RDBMS: One Tool, Many Contortions OLTP
3rd normal form, updates, simple query Reporting DB
Because the OLTP app slowed down during heavy query use Enterprise Data Warehouse
Because we needed a unified view of the enterprise – Star schema enters the picture
Data Marts Because the EDW didn’t have everything – Also star schema
Federated Because it took too long to agree on a standard model
Hybrid Because Federated is too slow
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34
If I run analytics in my OLTP DB then.... Won’t meet my SLA’s Too expensive No common data model Cache won’t ever be right Too expensive to keep around
context necessary for analytics
If I run transactions in my Analytical DB...
Transaction locks will block aggregate reads
Too expensive Why constrain ad-hoc
query? We need to investigate
The old consensus: mixing is bad
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35
The new wisdom: mixing is good Operational with Analytics Risk calculations Underwriting Compliance Content Discovery Fraud
Analytics with Operations Operational BI Archival/E-Discovery Personalization Situational Awareness
SINGLE SOURCE OF TRUTH
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 36
Mixing workloads in MarkLogic - how it works
ML as an analytic database - examples and possibilities
Range indexes: in-memory columnar Query load separation Tiered storage and real-time replication Hadoop MapReduce and HDFS Transactions and ACID help manage and
prioritize data - better performance, lower TCO
Operations and analytics in MarkLogic
COPIES, NOT ETL
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37
INFORMATION LIFE-CYCLE MANAGEMENT (FOR REAL)
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 38
Understanding the life cycle The older your database,
the more data you have
The older the data, the less likely you will reuse it
Storage requirements increase, but much of what is stored will go untouched
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 39
Data life cycle management, in three easy steps
1. Move data off active system to cheaper system.
2. Keep track of what you moved.
3. Provide facility for getting it back.
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 40
CERN: implementation is hard in the RDBMS world DBAs / database developers cannot easily
implement these policies by themselves.
Database admins, application developers, and application owners must work together to: Reduce amount of data produced Allow for database structure that can
facilitate archiving Define data availability requirements for
online data and archive Identify how to leverage database features
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41
CERN: archiving RDBMS data is also difficult
The DBA removes old partitions from the production database and moves them to the archive. One option: use partition exchange to table Post-move jobs can implement compression, drop indexes
Sticking points: Set of data must be consistent Must build support in the application Have to validate access to archived data Archived data must remain readable in future
versions
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 42
Tiered Storage With Tiered Storage, you can… Define data tiers based on a range index
Have content balanced into forests by tier
Move an entire tier to different storage
Attach a tier to a different database
Query one database on one tier…
…or the other database on the other tier…
…or both at once All with no downtime, and 100% consistency
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43
0
10000
20000
30000
40000
50000
60000
Tier 1 SAN Exadata ML usingDAS
Tier-1
Effective Cost/TB for Production Storage (all copies)
0100020003000400050006000700080009000
FlexPod/VCE NetApp ML usingDAS
Tier-2
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 44
GOVERNANCE + PROVENANCE
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45
Data Governance Considerations
Security
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 46
Data Governance Considerations
Security
Privacy
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47
Data Governance Considerations
Security
Privacy
Provenance
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 48
Data Governance Considerations
Security
Privacy
Provenance
Retention
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49
Data Governance Considerations
Security
Privacy Continuity
Provenance
Retention
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 50
Data Governance Considerations
Security
Privacy Continuity
Provenance Compliance
Retention
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51
Last Generation
OLTP
Warehouse
Data Marts Archives
“Unstructured”
“ ”
Video Audio
Signals, Logs, Streams
Social
Documents, Messages
{ } Metadata
Search 🔍
Reference Data
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 52
New Generation
Application
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 54
Summary Elastic systems let you respond rapidly to changing loads - and let you keep costs
in line with usage.
Scale-out systems on commodity hardware are much less expensive and more powerful than scale-up systems.
Converging transactional and analytical workloads into single data layer is not only possible - it is often a great idea. A single data layer can increase agility.
Managing information throughout its life cycle means more than choosing the cheapest storage possible - it means being able to manage and query data in real time.
Proper data governance is simpler in an enterprise NoSQL system.
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.
Q&A
Top Related