Post on 07-Jul-2020
Storage Decisions 2012 | © TechTarget
Randy Kerns
Senior Strategist
Evaluator Group
IT and Storage for
Big Data Analytics
Overview
● “Big data” can mean two
different things
- Storage for large amounts
of data
- Analytics against very large
amounts of data
● Usually from machine-to-
machine data
- Called pervasive computing
● So, what does this mean for
storage?
Storage Decisions 2012 | © TechTarget
What It Means for IT
Storage Decisions 2012 | © TechTarget
The Storage Way to Say Big Data
● Defined by architectural platform, big data storage is:
‒ Scale-out NAS
‒ Global NameSpace File System
‒ NAS gateway to SAN and Scale-out SAN
● Defined by application, big data storage is:
‒ Storage for applications that handle large files and requires
performance
‒ Storage for extremely large number of files
‒ Examples: Media & entertainment, oil & gas exploration,
life sciences, etc.
Storage Decisions 2012 | © TechTarget
The Analytics Way to Say Big Data
● Big data analytics is:
- A term for business intelligence (BI) processes that are
different from traditional data warehousing
- The ability to tap unstructured data as a source for BI
processes
- Information delivered to users in real or near real-time (but
not an absolute requirement)
- Convergence of multiple data sources
● Latency introduced by storage, including networked
storage, is often assiduously avoided
● Cost is minimized
Storage Decisions 2012 | © TechTarget
Logs,
Tweets
Location
HDFS
NoSQL DB
Customer
Profiles
High Scale
Data
Reductions
BI and Analytics
POS
Expert
System
NoSQL DB
Batch
Low Latency
1) Identify User
2a)Lookup User Profile
2b) Lookup Location
Predictions on Buying Behavior
4) Real-time: Determine Best Offer For This
User 3) Input Into
Data Analytics Model
Storage Decisions 2012 | © TechTarget
Why Should Storage Professionals Care?
● Distributed computing for analytics (Hadoop, for example)
is moving from science experiment to mission-critical
● As this happens, data encompassed by these
applications becomes the responsibility of people who
worry about:
- Security
- Data protection/disaster recovery/business continuance
- Data governance and compliance
- Digital records management and archiving
Storage Decisions 2012 | © TechTarget
Shared Storage for the Traditional
Data Warehouse
Files /
XML data Log FilesOLTP Operational
Data
Warehouse
Reports Dashboards Notifications
Archive Extract, Transform, Load (ETL)
Schedules
Ad hoc
Queries
Storage Decisions 2012 | © TechTarget
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
DAS DAS DAS DAS
1 2 3 4 5 6 7 8
B8
GM
R3 Link
Active
Link
Active
Link
Active
ConsolePwr
Active
Link
Active
DAS
Network
Layer
Compute
Layer
Storage
Layer
Distributed, Shared-Nothing Architectures for
Big Data Analytics
Storage Decisions 2012 | © TechTarget
C
O
N
T
R
O
L
CAP Theorem
● It is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (a guarantee that every request receives a
response about whether it was successful or failed)
- Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
● A distributed system can satisfy any two of these
guarantees at the same time, but not all three
Storage Decisions 2012 | © TechTarget
Issue for IT
● How to store information for big data
- How much data is there?????
- Where did this idea come from?
● What are the requirements
● Is it from analytics operations
- Store original data – capture in flight as part of the analytics
operation?
- Store as secondary process?
- Don’t save anything, except results?
● What about Rental Data?
Storage Decisions 2012 | © TechTarget
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
1 2 3 4 5 6 7 8
B8
GM
R3 Link
Active
Link
Active
Link
Active
ConsolePwr
Active
Link
Active
C
O
N
T
R
O
L
Network
Layer
Compute
Layer
Storage
Layer
Shared Storage as Secondary Storage
Storage Decisions 2012 | © TechTarget
● Is there a place for shared storage in shared-nothing?
If so, what does it look like?
SAN/NAS
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
1 2 3 4 5 6 7 8
B8
GM
R3 Link
Active
Link
Active
Link
Active
ConsolePwr
Active
Link
Active
C
O
N
T
R
O
L
Network
Layer
Compute
Layer
Storage
Layer SAN or NAS, but more commonly Scale-out NAS
Shared Storage as Primary Storage
Storage Decisions 2012 | © TechTarget
Shared Primary/Secondary Storage
● Advantages
- Can reduces latency for queries that span nodes
- Enhances system availability
- Addresses the enterprise storage requirements
Security
Data protection/disaster recovery/business continuance
Data governance and compliance
Digital records management and archiving
● Disadvantages
- Additional cost
- Crosses a “cultural” boundary
Storage Decisions 2012 | © TechTarget
Why Not Shared Storage?
Storage Decisions 2012 | © TechTarget
Big Data Storage for Big Data Analytics
● Shared storage as secondary storage for big data
analytics
- Data Protection, Database of Record, Archive
- Examples: NetApp and ParAccel, EMC Data
Domain/VMAX and Greenplum, RainStor
● Shared storage as primary storage for big data
analytics
- Examples: Calpont, Red Hat Gluster, IBM GPFS,
Nexenta ZFS, Hadoop nodes in Virtual Machines
Storage Decisions 2012 | © TechTarget
Is Hadoop a Storage Device?
● NO
- It’s a distributed computing platform
● YES
- 1K node cluster w/ 1TB RAM per node = 1PB of very high
performance storage
- Data protection built-in (multiple data copies but not RAID)
- HDFS - Embedded, distributed file system (like scale-out
NAS)
Storage Decisions 2012 | © TechTarget
HDFS – Hadoop File System
● Very large Distributed File System (DFS)
– 10K nodes, 100 million files, 10 PB
● Uses standard servers with direct attached storage
– Files are replicated to handle hardware failure – 3 copies
– Detect failures and recovers from them
● Optimized for batch processing
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth
● Runs in user space - heterogeneous OS
Storage Decisions 2012 | © TechTarget
Hadoop File System on Standard Servers
Storage Decisions 2012 | © TechTarget
Source: Matt Foley
N
O
D
E
1
N
O
D
E
2
N
O
D
E
3
N
O
D
E
n
DAS DAS DAS DAS
1 2 3 4 5 6 7 8
B8
GM
R3 Link
Active
Link
Active
Link
Active
ConsolePwr
Active
Link
Active
C
O
N
T
R
O
L
DAS
Network
Layer
Compute
Layer
Storage
Layer
Typical Hadoop Configuration
Storage Decisions 2012 | © TechTarget
Hadoop Key Milestones
● Dec 2004 – Google GFS paper published
● July 2005 – MapReduce first used
● Feb 2006 – Becomes Lucene subproject
● Apr 2007 – Yahoo! on 1000-node cluster
● Jan 2008 – Apache Top Level Project
● May 2009 – Hadoop sorts a Petabyte in 17 hours
● Aug 2010 – World’s largest Hadoop cluster at Facebook
- 2900 nodes
- 30+ Petabytes
Storage Decisions 2012 | © TechTarget
Evaluating Hadoop as a Storage Device
● Snapshots?
● Scale capacity and performance concurrently?
● SSD and automated tiering?
● Dedupe?
● Insert your hot-button storage feature here: __________
Storage Decisions 2012 | © TechTarget
Evaluating Hadoop as a Storage Device
Storage Decisions 2012 | © TechTarget
IT and Big Data Analytics
● There will be big data
● Circumstances may vary…. and change
● Participate early
- Data scientists may not have same concerns or
requirements
- Decisions can limit choices
● Understand options
- Products / software
Storage Decisions 2012 | © TechTarget
Storage Decisions 2012 | © TechTarget
Randy Kerns: randy@evaluatorgroup.com
Twitter: @rgkerns
Blog: http://itknowledgeexchange.techtarget.com/storage-soup/
Thank You! Questions?