Lassoing Big data - Chris & Greg Tinker, HP Master Technologists
-
Upload
hp-enterprise -
Category
Technology
-
view
770 -
download
1
description
Transcript of Lassoing Big data - Chris & Greg Tinker, HP Master Technologists
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Chris & Greg Tinker – HP Master Technologist
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
BIG Data and IT Solutions
Lassoing Big data
Chris & Greg Tinker, HP Master TechnologistsJune 2012
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3
Lassoing Big Data
Agenda
• Defining Big Data• Challenges• Solution design• Scenarios• Take away & closing statements
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Defining Big Data
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5
“Big Data” originating with analytics – business intelligence (BI)
Defining
• Traversing enormous diverse data types to spot patterns− 10s - 100s of terabytes (TB), petabytes (PB), and yes - even Exabyte's (EB)
• Business needing faster --“real time” (seconds - minutes vs. hours to days) analytic results − combining data from silos − Analyzing diverse data types− Connect data from various business units (cross analyze, access, &
reference )
• Growing at exponential rate − Structured data – data stored in databases− Unstructured – all other data including emails, social media, blogs, free form
feedback, documents, transaction, multimedia (images, videos, etc.) − 90% of enterprise information is unstructured
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
Big Data: growing at a massive scale
Defining
Today’s “Big Data” will not be considered the same in 5 years. By 2020, there will be 4 billion people online creating 50 trillion gigabytes of data*
Data and its management is not just a concern for IT departments.
• ~4 trillion SMSes a month ~4 PB per month worldwide
• ~30 billion pieces of content shared on Facebook every month
• ~48 hours of video uploaded onto YouTube every minute
In sixty seconds:• 1,820 TB of data is created; that’s enough data to fill up 2.6 million CDs**• 1.1 million conversations take place via instant Messenger**
*http://www.hpl.hp.com/research/intelligent_infrastructure.html
**http://www.go-gulf.com/60scs_v2.jpg
Structured Unstructured
Amount 10% 90%
Growth 22% 62%
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
Big Data: Landscape
Defining
Actionable intelligence
HPCC + softwareProgramming with more math and statistic
Unstructured Data
–Benchmarks–Trends
i.e. Social media
Silo Data
–Counts–Sums
i.e. Business units
Cloud Compute and Storage platforms
Structured Data
–Averages–Rates
i.e. databases
Big Data
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Challenges
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
Scale
Challenges
Big Clouds – compute and storage platforms to transform data into actionable intelligence
• fluctuating asset valuations−Convergence & Virtualization− Identifying untapped resources –utilization factors • cross-access, cross-analyze, and cross-reference
−Reconcile data silos−Massive data• HPCC solutions
−Hyperscale cluster solutions
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10
Governance
Challenges
Compliance
• SOX• Privacy directives – Data access
−US Federal ( HIPAA, FCRA, GLBA, DPPA, DOT, etc ) -- don’t forget the State addendums
−UK (DPA,…) • Data retention and archives • Purging of data after expiration of legal retention • Ability to prove compliance upon request and proving data has not
been manipulated, changed, or deleted• Restrictions and permutations of data models
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
Architecture
Challenges
• Large, single name space file system(s)−Parallel access file system• Clustered file systems
− Proprietary cluster volumes• Distance between data sources• Protocol(s)
− ISCSI, IFCP, FCP, …−CIFS, NFS, …• Metadata management• Backups
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
Analytics – software
Challenges
HP software• HP Autonomy & (Information Data Operating Layer) IDOL10
− Natural language processing− unstructured
• HP Vertica− Structured
Other software (examples)• Hadoop (both a file system and a map/reduce engine)
− Hadoop map/reduce on HP IBRIX parallel single namespace file system− Data processing (no built in natural-language processing)
• Apache, Cassandra, Cloudera, Lucene/Solr and many others
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Solution designs
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.14
Capacity
Solution design
Big Data solutions must deliver optimal utilization of assets while agile enough to support rapid scaling
• Hierarchal storage management methods− Recent data must be readily available for real-time
analytics− Performance− Reliability
• Disaster recovery • Archival / backup management• Leverage open standards – prevent lock-in
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.15
Near real-time
Solution design
Historically, analytics were derived from archived or aged data, today’s analytics require Cloud Compute and Storage platforms to achieve “nearly real-time” results
• Limit data movement• Hyper-scale solutions: High Performance compute clusters (HPCC)• Cloud and virtualization• capacity scalability -- Just-in-time scalability• Parallel work streams
Get the analytics closer to the data…
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.16
Performance
Solution design
Need fast and scalable access to data
How tightly coupled the data is to the applications• Application bottlenecks
− Message passing interfaces, network stacks, IO subsystem, Data layout• Parallelism – aging applications which do not make use of threading• Data set size, quantity of objects, access patterns
− How random is random?− File system(s)− Storage subsystem
• Network• Processing
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.17
Performance requirements influence scale constraints
Solution design
Service time ~1ms per I/O, Throughput ~ 8,000MB/secTransactions ~ 600,000 per day/hour/minute? (end to end? )
Tempering IT solutions with Business Realities • Determine speed at which consumption and indexing of data types needs to take place• Close to real-time, seconds, minutes, hours, days• Utilize Enterprise Solutions
− compress enormous volumes of data (via compression or de-duplication) • Volume of data available encumbers analysis – SCOPE of data set• Capture of data -- Real-time/low latency
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Scenarios
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.19
Power
Scenario 1
Production jobs executing for 4+ days when jobs began to fail… 1,000+ users light up phone bankThough “near real-time” is a great sound byte, most complex analytics of large scale research projects take hours if not days. During which data at rest is expected to remain at rest.• 500TB• 2,000,000 directories• 60,000,000 files• Single file system• Storage subsystem experiences multiple component failure (PDU and UPS failure)
File system linearly space representation. ??????
?
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.20
Power – Big Data File system complications
Scenario 1
Extremely High Aggregate Performance from a Single
Directory (and Single File)
Dir
F1 F2 F3 Fn
…
Subdir
S1
S2
S3
Sn
…
1
4
2
3
…
100
Segments
F2
F3
Fn
S1
S2
S3
Sn
Subdir
US Patent # 6,782,389
SegmentServers
S1
S2
Sn
Dir
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.21
Power
Scenario 1
File system metadata corruption• Part of disk subsystem failed while file system remained operational• Application IO errors combined with file system and SCSI IO errors• Disk subsystem was restored• No offline file system check was performed to fix metadata
Solution/mitigation•Production Offline required to perform full check•Restore individual files which were marked for deletion and placed in lost+found•Replication
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.22
Corruption
Scenario 2
A case that recently reached our desk – corruption of Oracle 700TB databaseChallenge• Production down
− database Down (corruption.. Would not start)− Application pointed to disk subsystem− 32 node farm− 50,000 LUNPATHS (we are seeing systems in excess of 200,000 LUNPATHS)
• Restoration− Exactly what area is corrupt – Data or temp/redo space?
• Why/How?
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.23
Corruption: layers
Scenario 2
Without ASM
With ASM
Tables
Tablespaces
Files
File Systems
Logical Volumes
Volume Groups
Physical Volumes
011100000100…..
011100000100…..
(S)LVM, VxVM, CVM
VxFS,..,CFS
Files and Disk GroupsManaged by ASM,displayable in OracleViews
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.24
Corruption: layers
Scenario 2
Upper Layer SD ST SR SG
SCIS MID LAYER: GLUE
SCSI Lower Layer FC ISCSISAS
Etc…
Use
r Space
Applications
GNU C lib
Kern
el S
pace
System Call Interface
VFS (ext3, NTFS, VxFS, etc.)
Buffer Cache
MPIO – device mapper
RAW
LVM, VxVM, Oracle ASM
Blkdev SCSI
IDEEtc…
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.25
User corruption
Scenario 2
Persistent naming is achieved by using the scsi_id –g –u –s UNIQUE wwid and placing it into the multipaths{} section of the multipath.conf file
multipaths {
multipath {
wwid 360060e8005709a000000709a000000c4
alias Oracle_vote1
}
}
Example:
#> multipath -ll
Oracle_vote1 (360060e8005709a000000709a000000c4) dm-11 HP,OPEN-V
[size=513M][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
\_ 0:0:0:4 sde 8:64 [active][ready]
\_ 1:0:0:4 sdk 8:160 [active][ready]
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.26
Corruption – result of human error
Scenario 2
Administrators manually modified /var/lib/multipath/bindings This file is a cache file created by multipath for persistent mapping of devices files• /var was it’s own filesystem• /var not mounted at boot time• / filesystem had it’s own /var/ which was covered up later in boot strap
• Identification• Mitigation (establish definitions within multipath.conf)
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.27
Scenario 3
10Gb High Availability Network connections not meeting expectations
Previous solution was a 1GE port channel environment on aging infrastructure
Upgraded Servers to• Single Blade w/ 64 Cores ( 4 X 16 core
processors )• 256GB memory• 4X 10Gb flex fabric interface ports• FCOE & ISCSI
Application performance was expected to be nearly 8X faster
Performance
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.28
Performance: applications
Scenario 3
Aging applications are becoming bottlenecksFew older applications make use of parallel work streams leveraging the overall bandwidth capacity of today’s servers • IT infrastructure at the time of application design• Home grown application scaled to unforeseen and unpredicted use
− Production use throttled development indicatives− Closed source application vendor no longer exist
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.29
Performance: the stack
Scenario 3Program design
Data Access
Structured/Unstructured data
OS
Data Layer
Process model
Bus
Mapping
Message queues
Integration Layer
Application Layer
Infrastructure Layer
Server
Storage
Clustering
Networking
Stability
Scalability
Data Access
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.30
Scenario 3
Throughput is achieved by optimizing parallel streamsTCP
# of streams TCP RTT (ms) KB/IO MTU TCP Segments/IO SCSI RTT KB/sec Mbit/sec Calculated MB/sec RTT (s)1 0.009 1.46 1,500 1 0.01 162,222.22 1,267.36 158 0.0000092 0.009 1.46 1,500 1 0.01 324,444.44 2,534.72 317 0.0000093 0.009 1.46 1,500 1 0.01 486,666.67 3,802.08 475 0.0000094 0.009 1.46 1,500 1 0.01 648,888.89 5,069.44 634 0.0000095 0.009 1.46 1,500 1 0.01 811,111.11 6,336.81 792 0.0000096 0.009 1.46 1,500 1 0.01 973,333.33 7,604.17 951 0.0000097 0.009 1.46 1,500 1 0.01 1,135,555.56 8,871.53 1,109 0.0000098 0.009 1.46 1,500 1 0.01 1,297,777.78 10,138.89 1,267 0.000009
ISCSI
# of streams TCP RTT (ms) KB/IO MTU TCP Segments/IO SCSI RTT KB/sec Mbit/sec Calculated MB/sec SCSI SVC (s)1 0.100 8.00 1,500 6 0.60 13,333.33 104.17 13 0.0006002 0.100 8.00 1,500 6 0.60 26,666.67 208.33 26 0.0006003 0.100 8.00 1,500 6 0.60 40,000.00 312.50 39 0.0006004 0.100 8.00 1,500 6 0.60 53,333.33 416.67 52 0.0006005 0.100 8.00 1,500 6 0.60 66,666.67 520.83 65 0.0006006 0.100 8.00 1,500 6 0.60 80,000.00 625.00 78 0.0006007 0.100 8.00 1,500 6 0.60 93,333.33 729.17 91 0.0006008 0.100 8.00 1,500 6 0.60 106,666.67 833.33 104 0.000600
Performance
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.31
Performance: the solution
Scenario 3
Problem identified with the single threaded nature of the applicationTo achieve the desired performance the latency between the data and the analytics had to be reduced due to application rework was not an option. (Source lost)
• Critical business application and data placed on local Storage− Latency maintained at or below ~0.1 msec where data set size allowed for such
latency.• HP PCI based Smartarray battery backed disk controllers with SSD disks
• Tiered storage model adopted• Utilized capacity of local server resources for special locality of application to reduce
network latency− More cores and memory allows for application and OS virtualization on same physical
machine− NOTE: virtual switch allows for network communication to not even leave the adapter
when talking between guests on same vswitch
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Take away
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.33
Storage Design considerations
Take away
Storage Tech NAS NASFC
FCNAS
Work load Mixed Work loads Mixed Random and High Sequential Throughput
Very high sequential bandwidth access to a single file
Scale Depends on Change Rate Depends on Change Rate Depends on Change Rate
File Types Many (millions) smaller to medium sized files
Some large files (most <100Gbyte) and some smaller files.
Very large files (most over 100GByte)Structured databases (data warehouses)
Aggregate Throughput Requirement
< 5Gbytes/sec 5 to 10 Gbytes/sec 10’s – 100’s of Gbyte/sec required
Protocols CIFS, NFS CIFS, NFS, FTP, HTTP, Webdav, ISCSI/block Access FC
FCNAS w/ IB for low latency and throughput
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.34
Scalability Factors
Take away
• Next Generation Data Centers− Power/heat− Scalable storage and compute power – cloud platforms
• Solution Designs− Availability− Scalability− Recovery− Performance
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.35
Always on support from HP
Take away
Who does your IT staff call?• Several levels
− Foundation Care− Proactive Care− Datacenter Care− Lifecycle Event Services
Complex Solution Team• Multi-vendor• Multi-solution
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.36
Win 1 of 12 HP Envy
Spectre Ultrabooks and Monster Beats
headsetNO PURCHASE NECESSARY & PURCHASE WILL NOT INCREASE CHANCE OF WINNING. OPEN ONLY TO Discover 2012 – Las Vegas ATTENDEES; LEGAL RESIDENTS OF 50 UNITED STATES, & THE DC, 18 YEARS OLD. Void in Puerto Rico, U.S. territories, possessions and where prohibited by law. Employees of Sponsor, its subsidiaries, affiliates, their immediate family and household members, as well as public sector employees, are not eligible. Entry constitutes agreement to rules & Sponsor’s decisions. Participants enter by submitting lead inquiry to HP’s booth. Winners chosen by random drawing daily on or about June 4-6, 2012. One entry per person. Winners will be notified via email and may have to sign and return an eligibility affidavit & liability release, unless prohibited. If eligible winners fail to sign and return required documents, prize may be forfeited. Prizes: One (1) of twelve (12) HP Envy Spectre 14 and Dr. Dre Beats Headsets (ARV $1699.00 each). No substitution, cash redemption or transfer of prizes, except in Sponsor’s discretion. Taxes are winners’ responsibility. Odds of winning depend on number of entries. Entrants release and hold harmless Sponsor, its subsidiaries, affiliates, and their officers, directors, employees, agents from any claim arising out of entry or prize receipt or use. Sponsor: Hewlett-Packard Company, Attn: HP 11445 Compaq Center Drive W, Houston, TX, USA 77070. Use this address for inquiries or requests for winner’s list.
Demo #563
Test Drive HP Insight Online
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you