Post on 03-Jul-2020
Infrastructure
Requirements for
Discovery Research
Chris Dagdigian
2009 PRISM Forum
Kicking Things Off
• Thanks for inviting me!
• Warning:
– I speak very fast
– Infamous for massive slide decks
• Goals for today
– Objective assessment of the state ofIT for discovery informatics
– Some specifics on:
• Green IT & Virtualization
• Storage
• Compute & HPC trends
• Networking
• Cloud Computing
– Some practical tips & advice
– Plenty of time for questions andconversation
BioTeam Inc.
Independent Consulting Shop
• Vendor/technology agnostic
Staffed by:
• Scientists forced to learn
High Performance IT
• Many years of industry &
academic experience
Our specialty:
• Bridging the gap between
Science & IT
I am not a “visionary”
cluster building in ‘02
Recipe for current career arc:
1. Find people smarter than myself
2. Watch what they do
3. Try to understand the ‘why’
4. Shamelessly copy them
What this means for today’s talk:
• I have no interest in being an ‘expert’,‘visionary’, ‘talking-head’ or pundit
• My interest lies in practical methods forapplying IT to solve research problems
• I’ll tell you what I think along with what I’veseen, done and broken in my daily work
• Happy to be challenged & questioned
First: Infrastructure Picture Tour
Single instrument scale: Self-contained lab-local cluster & storage for Illumina
Live Cell Imaging: High-speed confocal microscope rig
Workgroup Scale:
100 Terabyte storage
system and 10 node / 40
CPU core Linux Cluster
supporting multiple NGS
instruments
Large Core Facility
Computing, Virtualization & Scale-out
Compute Power, continued
• Trend
– Physical size of compute infrastructureshrinking rapidly
– Virtualization, consolidation & multi-coreare among the reasons
– Why build large clusters when you cancouple a smaller number of 8 or 16-coreservers/blades each supporting .5TB ofRAM?
• Result
– Shrinking floorspace need for HPC
– Facility issues are now the primarycompute restraint
– High density computing now restricted byavailable power density and coolingcapacity
– Research IT staff need to comprehendfacility, power and cooling issues in 2009 &beyond. This is critical.
Virtualization In Research IT
Still the lowest hanging fruit in most shops
• Tremendous benefits for:
• Operators, end-users,
environment & budgets
• Tipping point for me was:
• Live migration of running VMs
without requiring a proprietary
file system underneath
Virtualization In Research IT
Seen in 2009:
• Campus “Virtual Colocation Service”
• Deployed when HVAC/power hit facility limits
• Available campus-wide to all researchers & groups
• Built with VMWare & NetApp on Sun hardware
• Aggressive thin-provisioning & content optimization
• Truly significant payoff:
• ~400 servers currently virtualized
• Large # of physical servers retired & shut down
• Storage savings from de-dup, compression & thin provisioning
• Significant electrical & HVAC savings
• Full delegation of administrative control to owners
Virtualization Can Greatly Assist Research IT
Another benefit for Research IT shops:
• Lets scientists design, deploy and manage apps and services that are
not part of the “enterprise” portfolio
• Solves a common problem in large IT environments:
• Scientists routinely building web apps and services to satisfy
individual or workgroup level requirements
• Often need administrative control over the web server and
elevated access permissions on the base OS
• Apps and services do not meet enterprise standards for support,
security, documentation & lifecycle management
Learn from the big guys: “Trickle-Down” Scale-out Tips
• Google, Microsoft & Amazon all operate at extreme scales that few of us comeclose to matching
• Datacenter scaling & efficiency measures taken by these companies aretightly held trade secrets for competitive reasons
• This is starting to change and we will all benefit
• Both from “trickle-down” best practices & hard data
• And with vendor products improved to meet these demanding customers
“Trickle-Down” Scale-out Tips
April 2009:
• Google Datacenter Efficiency Summit
• Presentations now online
• Google video on Youtube showing ‘04-era 780W/sq ft containerized datacenter
• If Google was doing this stuff in 2004, what are they up to now?
And from server vendors:
• Dell warranting servers at 95F input temperature
• Rackable Systems warranting it’s C2 rack at 104F inlet temperature
September 2009 Report:
• Microsoft Dublin Datacenter
• Evaporative cooling & air-side economization (NO CHILLERS!)
• 95F Operating Temperature
October 2009 Report:• Microsoft Chicago datacenter (reported operating since July)
• One floor for containers (224,000 servers)
• 112 containers, each with 2,000 servers - 56MW critical load
• Two container stacks moved manually on air skates by 4 humans
Green IT for Cynics
Hyped beyond all reasonable measure
• Reminds me of dotcom-era WAN-scale grid computing PR
• Promising more than can berealistically delivered
Marketing aside, still worth pursuing:
• Shrinking physical footprint,reducing power consumption andbetter management of HVAC isbetter for the planet
• … and has tangible results on thebottom line
Green IT for Cynics
Where Green IT matters to me:
• Putting more capability into asmaller space
• Putting capability into spacesthat previously could not supportit
• Telco closets, underneathwetlab benches, etc.
• Working within power or HVAC-starved machine rooms
• Reducing power draw &increasing power use efficiency
• Reducing cooling costs &increasing air handling efficiency
Green IT for Cynics
My 2008 “aha!” moment:
• NexSan SATABeast Storage Arrays
• 48x 1TB SATA disks in 4Uenclosure w/ FC interconnects
• “AutoMAID” options built intomanagement console
• What we had:
• Three 48 terabyte SATABeastsin two racks
• APC PDUs with per-outletmonitoring features
• What we saw:
• 30% reduction in power draw
• No appreciable impact on clusterthroughput
Storage
The stakes are high if you don’t solve the research storage problem …
~200 TB
USB disk stored
on lab benches
Storage Trends
Seen in 2008-2009
• First 100TB single-namespace project
• First Petabyte+ storage project
• First large Cloud data transit project
• 4x increase in “technical storage audit”
work
• First time witnessing 10+TB catastrophic
data loss
• First time witnessing job dismissals due to
data loss
• Data Triage discussions are spreading
well beyond cost-sensitive industry
organizations
82TB Folder. Still satisfying. Single-namespace is good for science.
1PB volume - Even more satisfying
Data Drift - Real World Example
• Non-scalable storage islands add complexity
• Example:– Volume “Caspian” hosted on server “Odin”
– “Odin” replaced by “Thor”
– “Caspian” migrated to “Asgard”
– Relocated to “/massive/”
• Resulted in file paths that look like this:
/massive/Asgard/Caspian/blastdb
/massive/Asgard/old_stuff/Caspian/blastdb
/massive/Asgard/can-be-deleted/do-not-delete…
Data Management
• Very difficult
• Lab protocols changing faster than LIMS development cycles
• We have seen many different workarounds
– Notebooks, spreadsheets, file structures & wikis
• BioTeam is actively using MediaWiki Platform for cost-effective
data management in rapidly changing research IT environments
General Observations
Storage is a commodity in 2009
Cheap storage is easy
Big storage getting easier every day
Big, cheap & SAFE is much harder …
Data movement & management can
be hardest problem
Traditional backup methods may no
longer apply
• Or even be possible …
Observations cont.
• End users still have no clue about
the true costs of keeping data
accessible & available
• “I can get a terabyte fromCostco for $220!” (Aug 08)
• “I can get a terabyte fromCostco for $160!” (Oct 08)
• “I can get a terabyte fromCostco for $124!” (April 09)
• IT needs to be involved in setting
expectations and educating on true
cost of keeping data online &
accessible
• Organizations need forward looking
research storage roadmaps
Observations cont.
• The rise of “terabyte instruments” is
already having a major disruptive influence
on existing environments
• We see individual labs deploying
100TB+ systems
• Data movement is hard, especially to/from
wet labs
• I was wrong when I said
• “petabyte scale storage needs willappear within the decade …”
• That time is now for some large
organizations
Homework Exercise
• Select three vendors
• Build quotes for 100TB single-
namespace NAS solution
• My results:
• $100K to $1.5M range
• Repeat every six months
Follow-up:
Price a Petabyte Disk Array:
$125,000 - $4M USD
Capacity Dilemma: Data Triage
Data Triage
• The days of unlimited storage
for research are over
• Rate of consumption
increasing unsustainably
• First saw triage acts in 2007
(industry client)
• Becoming acceptable practice
in 2008
• Absolutely a given in 2009 for
most projects we see
Architecting Storage & Data Movement For Research IT
• First principal:
– Understand the data you will produce
– Understand the data you will keep
– Understand how the data will move
• Second principal:
– One platform or many?
– One vendor or many?
– One lab/core or many?
Final thoughts on storage for 2009
• Yes the problem is real
• More and more “terabyte instruments” are coming to market
• Some of us have peta-scale storage issues today
• “Data Deluge” & “Tsunami” are apt termsBut:
• The problem does not feel as scary as it once did
• Chemistry, reagent cost & human factors are natural bottlenecks
• Data Triage is an accepted practice, no longer heresy
• Data-reduction starting to happen within instruments
• Customers starting to trust instrument vendor software more
• We see large & small labs dealing successfully with these issues
• Many ways to tackle IT requirements
Mix and match solutions to fit local need …
Stolen Broad Slide - Future trend regarding downstream data …
Utility / Cloud Computing
Cloud/Utility Computing - Setting the stage
• Burned by “OMG!! GRID Computing” HypeIn 2009 will try hard never to use the word “cloud”
in any serious technical conversation. Vocabulary matters.
• Understand My Bias:
– Speaking of “utility computing” as it resonates with infrastructure people
– My building blocks are servers or groups of systems, not software
stacks, developer APIs or commercial products
• Goal:
• Replicate, duplicate, improve or relocate complex systems
Lets Be Honest
• Not rocket science
• Easy to learn, prototype &
understand
• Easy to learn
strengths/weaknesses
While I’m being honest
• Amazon Web Services (“AWS”) IS the cloud
– Simple, practical, understandable and usable today by just about anyone
– Rollout of features and capabilities continues to be impressive
• Competitors are years behind
– … and tend to believe too much of their own marketing materials
• The cloud is real, usable and useful TODAY
• If you are just starting out in this space:
– Lowest hanging fruit: dev, test & pilot projects
• Almost immediate payback in many cases
– Next step: CPU bound scientific problems
– Future: Archive/deep storage with cloud providers
Amazon Web Services
• A collection of agile infrastructure services available to on-demand
• New products and added features added almost monthly
• Recent enhancements:
– Two-factor Authentication & Rotating Credentials
– Virtual Private Cloud (“VPC”) Product
– EC2 auto-scaling & load-balancing– http://aws.amazon.com/about-aws/whats-new/
AWS Products/Services
• EC2 - Elastic Compute Cloud
– Scalable on-demand virtual servers
• SimpleDB - Simple Database Service
– Simple queries on structured data
• S3 - Simple Storage Service
– Bucket/object based storage
• EBS - Elastic Block Service
– Persistent block storage (looks like a disk)
AWS Products/Services
• SQS - Simple Queue System
– Message passing service storage
• Elastic MapReduce
– Hadoop on AWS
– Terabyte scale data mining & processing
• VPS - Virtual Private Cloud
– Connect your infrastructure to AWS via VPN tunnel
– (more important than it sounds …)
Cloud Sobriety
Cloud Sobriety
McKinsey presentation “Clearing the Air on Cloud Computing” is a must-read
• Tries to deflate the hype a bit
• James Hamilton has a nice reaction:
• http://perspectives.mvdirona.com/
Both conclude:
• IT staff needs to understand “the cloud”
• Critical to quantify your own internal costs
• Perform your own due diligence
Cloud Security
• Lots of overblown fears (and some political posturing)
• My personal take:
• Amazon, Google & Microsoft quite probably have better internaloperating controls than you do
• All of them are happy to talk as deeply as you like about all issuesrelating to security
• Do your own due diligence & don’t let politics or IT empire issues clouddecision making
• Biggest issue for me may be per-country data protection and patientprivacy rules
http://aws.amazon.com/security/
State of Amazon AWS
New features are being rolled out fast and furious
But …
– EC2 nodes still poor on disk IO operations
• EBS service can use some enhancements
• Many readers, one-writer on EBS volumes would be fantastic
• Poor support for latency-sensitive things and workflows that prefer tight networktopologies
This matters because:
• Compute power is easy to acquire
• Life science tends to be IO bound
• Life science is currently being buried in data
AWS & HPC: A whole new world
• For cluster people some radical changes
Years spent tuning systems for shared access
– Utility model offers dedicated resources
– EC2 not architected for our needs
– Best practices & reference architectures will change
• Current State: Transition Period
– Still hard to achieve seamless integration with local clusters & remote utility clouds
– Most people are moving entire workflows into the cloud rather than linking grids
– Some work being done on ‘transfer queues’
HPC Informatics & AWS: Summary
• Virtualized networking is ‘reasonable’ but there are certainly issues that
need to be worked around
• Network latency can be high
• Virtualized storage I/O is far slower than anything we can do with local
resources. Absolute fact.
• Still hard to share data/storage across many systems
• Inability to currently request EC2 nodes that are “close” in network
topology terms is problematic (but likely to change)
• MapReduce is not a viable solution for everyone
• Amazon has a deep interest in HPC workflows, expect them to address
all of our concerns
Cloud Storage
Cloud Storage
• It is quite probable that the “internet-scale” providers can provide
storage far more cheaply than we can ourselves
– Especially if we are honest about facility, power, continuity and operationalcosts
• Some people estimate cost at .80 GB/year and falling fast for Amazon
and others to provide 3x geographically replicated raw storage
– Can you seriously match this?
• These prices come from operating at extreme efficiency scales that we
will never be able to match ourselves
• Question: how best to leverage this?
When the ingest problem is solved …
• I think there may be petabytes of life science data that would flock to
utility storage services
• Public and private data stores
• Mass amount of grant funded study data
• Archive store, HSM target and DR store
• “Downloader Pays” model is compelling for people required to
share large data sets
Putting It All Together
Putting It All Together - Agile IT for Biomarker Discovery
• Understand that science changes faster than IT infrastructures
• Designing for flexibility is key
• Large groups: consider copying the large Next-Gen labs
– IT budgeted via “cost per gigabase sequenced”
– IT represents ~27% of sequencing cost on a per-gigabase basis
– Storage/compute for production is fixed in sequencing budget
– Storage/compute for post-capture analysis is a consumable
• Compute power is easy to acquire in 2009
– Concentrate on: Density, Flexibility, Virtualization, “Green IT” gains
Putting It All Together - Agile IT for Biomarker Discovery
• Storage is often our biggest headache
– Concentrate on: speed, scaling and large namespaces
• Data Movement & Management equally hard
– Concentrate on: data movement (networks), data lifecycle & curation
– Can your LIMS handle rapid protocol changes and new data?
• Cloud Computing is real & useful today
– In 2009, Amazon Web Services IS “the cloud”
– Lowest hanging fruit: dev, pilot & test programs
– Potential use case: long term, deep archive store + “downloader pays”
model for distribution of public-funded data sets
End;
• Thanks!
• Presentation slides will appear here:
– http://blog.bioteam.net
• Comments/feedback:
– “chris@bioteam.net”