High Performance Computing in Our Everydays
-
Upload
peter-wittek -
Category
Documents
-
view
175 -
download
3
Transcript of High Performance Computing in Our Everydays
High Performance Computing in Our Everydays
High Performance Computing in OurEverydays
Peter Wittek
Swedish School of Library and Information ScienceUniversity of Boras
10/10/11
High Performance Computing in Our Everydays
Outline
1 What Is New in HPC?
2 Supporting Frameworks
3 Computational Requirements of Digital Libraries
4 A Workflow in Cloud HPC
5 Experimental Results
6 Open Issues
7 Conclusions
High Performance Computing in Our Everydays
What Is New in HPC?
Cloud HPC
Cloud computing: think of it as a utilityE.g., you get to use 10 small computer instances for $0.82an hour
Your computer instances do not necessarily correspond toactual computers
VirtualizationDemo: ReactOS
Latest contestant in cloud computing: HPCNot ordinary computer instances
High Performance Computing in Our Everydays
What Is New in HPC?
Massive Parallelism
Figure: Floating-Point Operations per Second for the CPU and GPU
High Performance Computing in Our Everydays
What Is New in HPC?
Massive Parallelism
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
Streaming hardwareExplicit memory management
High Performance Computing in Our Everydays
What Is New in HPC?
Massive Parallelism
Parallel versus distributed computingDistributed nodes do not share the memory:
Connected through network;Calculations may run in a parallel fashion;Other nodes do not see what one node has computed;Nodes may fail.
High Performance Computing in Our Everydays
What Is New in HPC?
Why You Should Care
Digital libraries and HPC?No need for upfront investment;Go beyond full-text search;Machine learning;Pattern matching;Social media and graph mining;
You can define a new fieldFreedom
High Performance Computing in Our Everydays
Supporting Frameworks
Why Is Distributed Computing Hard?
Take an example: creating an inverted indexAn inverted index is at the core of search enginesA simple example:
term1: (doc1,freq11), (doc5,freq51)term2: (doc1,freq12), (doc3,freq32), (doc6,freq62)
Naıve approach to parallelize:Have an indexer at each node;Distribute documents to nodes;Let nodes broadcast the lists (Message Passing Interface –MPI).
High Performance Computing in Our Everydays
Supporting Frameworks
MapReduce
Published in 2004 by Google researchersSince then it has become widespread in data-intensiveprocessingCore idea: keep things simple, you can do two things:
Map: Send out chunks of data and then do something onthemReduce: Collect chunks of data and do something on themwhile collecting
Intermediate data structure: key-value pairsThe framework should also take care of the mundanetasks, such as failing nodes, network latency, etc.
High Performance Computing in Our Everydays
Supporting Frameworks
A MapReduce Inverted Indexer
The task is: formulate your problem in MapReduce termsMap: gets a chunk of text. Emits:
Key: termValue: document id and corresponding frequency
Reduce: Merges by keyThere might be a different number of map and reducetasks
High Performance Computing in Our Everydays
Supporting Frameworks
Another MapReduce Example
Sometimes it is worth bypassing the reduce phaseThen we do not need to emit key-value pairs at all
Distributed GPU random projection
High Performance Computing in Our Everydays
Supporting Frameworks
Exploiting GPU Resources
Low-level frameworks: CUDA and OpenCLThey certainly do not make GPUs much friendlierHigher-level libraries: BLAS, cuSPARSEAs long as you know maths. . .
High Performance Computing in Our Everydays
Supporting Frameworks
Overcoming GPU Obstacles
GPU MapReduceAcademic projects: Mars, GPMR
GPU-aware MapReduce: extend existing frameworksDevelop extensive middleware
High Performance Computing in Our Everydays
Computational Requirements of Digital Libraries
Digital Preservation
Future-proofing document collectionsEmulationMigration
Workflows are often tremendously compute-intensive
High Performance Computing in Our Everydays
Computational Requirements of Digital Libraries
Machine Learning and Advanced Services
Digital collections and social networksA step towards digital curation
SaaS approach to digital curation
Indexing by Lucene/NutchCollection-level metadata extraction by Mahout
High Performance Computing in Our Everydays
A Workflow in Cloud HPC
A Middleware Architecture
SupportServices:-Documentprocesses-Contextsearch-Datamining
Map
Red
uce
En
gin
e
PolicyEnforcement
ArchivalStorageInterface
Middleware
Grid or Cloud Storage Grid or Cloud Computing
A middleware to make adoption by DL practitioners easierMoving towards computational science
High Performance Computing in Our Everydays
Experimental Results
Cost
1 4 10 20 40 80
Number of Processing Cores
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08Avera
ge C
ost
in U
SD
100100010000
Figure: Comparison of average cost of computations with differentcollection sizes
High Performance Computing in Our Everydays
Experimental Results
Running time
1 4 10 20 40 80
Number of Processing Cores
0
1000
2000
3000
4000
5000
6000
7000
8000R
unnin
g T
ime (
Min
s)
100100010000
Figure: Comparison of running times with different collection sizes
High Performance Computing in Our Everydays
Open Issues
Obstacles to Adoption
Persistence and high-reliabilityMapReduceNot just a technological issue
Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers
Difficulty of integration
High Performance Computing in Our Everydays
Conclusions
Acknowledgment
Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated project.http://shaman-ip.eu/shaman/
Additional funding has been received from Amazon WebServices.http://aws.amazon.com/
High Performance Computing in Our Everydays
Conclusions
Summary
Cloud and HPC: a solution looking for a problemDigital libraries
Computational requirementsExpertiseComplexity and integration
Contact: [email protected]