Elastic r sc10-tutorial

34
Elastic-R A Virtual Collaborative Environment for Scientific Computing and Data Analysis in the Cloud Karim Chine [email protected]

description

 

Transcript of Elastic r sc10-tutorial

Page 1: Elastic r sc10-tutorial

Elastic-RA Virtual Collaborative

Environment for Scientific Computing and Data Analysis in the

Cloud

Karim [email protected]

Page 2: Elastic r sc10-tutorial

o Open-source (GPL) software environment for statistical computing and graphics

o Lingua franca of data analysis.

o Repositories of contributed R packages relatedto a variety of problem domains in life sciences,social sciences, finance, econometrics, chemometrics, etc. are growing at an exponential rate.

o R Website: http://www.r-project.org/o CRAN Task View: http://cran.r-project.org/web/views/o CRAN packages : http://cran.cnr.berkeley.edu/o Bioconductor: http://www.bioconductor.org/o R Metrics: https://www.rmetrics.org/

Scientific Computing Environments

www.scilab.org

http://root.cern.ch

www.sagemath.org

www.sas.com

office.microsoft.com

www.mathworks.com

www.scipy.org

www.spss.com

www.wolfram.com

Page 3: Elastic r sc10-tutorial

From: John Fox, Aspects of the Social Organization and Trajectory of the R Project, R Journal-Feb 2009

The ‘s Success Story

Page 4: Elastic r sc10-tutorial

"Give me a place to stand, and I shall move the earth

with a lever"

Scientific Computing Software, HPC and Usability

Page 5: Elastic r sc10-tutorial

Extract from the NetSolve/GridSolve Description Document

The emergence of Grid computing as the prototype of a next generation cyberinfrastructure for sciencehas excited high expectations for its potential as an accelerator of discovery, but it has also raisedquestions about whether and how the broad population of research professionals, who must be thefoundation of such productivity, can be motivated to adopt this new and more complex way of working.The rise of the new era of scientific modeling and simulation has, after all, been precipitous, and manyscience and engineering professionals have only recently become comfortable with the relatively simpleworld of the uniprocessor workstations and desktop scientific computing tools. In that world, softwarepackages such as Matlab and Mathematica represent general-purpose scientific computingenvironments (SCEs) that enable users — totaling more than a million worldwide — to solve a widevariety of problems through flexible user interfaces that can model in a natural way the mathematicalaspects of many different problem domains.Moreover, the ongoing, exponential increase in the computing resources supplied by the typicalworkstation makes these SCEs more and more powerful, and thereby tends to reduce the need for thekind of resource sharing that represents a major strength of Grid computing [1]. Certainly there arevarious forces now urging collaboration across disciplines and distances, and the burgeoning Gridcommunity, which aims to facilitate such collaboration, has made significant progress in mitigating thewell-known complexities of building, operating, and using distributed computing environments. But it isunrealistic to expect the transition of research professionals to the Grid to be anything but halting andslow if it means abandoning the SCEs that they rightfully view as a major source of their productivity.We therefore believe that Grid computing’s prospects for success will tend to rise and fall according toits ability to interface smoothly with the general purpose SCEs that are likely to continue to dominatethe toolbox of its targeted user base.

Arnold, D. and Agrawal, S. and Blackford, S. and Dongarra, J. and Miller, M. and Seymour, K. and Sagi, K. and Shi, Z. and Vadhiyar, S.

Page 6: Elastic r sc10-tutorial

Elastic-R targets major e-Science use cases

o Lower the barriers for accessing cyber infrastructures.

o Bridge the gap between existing SCEs and grids/clouds

o Help dealing with the data deluge (take the computation to the data)

o Enable collaboration within computing environments

o Simplify the science gateways creation and delivery process

o Provide the building blocks of a user-friendly platform for reproducible computational research

o Provide a cloud-based e-Learning environment for computational sciences and statistical education

o Lower the barriers for using distributed computing, make HTC accessible to all research professionals

o Bridge the gap between different mainstream SCEs and between SCEs and workflow workbenches

o Provide a universal computing toolkit for scientific applications and scalability frameworks.

o Provide a portal for scientific computing on demand, collaboration and computational resources sharing

o Merge on demand HPC capabilities, real time collaboration and social networking

Page 7: Elastic r sc10-tutorial

Computational ComponentsR packages : CRAN, Bioconductor, Wrapped C,C++,Fortran code

Scilab modules, Matlab Toolkits, etc.

Open source or commercial

Computational ResourcesHardware & OS agnostic computing engine : R, Scilab,..

Clusters, grids, private or public clouds

free: academic grids or pay-per-use: EC2, Azure

Computational User InterfacesWorkbench within the browser

Built-in views / Plugins / Spreadsheets

Collaborative views

Open source or commercial

Computational ScriptsR / Python / Groovy

On client side: interactivity..

On server side: data transfer ..

Stateful or stateless, automatic mapping of R data objects and functions

Computational Application Programming Interfaces

Java / SOAP / REST, Stateless and stateful

Computational Data StorageLocal, NFS, FTP, Amazon S3, Amazon EBSfree or commercial

Generated Computational Web Services

Elastic-R

Elastic-R is a ubiquitous plug-and-play platform for scientific and statistical computing Elastic-R lowers the barriers of accessing cyberinfrastructures

Page 8: Elastic r sc10-tutorial

Elastic-R on Infrastructure-as-a-Service style Cloud

Page 9: Elastic r sc10-tutorial

Anatomy of an Elastic-R machine instance on Amazon EC2

Heartbeat Restful WS over SSL

Page 10: Elastic r sc10-tutorial

Public Clouds

Private Cloud

Elastic-R portal: single facade to public and private clouds

Page 11: Elastic r sc10-tutorial

Exercise 1: Get started with Amazon EC2

Task 1 : Sign Up For Amazon Elastic Compute Cloud Open the following link : http://aws.amazon.com/ec2/Click on and follow the instructions

Task 2: Log on to AWS Management ConsoleOpen the following link http://aws.amazon.com/console/Choose Amazon EC2 in the Combo box and click on

Task 3: Run an Ubuntu Machine InstanceSelect the AMIs Panel, Select « All Images », enter the following AMI ID : ami-1437dd7dRight-click on the unique line displayed and choose « Launch Instance »

AMIs Panel

Page 12: Elastic r sc10-tutorial

-In the « Request Instances Wizard », click on « Continue » twice-At the « CREATE KEY PAIR » step, choose « Proceed without a Key Pair » and click on « Continue »-At the « CONFIGURE FIREWALL » step, choose « Create a new Security Group », fill in the name and the Description (with a name of your choice) and click on « Continue »-At the final step « REVIEW », click on « Launch » then click « Close »

-Open the « Instances » Panel, wait for the Status icon to change from orange to green

Task 4: Shut down the Machine Instance- Right-click on the newly created instance, choose « Terminate » and confirm (« Yes, terminate »)

Page 13: Elastic r sc10-tutorial

Exercise 2: Get started with Elastic-R

- Open The following link : www.elastic-r.org- Log on using the login/password you received by email before the tutorialAlternatively use one of the following accounts login:sc10-1/passwd:sc10-1 login:sc10-2/passwd:sc10-2 login:sc10-3/passwd:sc10-3 login:sc10-4/passwd:sc10-4

-In the « Cloud Console » panel, provide your Amazon.com login and password

- Keep the default values for machine : « R 2.11.0 Ubuntu 10.04 Lucid – 32 bits », keep the default value for capacity : «Small–32 bits–1 core–Memory 1.7 G», keep Disk to «None» and provide a label for the machine- Click on « Launch New Machine »

Page 14: Elastic r sc10-tutorial

-The Console displays « Retrieving AWS Keys.. » then « Launching.. » then « Machine started successfully …»

-Click « ok » and wait for a new line to appear in the list of Available Machines (with the label you provided)

-When the machine is ready, you get connected automatically, you can now interact with the R session using the R console and the R graphics

Page 15: Elastic r sc10-tutorial

-Type the following expressions in the R console : x=rnorm(100) var(x)plot(x)hist(x)

Page 16: Elastic r sc10-tutorial

- Open the menus and views and try the different available features

Page 17: Elastic r sc10-tutorial

Elastic-R is a collaborative Virtual Research Environment.Users can share their machine instances, stateful remote engines, data,..

Page 18: Elastic r sc10-tutorial

Amazon Virtual Private Cloud

Subnet 2

Subnet 3

Subnet 1

The Elastic-R portal itself is an EC2 machine instance. Any number of portals can be run on EC2 for decentralized and private collaboration

Page 19: Elastic r sc10-tutorial

Exercise 3: Share a machine instance and collaborate with other attendees.

-Open the collaboration Menu- Enter the list of Elastic-R users logins you will share your machine with in « Sharing Rights Editor » and click on « Refresh »-Providing only logins means granting all possible rights on all available engines. Specific sharing rights can be appended to the login -Examples: john:rw , allow user john to view and update peter:r:EngineTest, allow user peter to view only the session data on the engine EngineTest

- The new sharing rights appear in the « Machine Sharing Rights » panel

Page 20: Elastic r sc10-tutorial

-Your collaborators can see your machine in their lists of available machines and can connect to it-Try the different collaboration features(consoles broadcast, graphic annotation, whiteboard, spreadsheets,..

Page 21: Elastic r sc10-tutorial

-Open the Collaboration Menu and click on “Lead”, change the layout of your environment (resize menus, open and close views). Your collaborators’ browsers follow your Ajax workbench layout

Page 22: Elastic r sc10-tutorial

Visual Graphic User Interface Builder

Elastic-R Java Workbench

Plugins Repository

• myPlugin• myDashboard

Upload plugin

Elastic-R AJAX Workbench

Standalone Application AccessibleFrom a URL

Users can create easily Java User Interfaces (plugins) that use the full capabilities of a stateful and remote R engine and share them as URLs

Page 23: Elastic r sc10-tutorial

Software+Services=Applications convergence + ubiquitous collaboration.The server-side toolkit: R + spreadsheet models + virtual gui widgets.

Page 24: Elastic r sc10-tutorial

Reproducible research: A scientist can snapshot her computational environment and her data. She can archive the snapshot or share it with others.

Elastic-R AMI 1

R 2.10 + BioC 2.5

Elastic-R AMI 2

R 2.9 + BioC2..3

Elastic-R AMI 3

R 2.8+BioC2.0

Elastic-R Amazon Machine Images

Elastic-R EBS 1

Data Set XXX

Elastic-R EBS 2

Data Set YYY

Elastic-R EBS 3

Data Set ZZZ

Elastic-R EBS 4

Data Set VVV

Elastic-R AMI 2

R 2.9 +

BioC 2.3 Elastic-R EBS 4

Data Set VVV

Amazon Elastic Block Stores

Elastic-R AMI 2

R 2.9 +

BioC 2.3 Elastic-R EBS 4

Data Set VVV

Elastic-R.org

Page 25: Elastic r sc10-tutorial

Exercise 4: Create collaboratively and distribute a scientific dashboard using the Elastic-R server-side widgets designer

- Open The « Views » Menu and check « R Panels », a java applet is loaded - Right-click on the empty panel, choose “New Slider”- Keep the default R expression : slider.make("n", x=93,y=15,panel="ssprimary") and click “Apply”- A slider appears, change the value of the slider and check the value of the R variable “n” in the Console- Change the value of nusing the R console, and notice the change in the slider : the slider is a Bidirectional mirror of n-Right-click on the panel And choose “New Chart”,keep the default expressionand click “Apply”-A Bar Chart Mirroring the R variable “y” appears-Right-click on the Panel, Choose “New Macro”, keep the default expression: the Macro is triggered when “n”changes and updates “y”as being a function of n

-Change the slider and notice the chart update

Page 26: Elastic r sc10-tutorial

-Click on « Show Direct URL » and copy the displayed URL to the clipboard -Open a new Browser Window an paste the url in the browser’s address bar-Notice the java applet being loaded and the interactive standalone dashboard-Send the URL by email to you neighbour

Page 27: Elastic r sc10-tutorial

Exercise 5: call Elastic-R cloud engines from Office documents, mirror an R-enabled server-side spreadsheet to Excel (Windows users only).

Exercise 5 bis: Install R packages, generate variables, create new files, spreadsheets and virtual widgets’ panels. Snapshot your environment and share it with others.

- Open the « Tools » menu an click on the Word icon, choose Open Document with Word, read the displayed message.

-Inside the word document, type an R expression, select it and press Alt-F1.-Type an R expression producing a graphic, select it and press ALT-F1.

Page 28: Elastic r sc10-tutorial

-Open the « Tools » menu an click on the Excel icon, choose Open Document with Excel- Open the « Spreadsheet » view in the browser. Update the cells in the browser, notice what happens in Excel, update the cells in Excel and notice what happens in the browser

Page 29: Elastic r sc10-tutorial

The scientist can control any number of stateful R engines from within an R session on the cloud or on his machine. He can use them for parallel computing

Page 30: Elastic r sc10-tutorial

Exercise 6: Run two 64-bit Elastic-R machine instances with 4 engines each. Use the 8 engines from within a main R session to apply a function to a large dataset.

-Open the « Portal Info » Menu and choose profile « expert »-In the cloud console view, enter : your amazon login, your amazon password , “R.2.11.1 – Ubuntu 10.04 – 64 bits”, “Large-64 bits-2 cores-..”, Shutdown after “1” Hour, Disk “None”, empty label, Number “2”, Engine Names “A,B,C,D”, Memory Min “1024”, Memory max “1024”, check “Use SSL”, click “Launch New Machine”

Page 31: Elastic r sc10-tutorial

- Select the two new machines and click on « Get Workers Links »

Page 32: Elastic r sc10-tutorial

- In the R console, notice the R expressions for the Rlinks creation and the Rlinks successful creation notification-Type the following R commands in the R console:>RTo display the vector of Rlinks mapping all the engines in the two Large instances>rlink.console(R[1], 'ls()')To send the expression ‘ls() ’ to the R engine referenced by R*1+>rlink.put(R[1],'n')To send the variable n to the first engine> cl<-cluster.make(R)To create a logical cluster holding all the R engines in the two large instances>cluster.console(cl, ‘ls()’)To send the expression ‘ls()’ to the consoles of all the engines>library(vsn)>data(kidney)>l=list()>for (i in c(1:100)) l[[as.character(i)]]=kidneyTo build a list of 100 item, each holding an ExpressionSet.>cluster.console(cl, 'library(vsn)')To load the library vsn on all workers>cluster.apply(cl, ‘l’, ‘justvsn’)To normalize the 100 ExpressionSet in parallelUsing the 8 engines.

Page 33: Elastic r sc10-tutorial

Elastic-R Portal :

www.elastic-r.org

Articles about the project:

Chine K. (2010). Open Science in the Cloud: Towards a Universal Platform for Scientific andStatistical Computing. In Handbook of Cloud Computing. (Chapter 19). Springer US.

Karim Chine, "Learning Math and Statistics on the Cloud, Towards an EC2-Based Google Docs-likePortal for Teaching / Learning Collaboratively with R and Scilab," icalt, pp.752-753, 2010 10th IEEEInternational Conference on Advanced Learning Technologies, 2010

Karim Chine, "Scientific Computing Environments in the age of virtualization, toward a universalplatform for the Cloud" pp. 44-48, 2009 IEEE International Workshop on Open Source Software forScientific Computation (OSSC), 2009

Karim Chine, "Biocep, Towards a Federative, Collaborative, User-Centric, Grid-Enabled and Cloud-Ready Computational Open Platform" escience,pp.321-322, 2008 Fourth IEEE InternationalConference on eScience, 2008

Linkedin Group:

http://www.linkedin.com/groups?home=&gid=2345405

Links

Page 34: Elastic r sc10-tutorial

ACS: Madi Nassiri Amazon: Simone Brunozzi, Deepak Singh AT&T Research Labs: Simon Urbanek ATUGE: Imen Essafi, Béchir

Tourki, Ilyes Gouja, HatemHachicha, Amine Elleuch Auckland Centre for eResearch: Nick Jones Banca d'Italia: Giuseppe Bruno

Bio-IT World: Kevin Davies BNP Paribas: Ousseynou Nakoulima Cambridge Healthtech Institute: Cindy Crowninshield City

University of New York: Mario Morales, Makram Talih Columbia University: Omar Besbes Dassault Systèmes: Omri Ben Ayoun,

Patrick Johnson Dataspora: Michael E. Driscoll EDF: Alejandro Ribes EBI: Alvis Brazma, Wolfgang Huber, Kimmo Kallio, Misha

Kapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-Serra, Ugis Sarkans, Kirsten Williams, Eamonn Maguire EPFL:

Darlene Goldstein ESPRIT: Farouk Kammoun, Tahar. Benlakhdar e-Taalim: Nadhir Douma ETH Zürich: Yohan Chalabi, Diethelm

Würtz, Martin Mächler European Commission: Konstantinos Glinos, Enric Mitjana, Monika Kacik, Ioannis Sagias FHCRC: Martin

Morgan, Nianhua Li, Seth Falcon Google: Olivier Bosquet FVG LLC: Lisa Wood Harvard University: Tim Clark, Sudeshna Das,

Douglas Burke,Paolo Ciccarese IBM: Jean-Louis Bernaudin, Pascal Sempe, Loic Simon, Lea A Deleris, Alex Fleischer, Alain Chabrier

Imperial College London: Asif Akram, Vasa Curcin, John Darlington, Brian Fuchs Indiana University:Michael Grobe INRIA: David

Monteau, Christian Saguez, Claude Gomez, Sylvestre Ledru JISC: John Wood, David Flanders Johnson & Johnson - Janssen

Pharmaceutica: Patrick Marichal KXEN: Eric Marcade Lancaster University: Robert Crouchley, Daniel Grose Leibniz Universität

Hannover: Kornelius Rohmeier LIAMA: Baogang Hue, Kang Cai Limagrain: Zivan Karaman Mekentosj: Alexander Griekspoor,

Matt Wood Microsoft: Eric Le Marois, Tony Hey Mubadala: Ghazi Ben Amor Nature Publishing Group: Ian Mulvany, Steve Scott

NCeSS: Peter Halfpenny, Rob Procter, Marzieh Asgari-Targhi, Alex Voss, YuWei Lin, Mercedes Argüello Casteleiro, Wei Jie, Meik

Poschen, Katy Middlebrough, Pascal Ekin, June Finch, Farzana Latif, Elisa Pieri, Frank O'Donnell New York Java User Group:

Frank D Greco OeRC: Dimitrina Spencer, Matteo Turilli, David Wallom, Steven Young OMII-UK: Neil Chue Hong, Steve Brewer

OpenAnalytics: Tobias Verbeke Oracle: Dominique van Deth, Andrew Bond OSS Watch: Ross Gardler Platform Computing:

Christopher Smith Royal Society: James Wilsdon San Diego Supercomputer Center: Nancy R. Wilkins-Diehr Sanger Institute: Lars

Jorgensen, Phil Butcher Shell: Wayne.W.Jones, Nigel Smith Société Générale: Anis Maktouf Stanford University: John Chambers,

Balasubramanian Narasimhan, Gunter Walther SYSTEM@TIC: Karim Azoum Technische Universität Dortmund: Uwe Ligges,

Bernd Bischl Technoforge: Pierre-Antoine Durgeat Tekiano: Samy Ben Naceur Télécom-ParisTech: Isabelle Demeure, Georges

Hebrail, Nesrine Gabsi The Generations Network: Jim Porzak Total: Yannick Perigois Tunisian Ministry of Communication

Technologies: Naceur Ammar, Lamia Chaffai-Sghaier, Mohamed Saïd Ouerghi, Syrine Tlili Tunisian Ecole Polytechnique: Riadh

Robbana UC Berkeley: Noureddine El Karoui, Terry Speed UC Davis: Rudy Beran, Debashis Paul, Duncan Temple Lang UCL:

Daniel Jeffares UCLA: Ivo Dinov, Jeroen Ooms UC San Diego: Anthony Gamst UCSF: Tena Sakai Université Catholique de

Louvain: Christian Ritter University of Cambridge: Ian Roberts, Robert MacInnis Peter Murray-Rust, Jim Downing University of

Manchester: Carole Goble, Len Gill, Simon Peters, Richard D Pearson, Iain Buchan, John Ainsworth University of Plymouth: Paul

Hewson University of Split: Ivica Puljak UTK: Ajay Ohri World Bank Group-IFC: Oualid Ammar Yahoo: Laurent Mirguet, Rob

Weltman Independant:Charles Dallas, Romain François

Acknowledgments