ODSC and iRODS
-
Upload
raminder-singh -
Category
Documents
-
view
116 -
download
2
Transcript of ODSC and iRODS
1
Open Data Science Conference and iRODS User Group meeting
Raminder SinghResearch Data Services
Research Technologies, Indiana UniversityJuly 7th, 2016
2
ODSC East 2016https://www.odsc.com/boston
3
Technologies Discussed• Julia is a high-level, high-performance dynamic programming language for technical computing with
familiar syntax. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.
• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business
• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other libraries like NumPy and SciPy.
• Apache Spark is an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
4
Keynote Speakers
5
About Companies of Keynote Speakers
• Booz Allen Hamilton: Core business is the provision of management, technology and security services, to civilian government agencies. http://www.boozallen.com/datascience
• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. https://rapidminer.com/
• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
6
Other Interesting Speakers
7
Topics for Training Workshops
• Using R for Data Analytics– https://github.com/zachmayer/forecast
• Building a Real-time Recommender Systems with Spark ML, Kafka, and the PANCAKE STACK– http://advancedspark.com/
• Analyzing Open Data in Healthcare using Public APIs and Reproducible Workflows
– https://github.com/jhajagos/health-open-data-workshop
8
List of Good Talks Available Online• Kirk Borne – “2 Most Important Things in Data Science”
– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-science/• Experiment • Data collection
• Tomorrow’s Map Room: Data Portals– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/
• Interactive Data Visualizations in R with Shiny and ggplot2– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-vi
sualizations-in-r-with-shiny-and-ggplot2/
• Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org– https
://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-with-bokeh-datashader/
• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be used to create Networks. http://www.exaptive.com/data-application-gallery
9
10
Objective to Attend
• iRODS features and architecture• User Community• Use Cases and Solutions built over iRODS• Future development and directions
Questions• Can I write rules in other languages? • Is it possible to attach it to existing storage?• What does it take to implement data policy rules for Research Data Alliance
(RDA) practical policy recommendations?
11
12
13
iRODS Implements Four Main Functions
Data Virtualization: iRODS provides a logical representation of files stored in physical storage locations. We call this logical view a virtual file system and the capabilities it provides.
Data Discovery: This information about data, called metadata, is extremely useful for Data Discovery, locating relevant data within large data sets.
Workflow Automation: Once data is stored and available in the catalog, it often needs to be migrated, secured, or otherwise processed.
Secure Collaboration: Data is most useful when it’s in the hands of the right people. There is a recognized need in the public research community to publish data sets that accompany written articles.
14
15
16
18
EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
19
20
Getting R to talk to iRODSBernhard Sonderegger, Nestlé Institute of Health Sciences
• The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs.
• To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings.
• The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.
21
22
23
24
Review
Questions• Can I write rules in other languages?
– YES• Is it possible to attach it to existing storage?
– YES. There are tools to load the data• What does it take to implement data policy rules for Research Data Alliance
(RDA) practical policy recommendations?– Here https://github.com/DICE-UNC/policy-workbook is a reference
implementation for RDA recommendations. It needs some work to update and test these with the latest version of iRODS.
25
iRODS User Group Meeting notes and slides
• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech
report• http://slides.com/irods/ : Workshop Slides• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation• http://www.cyverse.org/ : iRODS as a service• http://irods.org/documentation/articles/ : Other Articles• http://www.odum.unc.edu/ • http://datafed.org/about/use-cases/• http://renci.org/news/virtual-institute-for-social-research/