R The unsung hero of Big Data · 2018-06-21 · What’s R...
Transcript of R The unsung hero of Big Data · 2018-06-21 · What’s R...
R The unsung hero of Big Data
Dhafer MaloucheCEAFE, Beit El Hikma, June 21st, 2018
ESSAI-MASE-Carthage Universityhttp://dhafermalouche.net
What’s R
• Free software environment for statistical computation
• Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at theUniversity of Auckland, New Zealand
• Statistical computing• Data Extraction
• Data Cleaning
• Data Visualization
• Modeling
• almost 13,000 packages
• IDE: RStudio
• One of the most popular Statistical Softwares
1
Some other features
• Reporting: Rmarkdown: html, pdf, word...
• Dynamic data visualization1: Plotly, highcharter, rbokeh, dygraph,leaflet, GoogleVis...
• Dashboards with flexdashboard
• Sophisticated statistical web apps with Shiny
• R can be called from Python, Julia...
1https://www.htmlwidgets.org
4
However
• R is not well-suited for working with data structures larger than about10-20% of a computer’s RAM.
• Data exceeding 50% of available RAM are essentially unusable.
• We consider a data set large if it exceeds 20% of the RAM on a givenmachine and massive if it exceeds 50%
5
Solutions offered by R
• Within R• ff, ffbase, ffbase2, and bigmemory to enhance out-of-memoryperformance
• Apply statistical methods to large R objects through the biglm, bigalgebra,bigmemory...
• bigvis package for large data visualization• faster data manipulation methods available in the data.table package
• Connecting R to famous Big Data tools
7
Types of data
• Medium sized files that can be loaded in R ( within memory limit butprocessing is cumbersome (typically in the 1-2 GB range): read.csv,read.table...
• Large files that cannot be loaded in R due to R/OS limitations. Twoother groups
• Large files: from 2 to 10 GB, they can be processed locally using some workaround solutions: read.table.ffdf, fread.
• Very Large files - ( > 10 GB) that needs distributed large scale computing:Hadoop, H2O, Spark...
8
Comparing three methods to import a medium size data
• Standard read.csv> system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE))
user system elapsed162.832 12.785 180.584
• Optimized read.csv> ptm<-proc.time()> length(readLines("airline_20MM.csv"))[1] 20000001> proc.time()-ptm
user system elapsed26.097 0.588 26.766> classes <- c("numeric",rep("character",3),rep("numeric",22))> system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",",+ stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes))
user system elapsed68.232 3.672 72.154
• fread> system.time(DT1 <- fread("airline_20MM.csv"))Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18
user system elapsed15.113 2.443 23.715
11
Large datasets with size 2-10 GB
• Too big for in-memory processing and for distributed computed files
• Two solutions• big... packages: bigmemory, bigalgebra, biganalytics
• ff packages
12
ff, ffbase and ffbase2 packages
• Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini.Already more than 340,000 downloads.
• It chunks the dataset, and stores it on a hard drive.
• It includes a number of general data-processing functions:
• The ffbase package allows users to apply a number of statistical andmathematical operations.
13
ff, ffbase and ffbase2 packages, Example
• Create a directory for the chunk files
> system("mkdir air20MM")> list.dirs()...[121] "./air20MM"....
• set the path to this newly created folder, which will store ff data chunks,
> options(fftempdir = "./air20MM")
14
ff, ffbase and ffbase2 packages, Example
• Import the data to R> air20MM.ff <- read.table.ffdf(file="airline_20MM.csv",+ sep=",", VERBOSE=TRUE,+ header=TRUE, next.rows=400000,+ colClasses=NA)read.table.ffdf 1..400000 (400000) csv-read=3.224sec ffdf-write=0.397secread.table.ffdf 400001..800000 (400000) csv-read=3.174sec ffdf-write=0.205secread.table.ffdf 800001..1200000 (400000) csv-read=3.033sec ffdf-write=0.198sec......read.table.ffdf 20000001..20000000 (0) csv-read=0.045seccsv-read=141.953sec ffdf-write=67.208sec TOTAL=209.161sec
• Memory size, dimension> format(object.size(air20MM.ff),units = "MB")[1] "0.1 Mb"> class(air20MM.ff)[1] "ffdf"> dim(air20MM.ff)[1] 20000000 26
• One binary file for each variable> list.files("./air20MM")[1] "ffdf2c9103fa5e4.ff" "ffdf2c915cd46aa.ff" "ffdf2c919345992.ff"[4] "ffdf2c919f020c5.ff" "ffdf2c91b4e0b28.ff" "ffdf2c91fdfba1f.ff"[7] "ffdf2c920be7d19.ff" "ffdf2c922e00bb9.ff" "ffdf2c92321b092.ff"[10] "ffdf2c9263bfa45.ff"....
15
ff, ffbase and ffbase2 packages, Example
• Size of the binary files (80 Mb each)
> file.size("./air20MM/ffdf2c9103fa5e4.ff")[1] 8e+07
• The binary file of a given variable
> basename(filename(air20MM.ff$DayOfWeek))[1] "ffdf2c92babdb9f.ff"
• Many other operations:• Saving and loading ff objects,• Compute tables with table.ff,• Converting a numeric vector to a factor with cut.ff,• Value matching with ffmatch• bigglm.ffdf for Generalized Linear Model (GLM)
...and many others!!
16
bigmemory, Example
• Reading big matrices> ptm<-proc.time()> air20MM.matrix <- read.big.matrix("airline_20MM.csv",+ type ="integer", header = TRUE, backingfile = "air20MM.bin",+ descriptorfile ="air20MM.desc", extraCols =NULL)> proc.time()-ptm
user system elapsed109.665 2.425 113.741
• Size, dimensions.> dim(air20MM.matrix)[1] 2.0e+07 2.6e+01> object.size(air20MM.matrix)696 bytes
• Files.> file.exists("air20MM.desc")[1] TRUE> file.exists("air20MM.bin")[1] TRUE> file.size("air20MM.desc")[1] 753> file.size("air20MM.bin")/1024^3[1] 1.937151
17
Apache Spark
• Speed: Runs workloads 100x faster.
• Easily operable writing applicationsquickly in Java, Scala, Python, R,and SQL.
• Combine SQL, streaming, andcomplex analytics.
19
sparklyr: R interface for Apache Spark
• Connect to Spark from R. The sparklyr package provides a completedplyr backend.
• Filter and aggregate Spark datasets then bring them into R for analysis andvisualization.
• Use Spark’s distributed machine learning library from R. Create extensionsthat call the full Spark API and provide interfaces to Spark packages.
20
Managing data in Spark from R
• Copying data from R to Spark: dplyr package> library(dplyr)> iris_tbl <- copy_to(sc, iris)
• Reading csv files
> airline_20MM_sp <- spark_read_csv(sc, "airline_20MM","airline_20MM.csv")
• Munging and Managing data on Spark from R: quickly getting statistics onMassive data.
• Execute SQL queries directly against tables within a Spark cluster.
> library(DBI)> query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9")
22
Managing data in Spark from R
• Machine Learning procedures on Spark:• ml_decision_tree for decision trees• ml_linear_regression for regression models• ml_gaussian_mixture for fitting Gaussian mixture distributions and EMalgorithm
• ....
• Example
> mtcars_tbl <- copy_to(sc, mtcars)> partitions <- mtcars_tbl %>%+ filter(hp >= 100) %>%+ mutate(cyl8 = cyl == 8) %>%+ sdf_partition(training = 0.5, test = 0.5, seed = 1099)> fit <- partitions$training %>%+ ml_linear_regression(response = "mpg", features = c("wt", + "cyl"))
23
More things to do on Spark from R
• Reading and Writing Data : CSV, JSON, and Parquet formats:spark_write_csv, spark_write_parquet, spark_write_json
• Execute arbitrary R code across your cluster using spark_apply
> spark_apply(iris_tbl, function(data) {+ data[1:4] + rgamma(1,2)+ })
• View the Spark web console using the spark_web function:
> spark_web(sc)
24
H2O
• Software for machine learning and data analysis.
• Ease of Use
• Open source (the liberal Apache license)
• Easy to use Scalable to big data
• Well-documented and commercially supported.
• Website: https://www.h2o.ai/h2o/
25
How to install H2O?2
It takes few minutes, ∼ 134 Mb to download.
# The following two commands remove any previously installed H2O packages for R.if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }# Next, we download packages that H2O depends on.pkgs <- c("RCurl","jsonlite")for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}# Now we download, install and initialize the H2O package for R.install.packages("h2o", type="source",repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R")# Finally, let's load H2O and start up an H2O clusterlibrary(h2o)h2o.init()
2Procedure available inhttp://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html
26
Munging data and ML in H2O from R
• Importing data files h2o.importFile
• Importing multiple files h2o.importFolder
• Combining data sets by columns and rows h2o.cbind and h2o.rbind
• Group one or more columns and apply a function to the result group_by
• Imputing missing values h2o.impute
• And the most important Machine Learning algorithms: PCA, Random Forests,Regression Models and Classifications, Gradient Boosting Machine....
27
Hadoop and RHadoop
RHadoop is a collection of five R:
• rhdfs: Basic connectivity to the Hadoop Distributed File System. Rprogrammers can browse, read, write, and modify files stored in HDFS fromwithin R
• rhbase: Basic connectivity to the HBASE distributed database.
• plyrmr: Data manipulation operations.
• rmr2: Allows R developer to perform statistical analysis in R via HadoopMapReduce functionality on a Hadoop cluster.
• ravro: Read and write avro files from local and HDFS file system
28