R The unsung hero of Big Data · 2018-06-21 · What’s R...

32
R The unsung hero of Big Data Dhafer Malouche CEAFE, Beit El Hikma, June 21st, 2018 ESSAI-MASE-Carthage University http://dhafermalouche.net

Transcript of R The unsung hero of Big Data · 2018-06-21 · What’s R...

R The unsung hero of Big Data

Dhafer MaloucheCEAFE, Beit El Hikma, June 21st, 2018

ESSAI-MASE-Carthage Universityhttp://dhafermalouche.net

What’s R

• Free software environment for statistical computation

• Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at theUniversity of Auckland, New Zealand

• Statistical computing• Data Extraction

• Data Cleaning

• Data Visualization

• Modeling

• almost 13,000 packages

• IDE: RStudio

• One of the most popular Statistical Softwares

1

R Environment

2

RStudio

3

Some other features

• Reporting: Rmarkdown: html, pdf, word...

• Dynamic data visualization1: Plotly, highcharter, rbokeh, dygraph,leaflet, GoogleVis...

• Dashboards with flexdashboard

• Sophisticated statistical web apps with Shiny

• R can be called from Python, Julia...

1https://www.htmlwidgets.org

4

However

• R is not well-suited for working with data structures larger than about10-20% of a computer’s RAM.

• Data exceeding 50% of available RAM are essentially unusable.

• We consider a data set large if it exceeds 20% of the RAM on a givenmachine and massive if it exceeds 50%

5

Big Data and R

Can we then handle Big Data inR?

6

Solutions offered by R

• Within R• ff, ffbase, ffbase2, and bigmemory to enhance out-of-memoryperformance

• Apply statistical methods to large R objects through the biglm, bigalgebra,bigmemory...

• bigvis package for large data visualization• faster data manipulation methods available in the data.table package

• Connecting R to famous Big Data tools

7

Types of data

• Medium sized files that can be loaded in R ( within memory limit butprocessing is cumbersome (typically in the 1-2 GB range): read.csv,read.table...

• Large files that cannot be loaded in R due to R/OS limitations. Twoother groups

• Large files: from 2 to 10 GB, they can be processed locally using some workaround solutions: read.table.ffdf, fread.

• Very Large files - ( > 10 GB) that needs distributed large scale computing:Hadoop, H2O, Spark...

8

Medium sized files

9

Airline Data

airline20MM.csv ∼ 1.6 GB, 20 millions observations, 28 variables.

10

Comparing three methods to import a medium size data

• Standard read.csv> system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE))

user system elapsed162.832 12.785 180.584

• Optimized read.csv> ptm<-proc.time()> length(readLines("airline_20MM.csv"))[1] 20000001> proc.time()-ptm

user system elapsed26.097 0.588 26.766> classes <- c("numeric",rep("character",3),rep("numeric",22))> system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",",+ stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes))

user system elapsed68.232 3.672 72.154

• fread> system.time(DT1 <- fread("airline_20MM.csv"))Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18

user system elapsed15.113 2.443 23.715

11

Large datasets with size 2-10 GB

• Too big for in-memory processing and for distributed computed files

• Two solutions• big... packages: bigmemory, bigalgebra, biganalytics

• ff packages

12

ff, ffbase and ffbase2 packages

• Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini.Already more than 340,000 downloads.

• It chunks the dataset, and stores it on a hard drive.

• It includes a number of general data-processing functions:

• The ffbase package allows users to apply a number of statistical andmathematical operations.

13

ff, ffbase and ffbase2 packages, Example

• Create a directory for the chunk files

> system("mkdir air20MM")> list.dirs()...[121] "./air20MM"....

• set the path to this newly created folder, which will store ff data chunks,

> options(fftempdir = "./air20MM")

14

ff, ffbase and ffbase2 packages, Example

• Import the data to R> air20MM.ff <- read.table.ffdf(file="airline_20MM.csv",+ sep=",", VERBOSE=TRUE,+ header=TRUE, next.rows=400000,+ colClasses=NA)read.table.ffdf 1..400000 (400000) csv-read=3.224sec ffdf-write=0.397secread.table.ffdf 400001..800000 (400000) csv-read=3.174sec ffdf-write=0.205secread.table.ffdf 800001..1200000 (400000) csv-read=3.033sec ffdf-write=0.198sec......read.table.ffdf 20000001..20000000 (0) csv-read=0.045seccsv-read=141.953sec ffdf-write=67.208sec TOTAL=209.161sec

• Memory size, dimension> format(object.size(air20MM.ff),units = "MB")[1] "0.1 Mb"> class(air20MM.ff)[1] "ffdf"> dim(air20MM.ff)[1] 20000000 26

• One binary file for each variable> list.files("./air20MM")[1] "ffdf2c9103fa5e4.ff" "ffdf2c915cd46aa.ff" "ffdf2c919345992.ff"[4] "ffdf2c919f020c5.ff" "ffdf2c91b4e0b28.ff" "ffdf2c91fdfba1f.ff"[7] "ffdf2c920be7d19.ff" "ffdf2c922e00bb9.ff" "ffdf2c92321b092.ff"[10] "ffdf2c9263bfa45.ff"....

15

ff, ffbase and ffbase2 packages, Example

• Size of the binary files (80 Mb each)

> file.size("./air20MM/ffdf2c9103fa5e4.ff")[1] 8e+07

• The binary file of a given variable

> basename(filename(air20MM.ff$DayOfWeek))[1] "ffdf2c92babdb9f.ff"

• Many other operations:• Saving and loading ff objects,• Compute tables with table.ff,• Converting a numeric vector to a factor with cut.ff,• Value matching with ffmatch• bigglm.ffdf for Generalized Linear Model (GLM)

...and many others!!

16

bigmemory, Example

• Reading big matrices> ptm<-proc.time()> air20MM.matrix <- read.big.matrix("airline_20MM.csv",+ type ="integer", header = TRUE, backingfile = "air20MM.bin",+ descriptorfile ="air20MM.desc", extraCols =NULL)> proc.time()-ptm

user system elapsed109.665 2.425 113.741

• Size, dimensions.> dim(air20MM.matrix)[1] 2.0e+07 2.6e+01> object.size(air20MM.matrix)696 bytes

• Files.> file.exists("air20MM.desc")[1] TRUE> file.exists("air20MM.bin")[1] TRUE> file.size("air20MM.desc")[1] 753> file.size("air20MM.bin")/1024^3[1] 1.937151

17

Large Scale Computing

18

Apache Spark

• Speed: Runs workloads 100x faster.

• Easily operable writing applicationsquickly in Java, Scala, Python, R,and SQL.

• Combine SQL, streaming, andcomplex analytics.

19

sparklyr: R interface for Apache Spark

• Connect to Spark from R. The sparklyr package provides a completedplyr backend.

• Filter and aggregate Spark datasets then bring them into R for analysis andvisualization.

• Use Spark’s distributed machine learning library from R. Create extensionsthat call the full Spark API and provide interfaces to Spark packages.

20

Connecting Spark to R

21

Connecting Spark to R

21

Connecting Spark to R

21

Managing data in Spark from R

• Copying data from R to Spark: dplyr package> library(dplyr)> iris_tbl <- copy_to(sc, iris)

• Reading csv files

> airline_20MM_sp <- spark_read_csv(sc, "airline_20MM","airline_20MM.csv")

• Munging and Managing data on Spark from R: quickly getting statistics onMassive data.

• Execute SQL queries directly against tables within a Spark cluster.

> library(DBI)> query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9")

22

Managing data in Spark from R

• Machine Learning procedures on Spark:• ml_decision_tree for decision trees• ml_linear_regression for regression models• ml_gaussian_mixture for fitting Gaussian mixture distributions and EMalgorithm

• ....

• Example

> mtcars_tbl <- copy_to(sc, mtcars)> partitions <- mtcars_tbl %>%+ filter(hp >= 100) %>%+ mutate(cyl8 = cyl == 8) %>%+ sdf_partition(training = 0.5, test = 0.5, seed = 1099)> fit <- partitions$training %>%+ ml_linear_regression(response = "mpg", features = c("wt", + "cyl"))

23

More things to do on Spark from R

• Reading and Writing Data : CSV, JSON, and Parquet formats:spark_write_csv, spark_write_parquet, spark_write_json

• Execute arbitrary R code across your cluster using spark_apply

> spark_apply(iris_tbl, function(data) {+ data[1:4] + rgamma(1,2)+ })

• View the Spark web console using the spark_web function:

> spark_web(sc)

24

H2O

• Software for machine learning and data analysis.

• Ease of Use

• Open source (the liberal Apache license)

• Easy to use Scalable to big data

• Well-documented and commercially supported.

• Website: https://www.h2o.ai/h2o/

25

How to install H2O?2

It takes few minutes, ∼ 134 Mb to download.

# The following two commands remove any previously installed H2O packages for R.if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }# Next, we download packages that H2O depends on.pkgs <- c("RCurl","jsonlite")for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}# Now we download, install and initialize the H2O package for R.install.packages("h2o", type="source",repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R")# Finally, let's load H2O and start up an H2O clusterlibrary(h2o)h2o.init()

2Procedure available inhttp://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html

26

Munging data and ML in H2O from R

• Importing data files h2o.importFile

• Importing multiple files h2o.importFolder

• Combining data sets by columns and rows h2o.cbind and h2o.rbind

• Group one or more columns and apply a function to the result group_by

• Imputing missing values h2o.impute

• And the most important Machine Learning algorithms: PCA, Random Forests,Regression Models and Classifications, Gradient Boosting Machine....

27

Hadoop and RHadoop

RHadoop is a collection of five R:

• rhdfs: Basic connectivity to the Hadoop Distributed File System. Rprogrammers can browse, read, write, and modify files stored in HDFS fromwithin R

• rhbase: Basic connectivity to the HBASE distributed database.

• plyrmr: Data manipulation operations.

• rmr2: Allows R developer to perform statistical analysis in R via HadoopMapReduce functionality on a Hadoop cluster.

• ravro: Read and write avro files from local and HDFS file system

28

Ressources

https://spark.rstudio.com/

29