Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath
-
Upload
spark-summit -
Category
Data & Analytics
-
view
1.469 -
download
2
Transcript of Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath
![Page 1: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/1.jpg)
Petabyte scale data science using Spark & R
Sridhar Alla, Kiran MuglurmathComcast
![Page 2: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/2.jpg)
Who we are• Sridhar Alla
Director, Solution Architecture, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business
Intelligence initiatives. • Kiran Muglurmath
Executive Director, Data Science, Comcastfocuses on architecting and building solutions to meet the needs of the Enterprise Business
Intelligence initiatives.
![Page 3: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/3.jpg)
Top Initiatives
• Customer Churn Prediction• Clickthru Analytics • Personalization• Customer Journey• Modeling
![Page 4: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/4.jpg)
Spark Stack
![Page 5: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/5.jpg)
• Enables using R packages to process data • Can run Machine Learning and Statistical Analysis
algorithms
SparkR
![Page 6: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/6.jpg)
Spark MLlib
• Implements various Machine Learning Algorithms• Classification, Regression, Collaborative Filtering,
Clustering, Decomposition• Works with Streaming, Spark SQL, GraphX or with
SparkR.
![Page 7: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/7.jpg)
Using PySpark & SparkR
![Page 8: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/8.jpg)
Hidden Markov Model (HMM)
• Supporting points go here.
![Page 9: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/9.jpg)
Dataset Preparation: Training Data
• Supporting points go here.
![Page 10: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/10.jpg)
Dataset Preparation: Raw Data
• Supporting points go here.
![Page 11: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/11.jpg)
Baum – Welch algorithm for state detection
1. Given the download/upload levels (observations) for a given time interval, the model detects the hidden streaming state for that interval.
2. Given a set of observations (i = 1 .. n), ith hidden variable is independent of (i – 1)th hidden variable. For a discrete random variable Xt with N possible values, assume at P(Xt|X{t-1}) is independent of time t
1. From observations, calculate transition probabilities for N possible states. Then recursively compute maximum likelihoods for all observations, backwards and forwards to identify most probable state for each observation.
![Page 12: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/12.jpg)
Sample Code (R):
• library('RHmm')• indata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• testdata <- read.csv(file.choose(), header = FALSE, sep = ",", quote = "\"", dec = ".")• downloads <- c(as.numeric(indata$V4))• downloadModel <- HMMFit(downloads, nStates=3)• testdownloads <- c(as.numeric(testdata$V4))• tVitPath <- viterbi(downloadModel, testdownloads)
• #Forward-backward procedure, compute probabilities• tfb <- forwardBackward(downloadModel, testdownloads)
• # Plot implied states• layout(1:3)• plot(testdownloads[1:100],ylab="Down Bandwidth",type="l", main="Download bytes")• plot(tVitPath$states[1:100],ylab="Download States",type="l", main="Download States")
![Page 13: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/13.jpg)
Output for a test dataset
• Supporting points go here.
![Page 14: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/14.jpg)
Parallelizing in Hadoop
Steps:• Create sample dataset to build model. This can be a small sample (~2000 – 5000 rows), or a size sufficient to build
generalized model.
• Script model as an R file, except that it should use streamed input instead of reading from CSV files. Separate map.R and reduce.R can be created if a reduction stage is required to create unified output datasets.
• Test that code works from command line with structure below, where dataset.csv is the input dataset with structure as shown before
cat dataset.csv | map.R | reduce.R > output.csv• Ensure that Hive tables are in delimited text format. Deploy and run model using Hadoop streaming with sample command
line below
hadoop jar /usr/hdp/2.2.6.4-1/hadoop-mapreduce/hadoop-streaming.jar \-D mapred.min.split.size=268435456 \-D mapreduce.task.timeout=300000000 \-D mapreduce.map.memory.mb=3584 \-D mapreduce.reduce.memory.mb=8092\-input /user/hive/warehouse/ebidatascience.db/ipdr/local_day_id=$NEXT_DATE -output /user/hive/warehouse/ebidatascience.db/ipdr_flagged/-file ./map.R-file <sample dataset to build model.csv>
-mapper ./map.R
![Page 15: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/15.jpg)
Flagged output
• Supporting points go here.
![Page 16: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/16.jpg)
Performance
• 1.7B observations/day• About 30 minutes processing time/day• 380 shared nodes• 92% accuracy in detecting streaming events
![Page 17: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/17.jpg)
Output for a test dataset
• Supporting points go here.
![Page 18: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/18.jpg)
Add Pages as Necessary• Supporting points go here.
![Page 19: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/19.jpg)
We are hiring!• Big Data Engineers (Hadoop, Spark,
Kafka…)• Data Analysts (R, SAS…..)• Big Data Analysts (Hive, Pig ….)
![Page 20: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/20.jpg)
THANK YOU.
![Page 21: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/21.jpg)
Output for a test dataset
• Supporting points go here.
![Page 22: Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Muglurmath](https://reader031.fdocuments.in/reader031/viewer/2022030305/5875b65b1a28ab8b618b785f/html5/thumbnails/22.jpg)
Add Pages as Necessary• Supporting points go here.