Bayesian Networks with R and Hadoop
-
Upload
ofer-mendelevitch -
Category
Technology
-
view
1.973 -
download
10
description
Transcript of Bayesian Networks with R and Hadoop
![Page 1: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/1.jpg)
© Hortonworks Inc. 2014
HortonworksBayesian Networks with R and HadoopHadoop Summit, June 2014Ofer Mendelevitch
![Page 2: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/2.jpg)
© Hortonworks Inc. 2014 Page 2
A bit about me
Ofer MendelevitchDirector, Data Science @ HortonworksPreviously: Nor1, Yahoo!, Risk Insight, QuiverPersonal blog: www.achessdad.com
![Page 3: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/3.jpg)
© Hortonworks Inc. 2014 Page 3
What I will cover today…
•What is a Bayesian Network?
•Why I think it’s cool
•Bayesian networks with R: the bnlearn package
•Bayes Networks Inference with R and Hadoop
![Page 4: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/4.jpg)
© Hortonworks Inc. 2014 Page 4
Introduction to Bayesian Networks
(with examples using R)
![Page 5: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/5.jpg)
© Hortonworks Inc. 2014 Page 5
Example: “Asia” Bayesian NetworkEach node is a random variable: yes/no
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
![Page 6: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/6.jpg)
© Hortonworks Inc. 2014 Page 6
Example: “Asia” Bayesian NetworkGraph structure reflects “causal” relationships
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
![Page 7: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/7.jpg)
© Hortonworks Inc. 2014 Page 7
Example: “Asia” Bayesian Networknode CPT: P(node | parents)
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
SoB
Tub or Cancer
Bronchitis T F
T T 0.7 0.3
F T 0.4 0.6
T F 0.45 0.55
F F 0.05 0.95
CPT
![Page 8: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/8.jpg)
© Hortonworks Inc. 2014 Page 8
What is a (discrete) Bayesian Network?(also called Bayes Nets, Belief Nets, etc)
• A network structure (DAG):– Nodes => random variables, taking discrete values– Edges => conditional dependencies
• E.g., lung cancer is statistically dependent on smoking
• A set of conditional probability tables (CPTs):– Each node has a set of parents, determined by the graph– CPT holds P(node | parent-A, parent-B, …) for each node
![Page 9: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/9.jpg)
© Hortonworks Inc. 2014 Page 9
Why are Bayesian Networks cool?
• Intuitive/adaptive modeling tool:– Graphs are natural for modeling relationships– Easy to combine data-driven learning with expert know-how– You can start small, and add knowledge as it is acquired
• “Naturally” addresses inference with missing values
• Inference can be applied to any variable/node– As opposed to a single (target) variable in supervised learning
![Page 10: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/10.jpg)
© Hortonworks Inc. 2014 Page 10
Bayesian networks have been successfully used for a variety of real-world applications
• Healthcare: medical diagnosis, genetic modeling • Security: crime pattern analysis, terrorism risk
management• Education: student modeling• Finance: credit rating, predicting defaults• Tech support: troubleshooting for computers/printers
See “Bayesian networks: a practical guide to applications”, Pourret et al
![Page 11: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/11.jpg)
© Hortonworks Inc. 2014 Page 11
Bayesian networks with R
• http://cran.r-project.org/web/views/Bayesian.html
• We will focus on “bnlearn” (by Marco Scutari)– Implements various structure learning algorithms (hc, tabu,
gs, iamb, mmhc, rsmax2, etc)– Provides automated learning of CPT– Approximate inference: “likelihood sampling” and “likelihood
weighting”– Supports snow/parallel for some algorithms
![Page 12: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/12.jpg)
© Hortonworks Inc. 2014 Page 12
Step 1: Constructing the graph
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
• Manually (expert knowledge)• Automatically from data
![Page 13: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/13.jpg)
© Hortonworks Inc. 2014 Page 13
Manual graph construction: Asia> library(bnlearn)> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")> ag = empty.graph(varnames)> arcs(ag, ignore.cycles=T) = data.frame(> "from”=c("Asia", "Smoking", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC"),> "to”=c("Tub", "LC", "Bronchitis", "Tub-or-LC", "Tub-or-LC", "SoB", "X-ray", "SoB"))> graphviz.plot(ag)
![Page 14: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/14.jpg)
© Hortonworks Inc. 2014 Page 14
Automated graph construction: Asia> library(bnlearn)> varnames = c("Asia", "Smoking", "Tub", "LC", "Bronchitis", "Tub-or-LC", "X-ray", "SoB")> data(asia); names(asia) = varnames> bg = hc(asia)> graphviz.plot(bg)
![Page 15: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/15.jpg)
© Hortonworks Inc. 2014 Page 15
Automated learning does not always work perfectly…
For example:• May not learn all the “expected” edges• May learn in the wrong direction
Therefore, in practice it helps to:• Provide whitelist and blacklist to the algorithm• Pre-seed with a manual networks structure, and let the
algorithm learn from there• Ensemble learning of structure (see boot.strength)
![Page 16: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/16.jpg)
© Hortonworks Inc. 2014 Page 16
Step 2: Learning the CPT / probabilities
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
SoB
Tub or Cancer
Bronchitis T F
T T 0.85 0.15
F T 0.79 0.21
T F 0.73 0.27
F F 0.1 0.9
CPT
![Page 17: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/17.jpg)
© Hortonworks Inc. 2014 Page 17
Learning CPT for each node in the graph> fitted = bn.fit(ag, asia)
> print(fitted$SoB)
Parameters of node SoB (multinomial distribution)Conditional probability table: , , Tub-or-LC = no BronchitisSoB no yes no 0.90017286 0.21373057 yes 0.09982714 0.78626943
, , Tub-or-LC = yes BronchitisSoB no yes no 0.27737226 0.14592275 yes 0.72262774 0.85407725
![Page 18: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/18.jpg)
© Hortonworks Inc. 2014 Page 18
Using the BN for inference
• Given evidence: (1) visit to asia, (2) SoB (3) Bronchitis• What is the likelihood of “lung cancer”?
Visit to Asia Smoking
Tuberculosis Lung cancer Bronchitis
Tuberculosis or cancer
X-ray result Shortness of breath
![Page 19: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/19.jpg)
© Hortonworks Inc. 2014 Page 19
Inferring with missing values
• We provide evidence (“yes” or “no” in this case) only for those nodes where we have such evidence
• If a value is “missing” it’s just not included in the evidence when doing inference…
This is in contrast to supervised learning, where ALL values are typically needed for inference.
![Page 20: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/20.jpg)
© Hortonworks Inc. 2014 Page 20
Exact Inference with gRain
• The gRain package implements exact inference for discrete Bayesian Networks using the “Junction Tree” belief propagation algorithm
• Bnlearn/gRain cooperate nicely
> jtree = compile(as.grain(fitted))> jp = setFinding(jtree, nodes = c("Asia", "Sob", "Bronchitis"), states = c("yes", "yes", "yes"))> print(querygrain(jp, nodes="LC")$LC)
LC no yes 0.934 0.066
![Page 21: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/21.jpg)
© Hortonworks Inc. 2014 Page 21
Approximate inference with bnlearn
Bnlearn implements approximate inference: logic sampling (aka rejection sampling) and likelihood weighting > # Infer probability P(SoB | Asia, Bronchitis) using logic sampling> p1 = cpquery(fitted, event = eval(SoB == 'yes'), evidence = eval(Asia == 'yes' & Bronchitis == 'yes'), method="ls")> print(p1)
[1] 0.8014706
> # Infer probability P(SoB | Asia, Bronchitis) using likelihood weighting> evidence = list("yes", "yes")> names(evidence) = c("Asia", "Bronchitis")> p2 = cpquery(fitted, eval(SoB == 'yes'), evidence, method="lw") > print(p2)
[1] 0.795404
![Page 22: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/22.jpg)
© Hortonworks Inc. 2014 Page 22
Large scale Bayes Networks Inference with R and Hadoop
![Page 23: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/23.jpg)
© Hortonworks Inc. 2014 Page 23
What is large?
• Number of nodes:– 10s: Medium – 100s: Large– 1000s: Very large
• Number of instances: – 100,000s to millions
![Page 24: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/24.jpg)
© Hortonworks Inc. 2014 Page 24
Manually constructing large graphs is hard
![Page 25: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/25.jpg)
© Hortonworks Inc. 2014 Page 25
Large scale learning in practice: manual + automated
• Define nodes• Seed with some known edges, based on expert
knowledge• Augment with automated learning (e.g., hc, tabu,
rsmax2, etc)
![Page 26: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/26.jpg)
© Hortonworks Inc. 2014 Page 26
Large scale inference: Exact or Approximate?
Pros ConsExact (Jtree)gRain
Fast inference time Computational complexity determined (exponentially) by largest clique size
Approximate (LS, LW)Bnlearn
Can be used for any graphNot limited by “clique” size
Inference is often much slowerNot accurate for rare events
![Page 27: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/27.jpg)
© Hortonworks Inc. 2014 Page 27
About RHadoop/RMR
• An open source project, supported by revolution analytics
• Various sub-projects: RMR, RHDFS, RHBASE, plyrmr, etc• We will focus on RMR
– Implement mapper/reducer code using R
• RHadoop: https://github.com/RevolutionAnalytics/RHadoop/wiki• Installing RMR on HDP: http://www.slideshare.net/Hadoop_Summit/enabling-r-on-
hadoophttp://www.research.janahang.com/install-rhadoop-on-hortonworks-hdp-2-0/
![Page 28: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/28.jpg)
© Hortonworks Inc. 2014 Page 28
Large scale inference with R and Hadoop
Infer with RMR
Inference is embarrassingly parallelHadoop determines # of mappers, based on file sizeSO we’ll use reducers to parallelize CPQuery
![Page 29: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/29.jpg)
© Hortonworks Inc. 2014 Page 29
Example: Adult dataset
• Donated by Ronny Kohavi and Barry Becker, 1996 - http://archive.ics.uci.edu/ml/datasets/Adult
• Extracted from 1994 census data• 48842 instances, 14 features such as:
– Age, country, occupation, marital status, capital gain, etc– Goal: predict if income is >50K or not
…53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K…
![Page 30: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/30.jpg)
© Hortonworks Inc. 2014 Page 30
Sample learned network structure for “adult”
![Page 31: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/31.jpg)
© Hortonworks Inc. 2014 Page 31
Inference with RMR on adult dataset
NUM_REDUCERS = 4opt = rmr.options(backend = "hadoop”,
backend.parameters = list(hadoop=list(D="mapreduce.reduce.memory.mb=1024",
D=paste0("mapreduce.job.reduces=”, NUM_REDUCERS))))
inpFile = 'adult.test'outFile = 'adult.out'
mapreduce(input=inpFile, input.format="text", output=outFile, output.format="csv", map=map_func, reduce=reduce_func)
![Page 32: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/32.jpg)
© Hortonworks Inc. 2014 Page 32
Our mapper: passing on to reducer…
map_func <- function(., values){ out_klist= list(); out_vlist = list() for (v in values) {
fvec = unlist(strsplit(v, ',', fixed=T)) # Read row and split into columns if (length(fvec)<15) { next; } # deal with row not in expected format
key = floor(runif(1, 0, NUM_REDUCERS)) out_klist = c(out_klist, key) out_vlist = c(out_vlist, v)
} return (keyval(out_klist, out_vlist))}
![Page 33: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/33.jpg)
© Hortonworks Inc. 2014 Page 33
Our reducer: where all the action happens
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
reduce_func <- function(., values){ out_klist = list(); out_vlist = list()
for (v in values) { increment.counter('bn-demo', 'row', 1) # to let MR know we are still active
fvec = sapply(strsplit(v, ',', fixed=T), trim) # read row and split into columns names(fvec)=c("age", "type_employer", "fnlwgt", "education", "education_num","marital", "occupation", "relationship", "race","sex", "capital_gain", "capital_loss", "hr_per_week", "country", "income")
pv = dataprep(fvec) # transform to “learned” features
evidence = as.list(pv[1,setdiff(colnames(pv), 'income')]) prob = cpquery(fitted, event = (income == ">50K"), evidence = evidence, method="lw") out_klist = c(out_klist, v) out_vlist = c(out_vlist, format(prob, digits=2)) } return (keyval(out_klist, out_vlist))}
![Page 34: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/34.jpg)
© Hortonworks Inc. 2014 Page 34
Example output: adult.out
26, Private, 191573, Assoc-acdm, 12, Married-civ-spouse, Prof-specialty, Wife, White, Female, 0, 0, 40, United-States, >50K. ,0.3752, Private, 203635, HS-grad, 9, Married-civ-spouse, Handlers-cleaners, Husband, White, Male, 0, 0, 40, United-States, >50K. ,0.1436, Private, 68798, HS-grad, 9, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K., 0.01934, Private, 31752, HS-grad, 9, Divorced, Machine-op-inspct, Other-relative, White, Female, 0, 0, 40, ?, <=50K. ,0.1459, ?, 291856, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 40, United-States, <=50K. ,0.07426, Private, 135848, Bachelors, 13, Never-married, Sales, Not-in-family, White, Male, 0, 0, 10, Guatemala, <=50K. ,0.0350, Local-gov, 237356, Some-college, 10, Married-civ-spouse, Protective-serv, Husband, White, Male, 7298, 0, 40, United-States, >50K.,0.8956, Self-emp-not-inc, 140729, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 0, 0, 45, United-States, <=50K.,0.1422, Private, 54560, HS-grad, 9, Married-civ-spouse, Transport-moving, Husband, White, Male, 0, 0, 60, United-States, <=50K., 0.2145, Self-emp-inc, 88500, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 7298, 0, 40, United-States, >50K., 0.94
![Page 35: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/35.jpg)
© Hortonworks Inc. 2014 Page 35
More information
• Detailed step-by-step guide and code used can be found on: https://github.com/ofermend/bayes-net-r-hadoop
• Download Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/
• Further reading/learning:– http://www.bnlearn.com/– PGM class on Coursera: https://www.coursera.org/course/
pgm– PGM Ebook from UCL:
http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/250214.pdf– Many others…
![Page 36: Bayesian Networks with R and Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042613/547853b1b4af9f157e8b479b/html5/thumbnails/36.jpg)
© Hortonworks Inc. 2014 Page 36
Thank you!
Any Questions?
Ofer Mendelevitch, [email protected], @ofermend
We’re hiring! www.hortonworks.com/careers Hortonworks training: www.hortonworks.com/trainingHortonworks blog: www.hortonworks.com/blog