Karl Broman · script to combine results combined results ReadMe describing the point 16. One...
Transcript of Karl Broman · script to combine results combined results ReadMe describing the point 16. One...
![Page 1: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/1.jpg)
Big jobs/simulations
Karl Broman
Biostatistics & Medical Informatics, UW–Madison
kbroman.orggithub.com/kbroman
@kwbromanCourse web: kbroman.org/AdvData
![Page 2: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/2.jpg)
But first…
Suppose I’ve just written an R function and it seems to work, and suppose Inoticed a simple way to speed it up.
What should I do first?
2
![Page 3: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/3.jpg)
But first…
Suppose I’ve just written an R function and it seems to work, and suppose Inoticed a simple way to speed it up.
What should I do first?
▶ Commit it to a git repository
2
![Page 4: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/4.jpg)
But first…
Suppose I’ve just written an R function and it seems to work, and suppose Inoticed a simple way to speed it up.
What should I do first?
▶ Commit it to a git repository▶ Make it an R package
2
![Page 5: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/5.jpg)
But first…
Suppose I’ve just written an R function and it seems to work, and suppose Inoticed a simple way to speed it up.
What should I do first?
▶ Commit it to a git repository▶ Make it an R package▶ Write a test or two
2
![Page 6: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/6.jpg)
So what’s the big deal?
▶ You don’t want knitr running for a year.▶ You don’t want to re-run things if you don’t have to.
3
![Page 7: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/7.jpg)
Unix basicsnice +19 R CMD BATCH input.R output.txt &
fgctrl-zbg
ps uxtop
killkill -9pkill
4
![Page 8: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/8.jpg)
Disk thrashing
In computer science, thrashing occurs when a computer’s virtual memorysubsystem is in a constant state of paging
, rapidly exchanging data inmemory for data on disk, to the exclusion of most application-levelprocessing.
– Wikipedia
5
![Page 9: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/9.jpg)
Disk thrashing
In computer science, thrashing occurs when a computer’s virtual memorysubsystem is in a constant state of paging, rapidly exchanging data inmemory for data on disk, to the exclusion of most application-levelprocessing.
– Wikipedia
5
![Page 10: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/10.jpg)
Biggish jobs in knitr
▶ Manual caching▶ Built-in cache=TRUE
▶ Split the work and write a Makefile
6
![Page 11: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/11.jpg)
Manual caching
```{r a_code_chunk}file <- "cache/myfile.RData"
if(file.exists(file)) {load(file)
} else{
....
save(object1 , object2 , object3 , file=file)}```
7
![Page 12: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/12.jpg)
Manual caching
```{r a_code_chunk}file <- "cache/myfile.rds"
if(file.exists(file)) {object <- readRDS(file)
} else{
....
saveRDS(object, file)}```
7
![Page 13: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/13.jpg)
Chunk references
```{r not_shown , eval=FALSE}code_here <- 0```
```{r a_code_chunk , echo=FALSE}file <- "cache/myfile.RData"
if(file.exists(file)) {load(file)
} else{<<not_shown >>
save(code_here , file=file)}```
8
![Page 14: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/14.jpg)
A cache gone bad
9
![Page 15: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/15.jpg)
Knitr’s cache system
```{r chunk_name , cache=TRUE}load("a_big_file.RData")med <- apply(object, 2, median, na.rm=TRUE)```
▶ Chunk is re-run if edited.▶ Otherwise, objects from previous run are loaded.▶ Don’t cache things with side effects
e.g., options(), par()
10
![Page 16: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/16.jpg)
Cache dependencies
Manual dependencies
```{r chunkA, cache=TRUE}Sys.sleep(2)x <- 5```
```{r chunkB, cache=TRUE, dependson="chunkA"}Sys.sleep(2)y <- x + 1```
```{r chunkC, cache=TRUE, dependson="chunkB"}Sys.sleep(2)z <- y + 1```
11
![Page 17: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/17.jpg)
Cache dependencies
Automatic dependencies
```{r setup, include=FALSE}opts_chunk$set(autodep = TRUE)dep_auto()```
12
![Page 18: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/18.jpg)
Parallel computing
If your computer has multiple processors, use library(parallel) to makeuse of them.
▶ detectCores()
▶ RNGkind("L'Ecuyer-CMRG") and mclapply (Unix/Mac)▶ makeCluster, clustersetRNGStream, clusterApply, and stopCluster
(Windows)
13
![Page 19: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/19.jpg)
Systems for distributed computing
▶ HTCondor and the UW-Madison CHTC▶ Other condor-like systems▶ “By hand”
– e.g., perl script + template R script
14
![Page 20: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/20.jpg)
Simulations
▶ Computer simulations require RNG seeds (.Random.seed in R).▶ Multiple parallel jobs need different seeds.▶ Don’t rely on the current seed, or on having it generated from the clock.▶ Use something like set.seed(91820205 + i)
▶ An alternative is create a big batch of simulated data sets in advance.
15
![Page 21: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/21.jpg)
Save everything
▶ RNG seeds▶ input▶ output▶ version numbers, with sessionInfo()
▶ raw results▶ script to combine results▶ combined results▶ ReadMe describing the point
16
![Page 22: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/22.jpg)
One Makefile to rule them all
▶ Separate directory for each batch of big computations.▶ Makefile that controls the combination of the results (and everything
else).▶ KnitR-based documents for the analysis/use of those results.
17
![Page 23: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/23.jpg)
Potential problems
▶ Forgetting save() in your distributed jobs▶ A bug in the save() command▶ make clobbers some important results
Scripts should refuse to overwrite output files
18
![Page 24: Karl Broman · script to combine results combined results ReadMe describing the point 16. One Makefile to rule them all Separate directory for each batch of big computations. Makefile](https://reader035.fdocuments.in/reader035/viewer/2022070722/5f01dcf27e708231d4016703/html5/thumbnails/24.jpg)
Summary
▶ Careful organization and modularization.▶ Save everything.▶ Document everything.▶ Learn the basic skills for distributed computing.
19