Interacting with Remote Systems - Princeton University · 2 non-parameteric bootstrap confidence...
Transcript of Interacting with Remote Systems - Princeton University · 2 non-parameteric bootstrap confidence...
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Remote Systems
Advanced Statistical Programming CampJonathan Olmsted (Q-APS)
Day 2: May 28th, 2014AM Session
ASPC Interacting with Remote Systems Day 2 AM 1 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Yesterday . . .
Topics
1 monitoring performance
2 vectorization, built-infunctions
3 convenience wrappers
4 a new looping construct
5 parallel foreach
6 parallel RNG
Examples1 pairwise distance calculation2 non-parameteric bootstrap
confidence intervals (voterturnout)
3 parametric bootstrapconfidence intervals (voterturnout)
4 cross-validation (civil waronset)
ASPC Interacting with Remote Systems Day 2 AM 2 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
This session . . .
1 Overview of TIGRESS Systems
2 Interacting with Remote Systems
3 Running Single-Node Jobs on Adroit
4 Group Exercise
ASPC Interacting with Remote Systems Day 2 AM 3 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
TIGRESS
• TIGRESS is a collection of computational resources for PrincetonResearchers
• Computational clusters• Visualization hardware• Visualization laboratory with visualization wall
• 4096 x 2160 pixels• 150 sq. ft.
ASPC Interacting with Remote Systems Day 2 AM 4 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Computational Clusters
ASPC Interacting with Remote Systems Day 2 AM 5 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Logging In to Adroit
• Connect to cluster through the SSH protocol.
• On Windows: use PuTTY• On Mac/Linux: use the terminal and ssh
• ssh usersystem.princeton.edu• If using a wireless connection (with a laptop), use the SRA client.
ASPC Interacting with Remote Systems Day 2 AM 6 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Logging In to Adroit
The machine gives you a prompt, waiting for input (like R)
ASPC Interacting with Remote Systems Day 2 AM 7 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
Everything is a file or directory and organized in a hierarchy. Get yourcurrent location with pwd.
ASPC Interacting with Remote Systems Day 2 AM 8 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
ls lists the contents of the current (working) directory.
ASPC Interacting with Remote Systems Day 2 AM 9 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
mkdir creates a directory.
ASPC Interacting with Remote Systems Day 2 AM 10 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
rm removes/deletes a file. rmdir removes/deletes a directory.
ASPC Interacting with Remote Systems Day 2 AM 11 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
cd changes the directory to the specified location. cd .. changes thedirectory up one level.
ASPC Interacting with Remote Systems Day 2 AM 12 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
You can make simple edits to files with Emacs. Start it with emacs -nwfilename.
ASPC Interacting with Remote Systems Day 2 AM 13 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
Once in Emacs, you can type like you would in a GUI text editor.
ASPC Interacting with Remote Systems Day 2 AM 14 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
To save, type Ctrl+x Ctrl+s. To quit, type Ctrl+x Ctrl+c.
ASPC Interacting with Remote Systems Day 2 AM 15 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
cp copies files from one location (or file name) to another.
ASPC Interacting with Remote Systems Day 2 AM 16 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
mv moves/renames files or directories.
ASPC Interacting with Remote Systems Day 2 AM 17 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
cat displays the contents of a file.
ASPC Interacting with Remote Systems Day 2 AM 18 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
head displays the beginning of a file.
ASPC Interacting with Remote Systems Day 2 AM 19 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Interacting with Adroit
tail displays the end of a file.
ASPC Interacting with Remote Systems Day 2 AM 20 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Submitting Jobs
• With these pieces, we can submit jobs.
• The resources on Adroit (and other systems) us a scheduler(SLURM).
• It’s like taking a ticket at the deli counter.• Once the machine is ready to run what you’ve requested, your job
comes out of the queue.
• Requesting resources requires specific commands and a specificformat of the request.
ASPC Interacting with Remote Systems Day 2 AM 21 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Submitting Jobs
#!/ bin/sh#SBATCH -N 1#SBATCH --ntasks -per -node=1#SBATCH -t 10:00#SBATCH -J Ex1#SBATCH -o log.%j#SBATCH --mail -type=begin#SBATCH --mail -type=end
Rscript -e 'a <- rnorm(1e3)'
Listing 1: Example SLURM Job Script
ASPC Interacting with Remote Systems Day 2 AM 22 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Submitting Jobs
sbatch submits a SLURM script to the queue.
ASPC Interacting with Remote Systems Day 2 AM 23 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Submitting Jobs
squeue displays information about the jobs in the queue.
ASPC Interacting with Remote Systems Day 2 AM 24 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Submitting Jobs
A busy queue will have a lot of information.
ASPC Interacting with Remote Systems Day 2 AM 25 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable Bias
In the handout, we demonstrate the effect of omitted variable bias in alinear model.
• Simulate data according to
y = β0 + β1x1 + β2x2 + β3x3 + ε (1)
y ∼ x1 + x2 + x3 (2)
β0 = β1 = β2 = β3 = 1
• Estimate the (mis-specified) linear model according to
E [y |x ] = α̂0 + α̂1x1 + α̂2x2. (3)
y ∼ x1 + x2 (4)
• Estimate the correct linear model according to
E [y |x ] = β̂0 + β̂1x1 + β̂2x2 + β̂3x3 (5)
y ∼ x1 + x2 + x3 (6)
ASPC Interacting with Remote Systems Day 2 AM 26 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable Bias
## Error: object ’form’ not found
Canonical result: estimates from the mis-specified model are biased.
ASPC Interacting with Remote Systems Day 2 AM 27 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable Bias
Computational strategy:1 Define a function to represent the data-generating process.
2 Define a data.frame of test conditions indicating how to generatedata and estimate a model with expand.grid:
• Monte Carlo iterations• sample size• specification
3 Run the simulation with foreach in parallel.
4 Collect results in a data.frame of output.
ASPC Interacting with Remote Systems Day 2 AM 28 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable Bias
Practical steps:1 Write the R code locally.
2 Debug locally using a small scale simulation.
3 Copy files up to Adroit.
4 Submit the job to the scheduler.
5 Copy the output back to the local machine.
6 Summarize and visualize results.
ASPC Interacting with Remote Systems Day 2 AM 29 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
• It turns out the degree of bias is not actually linked to the“correctness” of the specification in an intuitive way.
• Sometimes, fewer variables are “better”.• What really matters is the correlation between the included
regressors and the omitted regressors.
• Goal: Demonstrate this (perhaps) counter-intuitiveresult using a similar framework with a partner.
ASPC Interacting with Remote Systems Day 2 AM 30 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
• Simulate data according to
y = β0 + β1x1 + β2x2 + β3x3 + ε (7)
• We already have this function in R.• Estimate three different specifications:
1 y ∼ x1 + x2 + x32 y ∼ x1 + x23 y ∼ x1
• You can and should use the framework from handout as atemplate.
ASPC Interacting with Remote Systems Day 2 AM 31 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
The data-generating function (from the handout):genNormData <- function(.n, .scale = 1, .rho = .5) {
cN <- .n ; cK <- 3cRho <- .rhocSigma <- diag(cK)cSigma[2, 1] <- cRho ; cSigma[1, 2] <- cRhocSigma[3, 1] <- -cRho ; cSigma[1, 3] <- -cRhomL <- chol(cSigma)mZ <- matrix(rnorm(cN * cK, sd = .scale), nrow = cN)mX <- cbind(1, mZ %*% mL)mE <- matrix(rnorm(cN), nrow = cN)vB <- c(1, 1, 1, 1) ;mMu <- mX %*% vBvY <- rnorm(cN,
mean = as.vector(mMu))
ret <- data.frame(y = vY, mu = mMu,x1 = mX[, 2], x2 = mX[, 3],x3 = mX[, 4])
return(ret)}
ASPC Interacting with Remote Systems Day 2 AM 32 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
A Monte Carlo dataset:
head(genNormData(50))
## y mu x1 x2 x3## 1 2.3508 0.1520 -0.56048 -0.06086 -0.22668## 2 2.2590 0.9466 -0.23018 -0.13981 0.31659## 3 2.0426 2.3078 1.55871 0.74223 -0.99315## 4 2.9103 2.3671 0.07051 1.22050 0.07606## 5 -0.3227 0.0916 0.12929 -0.13088 -0.90681## 6 3.9531 4.4294 1.71506 2.17083 -0.45653
ASPC Interacting with Remote Systems Day 2 AM 33 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Simulation Conditions:
• Monte Carlo simulations:
vMC <- 1:5000 # remotelyvMC <- 1:20 # locally
• Sample size:• Specification:
f1 <- formula(y ~ x1)f2 <- formula(y ~ x1 + x2)f3 <- formula(y ~ x1 + x2 + x3)lForms <- list(f1, f2, f3)
ASPC Interacting with Remote Systems Day 2 AM 34 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Construct the grid for test conditions:
dfCases <- expand.grid(mc = vMC,n = vSize,form = c(1, 2, 3))
ASPC Interacting with Remote Systems Day 2 AM 35 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Set up the parallel backend:
library("doParallel")
## Loading required package: foreach## Loading required package: iterators## Loading required package: parallel
cl <- makeCluster(2, "PSOCK")registerDoParallel(cl)
ASPC Interacting with Remote Systems Day 2 AM 36 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)BiasRun the simulation:
dfOut <- foreach(case = 1:nrow(dfCases),.combine = rbind,.export = c("lForms", "dfCases", "genNormData")) %dopar% {
cN <- dfCases$n[case]cForm <- dfCases$form[case]cMC <- dfCases$mc[case]dfTmpData <- genNormData(cN,.scale = .75)mod <- lm(formula = lForms[[cForm]],
data = dfTmpData)
mOut <- matrix(c(cMC, cForm, cN,coef(mod), rep(NA, times = 3 - cForm)),
nrow = 1)
ret <- as.data.frame(mOut)return(ret)
}
ASPC Interacting with Remote Systems Day 2 AM 37 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Add useful column names:
names(dfOut) <- c("mc", "form", "n", "x0", "x1", "x2", "x3")
ASPC Interacting with Remote Systems Day 2 AM 38 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)BiasDoes the simulation output make sense?
head(dfOut)
## mc form n x0 x1 x2 x3## 1 1 1 100 1.1061 1.1070 NA NA## 2 2 1 100 1.1488 1.2008 NA NA## 3 3 1 100 0.9752 1.0529 NA NA## 4 4 1 100 1.2431 1.3084 NA NA## 5 5 1 100 1.1132 0.4977 NA NA## 6 6 1 100 1.2134 1.0090 NA NA
tail(dfOut)
## mc form n x0 x1 x2 x3## 355 15 3 2000 0.9576 1.0422 0.9793 1.0178## 356 16 3 2000 0.9465 0.9551 1.0409 1.0061## 357 17 3 2000 1.0203 1.0248 1.0081 1.0118## 358 18 3 2000 1.0139 1.0031 1.0210 1.0347## 359 19 3 2000 0.9934 0.9279 1.0038 0.9614## 360 20 3 2000 0.9950 0.9798 0.9995 0.9870
ASPC Interacting with Remote Systems Day 2 AM 39 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Close down the parallel backend:
stopCluster(cl)
Save the results:
write.csv(dfOut, "ov_unbias.csv")
ASPC Interacting with Remote Systems Day 2 AM 40 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
1 Save your R code as ov_unbias.R.2 Write a SLURM Job Script:
#!/bin/sh#SBATCH --nodes 1 # << number of compute nodes#SBATCH --ntasks -per -node=4 # << number of processors per node#SBATCH -t 00:10:00 # << max time required (hh:mm:ss)#SBATCH -J ovunbias # << brief , descriptive job label#SBATCH -o log.%j # << file to save output/error
Rscript ov_unbias.R
Listing 2: ov_unbias.slurm
ASPC Interacting with Remote Systems Day 2 AM 41 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
1 Copy your code to Adroit.2 Log in to Adroit.3 Navigate to your SLURM script.4 Submit your job: sbatch ov_unbias.slurm .5 Check the queue: squeue -u netid .6 Check your log file: cat log.452 .7 Check that your output seems sensible: head ov_unbias.csv .8 Copy your output file back to your local machine.
ASPC Interacting with Remote Systems Day 2 AM 42 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Omitted Variable (Un?)Bias
Visualize!
## Error: object ’variable’ not found
ASPC Interacting with Remote Systems Day 2 AM 43 / 44
TIGRESS Interacting with Remote Systems Single-Node Jobs Exercise Wrap-Up
Wrap-Up
• The entire development process using parallel computing onAdroit
• With practice, you will become more efficient.• Countless tools to make each of these steps a little easier.
• But, don’t want to learn to many new things at once.
• Developing and debugging locally is dramatically easier• Set up your workflow so that this is possible.
• Drawback: The current approach only allows us to parallelizeacross processors within a single computational node on Adroit
• This afternoon we discuss a more general approach.
ASPC Interacting with Remote Systems Day 2 AM 44 / 44